Typically, data planning can involve two levels of planning: table-level planning and field-level planning. So, what can we plan for metadata? Is data planning feasible? In this article, the author shares their perspective. Let's take a look.
Metadata is the foundation of big data platforms, and these platforms are built around metadata. If a big data platform can manage metadata effectively, then it is already halfway to success. So, what can we plan for metadata, and is it feasible?
1. What to Plan for in Data Planning?
When planning data, there are typically two levels: table-level planning and field-level planning.
2. Table-Level Planning
Table-level planning involves the design of the data warehouse. It includes data warehouse layering and business line partitioning.
1. Data Warehouse Layering
Data warehouse layering refers to what we commonly hear in the data warehouse field: ODS, DWD, DWS, and other layers.
In general table creation processes, distinguishing different layers is simply done by adding prefixes to table names. However, in big data platforms, we also want to add a similar layer label to distinguish which layer a table belongs to.
If a wizard-based table creation process is used, a selection for data warehouse layering can be added directly, and the table's layer is defined during the creation process. For script-based table creation, maintenance needs to be done after the table is created because there is no way to mark the table's layer in a script-based text editor.
Of course, unless the layering of tables is logically tied to the underlying storage database (i.e., different data warehouse layers correspond to different databases, which seems to be the case in most real-world scenarios).
2. Business Line Layering
In addition to determining the data warehouse layer, a table also needs to be classified by business domain. A data warehouse generally aggregates data from multiple business lines, with some business domains overlapping and others unique. Thus, it is necessary to partition the data according to actual business situations. While the data warehouse layering is a technical issue, the partitioning of business domains is a combination of business and technical considerations. It requires a good understanding of the business and the ability to translate these business domains into a technical representation, ensuring no duplication or omissions. Labeling business domains on tables is similar to data warehouse layering and can be done during the creation process if a wizard-based approach is used. For script-based approaches, maintenance is needed afterward.
3. Is Table-Level Planning Feasible?
Returning to the earlier question, is data planning feasible? Personally, I believe that table-level planning is feasible and necessary. With these data warehouse layers and business domain partitions, it becomes much easier to locate data or to perform governance and review of different layers later on.
💡 Personally, I feel that the big data field is more of an experiential domain, where everyone has their own understanding. Different terms are not fully standardized, and interpretations come from various perspectives. This article is based more on my own practical experience, and my understanding may evolve as I encounter different aspects of the field in the future.
4. Field-Level Planning
The other level of planning is field-level planning. Is this feasible, and what can we plan for at the field level?
Data Metrics
The use of data metrics first requires the unification of those metrics.
Unifying data metrics, before a system is in place to support it, is typically managed through an Excel spreadsheet, which standardizes the necessary metrics and their definitions. In small-scale scenarios, such as within a project team, this may be feasible. However, if the unification needs to extend to the entire company or group, Excel is no longer sufficient. A system is required, one that includes processes for the creation, review, publication, and deprecation of metrics.
With unified data metrics in place, these metrics can be used in two scenarios: modeling and OLAP (Online Analytical Processing). This article focuses on field-level planning for metrics in the modeling scenario.
🔑 I classify the use of data metrics into two categories: those for modeling and those for OLAP. These two scenarios are somewhat similar but also have subtle differences. While I can’t say whether this classification is entirely accurate, I will note it for now.
Once a unified metric standard is in place across the company, where can it be used in the modeling scenario? Personally, I believe it should only be used in graphical wizard-based metadata creation. During the metadata creation process, the field names should be presented as dropdown selections, allowing users to only choose already-published metrics when creating tables. This ensures consistency in the names and definitions of metrics, avoiding discrepancies in fields that have the same meaning.
However, whether this wizard-based approach is practical in actual data development processes and whether it is feasible to restrict SQL-based creation is debatable. I believe it might not be particularly feasible, as it could slow down development efficiency, and in practice, it may not be widely adopted.
5. Data Standards
Here, I will only discuss the use of data standards, without going into how they are defined (which will be covered in a separate article). Data standards typically include code tables, so they are collectively referred to as data standards without further distinction.
Once the standards are established, where are they used? Personally, I feel that they are typically used during the creation of metadata in wizard-based tools, where a data standard is chosen and linked to a field. Once the binding relationship is established, what can be done?
If the goal is quality control, data that doesn't meet the standard can be filtered out, which ties into data quality. If it’s not related to data quality, then once the standard is determined, what happens next? It seems that there isn’t much more to it. This is why I haven’t yet fully understood the role of data standards in the metadata creation process, and further study is needed.
🔑 I haven’t worked with industrial data standards, but I imagine the industrial sector may have more use cases for data standards, particularly for data length, precision, etc.
6. Is Field-Level Planning Feasible?
Is field-level planning feasible in practice? To be honest, I haven’t seen any successful examples of it so far. There are two main reasons for this.
First, both data metrics and data standards require support from graphical interfaces, i.e., wizard-based metadata creation. In these tools, each field needs to be added line by line, and when there are many fields, the amount of configuration required becomes too large. Developers may not accept this.
Second, most companies already have a certain amount of historical data, which can be considered a "legacy burden." It is impossible to simply redo everything.
7. Is Existence Justified?
If we adhere to the idea that "existence justifies its own reason," it could be because I have not yet encountered specific business scenarios. The use of data metrics and data standards still requires deeper research.
However, there is an interesting point: it seems that all major cloud providers have this module. It’s unclear whether they have figured out how to use it, or if they simply want to avoid losing points in the competition by offering this feature, leaving the question of its actual usability up to the users.
Rumor has it that Dataphine can make field-level planning work, because using Dataphine requires starting from scratch — from planning, modeling, and development. If strict wizard-based table creation is implemented throughout the process, it could be feasible. This might also explain why Alibaba has both DataWorks and Dataphine — DataWorks is for legacy systems, while Dataphine starts fresh. This could also be due to abundant human resources.