Big data platforms do not generate data. Most of the raw data actually comes from business systems, so we need to focus on the task of “transporting” data. This brings us to the concept of “data integration.” In this article, the author shares his insights and experiences, so let’s take a look.
I do not generate data; I am just a data porter.
In big data platforms, data is not generated, or in other words, the raw data originates from business systems. Therefore, the first task we need to accomplish is transporting the data from the business systems to the data platform.
Using a familiar saying, “I do not produce data; I am just a data porter.” This data porter is the first module of the data platform – data integration.
1. What Are the Different Terms for a Data Porter?
In product design, it is common to encounter various names for the same product. These names may seem to refer to the same thing, but if you look closely, there may be some differences.
In the design of data integration products, similar problems arise. Commonly used names include data integration, data synchronization, and data collection. Sometimes, you may also hear terms like data pipeline or data transmission.
I searched for the differences using a large language model.
💡 Data Integration refers to the process of combining data from different sources, types, and formats into a unified view or data warehouse for analysis and processing. Data integration typically involves multiple steps like extraction, transformation, cleaning, and loading, and also needs to address issues like data format conversion, data quality, and data security.
Data Synchronization, on the other hand, involves copying or transferring data from one data source to another to maintain consistency between the two sources. Data synchronization typically focuses on real-time or near-real-time data transmission and replication, used in distributed systems for data replication, backup, and disaster recovery.
It seems that data integration emphasizes the consolidation and processing of data, while data synchronization emphasizes data transmission and consistency.
💡 In the big data field, data collection and data integration are closely related concepts, but they serve different roles and purposes in data processing.
Data Collection is the process of obtaining data from external systems and inputting it into the system, either manually or automatically. During this process, the data may need to be cleaned or formatted for subsequent integration and use.
Data Integration involves bringing together data from different sources, formats, and types into a comprehensive, unified data view or warehouse, typically through extraction, transformation, cleaning, and loading steps. Its main purpose is to enable comprehensive data sharing and analysis.
Therefore, data collection can be seen as the foundation for data integration, and data integration is the follow-up process that enables the effective use and management of big data assets.
In this definition, data collection is a step before data integration. In real-time scenarios, collecting changes to data is akin to data collection. These conclusions were all provided by the large language model. After grasping the basic concepts of these terms, it’s important to differentiate them based on company practices, departmental history, and contextual usage.
Personally, I often mix up the terms data integration and data synchronization. If a system offers both a guided and drag-and-drop form, I would call the drag-and-drop one data integration because it focuses more on data transformation and consolidation. The guided form could be called data synchronization because it emphasizes data transmission and consistency. As for data collection, I mix it up less often, mainly understanding it as the process of collecting changes in databases.
Again, this is entirely my personal classification.
2. Processing During Data Transportation
In the process of data synchronization, does it need to be processed? Although data synchronization is often compared with ETL (Extract, Transform, Load), whether transformations are needed during synchronization can be debated.
One-to-One Synchronization
The purpose of data synchronization is to retain the business data history. If the history needs to be preserved, even erroneous data should be kept, as errors are still part of the historical record. This synchronization is done in a one-to-one manner with the business system, meaning even if the data brought over is erroneous or non-standard, it is kept as-is. This approach preserves the original business data when data discrepancies arise and data tracing is necessary.I personally think this idea is good because it preserves the business history completely. However, the problem is that the erroneous data can always be corrected in the business system, but offline synchronization doesn't happen in real time. This approach may be too extreme, demanding high standards for both people and processes.
Transformations and Cleaning During Synchronization
The second approach is less strict and more flexible. It allows for row-level addition, deletion, and standardization, as well as field aggregation, association, and transformation, among other operations.From a product design perspective, supporting this form essentially supports one-to-one synchronization. If the system has the ability to perform transformation and aggregation during synchronization but doesn’t use them, it’s essentially performing one-to-one synchronization. So, one-to-one synchronization seems to be more of a norm or requirement.
3. Target Table Types for Data Transportation
The goal of transporting business data to the data platform is to retain history and ensure data traceability. However, since business system data is constantly changing, how to preserve the historical changes of the data involves the structure of target tables.
This is part of the data warehouse modeling field, but why mention it here? Let's first look at some common types of target tables: full tables, slice tables, and slowly changing dimension (SCD) tables.
Full Table
As the name suggests, a full table involves syncing all data to the target system. It is typically used for syncing relatively static data like code tables.Slice Table
Slice tables are further divided into incremental slices and full slices. A full slice stores all the business data of a given day in a partition for that day. An incremental slice stores only the new data for that day in the partition for that day.Slowly Changing Dimension (SCD) Table
SCD tables are the most complex. They require a unique key, and the system needs to track whether business data has changed. After a change, a new record is added to the target table, noting the start and end times of the change. Some systems might also include versioning and status indicators like “current state” (SCD tables depend on the synchronization time granularity; finer time granularity might cause issues when synchronizing the data to the target system).
Why mention this here? Because data integration products need to support the creation of these target table types. Full tables for full synchronization, slice tables for incremental slices (with the ability to filter daily changes), and the complex logic of SCD tables, which might require logical solidification (a process I’ve only seen in Powercenter). These factors must be considered during the data synchronization process. It’s not just about transporting data to the target; it’s also about transporting data in a way that fits the target table structure.
4. Interaction Forms for Data Transportation
During the data transportation process, there are typically three forms of interaction: script-based, drag-and-drop, and wizard-based.
Script-Based
As the name suggests, script-based synchronization involves writing a script to perform data synchronization. This form is more technical and is usually not a major focus in product design.Common script-based synchronization tools include:
Sqoop: An older tool used for batch data migration between structured data and Hadoop. Initially developed by Apache Software Foundation, the project was terminated in 2016.
DataX: Alibaba Cloud's data integration tool, which also provides a script-based interface for data synchronization, particularly for unstructured data sources.
Drag-and-Drop
Drag-and-drop data integration products involve creating an ETL DAG (Directed Acyclic Graph) by dragging operators onto a canvas.Common drag-and-drop synchronization tools include:
When using drag-and-drop data integration, it often takes on some aspects of data development, though there are distinctions between drag-and-drop data integration and data development.
Informatica Powercenter: A well-known product abroad, often used in finance sectors but less common in the internet industry.
IBM Datastage: Similar to Powercenter.
Kettle: An open-source, free data ETL tool.
Wizard-Based
Wizard-based data integration involves creating tasks through input fields or configuration options without needing to write code or drag operators. This is a common model used by most cloud providers for their data integration or synchronization products.
5. Timeliness
In my understanding, data integration can be divided into two categories: offline data integration and real-time data integration. Full synchronization, incremental synchronization, etc., are just options within these two major forms. Both can be implemented using script-based, drag-and-drop, or wizard-based methods. The format isn’t as important as whether the integration is real-time or offline.
Real-time data integration is gaining more attention, with the rise of concepts like batch-stream integration, leading to increased real-time data integration demands.
However, I don’t think offline data integration will be completely replaced. Real-time integration is costlier, and the technologies involved differ significantly from offline integration. Not all existing technical architectures are ready for real-time integration.
Moreover, historical habits also play a role. Concepts like slice tables and SCD tables make more sense in offline scenarios, and many concepts smoothly apply to offline contexts but are often overlooked in real-time scenarios.
6. Supported Data Source Types
The number of supported data sources reflects the capabilities of a platform. The more sources it supports, the stronger its capabilities are considered to be. In fact, open-source tools like DataX often provide this type of information for people to learn from. You can find resources online to further understand this.
In conclusion, data integration is the first step in data platforms and is crucial for bridging the gap between business systems and data analysis platforms. In designing or implementing data integration products, factors such as data types, synchronization methods, interaction forms, and timeliness should all be considered carefully.