By Noreen Kendle
The typically approach to data integration has been a bottom-up, data systems, tools/technology approach where a data integration tool is used to merge data columns form various database systems. This approach has been around for many years. If it worked, then we would not be in the data disparity mess we are in today.
The data integration tools and the automation they offer are not the point of failure; the issue is the methodology, how the tools are applied. The tools move and merge the data as directed. It is how they are directed. The decisions on what gets “merged” or “pasted” together are typically based on data field names, content and sometimes relationships within the existing data systems. Anyone who has worked in the guts of data knows how misleading a column name or even its content is to its meaning. Add to that the fact that most legacy data systems are riddled with data issues and idiosyncrasies. When they are integrated in this manner, the data abnormalities multiply.
There is also the dimension of time that plays a role in data meaning and integration. Often data can change its meaning over time as can the business it represents. Even when two data items appear the same, they may be representations of the same thing at different points in time and therefore, not really the same thing. This is a challenge for data systems tool based integration.
Another challenge with this data integration methodology is capturing the data relationships. Data meaning is dependent on it relationships with other data, where the meaning of a data item, a column in a database, is relative to its relationship within the table it resides, as well as its relationship/dependencies to other attributes and tables. When data is looked at individually, as a single column or attribute, void of its relationships, there is a high risk of losing its full meaning, thus compromising its integrity. Many organization’s database systems today, even if they have the capability, do not implement referential integrity (RI) within the database or compensate for the RI. Much of the data, even well named and defined data has lost its relative meaning.
Data is not something that can be disassembled and treated linearly, without the risk of compromising its integrity. There is a very good reason why relational logic for data structures works well – data, as is the real world it represents is inter-related. Hierarchies are simple parent-child relationships. This does not necessarily mean data has to be stored or moved in a relational structure; rather its relational or relative meaning must be preserved in some manner in order to maintain its integrity.
Data integration tools can only take a “best guess” in an attempt to derive a data field’s meaning, even if the guess is based on a good quality name, content, and structure/relationships. The real meaning/understanding of data comes from its intended business meaning, how it represents the business. Without the real business meaning, where the physical data is mapped/justified to a business data model and then used to drive the data integration, it can only ever be a best guess, with or without a tool.