July 8, 2025

Understanding Semantic Mappings in Data Integration: Why Schema Mapping Matters

In today’s data-driven world, integrating data across systems is a fundamental challenge faced by organizations of all sizes. Whether you’re building a modern data warehouse or powering analytics through distributed data pipelines, a common thread is the need to bring data together from disparate sources in a meaningful way. But how do we ensure that this data, often stored in different formats and structures, can be understood and used cohesively?

This is where semantic mappings and mediated schemas come into play. In this post, we’ll explore how these concepts are used in data integration, why they matter, and how they are applied in real-world scenarios—including practical examples from my current work using Azure Data Factory and Databricks.

What Is a Mediated Schema?

In any data integration system, users typically interact with a mediated schema (also called a global schema). This schema serves as an abstract layer through which queries can be posed, independent of how the data is actually stored in source systems.

Behind the scenes, the system uses mappings to define how data from various sources corresponds to the mediated schema. These mappings are a key component of what’s known as source descriptions—a concept detailed in Halevy, Rajaraman, & Ordille (2006). The role of these mappings is to establish the semantic relationships between fields across systems.

Why Are Semantic Mappings Needed?

To understand the importance of semantic mappings, consider a simple example:

In System A, a field is labeled FirstName
In System B, the equivalent data might be stored in a field called Name

Although both fields store similar data, they use different names, possibly even different formats or structures. This discrepancy is known as a semantic heterogeneity issue.

Without a clear mapping that says "FirstName in System A corresponds to Name in System B," it’s nearly impossible to automate data integration accurately. This is why semantic mapping is critical—not only for data analysts and engineers but also for the ETL (Extract, Transform, Load) pipelines that move data between systems.

In my current role, we routinely create mapping documents that explicitly define how fields in one application relate to those in another. This step comes before we build pipelines in tools like Azure Data Factory, and it ensures consistency and accuracy throughout the data integration lifecycle.

The Dynamic Nature of Data: Schema Evolution Challenges

Mapping fields between systems is just one part of the equation. Another challenge arises when the schema changes over time. For example, what happens if a data source updates a field name or changes its format? Without a robust strategy, such changes can break downstream pipelines and introduce errors in reporting or analysis.

This issue touches on a deeper problem: there is often no universal consensus on how data should be structured or labeled. This makes schema design one of the most fragile parts of a data integration project. In fact, as noted by Halevy et al. (2006), many data warehouse projects fail precisely at the schema design phase due to the difficulty of aligning terminology and structure across teams and systems.

A Real-World Solution: Incremental Integration

In the real world, data interoperability is often achieved not through rigid standardization, but via incremental integration. Instead of requiring all data sources to conform to a single, global standard, organizations build translators between small, related sets of sources. Over time, these translators evolve to support more systems.

This concept is echoed in work by Halevy, Ives, Mork, & Tatarinov (2003), who argue that a scalable architecture for data integration—such as for the Semantic Web—must allow for incremental addition of sources. Each new data source should be able to map to whichever sources it deems most compatible, rather than a monolithic and hard-to-evolve standard.

In my current projects, we’re adopting this exact model. By parameterizing our data flows between Azure Data Factory and Azure Databricks, we ensure that when a field or logic is updated, we only need to change it in one location. That change then propagates across the entire system, keeping everything in sync and significantly reducing the maintenance burden.

Conclusion

Semantic mappings and mediated schemas are foundational to modern data integration. They solve a critical challenge: how to understand and unify data from multiple, heterogeneous systems. While these mappings can be complex—especially as schemas evolve—they’re essential for building robust, scalable data pipelines.

Through careful planning, incremental integration, and the use of cloud-native tools like Azure Data Factory and Databricks, it's possible to build data systems that are both flexible and resilient.

References

Halevy, A., Rajaraman, A., & Ordille, J. (2006). Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases (pp. 9–16). Read PDF
Halevy, A., Ives, Z. G., Mork, P., & Tatarinov, I. (2003). Piazza: Data management infrastructure for semantic web applications. ACM SIGMOD Record, 31(4), 54–60. DOI link