Throughout this Data cycle series, we are having a look at the journey of our data from its generation to its storage and interpretation. In the first chapter, Data generation and collection, we discovered the diversity in the nature of the RWI data and how different the data sources of our system can be.

In this chapter, we will study how we can access this data and take the best of it for its future use and analysis.

What is it?

The Data integration has been widely discussed in the IT world since the 80s. When more and more systems were being designed, the need of getting and crossing information for more than one of them started to rise.

The first data integration systems based its operations on ETL (Extract, Transform and Load) tasks. A common schema is defined and the data from different sources is read, readapted and introduced in a central repository. The concept of Data Warehouse (DW) was born.

As the years past and systems become more and more complex but also more and more sophisticated, a question was raised: why do I need a copy of the data if it is already stored in this other system? Would not be easier to ask for the data only when I need it? Terms like Service, Interoperability or API (Application Programming Interface) were bring to the table and are here to stay.

And, what is coming? With the new Internet of the Things and Big Data tendencies, the data integration is also evolving. As opposed to data warehousing, new concepts are attracting more attention, like the data lakes. In these data lakes, any type of data can be stored, no matter if it is structured, semi-structured or non-structured (only structured data can be stored on a DW) and the data is processed as schema-on-read: the data is stored as-is and only structured and shaped when read.

The reality is that no data integration strategy is better than the other one and most of the times a mixed strategy is used depending on the involved systems and specific requirements.

Can it help me?

On the water and environment business the data integration is recognised as one of the biggest matters to deal with. Usually, business-centric information is spread through different systems not only inside our organisation but also outside of it (external providers, governmental agencies, etc.). And bringing all this information together and making it accessible is not always easy.

Sometimes, the data is stored in silos or systems not-so-sophisticated to provide an interoperability layer. If provided, the interoperability with each system is different from one to another: the data comes in different formats and different sources. And advance systems with proper APIs and the capability to provide data on-demand also required a compatible API client to be built and used.

But the effort DOES worth it. The data integration minimises data inconsistency and reduce data complexity, simplify the access to the data, increase the value of your information and provides you with more tools to understand your data and your business and make smarter decisions.

So… what is next?

Well, the cycle does not end here. Once the data is integrated, new challenges appear. Is my data correct? Is it permanently archived or is it lost every year? Can I have access to it anytime and anywhere? Can I aggregate it or transform it to extract new insights? Some of these questions will be answered in the next chapter of our series: Data validation and consolidation.