New technologies developed by the biggest players in Big Data (Yahoo, Google, Facebook, LinkedIn, etc.) are changing the way data and information are managed. And sometimes, very quickly. Some of these technologies are not easy to understand or apply to other business; but others are. One of these technologies is Apache Kafka, a “unified, high-throughput and low-latency open-source message broker”.

A few months ago, I found a very interesting presentation by Martin Kleppmann, a researcher at the University of Cambridge Computer Laboratory, where he explained how the Kafka message broker can be used for real-time data integration.

Apache Kafka

A message broker is a piece of software (and also an architectural pattern) that mediates communication amongst applications. In Kafka, this mediation is done via streams.

Streams

Kafka was created by LinkedIn to manage trillions of messages per day. It was implemented to accommodate their growing membership and increasing site complexity and to transform their IT architecture from a monolithic application infrastructure to one based on services.

The basic idea of a message broker can also be applied to data integration to create highly efficient and robust real-time data pipelines.

The principle is really simple. Different “producers” are configured in every system that contains valuable information to be shared. Every time new data gets into the system, these producers push it to a queue in the message broker. On the other side, “consumers” are subscribed to those queues so when new data arrives, they get notified immediately and can decide what they need to do with the new data.

Producers - consumers

A practical example: Let’s say we have a SCADA system which receives real time information from metering readers, and we need this information to be synchronised in our Financial system. Why do we need to wait till the end of the month to push this information to the finance system? We can configure a producer that reads the meter readings from SCADA and push them on a message queue.

Then, we configure a consumer that is subscribed to this message queue so when new data arrives, it directly pushes it into the Financial system.

But what if we also want to use these meter readings in our reporting tool? We just need to configure another consumer to the same message queue (consumers read queued message independently) so the meter readings are also pushed to our reporting tool database in real-time.

By using this, we can create multiple queues with different topics (meter readings, sensor values, assets metadata updates, etc.) where producers (SCADA systems, sensors, external data sources, etc.) share data and consumers (BI and reporting tools, data aggregators, asset management systems, etc.) receive instantaneous notifications when and where new information is available. Combinations are limitless.

In ADASA we have tested this technology and the results are promising. Simple ideas like this one, imported from the biggest technology trends and players, can also be applied to our IT architectures and bring us closer to our final goal of becoming data-driven organisations.

Presentation video recording in Youtube:

Presentation slides in Speakerdeck: