A brief background of Kafka
Kafka was created by LinkedIn in 2008 and early 2011 it was handed over to Apache as a highly scalable messaging system. Kafka is written in Java and Scala and it is an event streaming fault-tolerant messaging platform. It is designed for the continuous streaming of data for distributed real-time applications and data pipelines.
The platform easily enables consumers, who subscribe to different topics that Kafka publishes, to process streams of data from data sources called producers.
How does Apache Kafka Work?
Kafka is software platform with unique features which allows us to add applications and define several topics. A Kafka topic can be considered a category/feed name to which the records can be stored and published.
The platform leverages a streaming process enabling the processing of data in a parallelly connected system. This feature allows a different application to execute the record as soon as it arrives, instead of waiting for the output of the previous record. It stores the data in a fault-tolerant durable way.
In simple terms, Kafka producer applications write data to topics, and the consumer applications read from the topic. The producer applications connect to the platform and transfers a record onto a topic. Records have the following properties:
- Records are made up of byte arrays that can store any objects in any format.
- Records are stored in partitions identified by unique offsets enabling parallel access by consumers.
- Records have four attributes of: a record key, value, timestamp and headers. The record key and value are mandatory while timestamp and headers are optional.
Why Apache Kafka?
• Apache can maintain fault-tolerance, as in if the consumer fails to process the records due to any backend failure, Kafka can reprocess the data.
• It can process millions of records in a second.
• It is scalable and has a high-performance rate with a low latency value of less than 10ms.
• It can solve complex problems in a data-sharing multi-application environment.
• It acts as a middleware between a producer and consumer applications.
• It has batch-like ability with data persistence. allowing it to work as an ETL tool
Use cases of Apache Kafka
Originally, Kafka was designed to track website activity. It can be used to create user-activity pipeline, were an application could track all the actions that a user may take like any uploads, page views and searches.
Activity tracking often has very high volumes with many messages generated for each page view. It can extend the tracking to record and analyze the traffic flowing out of the APIs supporting the web activity. Combined this can be used to analyze the customer behavior to further enhance the web solution.
Expanding the web activity use case, Kafka can be used to capture the purchase interests of consumers based on activities. Information such as viewing of products, reading of reviews on a typical E-commerce site, and hovering over advertisements can be collected in the platform. This can be further expanding by incorporating other data records from other sources such social media platforms.
Kafka can help with maintaining and keeping track of all the topics and enable consumer applications to pull back “similar” or “interest” product data. This can be used by consumer applications to help optimize what is presented to the user dynamically by filtering the information based on their behavior or attributes.
Kafka can be used to capture all the changes to an application’s state by recording the sequence as records. Trigger alarms can be set in case of a significant and sudden change in usage or system faults by monitoring the servers. This can include data from the server agents and server Syslog. With the help of Kafka streams, the alarms can be triggered by joining the topics. This approach can be used in building a centralized logging system with the ability of making alarms and notifications accessible to multiple downstream systems in a standard format.
Kafka can play an important role as a tool for reliably ingesting, communicating, and moving large amounts of data between various elements of IT systems. It provides a scalable platform used by over 100 000 organizations globally. There is a vibrant community expanding its use case and feature sets, making it a solid choice when solutioning high-performance real-time data pipelines.
How does Kafka hold up in critical high-volume situations such as MLOps? Check back soon when we describe how to use Kafka to process millions of documents through successive transformations in real-time!