In our previous blog post on Apache Kafka, we introduced the product and discussed it at a high level. In this post we would like to take a deeper dive and discuss in more detail how it can be used to process millions of documents in real-time through a machine learning data analytics pipeline.
Case Study Overview
Today’s case study will consist of a data analytics pipeline in which we would like to:
- process a real time stream of documents
- annotate them with various processors
- write these annotated documents to an Elastic instance
The goal of a system like this is to provide a way to make inferences from a single new instance of a streamed document, based on the history of the previously processed documents.
Many examples of documents could be used for this including: patent applications, resumes, news articles, survey responses, and any other document from which meaningful content can be extracted.
In our simple example today, we will process each document to identify:
- the organizations
- identifiable people
- topics that are mentioned in each
The goal is to perform analytics on these documents such that we can create a model to predict correlations between organizations, people, and topics.
To achieve this, we will perform the following activities on each document:
Documents that do not meet a certain set of criteria will be discarded. These could be empty documents, or documents in a foreign language, or documents that are inappropriate in any number of ways.
Documents that can be identified as duplicates of other previously processed documents will be discarded.
Names found within the document that can be correlated with a set of known identified entities will be extracted and noted as ‘organizations’ or ‘persons’.
Instances of previously identified words and phrases will be extracted and noted as ‘topics’.
Before we get into the details of the implementation, we need to revisit the significant features of Kafka that we will be relying on.
|High Throughput and Low Latency:||What a powerful combination this is! A Kafka installation can handle a significant number of messages extremely quickly. Newly ingested messages are available to consumers within milliseconds.|
|Fault Tolerance/Durability:||Another powerful combination. Messages are never lost. Once the producer has successfully written the message to the Kafka broker it is guaranteed not to be lost. Even if the instance crashes, the messages are safe. Brokers can come and go, either intentionally or otherwise, without losing messages.|
|Multiple Simultaneous Clients:||Consumers can be grouped so that one set of rules applies to all members. Many different consumer groups can also be reading from the same or different Kafka topics simultaneously, and each group will be treated separately from the others. |
In order to enable multi-processing, the topic must be configured with multiple partitions, and only one member of the consumer group will be able to read from each partition.
|Numerous supported languages:||I am not even going to try to list all of the languages for which Kafka bindings are available. In poking around the internet, I found at least 20 supported bindings. |
At Indellient, we have client applications written in C++, Java, Python and Ruby to name just four. Whichever language your consumer and producer applications are written in, you can interact directly with Kafka brokers and receive the benefits described above.
Your existing applications can be adapted to interact directly with Kafka. Do you have an application that reads data from a DB and performs transformations before writing it back? Repackage it as a Kafka consumer and producer and the existing processing code need not change.
Each step has three essential Kafka specific elements:
Topic: For our purposes we will assign a single topic to each step, which we will assign a descriptive name to, such as ‘Filtered Documents’.
Consumer: The consumer will read from the topic written to by the previous step’s producer.
Producer: The producer will write the resulting documents to the steps configured topic.
In our example the consumer and the producer code are contained within the same application most of the time. The exception could be the ingestion and storage steps. If the ingestion is not from a Kafka topic, but is from a database or filesystem, then that step will be a producer only.
In our case the final step writes to an Elastic database, so is a consumer only. The actual work being done by the step can be done directly within the application, or can be packaged as a remotely deployed endpoint.
Packaging the work as an endpoint allows for the utmost flexibility. This permits our pipeline to use pre-existing tools to perform particular steps. These are tools which undoubtably can, and are, being used in other contexts. It is also a design pattern that can be used to build multiple similar pipelines with different sets of shared steps.
Tuning the Steps
Invariably, some of the steps will take longer to perform than others. If all the steps can be performed at a pace that equals or exceeds the pace of the previous step, then the pipeline will develop no lag. A long-acting step in a serialized pipeline will doom the pipeline to failure as the inbound documents will never be depleted.
To address this, we will use the ability of Kafka to reliably manage a multi-threaded consumer. The key to is to compute a reasonable throughput estimate for each step. Then we determine the number of simultaneous threads it will take to keep the documents moving through this step at the same pace as the remaining steps. We use this number as the minimum number of partitions of the source topic for this step.
We then have the choice of building our processor as a multithreaded application or running multiple instances of the single threaded consumer. As we most often are deploying into a Kubernetes cluster, we can leverage that and scale up our processing step such that we have sufficient throughput.
If we scale the number of our processors to be more than the number of partitions, then we will simply have some idle consumers. Nothing will break, but resources will be wasted.
If we scale the number of our processors to be fewer than the number of partitions, we will have some processors handling more partitions than others. In this case, some of processors will be working less hard than the others.
Having the same number of processors as partitions is the most efficient implementation.
A Final Thought
Kafka has many, and exceptionally different, “use” cases, ranging from millions of simple consumers receiving messages from one or more topics to a few consumer groups performing complicated transformations in parallel. The guarantee of processing every document once and only once makes it a great choice for high volume computationally-intensive pipelines.
For our pipeline, we process millions of documents through processors with varying levels of complexity and throughput. Kafka’s extreme flexibility and stability proved a very strong choice.
In an upcoming blog entry, we will look at Apache Airflow and examine how it can be used to control a data analytics pipeline. In the meantime, if you have any questions or would like to discuss how we could build a pipeline to process whatever documents you have, please reach out!
Custom Cloud Applications
Indellient takes a customer-first approach to help you build a modern cloud strategy on Amazon Web Services, Windows Azure and Google Cloud Platform. Our team can help you build, replatform, migrate and integrate applications, so you can benefit from the scalability, agility, and performance available through cloud technologies.