An Introduction to Apache Kafka

Elizabeth Mabishi

Kafka's history

Before we dive in deep into how Kafka works and get our hands messy, here's a little backstory.

Kafka is named after the acclaimed German writer, Franz Kafka and was created by LinkedIn as a result of the growing need to implement a fault tolerant, redundant way to handle their connected systems and ever growing pool of data.

What is Kafka?

Briefly put, Apache Kafka is a distributed streaming application.

Let's take a deeper look at what this means:

  1. Distribution: Generally, in software architecture, distribution is a measure of how components in a system are able to perform autonomously. Here are a few defining characteristics of distribution.
    • Several computational entities, workers, nodes or components working to achieve a singular goal.
    • Segregation of work between components in the system.
    • Concurrency of components: Each component performs operations in its own order, independent of the order of operations of other components in the system.
    • Lack of a global clock: The functions of the components in the system are not synchronised.
    • Independent failure: If a component in the system fails, then its failure should not affect other components.
    • Communication between the nodes of the system is achieved through message passing.
  2. Streaming: This refers to the real time and continuous nature of the storage, processing and retrieval of Kafka messages.

In this guide, we'll see how Kafka achieves this.

Microservices Architecture

Before we discuss how Kafka works, I think a good place to start would be to explain the context in which Kafka functions.

With increasing frequency, the microservices software architecture is becoming an indispensible paradigm for software engineering and development. As the scale of applications increase, that is the data they consume, process and output increases, it becomes increasingly important to find fault tolerant, scalable ways to manage both systems and the data they manage.

Like the name suggests, a microservice is a piece of software that serves a singular purpose and works with other system components to perform a task.

It is here that Kafka shines. It lends itself well to enabling communication between microservices in a system through the passing of messages between them.

Kafka components

Kafka achieves messaging through a publish and subscribe system facilitated by the following components:

Topic

Topics are how Kafka stores and organises messages across its system and are essentially a collection of messages. Visualise them as logs wherein Kafka stores messages. Topics can be replicated (copies are created) and partitioned (divided). The ability to replicate and partition topics is one of the factors that enable Kafka's fault tolerance and scalability.

Producer

The producer publishes messages to a Kafka topic.

Consumer

The consumer subscribes to a topic(s), reads and processes messages from topic(s).

Broker

The broker/server(s) manage the storage of messages in topic(s). Many brokers form a Kafka cluster.

Zookeeper

Kafka uses zookeeper to provide the brokers with metadata about the processes running in the system and to facilitate health checking and broker leadership election.

Prerequisites

You'll need Java for this, so go ahead and download the SDK from here .

Download the Kafka source files from here and unzip them to a directory of your choice. As of the writing of this article, the current release version is 0.10.1.0

Doing this using the tar utility is trivial. tar -xzf kafka_2.11-0.10.1.0.tgz

You can also download Kafka using Homebrew on macOS, but here, we'll use the method above in order to expose the directory structure of Kafka and give us easy access to its files.

Demonstration

Starting zookeeper

Let's begin by starting up zookeeper by running the following command at the root of the uncompressed folder.

bin/zookeeper-server-start.sh config/zookeeper.properties

The indication that zookeeper has come alive is a stream of information output to your terminal window.

In another shell, let's start a Kafka broker like this:

bin/kafka-server-start.sh config/server.properties

Of note is the fact that we can create multiple Kafka brokers simply by copying the server.properties file and making a few modifications to the values in the following fields, which must be unique to each broker:

  1. broker.id
  2. listeners: The first broker was started at localhost:9092.
  3. log.dirs: The physical location where each broker will store its messages.

Creating a Kafka topic

In another shell, create a Kafka topic called my-kafka-topic like this.

bin/kafka-topics.sh --create --topic my-kafka-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1

You should receive confirmation that the topic has been created. A few points of note:

  • We invoke zookeeper because it manages metadata relevant to the Kafka brokers/cluster and therefore would need to know when topics are created.
  • The partition and replication factor can be changed depending on how many paritions and topic replicas you require. The default number of partitions and replicas is set to 1, but this is configurable in the bin/kafka-topics.sh file.
  • The number of replicas must be equal to or less than the number of brokers currently operational.

If you now look at the zookeeper stream we began earlier, you'll notice that the broker has registered our newly created topic.

Creating a Kafka Producer

In another shell, let's create a Kafka producer with this:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic my-kafka-topic

A few points of note:

  • We invoke the broker we started, listening at localhost:9092 because it manages the storage of messages to topics. More information about this port can be found in the config/server.properties file.
  • We will produce messages to the just created topic my-kafka-topic

Creating a Kafka Consumer

In yet another shell, run this to start a Kafka consumer:

bin/kafka-consumer.sh --bootstrap-server localhost:9092 --topic my-kafka-topic --from-beginning

Now the real fun begins.

In the producer stream, type some messages. Press enter after each one and watch out for what happens in the shell you started the consumer in.

Drumroll please!

The messages from the producer are appearing in the consumer thread. I must admit, the first time I saw this at work, I was quite impressed.

After you catch your breath, let's do a little snooping to find out how Kafka achieves this.


At the root of the downloaded Kafka folder, run:

cd tmp/kafka-logs/my-kafka-topic-0
cat 00000000000000000000.log

Here, we find our produced and consumed messages. This is why Kafka is often called a distributed commit log. It functions as an immutable record of messages.

Of note is the fact that you can dictate in which physical file the broker saves messages. This setting can be found in the config/server.properties file.

Our first Apache Kafka application

In this section, we'll create an Apache Kafka producer in Python and a Kafka consumer in JavaScript.

Prerequisites

We'll need a few things

  1. node.js
  2. virtualenv

Confirm you've installed both correctly using

node --version && virtualenv --version

You should see similar results

v6.2.2
15.0.3

Directory Structure

Next, create the following folder structure.

├── scotch-kafka
    ├── producer
         ├── producer.py
    ├── consumer
         ├── consumer.js

Make sure you've started zookeeper and a broker as above.

Python producer

Make a virtual environment and while inside it, install the Python kafka module by running:

pip install kafka-python

Write the following into the producer.js file

from kafka import KafkaProducer
import json

# Create an instance of the Kafka producer
producer = KafkaProducer(bootstrap_servers='localhost:9092',
                            value_serializer=lambda v: json.dumps(v).encode('utf-8'))

# Call the producer.send method with a producer-record
for message in range(5):
    producer.send('kafka-python-topic', {'values': message})

Let's break down the producer instantiation.

A few items are required for Kafka to create an instance of a producer. These are usually called producer configuration properties.

  1. A list of brokers to connect to: Herein given under bootstrap_servers
  2. Key and Value serialisers: These properties indicate to the producer how to serialise the messages it will publish and is set to a string serialiser by default. In the producer above, we're specifically choosing a JSON serialisation format.

Of note is that, Kafka producer instances can only send Producer Record values that match the key and value serialisers types the producer is configured with.

To send the messages, referred to in Kafka terminology, as producer records, we need to call the producer.send function and supply both a topic and value at minimum.

Some optional values we could provide are:

  1. Partition: Refers to the partition within the topic the message is to be published to.
  2. Timestamp: The time at which the producer published the message. This setting is configurable using the log.message.timestamp.type setting in the server.properties file.
  3. Key: In conjunction to the partition value, this is used to determine how and to which partition in a topic the Kafka producer will be sending a message to.

Note that the topic we're using has the name kafka-python-topic, so you'll have to create a topic of the same name.

bin/kafka-topics.sh --create --topic kafka-python-topic --zookeeper localhost:2181 --partitions 1 --replication-factor 1

JavaScript Consumer

In the scotch-kafka folder, run:

npm install no-kafka

Write the following into the consumer.js file.

const Kafka = require('no-kafka');

// Create an instance of the Kafka consumer
const consumer = new Kafka.SimpleConsumer
var data = function (messageSet, topic, partition) {
    messageSet.forEach(function (m) {
        console.log(topic, partition, m.offset, m.message.value.toString('utf8'));
    });
};

// Subscribe to the Kafka topic
return consumer.init().then(function () {
    return consumer.subscribe('kafka-python-topic', data);
});

In order to create an instance of a consumer, Kafka requires the following:

  1. A list of brokers to connect to. Here, the no-kafka module sets a default address of localhost:9092
  2. Key and Value deserialisers: Here, the message is converted from json to string.

At the root of the project, run the following to start the consumer

node consumer/consumer.js

and in another shell,

python producer/producer.py

The consumer output should look something like this:

kafka-python-topic 0 216 {"values": 0}
kafka-python-topic 0 217 {"values": 1}
kafka-python-topic 0 218 {"values": 2}
kafka-python-topic 0 219 {"values": 3}
kafka-python-topic 0 220 {"values": 4}

Conclusion

Congratulations, you've just built your first application using Apache Kafka. Not a mean feat at all.

You'll find all the code we've used here.

I hope this tutorial will act as a foundation from which to learn more about Kafka, it's use cases and the many advantages it has over traditional messaging systems.

Have a minute? Give me some feedback. Drop me a line in the comment box below.

Elizabeth Mabishi

5 posts

Programmer, Gamer, Powerlifter, Bodybuilder, Movement and Martial arts enthusiast.