

#BENCHMARK MEANING GEARS 5 SCATTERED OFFLINE#
These offline systems may load only at intervals as part of a periodic ETL cycle, or may go down for several hours for maintenance, during which time Kafka is able to buffer even TBs of unconsumed data if needed. This makes Kafka a good fit for things outside the bounds of normal messaging systems such as acting as a pipeline for offline data systems such as Hadoop. It also makes it possible to support space-efficient publish-subscribe as there is a single shared log no matter how many consumers in traditional messaging systems there is usually a queue per consumer, so adding a consumer doubles your data size. This allows usage in situations where the consumer of data may need to reload data. Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week). Messages are immediately written to the filesystem when they are received. Note that unlike most messaging systems the log is always persistent. These partitions are spread across a cluster of machines, allowing a topic to hold more data than can fit on any one machine. This offset is used by the consumer to describe it's position in each of the logs. Each record in the log has an associated entry number that we call the offset. This picture shows a producer process appending to the logs for the two partitions, and a consumer reading from the same logs.


Producers send records to the cluster which holds on to these records and hands them out to consumers:
#BENCHMARK MEANING GEARS 5 SCATTERED SOFTWARE#
Kafka is a distributed messaging system originally built at LinkedIn and now part of the Apache Software Foundation and used by a variety of companies. To help understand the benchmark, let me give a quick review of what Kafka is and a few details about how it works. Indeed our production clusters take tens of millions of reads and writes per second all day long and they do so on pretty modest hardware.īut let's do some benchmarking and take a look. This is because a log is a much simpler thing than a database or key-value store. A million writes per second isn't a particularly big thing. In any case, one of the nice things about a Kafka log is that, as we'll see, it is cheap. Evil thing, but doing a million of anything per second is fun. I've always liked the benchmarks of Cassandra that show it doing a million writes per second on three hundred machines on EC2 and Google Compute Engine. My experience has been that systems that are fragile or expensive inevitably develop a wall of protective process to prevent people from using them a system that scales easily often ends up as a key architectural building block just because using it is the easiest way to get things built.

If you want to use a system as a central data hub it has to be fast, predictable, and easy to scale so you can dump all your data onto it. To actually make this work, though, this "universal log" has to be a cheap abstraction. I wrote a blog post about how LinkedIn uses Apache Kafka as a central publish-subscribe log for integrating data between applications, stream processing, and Hadoop data ingestion.
