Table of Contents
This demo shows users how to monitor Kafka streaming ETL deployments using Confluent Control Center.
The use case is a streaming pipeline built around live edits to real Wikipedia pages. Wikimedia Foundation has IRC channels that publish edits happening to real wiki pages (e.g. #en.wikipedia, #en.wiktionary) in real time. Using Kafka Connect, a Kafka source connector kafka-connect-irc streams raw messages from these IRC channels, and a custom Kafka Connect transform kafka-connect-transform-wikiedit transforms these messages and then the messages are written to Kafka. This demo uses KSQL for data enrichment, or you can optionally develop and run your own Kafka Streams application. Then a Kafka sink connector kafka-connect-elasticsearch streams the data out of Kafka, applying another custom Kafka Connect transform called NullFilter. The data is materialized into Elasticsearch for analysis by Kibana.
Note: this is a Docker environment and has all services running on one host. This demo is not to be used in production; this is exclusively to easily demo the Confluent Platform. In production, Confluent Control Center should be deployed with a valid license and with its own dedicated metrics cluster, separate from the cluster with production traffic. Using a dedicated metrics cluster is more resilient because it continues to provide system health monitoring even if the production traffic cluster experiences issues.
- Since this repository uses submodules,
git clone
with the--recursive
option:
$ git clone --recursive https://github.com/confluentinc/cp-demo
Otherwise, git clone
and then use the submodule
commands to initialize and update:
$ git clone https://github.com/confluentinc/cp-demo
$ cd cp-demo
$ git submodule init
Submodule 'kafka-connect-irc' (https://github.com/cjmatta/kafka-connect-irc) registered for path 'kafka-connect-irc'
Submodule 'kafka-connect-transform-wikiedit' (https://github.com/cjmatta/kafka-connect-transform-wikiedit) registered for path 'kafka-connect-transform-wikiedit'
$ git submodule update
-
In the advanced Docker preferences settings, increase the memory available to Docker to at least 8GB (default is 2GB).
-
From the
cp-demo
directory, runmake clean all
to build the IRC connector and the transformer that will parse the Wikipedia edit messages to data. These are saved toconnect-plugins
path, which is a shared volume to theconnect
docker container.
$ make clean all
...
$ ls connect-plugins
Note: If make
has a FATAL
error as shown below, it means this git repo was not cloned with the submodules. Please go back to step 1 above and correct this.
[FATAL] Non-readable POM /private/tmp/cp-demo/kafka-connect-irc/pom.xml: /private/tmp/cp-demo/kafka-connect-irc/pom.xml (No such file or directory)
- Start Docker Compose. It will take about 2 minutes for all containers to start and for Confluent Control Center GUI to be ready.
$ docker-compose up -d
- Verify the status of the Docker containers show "Up" state, except for the
kafka-client
container which is expected to have "Exit 0" state. If any containers are not up, verify in the advanced Docker preferences settings that the memory available to Docker is at least 8GB (default is 2GB).
$ docker-compose ps
Name Command State Ports
------------------------------------------------------------------------------------------------------------------------------
cpdemo_connect_1 /etc/confluent/docker/run Up 0.0.0.0:8083->8083/tcp, 9092/tcp
cpdemo_control-center_1 /etc/confluent/docker/run Up 0.0.0.0:9021->9021/tcp
cpdemo_elasticsearch_1 /bin/bash bin/es-docker Up 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp
cpdemo_kafka-client_1 bash -c echo Waiting for K ... Exit 0
cpdemo_kafka1_1 /etc/confluent/docker/run Up 0.0.0.0:29092->29092/tcp, 0.0.0.0:9092->9092/tcp
cpdemo_kafka2_1 /etc/confluent/docker/run Up 0.0.0.0:29093->29093/tcp, 9092/tcp, 0.0.0.0:9093->9093/tcp
cpdemo_kibana_1 /bin/sh -c /usr/local/bin/ ... Up 0.0.0.0:5601->5601/tcp
cpdemo_ksql-cli_1 perl -e while(1){ sleep 99 ... Up 0.0.0.0:9098->9098/tcp
cpdemo_schemaregistry_1 /etc/confluent/docker/run Up 0.0.0.0:8081->8081/tcp
cpdemo_zookeeper_1 /etc/confluent/docker/run Up 0.0.0.0:2181->2181/tcp, 2888/tcp, 3888/tcp
- Wait till Confluent Control Center is running fully. Verify when it's ready when the logs show the following event
$ docker-compose logs -f control-center | grep -e HTTP
control-center_1 | [2017-09-06 16:37:33,133] INFO Started NetworkTrafficServerConnector@26a529dc{HTTP/1.1}{0.0.0.0:9021} (org.eclipse.jetty.server.NetworkTrafficServerConnector)
- Decide how you want to run the rest of the demo, with or without KSQL. The reason there are two ways to run the demo is because KSQL does not support Avro with Schema Registry at this time. When KSQL supports Avro with Schema Registry, we will collapse the workflows into one.
# With KSQL: data streams from Wikipedia IRC to KSQL to Elasticsearch. The Kafka source and sink connectors use Json
$ export DEMOPATH=scripts_ksql
# Without KSQL: data streams straight through Kafka from Wikipedia IRC to Elasticsearch without KSQL. The Kafka source and sink connectors use Avro with Confluent Schema Registry
$ export DEMOPATH=scripts_pipeline
- Setup the cluster and connectors
$ ./$DEMOPATH/setup.sh
-
Use Google Chrome to view the Confluent Control Center GUI at http://localhost:9021. Click on the top right button that shows the current date, and change
Last 4 hours
toLast 30 minutes
. -
View the data in the Kibana dashboard at http://localhost:5601/app/kibana#/dashboard/Wikipedia
- Monitoring --> System Health: Confluent Control Center landing page shows the overall system health of a given Kafka cluster. For capacity planning activities, view cluster utilization:
- CPU: look at network and thread pool usage, produce and fetch request latencies
- Network utilization: look at throughput per broker or per cluster
- Disk utilization: look at disk space used by all log segments, per broker
- Management --> Kafka Connect: Confluent Control Center uses the Kafka Connect API to manage Kafka connectors. Kafka Connect Sources tab shows the connector
wikipedia-irc
. Click "Edit" to see the details of the connector configuration and custom transforms.
- Kafka Connect Sinks tab shows the connector
elasticsearch-ksql
(orelasticsearch-pipeline
if you are running without KSQL). Click "Edit" to see the details of the connector configuration and custom transforms.
- Monitoring --> Data Streams --> Message Delivery: The Kafka Connect sink connector has a corresponding consumer group
connect-elasticsearch-ksql
consuming from the configured Kafka topic. This consumer group will be in the consumer group statistics in the stream monitoring charts.
- Management --> Topics --> Topic Information: For a given topic, click on the three dots "..." next to the topic name and click on "View details". View which brokers are leaders for which partitions and the number of consumer groups currently consuming from this topic. Click on the boxed consumer group count to select a consumer group for which to monitor its data streams and jump to it.
- Monitoring --> Data Streams --> Message Delivery: hover over any chart to see number of messages and average latency within a minute time interval.
- Monitoring --> System Health: to identify bottlenecks, you can see a breakdown of produce and fetch latencies through the entire request lifecycle. Click on the line graph in the "Request latency" chart. The request latency values can be shown at the median, 95th, 99th, or 99.9th percentile. Depending on where the bottlenecks are, you can tune your brokers and clients appropriately.
If you ran the demo with KSQL, i.e. DEMOPATH=scripts_ksql
, then there are additional things you can look at. If you did not run the demo with KSQL, skip this section.
- Run KSQL CLI to get more information on the queries, streams, and tables.
$ docker-compose exec ksql-cli ksql-cli remote http://localhost:8080
...
ksql> show queries;
ksql> describe wikipediabot;
ksql> select * from wikipediabot limit 3;
ksql> describe en_wikipedia_gt_1;
ksql> select * from en_wikipedia_gt_1 limit 3;
- Monitoring --> Data Streams --> Message Delivery: all KSQL queries are materialized in Confluent Control Center as consumer groups with names
ksql_query_
. To correlate these consumer groups to the actual KSQL query, note the query number and query string in the output of:
$ docker-compose exec ksql-cli ksql-cli remote http://localhost:8080 --exec "show queries;"
- Monitoring --> Data Streams --> Message Delivery: graphs for consumer groups
EN_WIKIPEDIA_GT_1_COUNTS-consumer
andksql_query_5
are displaying data at intervals instead of smoothly like the other consumer groups. This is because Confluent Control Center displays data based on message timestamps, and this particular stream of a data is a tumbling window with a window size of 5 minutes. Thus all its message timestamps are marked to the beginning of each 5-minute window and this is why the latency for these streams appears to be high. Kafka streaming tumbling windows are working as designed and Confluent Control Center is reporting them accurately.
Control Center shows which consumers in a consumer group are consuming from which partitions and on which brokers those partitions reside. Control Center updates as consumer rebalances occur in a consumer group.
- If your consumer group
app
is not running, start consuming from topicwikipedia.parsed
with a new consumer groupapp
with one consumerconsumer_app_1
. It will run in the background.
$ ./$DEMOPATH/start_consumer_app.sh 1
- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. Click on the box "View Details" above the bar graph to drill down into consumer group details. This consumer groupapp
has a single consumerconsumer_app_1
consuming all of the partitions in the topicwikipedia.parsed
. The first bar may be red because the consumer started in the middle of a time window and did not receive all messages produced during that window. This does not mean messages were lost.
- Add a second consumer
consumer_app_2
to the existing consumer groupapp
.
$ ./$DEMOPATH/start_consumer_app.sh 2
- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. Notice that the consumersconsumer_app_1
andconsumer_app_2
now share consumption of the partitions in the topicwikipedia.parsed
. When the second consumer was added, that bar may be red for both consumers because a consumer rebalance occurred during that time window. This does not mean messages were lost, as you can confirm at the consumer group level.
Streams monitoring in Control Center can highlight consumers that are slow to keep up with the producers. This is critial to monitor for real-time applications where consumers should consume produced messages with as low latency as possible. To simulate a slow consumer, we will use Kafka's quota feature to rate-limit consumption from the broker side, for just one of two consumers in a consumer group.
- Click on Data Streams, and "View Details" for the consumer group
app
. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1
andconsumer_app_2
, that were created in an earlier section. If these two consumers are not running, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. -
Add a consumption quota for one of the consumers in the consumer group
app
.
$ ./$DEMOPATH/throttle_consumer.sh 1 add
Note: you are running a Docker demo environment with all services running on one host, which you would never do in production. Depending on your system resource availability, sometimes applying the quota may stall the consumer (KAFKA-5871), thus you may need to adjust the quota rate. See the ./$DEMOPATH/throttle_consumer.sh
script for syntax on modifying the quota rate.
- If consumer group
app
does not increase latency, decrease the quota rate - If consumer group
app
seems to stall, increase the quota rate
- View the details of the consumer group
app
again,consumer_app_1
now shows high latency, andconsumer_app_2
shows normal latency.
- In the System Health dashboard, you see that the fetch request latency has likewise increased. This is the because the broker that has the partition that
consumer_app_1
is consuming from is taking longer to service requests.
- Click on the fetch request latency line graph to see a breakdown of produce and fetch latencies through the entire request lifecycle. The middle number does not necessarily equal the sum of the percentiles of individual segments because it is the total percentile latency.
- Remove the consumption quota for the consumer. Latency for
consumer_app_1
recovers to steady state values.
$ ./$DEMOPATH/throttle_consumer.sh 1 delete
Streams monitoring in Control Center can highlight consumers that are over consuming some messages, which is an indication that consumers are processing a set of messages more than once. This may happen intentionally, for example an application with a software bug consumed and processed Kafka messages incorrectly, got a fix, and then reprocesses previous messages correctly. This may also happen unintentionally if an application crashes before committing processed messages. To simulate over consumption, we will use Kafka's consumer offset reset tool to set the offset of the consumer group app
to an earlier offset, thereby forcing the consumer group to reconsume messages it has previously read.
- Click on Data Streams, and "View Details" for the consumer group
app
. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1
andconsumer_app_2
, that were created in an earlier section. If these two consumers are not running and were never started, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. -
Stop the consumer group
app
to stop consuming from topicwikipedia.parsed
. Note that the command below stops the consumers gracefully withkill -15
, so the consumers follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_graceful.sh
- Wait for 2 minutes to let messages continue to be written to the topics for a while, without being consumed by the consumer group
app
. Notice the red bar which highlights that during the time window when the consumer group was stopped, there were some messages produced but not consumed. These messages are not missing, they are just not consumed because the consumer group stopped.
- Reset the offset of the consumer group
app
by shifting 200 offsets backwards. The offset reset tool must be run when the consumer is completely stopped. Offset values in output shown below will vary.
$ docker-compose exec kafka1 kafka-consumer-groups --reset-offsets --group app --shift-by -200 --bootstrap-server kafka1:9092 --all-topics --execute
TOPIC PARTITION NEW-OFFSET
wikipedia.parsed 1 4071
wikipedia.parsed 0 7944
- Restart consuming from topic
wikipedia.parsed
with the consumer groupapp
with two consumers.
$ ./$DEMOPATH/start_consumer_app.sh 1
$ ./$DEMOPATH/start_consumer_app.sh 2
- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. Notice several things:
- Even though the consumer group
app
was not running for some of this time, all messages are shown as delivered. This is because all bars are time windows relative to produce timestamp. - For some time intervals, the the bars are red and consumption line is above expected consumption because some messages were consumed twice due to rewinding offsets.
- The latency peaks and then gradually decreases, because this is also relative to the produce timestamp.
Streams monitoring in Control Center can highlight consumers that are under consuming some messages. This may happen intentionally when consumers stop and restart and operators change the consumer offsets to the latest offset. This avoids delay processing messages that were produced while the consumers were stopped, especially when they care about real-time. This may also happen unintentionally if a consumer is offline for longer than the log retention period, or if a producer is configured for acks=0
and a broker suddenly fails before having a chance to replicate data to other brokers. To simulate under consumption, we will use Kafka's consumer offset reset tool to set the offset of the consumer group app
to the latest offset, thereby skipping messages that will never be read.
- Click on Data Streams, and "View Details" for the consumer group
app
. Click on the blue circle on the consumption line on the left to verify there are two consumersconsumer_app_1
andconsumer_app_2
, that were created in an earlier section. If these two consumers are not running and were never started, start them as described in the section consumer rebalances.
-
Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. -
Stop the consumer group
app
to stop consuming from topicwikipedia.parsed
. Note that the command below stops the consumers ungracefully withkill -9
, so the consumers did not follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_ungraceful.sh
- Wait for 2 minutes to let messages continue to be written to the topics for a while, without being consumed by the consumer group
app
. Notice the red bar which highlights that during the time window when the consumer group was stopped, there were some messages produced but not consumed. These messages are not missing, they are just not consumed because the consumer group stopped.
- Wait for another few minutes and notice that the bar graph changes and there is a herringbone pattern to indicate that perhaps the consumer group stopped ungracefully.
- Reset the offset of the consumer group
app
by setting it to latest offset. The offset reset tool must be run when the consumer is completely stopped. Offset values in output shown below will vary.
$ docker-compose exec kafka1 kafka-consumer-groups --reset-offsets --group app --to-latest --bootstrap-server kafka1:9092 --all-topics --execute
TOPIC PARTITION NEW-OFFSET
wikipedia.parsed 1 8601
wikipedia.parsed 0 15135
- Restart consuming from topic
wikipedia.parsed
with the consumer groupapp
with two consumers.
$ ./$DEMOPATH/start_consumer_app.sh 1
$ ./$DEMOPATH/start_consumer_app.sh 2
- Let this consumer group run for 2 minutes until Control Center stream monitoring shows the consumer group
app
with steady consumption. Notice that during the time period that the consumer groupapp
was not running, no produced messages are shown as delivered.
To simulate a failed broker, stop the Docker container running one of the two Kafka brokers.
- Stop the Docker container running Kafka broker 2.
$ docker-compose stop kafka2
- After a few minutes, observe the System Health shows the broker count has gone down from 2 to 1, and there are many under replicated partitions.
- View topic details to see that there are out of sync replicas on broker 2.
- Restart the Docker container running Kafka broker 2.
$ docker-compose start kafka2
- After about a minute, observe the System Health view in Confluent Control Center. The broker count has recovered to 2, and the topic partitions are back to reporting no under replicated partitions.
- Click on the broker count
2
inside the circle to view when the broker counts changed.
There are many types of Control Center alerts and many ways to configure them. Use the Alerts management page to define triggers and actions, or click on a streams monitoring graph for consumer groups or topics to setup alerts from there.
- This demo already has pre-configured triggers and actions. View the Alerts "Overview" screen, and click "Edit" to see configuration details.
- The trigger
Under Replicated Partitions
happens when a broker reports non-zero under replicated partitions, and it causes an actionEmail Administrator
. - The trigger
Consumption Difference
happens when consumption difference for the Elasticsearch connector consumer group is greater than0
, and it causes an actionEmail Administrator
.
-
If you followed the steps in the failed broker section, view the Alert history to see that the trigger
Under Replicated Partitions
happened and caused an alert when you stopped broker 2. -
You can also trigger the
Consumption Difference
trigger. In the Kafka Connect -> Sinks screen, edit the running Elasticsearch sink connector. -
Pause the Elasticsearch sink connector by pressing the pause icon in the top left. This will stop consumption for the related consumer group.
- View the Alert history to see that this trigger happened and caused an alert.
- Viewing topic data: if you want to watch the live messages from the
wikipedia.parsed
topic:
$ ./$DEMOPATH/listen_wikipedia.parsed.sh
- Stop the consumer group
app
to stop consuming from topicwikipedia.parsed
. Note that the command below stops the consumers gracefully withkill -15
, so the consumers follow the shutdown sequence.
$ ./$DEMOPATH/stop_consumer_app_group_graceful.sh
- Stop the Docker demo, destroy all components and clear all Docker volumes.
$ ./$DEMOPATH/reset_demo.sh