Apache Druid offers various ways to monitor its health. Key methods include:
Request logs: Logged by each service that can serve queries, these request logs contain information on query metrics including query time, returned data size, query type, and more.Metrics: Necessary for detailed monitoring query execution, ingestion, coordination, and more.Alerts: Generated when unexpected situations occur, emitted as JSON objects to a runtime log file or over HTTP (to services like Apache Kafka).Metrics can be emitted directly to other monitoring tools such as Prometheus using Prometheus Emitter | Apache® Druid . This approach, however, requires using a separate technology and at least basic knowledge of PromQL to visualize data in Grafana.
But what if we use Druid itself to monitor Druid? How could this approach be achieved, and is it a breakthrough method of monitoring Druid?
In this article, you will find answers to these questions and more. It presents the idea of monitoring Druid using Druid itself and provides guidance on how to set this up.
Our concept The idea was not immediately obvious but evolved in response to client needs. Many companies currently use Kafka in their data workflows, which means it might be quicker for developers and engineers to leverage Kafka rather than learning and implementing Prometheus in their internal system. Additionally, another frequently occurring challenge is the difficulty of writing queries in Grafana in the PromQL language when we first emit metrics to Prometheus.
These challenges led us to create a new monitoring pipeline that uses only Druid, Kafka, and Grafana.
Using Druid for Druid monitoring If you find our idea interesting and see the potential benefits of this approach, follow the second part of this tutorial to learn how to implement it in practice.
Deploying Kafka locally To follow the next tutorial steps, the Kafka instance must be deployed locally using a simple Docker compose file.
1. Create a 'kafka-compose.yml'
file and copy the below code:
# kafka-compose.yml
---
version: '2'
services:
zookeeper:
image: confluentinc/cp-zookeeper:latest
environment:
ZOOKEEPER_CLIENT_PORT: 2181
ZOOKEEPER_TICK_TIME: 2000
kafka:
# "`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-
# An important note about accessing Kafka from clients on other machines:
# -----------------------------------------------------------------------
#
# The config used here exposes port 9092 for _external_ connections to the broker
# i.e. those from _outside_ the docker network. This could be from the host machine
# running docker, or maybe further afield if you've got a more complicated setup.
# If the latter is true, you will need to change the value 'localhost' in
# KAFKA_ADVERTISED_LISTENERS to one that is resolvable to the docker host from those
# remote clients
#
# For connections _internal_ to the docker network, such as from other services
# and components, use kafka:29092.
#
# See https://rmoff.net/2018/08/02/kafka-listeners-explained/ for details
# "`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-'"`-._,-
#
image: confluentinc/cp-kafka:latest
depends_on:
- zookeeper
ports:
- 9092:9092
environment:
KAFKA_BROKER_ID: 1
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:29092,PLAINTEXT_HOST://localhost:9092
KAFKA_LISTENER_SECURITY_PROTOCOL_MAP: PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT
KAFKA_INTER_BROKER_LISTENER_NAME: PLAINTEXT
KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1
K AFKA_JMX_PORT: 9997
kafka-ui:
container_name: kafka-ui
image: provectuslabs/kafka-ui:latest
ports:
- "9000:8080"
environment:
KAFKA_CLUSTERS_0_NAME: local
KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: kafka:29092
KAFKA_CLUSTERS_0_METRICS_PORT: 9997
DYNAMIC_CONFIG_ENABLED: "true"
depends_on:
- "kafka"
The presented 'docker-compose.yml' was crafted using the documentation available on the Kafka UI and Kafka Images GitHub repositories.
2. Start the container:
docker-compose -f kafka-compose.yml up
3. Check whether Kafka is running correctly: Go to 'http://localhost:9000/' and check the connection.
4. Create a new topic named 'druid_metrics' exclusively for Druid metrics. The below configuration is just an example. You can tune the topic according to your current needs.
5. Repeat the above step to create a dedicated topic for Druid alerts named 'druid_alerts' .
Emitting Druid Metrics to Kafka For this tutorial, we are using the Apache Druid 29.0.1 release .
Download and Install Druid locally Install Druid locally:
tar -xzf apache-druid-29.0.1-bin.tar.gz
cd apache-druid-29.0.1
Enable Emitting Metrics from Druid It is possible to emit Druid metrics directly in JSON format to a Kafka topic using a community extension named Kafka Emitter . Specify all necessary information about your Kafka instance in the 'common.runtime.properties file' located at the 'apache-druid-29.0.1/conf/druid/auto/_common/' path.
1. Add Kafka extension into 'druid.extentions.loadList' :
druid.extensions.loadList=["druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-multi-stage-query", "kafka-emitter"]
2. Specify all necessary parameters of your Kafka instance to emit metrics:
druid.emitter.kafka.bootstrap.servers=localhost:9092
druid.emitter.kafka.event.types=["metrics"]
druid.emitter.kafka.metric.topic=druid_metrics
druid.emitter.kafka.producer.config={"max.block.ms":10000}
3. Extra parameters have to be added if you also want to emit alerts:
druid.emitter.kafka.event.types=["alerts"]
druid.emitter.kafka.metric.topic=druid_alerts
4. Specify the emitter and save the changes. In our case, it should be set to 'kafka' . All available emitters are listed in Druid docs .
5. Start Druid:
6. Wait a few minutes, then check the Kafka web UI to see if metrics are emitting correctly.
Ingesting Metrics from Kafka to Druid After successfully emitting metrics into Kafka, we can ingest them directly into Druid. Before doing so, we need to create an appropriate data schema that will cover the emitted data. In this tutorial, we will create a simple data schema to monitor query execution performance as an example.
Create a Data Schema for Metrics Druid metrics have basic dimensions which are set for all available metrics. However, most of them also include additional fields with extra information.
Fields available for all metrics:
timestamp: the time the metric was created metric: the name of the metric service: the service name that emitted the metric host: the host name that emitted the metric value: some numeric value associated with the metric The next step involves ingesting data from Kafka into Druid, specifically into a new data source created exclusively for metrics. Therefore, it's important to design the most efficient data schema, including the necessary dimensions and metrics.
Note: Additional fields dedicated to used metrics can be included. For example, if Druid only emits metrics related to JVM health, additional fields such as 'jvm/pool/committed' , 'jvm/pool/init' , 'jvm/pool/max' , etc. might be included for more in-depth analysis.
The following is a proposal for a basic data schema to monitor query performance:
Set up Ingestion for Parsing the Metrics With the data schema set, start the ingestion process into the newly created data source:
1. Select Load data/Streaming from the main Druid panel.
2. Click on the Apache Kafka button.
3. Connect to the Apache Kafka instance by providing the needed information such as:
Bootstrap servers - 'localhost:9092' Topic - 'druid_metrics' Click the Apply button.
If everything is configured correctly, you should see the example events on the left side of your screen.
4. Proceed through the next steps until you reach the Configure schema section. In this section, select the columns that are not needed for the tutorial and delete them. The final schema should look like the one below.
5. Complete the remaining steps of the ingestion process by choosing suitable for your case settings. For this tutorial case, the segment granularity is set to an hour .
6. When the ingestion process is completed correctly, you should be able to see the new running task in the Tasks section.
7. Go to the Query section and check whether the ‘druid_metrics’ data source is ready for querying. Run an example ‘SELECT’ query and see what your data looks like.
Ingesting Alerts from Kafka to Druid If you decide to emit alerts from the Druid to a separate Kafka topic, you will be able to ingest them similarly to how metrics were ingested in the previous tutorial section. This time, there is no need to drastically change the data schema during ingestion.
Check whether a new task has been created and is currently running. If so, you should be able to query your data source.
Great! You have successfully ingested Druid metrics and alerts from Kafka into the newly created data sources. Follow the next steps to learn how to use Druid Explore, create dashboards to visualize the results, and configure alerts in the Grafana.
Druid Explore By default, Druid provides an experimental tool for visualizing data from Druid data sources called Druid Explore .
Druid Explore is not as advanced as dedicated visualization tools, but it is sufficient for quick, ad-hoc visualizations. It offers a few chart types, filtering options, and other features specific to each chart type.
The Number of Executed Queries Grouped by Type Understanding which queries are most frequently executed on a Druid instance can be extremely useful for engineers. To visualize this a pie chart is ideal as it effectively shows the percentage of each query type. To create a pie chart in the Druid Explore panel, follow these steps:
1. On the right side of the screen, select Pie chart and Slice column as type .
2. Filter the data by a metric (e.g., query/time ) to include only data related to the executed queries.
3. You can add additional filters to avoid displaying 'null' values.
4. Druid also offers easy options to filter the final graph if you want to view it with different settings. Click on a legend element on the left side or directly on the chart slice to filter.
Of course, we have more options than just visualizing the number of queries. You can use the same data for a quick query review with other types of visualizations.
The Total Execution Time for Each Query Type To calculate the total execution time for each query type, change the metric from Count to Sum value . Be aware that this works because we have filtered the data by query/time . Changing the filter field to, for example, query/bytes will show the total amount of data returned for each query type.
Visualizing these metrics on a pie chart helps to quickly identify which query type has the biggest impact on the Druid instance. For quick access to specific values, a bar chart might be more suitable.
The Number of Executed Queries in the Perspective of the Time Knowing the total number of executed queries is useful, but sometimes it's not enough. Visualizing the number of executed queries over time can help in debugging and identifying periods of system overload.
Using a time chart in the Druid Explore tool, you can visualize the number of executed queries over different time periods. Set the time Granularity to decide the time intervals and use the Stack by parameter to choose which dimension to distinguish values by. Examples are shown below.
We encourage you to try Druid Explore and experiment with all available options and settings on your own. The examples provided are just a glimpse of what’s possible.
Druid Explore is an attractive built-in tool in Druid that allows for a quick dive into data from Druid data sources. Unfortunately, it doesn’t allow creating and saving dashboards to which we can go back anytime. For this reason, in the last two sections of this tutorial, you will find information on how to work with Grafana.
Druid Alerts in Grafana Integrate Apache Druid with Grafana To visualize alerts in Grafana, you first need to integrate Apache Druid with Grafana. A step-by-step tutorial is available on our blog. Check it out by clicking here .
Add New Alert Rules Once you have Druid metrics and alerts set up as your data source, you can create Grafana alerts based on these metrics. Here’s how you can set up two types of alerts:
I. Alert When the Mean Query Time Is Too High What if we could alert our DevOps team when the average time of our queries exceeds our expectations? This can be easily achieved by querying the ‘druid_metrics’ data source and setting up a dedicated condition.
Let’s see how to do that!
1. Go into Alerting and Alert rules to create a new alert rule.
2. Use the below query while defining the query:
select "dataSource", sum("value")/count("value")
from "druid_metrics"
where "metric" = 'query/time' and "dataSource" is not NULL
group by "dataSource"
3. Add a condition using expressions to set up a desired alert. For tutorial purposes, we'll use a math condition to trigger an alert when the average time exceeds 90ms. This is just an example to help visualize how the mechanism works with our data.
4. Set evaluation behavior and pending time for the newly created rule.
5. Configure labels and notification policy for your needs.
II. Alert When a New Alert Appears in Druid We can also set a new alert rule using the 'druid_alerts' data source and configure it to notify us when a specific alert is triggered. To achieve that we have to create two separate queries when defining the alert rule.
The first query counts the number of alerts with a distinction by 'severity' and 'service' without any time range. In the second query, we check the number of alerts that occurred in the thirty minutes preceding the current timestamp. This approach helps detect when a new alert is created on the Druid side.
Verify How the Alerts Are Working We have set up alerts to receive notifications starting from the first one. We observed that the state changed from normal to pending, and finally to firing.
Sample Grafana Dashboards With a dedicated Druid data source for collected metrics, Grafana can be used to monitor query performance effectively. If you are not familiar with Grafana quite well, I suggest reading my previous article, Monitoring Apache Druid in Grafana | Deep.BI .
Let’s create a query monitoring dashboard together!
1. Create a new dashboard.
2. Add a new visualization.
3. Select a Druid data source. If you have trouble selecting the correct data source, please review the step-by-step tutorial on Integrating Grafana with Apache Druid: A Step-by-Step Tutorial | Deep.BI .
Example Visualization I. Number of Queries and Their Mean Time for Queries Executed on the Brokers Select Query as SQL to write a dedicated SQL query.
select "type", "service", count("value") as number_of_queries, sum("value")/count("value") as mean_query_time
from "druid_metrics"
where "metric" = 'query/time' and "type" is not NULL and "service" = 'druid/broker'
group by "type", "service"
order by number_of_queries desc
II. The Total Query Time for Queries Executed on the Brokers Create a new visualization with the below query:
select "type", sum("value")as total_query_time
from "druid_metrics"
where "metric" = 'query/time' and "type" is not NULL and "service" = 'druid/broker'
group by "type"
III. The Number of Executed Queries in The Perspective of The Day Create a new visualization with the following query:
select "__time", count("value") as number_of_queries
from "druid_metrics"
where "metric" = 'query/time'
group by "__time"
IV. The Number of Queries Grouped by Data Source and Type Create a new visualization with the following query:
select "dataSource", "type", count("value") as number_of_queries
from "druid_metrics"
where "metric" = 'query/time' and "dataSource" is not NULL
group by "dataSource", "type"
V. The Number of Queries Group by Their Type Create a new visualization with the following query:
select "type", count("value") as number_of_queries
from "druid_metrics"
where "metric" = 'query/time' and "dataSource" is not NULL
group by "type"
order by number_of_queries desc
The Final Dashboard By implementing the five visualizations described above, you can create a comprehensive query performance dashboard. This will provide your team with the necessary information to identify which query types or data sources need optimization for improved performance.
Conclusion Monitoring Apache Druid using Druid itself, along with Kafka and Grafana, offers a streamlined and efficient approach to managing your data infrastructure. Following this tutorial, you've set up a robust real-time monitoring and alerting system. Check out our other articles here for more guides and tutorials. If you have any questions or need further assistance, feel free to contact us .