Apache Druid offers various ways to monitor its health. Key methods include:
Metrics can be emitted directly to other monitoring tools such as Prometheus using Prometheus Emitter | Apache® Druid. This approach, however, requires using a separate technology and at least basic knowledge of PromQL to visualize data in Grafana.
But what if we use Druid itself to monitor Druid? How could this approach be achieved, and is it a breakthrough method of monitoring Druid?
In this article, you will find answers to these questions and more. It presents the idea of monitoring Druid using Druid itself and provides guidance on how to set this up.
The idea was not immediately obvious but evolved in response to client needs. Many companies currently use Kafka in their data workflows, which means it might be quicker for developers and engineers to leverage Kafka rather than learning and implementing Prometheus in their internal system. Additionally, another frequently occurring challenge is the difficulty of writing queries in Grafana in the PromQL language when we first emit metrics to Prometheus.
These challenges led us to create a new monitoring pipeline that uses only Druid, Kafka, and Grafana.
If you find our idea interesting and see the potential benefits of this approach, follow the second part of this tutorial to learn how to implement it in practice.
To follow the next tutorial steps, the Kafka instance must be deployed locally using a simple Docker compose file.
1. Create a 'kafka-compose.yml'
file and copy the below code:
The presented 'docker-compose.yml' was crafted using the documentation available on the Kafka UI and Kafka Images GitHub repositories.
2. Start the container:
3. Check whether Kafka is running correctly: Go to 'http://localhost:9000/' and check the connection.
4. Create a new topic named 'druid_metrics' exclusively for Druid metrics. The below configuration is just an example. You can tune the topic according to your current needs.
5. Repeat the above step to create a dedicated topic for Druid alerts named 'druid_alerts'.
For this tutorial, we are using the Apache Druid 29.0.1 release.
Install Druid locally:
It is possible to emit Druid metrics directly in JSON format to a Kafka topic using a community extension named Kafka Emitter. Specify all necessary information about your Kafka instance in the 'common.runtime.properties file' located at the 'apache-druid-29.0.1/conf/druid/auto/_common/' path.
1. Add Kafka extension into 'druid.extentions.loadList':
2. Specify all necessary parameters of your Kafka instance to emit metrics:
3. Extra parameters have to be added if you also want to emit alerts:
4. Specify the emitter and save the changes. In our case, it should be set to 'kafka'. All available emitters are listed in Druid docs.
5. Start Druid:
6. Wait a few minutes, then check the Kafka web UI to see if metrics are emitting correctly.
After successfully emitting metrics into Kafka, we can ingest them directly into Druid. Before doing so, we need to create an appropriate data schema that will cover the emitted data. In this tutorial, we will create a simple data schema to monitor query execution performance as an example.
Druid metrics have basic dimensions which are set for all available metrics. However, most of them also include additional fields with extra information.
Fields available for all metrics:
The next step involves ingesting data from Kafka into Druid, specifically into a new data source created exclusively for metrics. Therefore, it's important to design the most efficient data schema, including the necessary dimensions and metrics.
Note: Additional fields dedicated to used metrics can be included. For example, if Druid only emits metrics related to JVM health, additional fields such as 'jvm/pool/committed', 'jvm/pool/init', 'jvm/pool/max', etc. might be included for more in-depth analysis.
The following is a proposal for a basic data schema to monitor query performance:
With the data schema set, start the ingestion process into the newly created data source:
1. Select Load data/Streaming from the main Druid panel.
2. Click on the Apache Kafka button.
3. Connect to the Apache Kafka instance by providing the needed information such as:
Click the Apply button.
If everything is configured correctly, you should see the example events on the left side of your screen.
4. Proceed through the next steps until you reach the Configure schema section. In this section, select the columns that are not needed for the tutorial and delete them. The final schema should look like the one below.
5. Complete the remaining steps of the ingestion process by choosing suitable for your case settings. For this tutorial case, the segment granularity is set to an hour.
6. When the ingestion process is completed correctly, you should be able to see the new running task in the Tasks section.
7. Go to the Query section and check whether the ‘druid_metrics’ data source is ready for querying. Run an example ‘SELECT’ query and see what your data looks like.
If you decide to emit alerts from the Druid to a separate Kafka topic, you will be able to ingest them similarly to how metrics were ingested in the previous tutorial section. This time, there is no need to drastically change the data schema during ingestion.
Check whether a new task has been created and is currently running. If so, you should be able to query your data source.
Great! You have successfully ingested Druid metrics and alerts from Kafka into the newly created data sources. Follow the next steps to learn how to use Druid Explore, create dashboards to visualize the results, and configure alerts in the Grafana.
By default, Druid provides an experimental tool for visualizing data from Druid data sources called Druid Explore.
Druid Explore is not as advanced as dedicated visualization tools, but it is sufficient for quick, ad-hoc visualizations. It offers a few chart types, filtering options, and other features specific to each chart type.
Understanding which queries are most frequently executed on a Druid instance can be extremely useful for engineers. To visualize this a pie chart is ideal as it effectively shows the percentage of each query type. To create a pie chart in the Druid Explore panel, follow these steps:
1. On the right side of the screen, select Pie chart and Slice column as type.
2. Filter the data by a metric (e.g., query/time) to include only data related to the executed queries.
3. You can add additional filters to avoid displaying 'null' values.
4. Druid also offers easy options to filter the final graph if you want to view it with different settings. Click on a legend element on the left side or directly on the chart slice to filter.
Of course, we have more options than just visualizing the number of queries. You can use the same data for a quick query review with other types of visualizations.
To calculate the total execution time for each query type, change the metric from Count to Sum value. Be aware that this works because we have filtered the data by query/time. Changing the filter field to, for example, query/bytes will show the total amount of data returned for each query type.
Visualizing these metrics on a pie chart helps to quickly identify which query type has the biggest impact on the Druid instance. For quick access to specific values, a bar chart might be more suitable.
Knowing the total number of executed queries is useful, but sometimes it's not enough. Visualizing the number of executed queries over time can help in debugging and identifying periods of system overload.
Using a time chart in the Druid Explore tool, you can visualize the number of executed queries over different time periods. Set the time Granularity to decide the time intervals and use the Stack by parameter to choose which dimension to distinguish values by. Examples are shown below.
We encourage you to try Druid Explore and experiment with all available options and settings on your own. The examples provided are just a glimpse of what’s possible.
Druid Explore is an attractive built-in tool in Druid that allows for a quick dive into data from Druid data sources. Unfortunately, it doesn’t allow creating and saving dashboards to which we can go back anytime. For this reason, in the last two sections of this tutorial, you will find information on how to work with Grafana.
To visualize alerts in Grafana, you first need to integrate Apache Druid with Grafana. A step-by-step tutorial is available on our blog. Check it out by clicking here.
Once you have Druid metrics and alerts set up as your data source, you can create Grafana alerts based on these metrics. Here’s how you can set up two types of alerts:
What if we could alert our DevOps team when the average time of our queries exceeds our expectations? This can be easily achieved by querying the ‘druid_metrics’ data source and setting up a dedicated condition.
Let’s see how to do that!
1. Go into Alerting and Alert rules to create a new alert rule.
2. Use the below query while defining the query:
3. Add a condition using expressions to set up a desired alert. For tutorial purposes, we'll use a math condition to trigger an alert when the average time exceeds 90ms. This is just an example to help visualize how the mechanism works with our data.
4. Set evaluation behavior and pending time for the newly created rule.
5. Configure labels and notification policy for your needs.
We can also set a new alert rule using the 'druid_alerts' data source and configure it to notify us when a specific alert is triggered. To achieve that we have to create two separate queries when defining the alert rule.
The first query counts the number of alerts with a distinction by 'severity' and 'service' without any time range. In the second query, we check the number of alerts that occurred in the thirty minutes preceding the current timestamp. This approach helps detect when a new alert is created on the Druid side.
We have set up alerts to receive notifications starting from the first one. We observed that the state changed from normal to pending, and finally to firing.
With a dedicated Druid data source for collected metrics, Grafana can be used to monitor query performance effectively. If you are not familiar with Grafana quite well, I suggest reading my previous article, Monitoring Apache Druid in Grafana | Deep.BI.
Let’s create a query monitoring dashboard together!
1. Create a new dashboard.
2. Add a new visualization.
3. Select a Druid data source. If you have trouble selecting the correct data source, please review the step-by-step tutorial on Integrating Grafana with Apache Druid: A Step-by-Step Tutorial | Deep.BI.
Select Query as SQL to write a dedicated SQL query.
Create a new visualization with the below query:
Create a new visualization with the following query:
Create a new visualization with the following query:
Create a new visualization with the following query:
By implementing the five visualizations described above, you can create a comprehensive query performance dashboard. This will provide your team with the necessary information to identify which query types or data sources need optimization for improved performance.
Monitoring Apache Druid using Druid itself, along with Kafka and Grafana, offers a streamlined and efficient approach to managing your data infrastructure. Following this tutorial, you've set up a robust real-time monitoring and alerting system. Check out our other articles here for more guides and tutorials. If you have any questions or need further assistance, feel free to contact us.