Modern, especially distributed, systems are becoming more and more tailored to business expectations and customer requirements. The result is that they are becoming increasingly complex, so it’s not obvious how to track every operation performed.
Fortunately, good system monitoring is gaining popularity and continues beyond collecting and processing logs. New systems offer the possibility to collect the most important information about them as metrics.
System metrics refer to various types of measurements present within a system. Each resource within the system that can be observed for factors such as performance, availability, reliability, and other characteristics, possesses one or multiple metrics from which data can be gathered.
In Druid, there is an easy way to configure emitting metrics which are fundamental for monitoring query execution and performance, ingestion process, and exceptions. At Druid documentation, you can closely look at every metric type, its dimensions, and a brief description.
These metrics are essential for monitoring the health, performance, and efficiency of Druid clusters, identifying bottlenecks, optimizing configurations, and ensuring the smooth operation of real-time analytics workloads. Additionally, the Druid integration with Grafana allows users to set up alerts based on predefined thresholds for these metrics.
Check our previous tutorial on Integrating Grafana with Druid for detailed guidelines on how to set the environment correctly. In the further part of this article, we will show you sample dashboards that you find useful for basic Druid monitoring.
Used metrics:
The most basic metrics (but extremely useful) are those that return the amount of success, failed, interrupted, or timeout queries. That simple information might be easily used for Druid cluster monitoring.
In Grafana we could display individual values depending on the daytime. Thanks to easily filtering by time, we can show a line graph for a particular time range.
Moreover, Grafana provides a feature to display mathematical expressions on charts. In our case, we wanted to see the rate of success queries. The ratio of positive queries to all gives us more detailed information about query performance.
The number of queries is specific and depends on the system, but the ratio of positive queries should be high no matter the number of queries.
To configure the above chart, you must use the Grafana expression feature available while creating the panel.
Firstly, you have to choose the metrics regarding the number of successful, failed, and interrupted queries. For this, write an appropriate PromQL query to Prometheus.
In the screenshot below you can also notice global Grafana variables such as “Servers” and “Jobs” added for filtering. They are not required to gain the successful query rate panel.
The expression down below shows how to calculate success query rates.
Remember to disable all queries despite the expression with the final result.
Used metrics:
Druid enables two query caching types: caching per-segment or caching whole-query. Regardless of the used type, we want to check whether our caching is effective. Druid might collect caching metrics as delta or total type. The difference between them is significant and we have to be careful during the query to choose the correct one. The delta metrics type collects caching metrics since the last emission while the total metrics type collects the total cache metrics values. We can monitor our cache performance while using the druid/query/delta/hitRate metric in the Grafana panel. Take notice to choose delta metric type rather than total one to draw the below chart on your own.
Used metrics:
The average query time of executed queries might show us how good the performance of our queries is, but it’s tough to deduce possible issues without extra details. Grouping time metrics by query type allows seeing the relationship between query type and the average time of executed queries. With that, we can present what kind of queries are executed the longest and are irritating for the customers.
Monitoring Druid in Grafana provides a comprehensive solution for tracking the health, performance, and efficiency of Druid clusters. Users can gain valuable insights into system operations with various metrics available, including query, indexing, coordinator, ingestion, and general health metrics. Grafana's flexibility enables the creation of informative dashboards, allowing users to visualize key metrics such as query success rates, cache hit rates, and query completion times. This integration helps users to optimize configurations, identify bottlenecks, and ensure the smooth operation of real-time analytics workloads in increasingly complex distributed systems.