If you find that article, you are looking for an efficient way to use data from Druid Segments in PySpark. There are several options for extracting data from Druid, but each comes with its own set of challenges. To address them, our team has developed the Spark Druid Segment Reader which allows reading data directly from Druid's Deep Storage as a Spark DataFrame in PySpark.
This tutorial will help you practice Spark Druid Segment Reader at Jupyter Notebook. If you want to find more information about the connector, follow the previous blog article.
Download the JAR from the repository: deep-bi/spark-druid-segment-reader. You can also download the solution and build that project using ‘sbt assembly’ command on your own.
To work with the custom JAR, add a “spark.jars” parameter to Spark Context. You can use the following Python code:
Remember to replace ‘spark_druid_segment_reader_jar_path’ with the actual path to the JAR file you downloaded or built in the previous step.
With the Spark Druid Segment Reader properly configured, you can now load a Data Frame from Druid Segments using the code below:
The ‘properties’ parameter is a dictionary containing read configuration. You can provide detailed specifications of it in README.
Here is an example definition of the Properties dictionary:
If you do all the steps correctly, you read data from Druid segments and write them as Data Frames. Now, you can use the potential of a Data Frame for your purpose. If you have any concerns or need further technical support, feel free to contact us.