Drill, Baby, Drill

Why Distributed Queries

Depending on the size of a company and legal requirements for data protection, a company maintains different environments for their product to clearly separate development from live service. Common environments include Development, Testing, Integration, and Production. Access to these environments is usually restricted based on roles. For example, developers have full access to Development and Testing, but only read-only access to Integration and Production. Ideally, there would be no access to these environments, but debugging live issues sometimes requires it.

To have realistic data in non-production environments, a common practice is to periodically mirror the production environment to other environments and anonymize the data. In some cases, it might be useful to work with the anonymized data and compare it to live data, such as when testing complex calculations.

Some DBMSs offer a mechanism to “link” databases. This may be limited to databases of the same type or a variety of databases, depending on what you are using. From a developer’s perspective, a link simply provides access to the linked database within a schema, making it appear like a regular part of your database. Most of the time, links have limitations on what is supported (e.g., transactions). If you have a setup with different environments, you likely don’t want to link them due to security concerns.

All of this makes it difficult to run queries across multiple environments. One way to handle this is by exporting the data you need and then performing the aggregation on that export. You could use a database like DuckDB, which is excellent at working with different file formats like CSV, JSON, or Parquet. Although this seems like a good idea at first, exporting data can be cumbersome when you need to update your queries. For example, if you need to add fields or tables, you have to adjust the export each time, which slows down your workflow.

This is where Apache Drill can be very helpful.

What is Apache Drill

You can think of Apache Drill as doing for data what an API Gateway does for services. It provides a proxy database that combines different data stores into one. This allows smooth, distributed queries across various datasets.

On its website, Drill highlights its ability to query non-relational data sources, suggesting its original use in data-lake scenarios. However, relational data sources are also supported, and that's what we will focus on in this article.

Getting Drill to Work for an Oracle Database on a Developer Machine

This is the scenario I encountered, but most of what I discuss here will apply to other relational databases as well. I will only cover the setup for a single developer machine. Drill can run as a distributed system for availability and performance, but that is beyond the scope of what I'm describing here.

Embedded Mode is...

... somewhat disappointing. The easiest way to run Drill is to download it and run it in embedded mode. Drill is developed in Java, so it requires the Java Runtime to be installed. The setup is detailed here. Following the instructions, I encountered errors because it seems that white spaces in the path to the Java installation are not supported (which is the default in my language). I worked around this issue by creating an alias with the subst command, only to be met with an UNSUPPORTED_OPERATION ERROR. I found some posts from others with the same issue, with one suggesting that an older JRE version should be used. At this point, I decided that embedded mode is not for me.

Run as Docker Container

Running Drill as a Docker container works seamlessly. I use Podman for this:

podman run -it -p 8047:8047 -p 31010:31010 apache/drill

Port 8047 is for the web interface.
Port 31010 is for JDBC connections.

If everything works as it did for me, you can now access the web UI at http://localhost:8047.

Configuring Storage

In Drill, data sources you can query are called storage. The web UI can be a bit confusing when it comes to configuration. In the storage tab, there are two lists of storage plugins: enabled (on the left) and available (on the right). The available storage plugins include a JDBC plugin, which I want to use. You can enable this to make it appear on the left side, but this brings up the question of how to configure multiple JDBC storages.

The simplest way to do this is to use the Create button on the left side. This lets you add as many storages of the same type as you need.

The name of the plugin (in this case dev) will be the schema you use in later SQL queries. If you use a different database, make sure to change the driver and url fields.

Drill doesn't include a JDBC driver by default, so you need to ensure it's available for the container. The documentation provides examples of configurations for various databases.

I downloaded the Oracle driver to c:\Tools\Drill on my host machine and then mapped the driver into the container:

podman run -it -p 8047:8047 -p 31010:31010 -v /mnt/c/Tools/Drill/ojdbc17.jar:/opt/drill/jars/3rdparty/ojdbc17.jar apache/drill

With this setup, the configuration works, and you can query the relational store.

Saving the Storage Configuration

You can save the storage configuration, as this post explains, but I haven't tried this for my scenario.

Using the Web UI

I started running some queries using the web UI, but I realized this might not be the best way for serious work. The UI is not user-friendly (for example, you have to switch between multiple forms to write your queries and check the results). I also encountered a formatting issue with BIGINT numbers. Initially, I thought it was a problem with the datatype, but it turned out the web UI formats them using E Notation, which is not helpful at all.

Connecting with DBeaver

DBeaver is a great database query tool that supports many different databases, including Drill. To connect, add a new Drill connection. The first time you do this, DBeaver automatically fetches the latest driver—nice!

Select URL for Connect by and enter the JDBC URL jdbc:drill:drillbit=localhost. Leave Username and Password blank, and check the Save password box. With this setup, you can run queries against all stores within Drill.

When querying, remember that Drill is its own database, so you can't use regular PL/SQL syntax. For example, identifiers are not escaped with quotes but with backticks. However, Drill is ANSI compliant, so you shouldn't encounter too many issues. A reference is available here.

Conclusion

I'm not quite sure about the current status of the Drill project. On one hand, it received several updates this year, it is an Apache project, and the documentation seems to be in good shape.

On the other hand, there were significant issues when trying to get it to work, and the posts I found on Stack Overflow didn't help resolve the problem. The web UI seems very unfinished to me and should be avoided.

However, for what I'm doing here, it works. I hope this will save me (and you) some time in the future.

Drill, Baby, Drill

Why Distributed Queries

What is Apache Drill