This project consist of building an input/output dialect in MLIR for Substrait, the cross-language serialization format of database query plans (akin to an intermediate representation/IR for database queries). The immediate goal is to create common infrastructure that can be used to implement consumers, producers, optimizers, and transpilers of Substrait; the more transcending goal is to study the viability of using modern, general-purpose compiler infrastructure to implement database query compilers.
Licensed under the Apache license with LLVM Exceptions. See LICENSE for more information.
Check out the the dedicated document for how to contribute.
Substrait defines a serialization format for data-intensive compute operations similar to relational algebra as they typically occur in database query plans and similar systems, i.e., an exchange format for database queries. This allows to separate the development of user frontends such as dataframe libraries or SQL dialects (aka "Substrait producers") from that of backends such as database engines (aka "Substrait consumers") and, thus, to interoperate more easily between different data processing systems.
While Substrait has significant momentum and finds increasing adoption in mature systems, it is only concerned with implementing the serialization format of query plans, and leaves the handling of that format and, hence, the in-memory format or intermediate representation (IR) of plans up to the systems that adopt it. This will likely lead to repeated implementation effort for everything else required to deal with that intermediate representation, including serialization/desiralization to and from text and other formats, a host-language representation of the IR such as native classes, error and location tracking, rewrite engines, rewrite rules, and pass management, common optimizations such as common sub-expression elimination, and similar.
This project aims to create a base for any system dealing with Substrait by building a "dialect" for Substrait in MLIR. In a way, it aims to build an in-memory format for the concepts defined by Substrait, for which the latter only describe their serialization format. MLIR is a generic compiler framework providing infrastructure for writing compilers from any domain, is part of the LLVM ecosystem, and has an active community with adoption from researchers and industry across many domains. It makes it easy to add new IR consisting of domain-specific operations, types, attributes, etc., which are organized in dialects (either in-tree and out-of-tree), as well as rewrites, passes, conversions, translations, etc. on those dialects. Creating a Substrait dialect and a number of common related transformations in such a mature framework has the potential to eliminate some of the repeated effort described above and, thus, to ease and eventually increase adoption of Substrait. By extension, building out a dialect for Substrait can show that MLIR is a viable base for any database-style query compiler.
The aim of the Substrait dialect is to support all of the following use cases:
- Implement the translation of the IR of a particular system to or from Substrait by converting it to or from the Substrait dialect (rather than Substrait's protobuf messages) and then use the serialization/deserializing routines from this project.
- Use the Substrait dialect as the sole in-memory format for the IR of a particular system, e.g., parsing some frontend format into its own dialect and then converting that into the Substrait dialect for export or converting from the Substrait dialect for import and then translating that into an execution plan.
- Implement simplifying and "canonicalizing" transformations of Substrait plans such as common sub-expression elimination, dead code elimination, sub-query/common table-expression inlining, selection and projection push-down, etc., for example, as part of a producer, consumer, or transpiler.
- Implement "compatibility rewrites" that transforms plans that using features that are unsupported by a particular consumer into equivalent plans using features that it does support, for example, as part of a producer, consumer, or transpiler.
- [Stretch] Implement a full-blow query optimizer using the dialect for both logical and physical plans. It is not clear whether this should be done with this dialect or rather one or two additional ones that are specifically designed with query optimization in mind.
The main objective of the Substrait dialect is to allow handling Substrait plans in MLIR: it replicates the components of Substrait plans as a dialect in order to be able to tap into MLIR infrastructure. In the taxonomy of Niu and Amini, this means that the Substrait dialect is both an "input" and an "output" dialect for Substrait. As such, there is only little freedom in designing the dialect. To guide the design of the few remaining choices, we shall follow the following rationale (from most important to least important):
- Every valid Substrait plan MUST be representable in the dialect.
- Every valid Substrait plan MUST round-trip through the dialect to the same plan as the input. This includes names and ordering.
- The import routine MUST be able to report all constraint violations of Substrait plans (such as type mismatches, dangling references, etc.).
- The dialect MAY be able to represent programs that do not correspond to valid Substrait plans. It MAY be impossible to export those to Substrait. For example, this allows to represent DAGs of operators rather than just trees.
- Every valid program in the Substrait dialect that can be exported to Substrait MUST round-trip through Substrait to a semantically equivalent program but MAY be different in terms of names, ordering, used operations, attributes, etc.
- The dialect SHOULD be understood easily by anyone familiar with Substrait. In particular, the dialect SHOULD use the same terminilogy as the Substrait specification wherever applicable.
- The dialect SHOULD follow MLIR conventions, idioms, and best practices.
- The dialect SHOULD reuse types, attributes, operations, and interfaces of upstream dialects wherever applicable.
- The dialect SHOULD allow simple optimizations and rewrites of Substrait plans without requiring other dialects.
- The serialization of the dialect (aka its "assembly") MAY change over time. (In other words, the dialect is not meant as an exchange format between systems -- that's what Substrait is for.)
MLIR provides infrastructure for virtually all aspects of writing a compiler. The following is a list of features that we inherit by using MLIR:
- Mostly declarative approach to defining relations and expressions (via ODS/tablegen).
- Documentation generation from declared relations and expressions (via ODS).
- Declarative serialization/parsing to/from human-readable text representation (via custom assembly).
- Syntax high-lighting, auto-complete, as-you-type diagnostics, code navigation, etc. for the MLIR text format (via an LSP server).
- (Partially declarative) type deduction framework (via ODS constraints or C++ interface implementations).
- (Partially declarative) verification of arbitrary consistency constraints, declarative (via ODS constraints) or imperative (via C++ verifiers).
- Mostly declarative pass management (via tablegen).
- Versatile infrastructure for pattern-based rewriting (via DRR and C++ classes).
- Powerful manipulation of imperative handling, creation, and modification of IR using native classes for operations, types, and attributes, walkers, builders, (IR) interfaces, etc. (via ODS and C++ infrastructure).
- Powerful location tracking and location-based error reporting.
- Generated Python bindings of IR components, passes, and generic infrastructure (via ODS).
- Powerful command line argument handling and customizable implementation of
typical tools
(
X-opt
,X-translate
,X-lsp-server
, ...). - Testing infrastructure
that is optimized for compilers (via
lit
andFileCheck
). - A collection of common types and attributes as well as dialects (i.e., operations) for more or less generic purposes that can be used in or combined with custom dialects and that come with transformations on and conversions to/from other dialects.
- A collection of interfaces and transformation passes on those interfaces, which allows to extend existing transformations to new dialects easily.
- A support library with efficient data structures, platform-independent file system abstraction, string utilities, etc. (via MLIR and LLVM support libraries).
This project builds as part of the LLVM External Projects facility (see
documentation
for the LLVM_EXTERNAL_PROJECTS
config setting).
You need to have the following software installed and in your PATH
or
discoverable
by CMake:
git
ninja
- LLVM prerequisites and a C/C++ toolchain
- Protobuf (compiler, runtime, and headers)
Define the following environment variables (adapted to your situation), ideally
making them permanent in your $HOME/.bashrc
or in the activate
script of
your Python virtual environment (see below):
export SUBSTRAIT_MLIR_SOURCE_DIR=$HOME/git/substrait-mlir-contrib
export SUBSTRAIT_MLIR_BUILD_DIR=${SUBSTRAIT_MLIR_SOURCE_DIR}/build
In your $HOME/src
directory, clone this project recursively:
git clone --recursive \
https://github.com/substrait-io/substrait-mlir-contrib \
${SUBSTRAIT_MLIR_SOURCE_DIR}
If you have cloned non-recursively already and every time a submodule is updated, run the following command inside the cloned repository instead:
cd ${SUBSTRAIT_MLIR_SOURCE_DIR}
git submodule update --recursive --init
Create a virtual environment, activate it, and install the dependencies from
requirements.txt
:
python3 -m venv ~/.venv/substrait-mlir
source ~/.venv/substrait-mlir/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install -r ${SUBSTRAIT_MLIR_SOURCE_DIR}/requirements.txt
For details, see the documentation of the MLIR Python Bindings.
Make some paths available in your Python environment by adding the following
lines to the end of ~/.venv/substrait-mlir/bin/activate
(then source
that
file again):
export SUBSTRAIT_MLIR_SOURCE_DIR=$HOME/git/substrait-mlir-contrib
export SUBSTRAIT_MLIR_BUILD_DIR=${SUBSTRAIT_MLIR_SOURCE_DIR}/build
export PATH=${SUBSTRAIT_MLIR_BUILD_DIR}/bin:$PATH
Run the command below to set up the build system, possibly adapting it to your
needs. For example, you may choose not to compile clang
, clang-tools-extra
,
lld
, and/or the examples to save compilation time, or use a different variant
than Debug
. Similarly, you may want to set DLLVM_ENABLE_LLD=OFF
on some Macs
that don't have lld
.
cmake \
-DPython3_EXECUTABLE=$(which python) \
-DBUILD_SHARED_LIBS=ON \
-DCMAKE_EXPORT_COMPILE_COMMANDS=TRUE \
-DCMAKE_BUILD_TYPE=Debug \
-DLLVM_ENABLE_PROJECTS="mlir;clang;clang-tools-extra" \
-DLLVM_TARGETS_TO_BUILD="X86" \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_INCLUDE_TESTS=OFF \
-DLLVM_INCLUDE_UTILS=ON \
-DLLVM_INSTALL_UTILS=ON \
-DLLVM_BUILD_EXAMPLES=ON \
-DLLVM_EXTERNAL_PROJECTS=substrait_mlir \
-DLLVM_EXTERNAL_SUBSTRAIT_MLIR_SOURCE_DIR=${SUBSTRAIT_MLIR_SOURCE_DIR} \
-DLLVM_ENABLE_LLD=ON \
-DLLVM_CCACHE_BUILD=ON \
-DMLIR_INCLUDE_INTEGRATION_TESTS=ON \
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
-DMLIR_ENABLE_PYTHON_BENCHMARKS=ON \
-S${SUBSTRAIT_MLIR_SOURCE_DIR}/third_party/llvm-project/llvm \
-B${SUBSTRAIT_MLIR_BUILD_DIR} \
-G Ninja
To build, run:
cd ${SUBSTRAIT_MLIR_BUILD_DIR} && ninja
substrait-opt --help
substrait-translate --help
You can run all tests with the following command:
cd ${SUBSTRAIT_MLIR_BUILD_DIR} && ninja check-substrait-mlir
You may also use lit
to run a subset of the tests.
llvm-lit -v ${SUBSTRAIT_MLIR_SOURCE_DIR}/test
llvm-lit -v ${SUBSTRAIT_MLIR_SOURCE_DIR}/test/Target
llvm-lit -v ${SUBSTRAIT_MLIR_SOURCE_DIR}/test/python/dialects/substrait/dialect.py
The MLIR LSP Servers allows editors to display as-you-type diagnostics, code navigation, and similar features. In order to extend this functionality to the dialects from this repository, use the following LSP server binaries:
${SUBSTRAIT_MLIR_BUILD_DIR}/bin/mlir-proto-lsp-server
${SUBSTRAIT_MLIR_BUILD_DIR}/bin/tblgen-lsp-server",
${SUBSTRAIT_MLIR_BUILD_DIR}/bin/mlir-pdll-lsp-server
In VS Code, this is done via the mlir.server_path
, mlir.pdll_server_path
,
and mlir.tablegen_server_path
properties in settings.json
.