Skip to content

Introduction

Vaivaswatha N edited this page Aug 15, 2024 · 17 revisions

pliron: Programming Languages Intermediate Representation

Background

For the larger part of my career as a compiler engineer, I've written C++ code, initially working on the GCC compiler, and then mostly LLVM. A few years back I worked on an interpreter (for a DSL) written in OCaml and went on to write a compiler for the same language in OCaml, targeting LLVM. More recently, as part of my day job, I'm now working on a compiler for the Sway language, written in Rust.

With that background, as a personal project, I ventured to start writing an extensible compiler framework in Rust. The design and ideas are mostly based on the MLIR framework. Extensible here means that the compiler does not have a fixed set of operations (opcodes) or type system, but instead can be (almost) arbitrarily extended.

Preview

A "hello world" IR in pliron, when printed, looks like this:

builtin.module @bar {
  ^block_1v1():
    builtin.func @foo: builtin.function<() -> (builtin.int<si64>)> {
      ^entry_block_2v1():
        c0_op_3v1_res0 = test.constant builtin.integer <0x0: builtin.int<si64>>;
        test.return c0_op_3v1_res0
    }
}

As with MLIR, module, func, constant and return are operations, prefixed with their dialect names. This code declares a module bar containing a function foo that returns a constant 0.

Motivation

For compilers, static analyzers and other related tools written in Rust, today, the only way to adopt MLIR is by wrapping around MLIR's C bindings. Such an endeavour however comes at a cost: debugging is hard. Here's an illustrative example:

// Built against LLVM Debug 18.1.2
 1│#include <mlir-c/IR.h>
 2│
 3│int main() {
 4│ MlirContext ctx = mlirContextCreate();
 5│ MlirStringRef filname = mlirStringRefCreateFromCString("foo.mlir");
 6│ MlirLocation loc = mlirLocationFileLineColGet(ctx, filname, 1, 1);
 7│
 8│ MlirModule module1 = mlirModuleCreateEmpty(loc);
 9│ MlirOperation opr1 = mlirModuleGetOperation(module1);
10│
11│ // mlirOperationDestroy(opr1);
12│
13│ mlirOperationDump(opr1);
14│ MlirOperation opr2 = mlirOperationClone(opr1);
15│ mlirOperationDump(opr2);
16│
17│ return 0;
18│}

Running this code prints the following:

module {
}
module {
}

If line 11 is uncommented, then the following is printed and the program crashes.

"builtin.module"() ({
}) : () -> ()
test.out: llvm-project/mlir/lib/IR/Region.cpp:79: void
mlir::Region::cloneInto(mlir::Region *, Region::iterator, mlir::IRMapping
&): Assertion `this != dest && "cannot clone region into itself"' failed.
Aborted (core dumped)

What makes this hard to debug?

  1. Even after opr1 was erased, dumping it actually works, giving an impression that it's all fine at that point.
  2. The crash message provides no information as to why it happened. In a large program, to be able to debug this, a developer must be familiar with MLIR internals, which isn't common for Rust programmers using this API. Often Rust programmers may not even be fluent in C++.

LLVM/MLIR C-API: Limitations

The type-system exposed by the llvm-c (or mlir-c) API is fundamentally weaker than what can be natively expressed in Rust (or even C++).

As an example, the C++ API of LLVM provides an IntegerType::getBitWidth method. Its counterpart in the C-API is LLVMGetIntTypeWidth(LLVMTypeRef IntegerTy). The argument here is a generic LLVMTypeRef. Thus the type-system does not prevent us from calling this function with a type other than IntegerType. In the best case (with a debug-build), this hits an assert at runtime, but otherwise we end up with a non-deterministic value or a crash.

To overcome the type-system limitation, projects such as inkwell define Rust types over the llvm-c types to provide a safer API. This however has limitations because we cannot always validate the inputs to an llvm-c function. For example, GEP indices cannot be validated before we construct a GEP, leading to possible crashes.

This problem is further amplified by the fact that the llvm-c API does not expose many functionalities that are available in the C++ API. For example, when constructing an ArrayType, to pre-validate that the element type is valid, one could call ArrayType::isValidElementType with the C++ API. But this is not available in the C-API. Similarly, the LLVM community was reluctant to expose GetElementPtrInst::getIndexedType, a public C++ method in the LLVM-C API. Without this, we'll need to re-implement the method if we want to validate the indices before building a GEP.

More importantly, the C-API is limited in the higher-level compiler functionality that it provides. For example:

  • The MLIR-C API does not provide means to create new dialects, operations, types or interfaces, but rather use what is already defined in the MLIR codebase.
  • The LLVM-C API does not provide access to the many analyses / transformations directly that's available in LLVM. One cannot get the dominator tree of a function or do a SCEV analysis from the C-API, for example.

In other words, the C-API is designed to interact with the compiler, not to extend it.

Finally, and obviously, the static memory safety guarantees of Rust are lost when interacting with a C++ library, limiting it to outside the Rust wrappers. A natively written framework guarantees memory safety.

Current Status

In it's current state pliron is a compiler infrastructure and not yet a useful compiler. In other words, the tools and data-structures to represent an IR (or multiple dialects of them) are mostly there, but there are no useful algorithms (analyses / optimizations) implemented yet. We have a proof-of-concept LLVM-IR dialect that is capable of representing a simple fibonacci program.

What next?

At its current state, pliron only demonstrates that it is possible (and practical) to write an MLIR-like extensible compiler in Rust. There is plenty of work left to enable production use of pliron.

  1. Provide a proof-of-concept dialect for the cranelift IR.
  2. Complete the LLVM dialect.
  3. Support for symbol tables.
  4. Generation of print and parse functions for operations, types and attributes based on a meta-language in derive macros. See discussion and a possible syntax.
  5. Integrate suitable APInt and APFloat libraries for numeric constants.

and a whole lot more ...

Further reading:

  1. A comparison of pliron with other compiler frameworks