HashQL

Compiler Pipeline

Three IRs, one pipeline

A query enters as source text and exits as backend-specific artifacts. Between those endpoints, it passes through three intermediate representations, each removing surface complexity while formalizing the structures that downstream stages depend on.

Pipeline overview

The compiler is frontend-agnostic by design. A surface syntax is parsed into an AST, which is lowered into the HIR, normalized into HIR/ANF, then reified into the MIR. From the MIR, execution analysis assigns basic blocks to backends, and code generation produces backend-specific artifacts. Each representation is simpler than the last, and each transformation makes the next one possible.

J-ExprfutureASTcanonicalizeHIRtype checkMIRoptimizeExec AnalysisplacementCode GendispatchPostgreSQLInterpreterEmbedding

Click any stage to expand its passes

J-Expr and the AST

The currently available frontend is J-Expr: a JSON-with-comments analogue of S-expressions. J-Expr was designed for machine generation, not hand-authoring. Most queries will be produced by host-language DSLs (Rust, TypeScript) that generate queries programmatically; a machine-friendly grammar makes these DSLs concise and type-safe. The S-expression shape means the syntax comprises just four forms: underscore, path expression, call expression, and data expression.

Other syntaxes can target the same compiler by producing the same AST. A future declarative layer could lower pattern-matching queries to functional calls; an imperative layer could adopt a Rust-like surface syntax. The AST resolves imports, formalizes special forms (type bindings, let bindings, conditionals) into dedicated nodes, and resolves all paths to their absolute form. Everything after the AST is syntax-independent.

HIR: the tree representation

The HIR preserves the programmer’s intent. Expressions nest freely, closures capture variables implicitly, and evaluation order is determined by the tree structure. This is where type inference and type checking happen: the type system generates subtyping constraints by walking the HIR and solves them to a fixpoint.

The language is an expression language: every construct evaluates to a value and may appear wherever a value is expected. HashQL separates names into two universes, a value universe (runtime constructs) and a type universe (compile-time constructs), so the same identifier can denote both a type and a value without conflict. After type checking, all type metadata is erased.

All interaction with the bi-temporal graph is isolated by graph effects. Graph operations are first-class and composable, but opaque: they can be executed only by the host runtime. This separation ensures that ordinary expressions remain side-effect free and equational reasoning is preserved.

The HIR supports tree-level optimizations (hoisting, dead branch elimination) while maintaining a close correspondence with the source program. But it impedes per-statement analysis: a single expression can contain nested subexpressions, implicit captures, and hidden evaluation dependencies. Placement requires reasoning at instruction granularity, so the HIR must be lowered further.

HIR/ANF: naming everything

Administrative Normal Form bridges the gap between the HIR and the MIR. ANF imposes a naming discipline: every intermediate computation is bound to a variable through a let-binding, and evaluation order becomes explicit in the binding sequence rather than implicit in tree nesting. The result is still tree-shaped, but flat: no nested computations in operand positions. ANF was chosen over Continuation-Passing Style (CPS); the two are interconvertible, but ANF operates on the existing HIR tree directly.

The transformation normalizes nested expressions, assembles graph operation chains into first-class GraphRead nodes (which become effectful terminators in the MIR), and promotes loop-invariant bindings out of graph traversal closures through effect hoisting: a form of LICM that operates by scope promotion rather than CFG preheader insertion. Top-level definitions are wrapped in thunks (zero-argument closures) so that every body in the MIR is explicitly invocable.

MIR: the control-flow graph

The MIR replaces the tree structure with a control-flow graph of basic blocks in SSA form. A program consists of bodies, each a CFG whose basic blocks are sequences of statements terminated by a single terminator. Terminators produce the CFG edges: they return to the caller, transfer control to another block, or suspend execution to yield to the orchestrator.

Because ANF and SSA share the same naming discipline, the translation from HIR/ANF to MIR is largely mechanical. The most involved step is closure conversion: implicit captures in the HIR are reified into explicit environment tuples, packed at the closure construction site and destructured at the callee entry block. The resulting closure aggregate pairs a function pointer with its environment, forming a fat pointer. Thunks (no captures) use thin pointers; type constructors always use fat pointers with an empty environment, because call sites cannot statically distinguish the two.

SSA form

The MIR is in Static Single Assignment form: each local is assigned exactly once. Because every use traces to a unique definition, dataflow analysis is straightforward. At join points where different control-flow paths converge, the MIR uses block parameters rather than phi-nodes. Each basic block declares parameters; each incoming edge supplies arguments. This makes dataflow between blocks explicit in the edge structure, not implicit inside blocks.

Effectful terminators

Graph operations (reads, traversals) are modeled as terminators that suspend execution and yield to the orchestrator. This is a design choice with consequences: unlike a function call that evaluates within the MIR’s execution model, an effectful terminator exits that model entirely. The orchestrator fulfills the request and resumes execution at the successor block.

Modeling effects as terminators means the same representation serves optimization, placement, and runtime dispatch. Control-flow analyses treat effect boundaries as ordinary edges. Transformation passes do not need special cases for graph operations. The cost is that any code after an effectful operation must live in a separate basic block, which increases the block count but enables finer-grained placement.

Control flowshape the CFGEffectfulexits the modelGotobb_n:goto bb_m(a)abb_m1 outgoing edgecarries argumentsSwitchIntbb_n:switchInt(%d)01kbb_abb_bbb_zk outgoing edgesone per armReturnbb_n:return %v%vcaller0 outgoing edgesleaves the bodyUnreachablebb_n:unreachable0 outgoing edgesproves dead codeGraphReadbb_n:GraphRead(q)yieldorchestratorresumebb_m1 outgoing edge + effectyields, then resumes

Five kinds, one per block. The four control-flow terminators differ only in their outgoing-edge structure; GraphRead is the only kind that exits the MIR’s execution model, yielding control to the orchestrator and resuming at the successor.

Instruction set

Statements come in two kinds: assignments and nops. An assignment evaluates an rvalue (load, binary operation, unary operation, aggregate construction, input access, or function application) and writes the result to a destination. Nops are placeholders left by dead code elimination; cleanup passes remove them. Function application is a statement, not a terminator: HashQL is referentially transparent with no exceptions or unwinding, so calls have a single continuation and no alternate control-flow paths.

Terminators come in four control-flow kinds (Goto, SwitchInt, Return, Unreachable) and one effectful kind (GraphRead). SwitchInt generalizes if-else to multi-way branching; because booleans are 1-bit integers, conditional branches use the same mechanism. The MIR has no notion of booleans: the discriminant is lowered to its integer equivalent (false → 0, true → 1).

Transformation pipeline

The MIR undergoes a series of transformation passes that prepare it for execution analysis. The passes are organized around inlining: a canonicalization pass runs before inlining to simplify the CFG, inlining expands call sites to expose per-statement costs, and a post-inlining canonicalization cleans up the result.

Canonicalization is a fixpoint loop of standard compiler optimizations: constant propagation, dead store elimination, CFG simplification, and SSA repair. These run until no pass makes progress. The optimizations are not novel individually; the contribution is applying them systematically to a query language whose functional, referentially transparent nature makes many of them directly applicable without the side-effect complications that imperative languages introduce.

Inlining is aggressive for graph operation bodies (filters, maps) because these are the bodies that undergo execution analysis. Every function call inside a filter body is a placement boundary: the interpreter cannot hand off mid-call to PostgreSQL. Inlining eliminates these boundaries by expanding call sites, exposing the full instruction sequence to the placement solver. Call graph analysis with SCC detection handles mutual recursion by identifying loop breakers.

hgres

HashQL is the query and execution layer at the heart of hgres, a multi-temporal knowledge graph made for use by both people and AI, where queries are typed programs and database systems are compilation targets.

Return home