Introduction

dboxide logo

This repository, dboxide, is a rewrite of a DBML parser. It is a learning project to explore concepts in language theory, inspired by the work in the type-theory repository. The project builds on lessons learned from a previous parser, @dbml/parse.

The primary motivation for this rewrite is to address the shortcomings of previous versions and explore more advanced compiler design concepts. The original PEG.js parser was slow and lacked resilience, making it difficult to build a good developer experience (e.g., in a language server). The first rewrite, @dbml/parse, brought performance improvements and basic language services but suffered from design flaws that made it fragile and hard to maintain.

This project serves to be both about parsing DBML & a research ground for:

Compiler infrastructure: Investigating best practices for lexing, parsing, and representing code in a way that supports incremental updates and rich analysis.
Developer experience: Building a foundation for a high-quality language server with features like precise error reporting, autocompletion, and code navigation.
Modern techniques: Applying query-based (on-demand) compilation, inspired by Salsa, to create a scalable and efficient analysis engine.

The goal is to create a parser that is not only fast and correct but also serves as a solid foundation for a new generation of DBML tooling.

Goals

This document outlines the primary goals for dboxide. These goals are divided into two main categories: Project Goals, which focus on the tangible outcomes of the project, and Learning Goals, which focus on the practical techniques and concepts to be explored during development.

Project Goals

These goals are about building a high-quality, feature-rich, and performant DBML parser and associated tooling.

Resilient and high-quality parser: Create a parser that can gracefully handle syntax errors, provide meaningful error messages, and produce a lossless syntax tree. This is crucial for a good developer experience.
Language server implementation: Build a functional language server for DBML that provides features like:
- Autocompletion
- Go to definition
- Find references
- Real-time diagnostics
Performance: The parser should be significantly faster than previous implementations, leveraging the performance benefits of Rust.
SQL dialect awareness: The parser should be able to understand and handle different SQL dialects, making it more versatile.
Modularity and extensibility: Design the parser and its components in a way that is easy to maintain, extend, and reuse.

Learning Goals

This project is also an opportunity to learn and apply modern compiler construction techniques. The focus is on understanding the “how” and “why” behind these techniques.

Lexing and parsing techniques:
- Token representation: Investigate the ideal way to represent syntax tokens, including their position, kind, and value.
- Lossless syntax trees: Learn how to build and work with lossless syntax trees that preserve all source information, including whitespace and comments.
- Resilient parsing: Master techniques for error recovery and resilience in parsers.
- Incremental parsing and red-green trees: Understand and implement incremental parsing using red-green trees to efficiently re-parse only the changed parts of a file.
Query-based compilation:
- Salsa framework: Learn how to use the salsa framework to build a query-based compiler architecture.
- Incremental computation: Understand how query-based systems enable incremental computation, where only the necessary computations are re-run when the input changes.
- Parallelism: Explore opportunities for parallelizing the compilation process to improve performance.
Type-driven validation: Learn to apply the “Parse, don’t validate” principle to create a more robust and type-safe parser.
Effective testing strategies: Learn how to write effective and maintainable unit and integration tests for a compiler.

Research

This section documents the research done for the dboxide project. It is divided into two main parts:

Research Plan: This page outlines the research plan, including the list of topics to be investigated.
Resources: This page documents the knowledge learnt from various resources.
Glossary: This page contains brief explanations of the concepts and techniques researched for the project.

Research Plan

This page outlines the research plan for dboxide. The goal is to investigate and understand the following concepts and techniques, which are crucial for building a modern and efficient parser and language server.

Research Topics

Lexing & Token representation:
- Ideal representation of a syntax token (source offset/pointer, kind, processed value).
- Handling of non-ASCII characters (UTF-8).
- Handling of trivial and error tokens in a lossless syntax tree.
- Handling of multi-word and unreserved keywords.
- Handling of ambiguous tokens (e.g., < as a less than sign or a generic bracket).
- Efficient storage and computation of token positions for incremental parsing.
- On-demand lexing vs lexing all at once.
String interning:
- What is string interning and why is it useful in a compiler?
- Where and how is it typically applied in a compilation pipeline (e.g., during lexing, parsing, or semantic analysis)?
- How can it be implemented efficiently?
- What are the trade-offs of using string interning?
Parsing & Syntax tree:
- Good representation for a lossless syntax tree.
- Storing error nodes and partial nodes in the lossless syntax tree.
- Resilience parsing and error recovery techniques.
- Designing good error messages.
Incremental parsing & Red-green trees:
- What are red-green trees and how do they work?
- How can they be used to implement incremental parsing?
- What are the performance implications of using red-green trees?
Program analysis & Query-based compiler:
- Utilizing a query-based architecture (like salsa) for program analysis, lexing, and parsing.
- Module system features, including module resolution.
- Name resolution and related problems.
- Resources:
Other topics:
- Language server implementation
- Parallelism in compilation
- SQL dialect awareness

Resources

This section documents the knowledge learnt from various sources during the implementation of dboxide:

rust-analyzer: An implementation of LSP for the Rust programming language - a good reference model.
libsyntax: Rust compiler’s crate that documents some design notes on tree structure and parser.
salsa: An incremental computation framework for building responsive compilers and language servers.

`rust-analyzer`

This section contains resources and notes about the rust-analyzer project.

rust-analyzer is an LSP implement for Rust for many code editors.

Official site: Link.

High-Level Architecture & Conventions
Coding Conventions
Syntax Tree & Parser
Writing Tests

High-Level Architecture & Conventions

This section describes the architecture of rust-analyzer.

Official site: Link.

rust-analyzer input/output:
- Input (Ground state): Source code data from the client. Everything is kept in memory.
  - Mapping from file paths to their contents.
  - Project structure metadata represented as a crate graph (crate roots, cfg flags, crate dependencies)
- Output (Derived state): “Structure semantic model” of the code.
  - A representation of the project that is fully resolved - type-wise and reference-wise.
Optimizations:
- Incremental:
  - Input can be a delta of changes.
  - Output can be a fresh code model.
- Lazy: The output is computed on-demand.

Parser - `parser` Crate

A hand-written recursive descent tree-agnostic parser.
Output: A sequence of events like “start node” and “finish node”, based on kotlin’s parser, which can be used to learn about dealing with syntax errors and incomplete input.
Some traits (TreeSink and TokenSource) are used to bridge the tree-agnostic parser with rowan trees.

Architecture Invariant - Tree-Agnostic Parser

The parser functions as a pure transformer, converting one flat stream of events into another.
Dual independence: The parser is not locked into:
- A specific tree structure (output format).
- A specific token representation (input format).
Benefits:
- Token independence allows using the same logic to parse:
  - Standard source code (text -> tokens).
  - Macro expansion (token trees -> tokens).
  - Synthetic code generated programmatically.
- Tree independence allows easily varying the syntax tree implementation + light-parsing.
  - Avoid allocation of tree nodes.
  - For tasks like “find all function names in this 10k line file,” the parser can simply emit “DefineName” events. A listener catches those names and ignores everything else, finishing the task in a fraction of the time.

Architecture Invariant - Infallible Parser

Parsing never fails.
Parser returns (T, Vec<Error>).

Syntax Tree Structure & Parser - `syntax` Crate

Based on libsyntax-2.0.

rowan: The underlying library used to construct the raw, untyped syntax trees (Green/Red trees).
ast internal crate: Provide a type-safe API layer on top of the raw rowan tree.
ungrammar internal crate: A grammar description format used to automatically generate the syntax_kinds and ast modules.

Architecture Invariant - `syntax` Crate as an API Boundary

The syntax crate knows nothing about salsa and LSP. It’s an API boundary.
Benefits:
- Allows it to be used for lightweight tooling without needing a full build or semantic analysis.

Architecture Invariant - Syntax Tree as a Value Type

The syntax tree is self-contained, defined solely by its contents without relying on global context (like interners).
Pure syntax: Unlike traditional compiler trees, it strictly excludes semantic information (such as type inference data).
Benefits:
- IDE optimization: Critical for tools like rust-analyzer, where assists and refactors require frequent tree modifications.
- Simplified transformation: Keeping the tree “dumb” (purely structural) allows for easy code manipulation without the complexity of managing semantic state during edits.

Architecture Invariant - Syntax Tree per File

A syntax tree is built for a single file.
Benefits: Enable parallel parsing of all files.

Architectural Invariant - Incomplete Syntax Tree

Syntax trees are designed to tolerate incomplete or invalid code (common during live editing).
AST accessor methods return Option types to safely handle missing data.

Query database - `base-db` Crate

salsa: A crate used for incremental and on-demand computation.
- salsa resembles a key-value store.
- salsa can compute derived values with specified functions.
Define most input queries.

Architecture Invariant - File-System Agnostic

Nothing is known about the file system & file paths.
FileId: An opaque type that represents a file.

Analyzer (Macro expansion, Name resolution, Type inference) - `hir-xxx` Crates

hir-expand: Macro expansion.
hir-def: Name resolution.
hir_ty: Type inference (Why does this one uses underscore?).
Define various IRs of the core.
hir-xxx is ECS-based (Entity-Component-System):
- ECS architecture: Instead of rich objects, compiler entities (like functions or structs) are represented as raw integer IDs (handles), similar to game entities.
- Database-driven: You cannot access data directly from an ID. You must query the central Salsa database (e.g., db.function_data(id)), which stores the actual content in “component” arrays.
Zero abstraction: The code is intentionally explicit about database access. It avoids helper methods to keep dependency tracking transparent and overhead low.
These crates “lower” (translate) Rust syntax into logic predicates, allowing the chalk engine to solve complex trait bounds and type inference.

Architecture Invariant - Incremental

The separation of “Identity” (ID) from “Data” in ECS allows rust-analyzer to update only changed data without breaking references to the ID elsewhere, enabling millisecond-level updates.

High-Level IR - `hir` Crate

An API boundary for consuming rust-analyzer as a library.
hir acts as the high-level API boundary, wrapping internal raw IDs (ECS-style) into semantic structs (e.g., Function) to provide a familiar object-oriented interface for library consumers.
“Thin handle” Pattern: These structs hold no data (only the ID) and require the db to be passed into every method call (e.g., func.name(db)), effectively bridging the stateless handles with the stateful Salsa database.
Analogy: Internally, the ECS-style code is like SQL & hir is like ORM.
- Syntax inversion (object-oriented vs functional):
  - In pure ECS, logic lives in external systems (e.g. db.function_visibility(id)).
  - The hir crate inverts this to an object-oriented style (func.visibility(db)), making the API discoverable via IDE autocomplete.
- Encapsulated “joins”:
  - Pure ECS requires you to manually query multiple tables to piece together information (e.g., get parent module ID -> look up module data -> find visibility).
  - The hir crate abstracts these complex multi-step database lookups into single, coherent methods.
- Semantic types:
  - ECS deals with efficient storage (raw u32 IDs).
  - hir deals with high-level meaning, exposing semantic types (like struct Type) rather than implementation details (like struct TypeId).

Architecture Invariant - Inert Data Structure

hir presents a fully resolved, inert view of the code, abstracting away the dynamic computations occurring in internal crates.
- “Inert” here is relative to the db object.
Syntax-to-HIR bridge: It manages the complex one-to-many mapping between raw syntax and semantic definitions (via the Semantics type).
The “Uber-IDE” pattern: To resolve a specific syntax node to an HIR entity (essential for “Go to Definition”), it employs a recursive strategy used by Roslyn and Kotlin: it resolves the syntax parent to a HIR owner, then queries that owner’s children to re-identify the target node.

IDE - `ide-xxx` Crates

Top-level API boundary: The ultimate entry point for external clients (LSP servers, text editors) to interact with rust-analyzer.
ide consumes the semantic model provided by the hir crate to implement concrete user features like code completion, goto definition, and refactoring.
ide is protocol-agnostic, designed to be used via LSP, custom protocols (like FlatBuffers), or directly as a library within an editor.
This crate introduces the concept of change over time:
- AnalysisHost: The mutable state container where you apply_change.
- Analysis: An immutable, transactional snapshot of the state used for querying.
Modular Architecture:
- ide: The public facade and home for smaller features.
- ide-db: Shared infrastructure (e.g. reference search).
- ide-xxx: Isolated crates for major features (completion, diagnostics, assists, SSR).

Architecture Invariant - View Layer

View/ViewModel layer:
- ide acts as the “View” (MVC) or “ViewModel” (MVVM), translating complex compiler data into simple, editor-friendly terms (offsets, text labels) rather than internal definitions or syntax trees.
- The API is built with POD types. All inputs and outputs are conceptually serializable (no complex object graphs or HIR types exposed).
The boundary is explicitly drawn at the “UI” level, following the philosophy popularized by the Language Server Protocol. - It talks in the language of the text editor & not the language of the compiler.

`rust-analyzer` Crate

Define the binary for the language server -> The entry point.
It acts like the network/protocol adapter tha connects the pure logic of the ide crate to the outside world. -> Functional core, imperative shell.

Architecture Invariant - LSP & JSON Awareness

rust-analyzer is the only place where LSP types and JSON serialization exist.
Lower crates (ide, hir) remain pure and protocol-agnostic. They are forbidden from deriving Serialize or Deserialize for LSP purposes.
rust-analyzer maintains its own set of serializable types. It manually converts the ide crate’s Rust-native data structures (like TextRange) into LSP’s wire-format structures (like Range with line/character) before sending them over the wire.

Architecture Invariant - Protocol-Wise Statelessness

The server is stateless, in the sense that it doesn’t know about the previous requests.

Utilities - `stdx` Crate

rust-analyzer avoids small helper crates.
stdx is the crate to store all small reusable utilities.

Macro Crates

Core abstraction (tt): Macros are defined purely as TokenTree -> TokenTree transforms, isolated from other compiler parts. The tt crate defines this structure (single tokens or delimited sequences).
Declarative macros (mbe): The mbe crate implements “Macros By Example” (macro_rules!). It handles parsing, expansion, and the translation between the IDE’s syntax trees and the raw token trees.
Procedural Macros: Proc-macros run in a separate process to isolate the IDE from user code crashes.
- Server (proc-macro-srv): Load the dynamic libraries (built by Cargo) and executes the macros.
- Client (proc-macro-api): Communicate with the server, sending/receiving Token Trees.

Architecture Invariant - Isolation

Because arbitrary macro code can panic or segfault (crashing the editor), rust-analyzer executes them in a separate process. This allows the main IDE to survive fatal errors and recover gracefully.
salsa’s incremental system assumes all functions are pure (deterministic). Since proc-macros can be non-deterministic (e.g., reading external files or random numbers), they violate this core assumption and require special handling to prevent database corruption or infinite invalidation loops.

Virtual File System - `vfs-xxx` and `paths` Crates

Virtual file system (VFS): These crates provide an abstraction layer that generates consistent snapshots of the file system, insulating the compiler from raw, messy OS paths.
The architecture does not assume a single unified file system. A single rust-analyzer process can serve multiple remote machines simultaneously, meaning the same path string could exist on two different machines and refer to different content.
“Witness” API: To resolve this ambiguity, path APIs generally require a “file system witness” (an existing anchor path) to identify which specific file system context the operation targets.

Interning - `intern` Crate

Use Arc (Atomic Reference Counting) to ensure identical data (like strings or paths) is stored only once in memory.
Optimized for “value types” that are defined by their content (e.g., std::vec::Vec), rather than “entities” defined by an ID (e.g., Function #42).
db-independent: Unlike salsa’s integer IDs, Interned<T> owns its data, allowing access and inspection without needing a reference to the compiler database (db).

Interning enables instant equality checks by comparing memory pointers instead of scanning content, which is critical for frequently compared items like file paths.
Interning serves as a lower-level optimization layer for static, immutable data that doesn’t require the full overhead of incremental dependency tracking.

Architectural Policies

Stability Guarantees

rust-analyzer avoids new stability guarantees to move fast.
The internal ide API is explicitly unstable.
Stability is only guaranteed at the LSP level (managed by the protocol) and input level (Rust language/Cargo).
De-facto stability: rust-project.json became stable implicitly by virtue of having users - a lesson to explicitly mark APIs as unstable/opt-in before release.

Code Generation

The API for syntax trees (syntax::ast) and manual sections (features, assists, config) are generated automatically.
To simplify builds, rust-analyzer does not use itself for codegen. It uses syn and manual string parsing instead.

3. Cancellation (Concurrency)

The problem: If the user types while the IDE is computing (e.g., highlighting), the result is immediately stale.
The solution: The salsa database maintains a global revision counter.
- When input changes, the counter is bumped.
- Old threads checking the counter notice the mismatch and panic with a special Canceled token.
The ide boundary catches this panic and converts it into a Result<T, Canceled>.

Testing Strategy

Tests are concentrated on three system boundaries:
- Outer (rust-analyzer crate): “Heavy” integration tests via LSP/stdio. Validates the protocol but is slow (reads real files).
- Middle (ide crate): The most important layer. Tests AnalysisHost (simulating an editor) against expectations.
- Inner (hir crate): Tests semantic models using rich types and snapshot testing (via the expect crate).

Key Testing Invariants

Data-driven: Tests use string fixtures (representing multiple files) rather than calling API setup functions manually. This allows significant API refactorings.
No libstd: Tests do not link to libstd/libcore to ensure speed; all necessary code is defined within the test fixture.

Error Handling

No IO in core: Internal crates (ide, hir) are pure and never fail (no Result). They return partial data plus errors: (T, Vec<Error>).
Panic resilience: Since bugs are inevitable, every LSP request is wrapped in catch_unwind so a crash in one feature doesn’t kill the server.
Macros: Uses always! and never! macros to handle impossible states gracefully.

Observability

Profiling: Includes a custom low-overhead hierarchical profiler (hprof) enabled via env vars (RA_PROFILE).

Serialization

The trap of ease: While #[derive(Serialize)] is easy to add, it creates rigid IPC boundaries (backward compatibility contracts) that are extremely difficult to change later.
To strictly preserve internal flexibility, types in core crates like ide and base_db are not serializable by design.
Serialization is forced to the “edge” (the client). External clients must define their own stable schemas (e.g., rust-project.json) and manually convert them into internal structures, isolating the core compiler from protocol versioning issues.

Coding Conventions

Official site: Link.

General Philosophy

rust-analyzer’s approach to clean code:
- Velocity over perfection: Do not block functional PRs on purely stylistic changes.
- “Show, don’t just tell”: For complex style issues, reviewers are encouraged to merge the PR and then send a follow-up cleanup PR themselves. This resolves the issue faster and teaches the author “by example” rather than through endless comment threads.
If a review comment applies generally, update the Style Guide instead of leaving a one-off comment. This way, temporary feedback is turned into permanent documentation.
Small, atomic cleanup PRs (even just renaming a variable) are explicitly encouraged to keep the codebase healthy.

Scale of Changes

Generally, small & focused PRs are preferred; but sometimes, that isn’t possible.
rust-analyzer categorizes PRs into 3 groups.

Internal Changes (Low Risk)

Definition: Changes confined to the internals of a single component. No pub items are changed or added (no changes to the interfaces and no new dependencies).
Review standard: Easy Merge.
- Does the happy path work?
- Are there tests?
- Does it avoid panicking on the unhappy path?

API Expansion (Medium Risk)

Definition: Adding new pub functions or types that expose internal capabilities to other crates.
Review standard: High scrutiny.
- The interface matters more than the implementation. It must be correct and future-proof.
rust-analyzer’s guideline: If you start a “Type 1” change and realize you need to change the API, stop. Split the API change into its own separate, focused PR first.

Dependency Changes (High Risk)

Definition: Introducing new connections between components via pub use re-exports or Cargo.toml dependencies.
Review standard: Rare & dangerous.
- These break encapsulation.
- Even an innocent-looking pub use can accidentally degrade the architecture by leaking abstractions across boundaries.

Crates.io Dependencies

Restrict external dependencies: Be extremely conservative with crates.io usage to minimize compile times and breakage risks.
Do not use small “helper” libraries (allowed exceptions: itertools, either).
Internalize utilities: Place general, reusable logic into the internal stdx crate rather than adding a dependency.
Audit dependency tree: Periodically review Cargo.lock to prune irrational transitive dependencies.

Rationale

Compilation speed:
- Rust compiles dependencies from source.
- Avoiding bloat is the best way to keep build times and feedback loops fast.
Transitive bloat:
- Small “helper” crates often pull in deep chains of hidden dependencies (the “iceberg” effect).
Stability & Security:
- Reduce the risk of upstream abandonment, breaking changes, or supply chain attacks.
Self-reliance: If logic is simple enough for a micro-crate, it belongs in the internal stdx library, not as an external liability.

Commit Style

Document for changelogs to avoid release burden on the maintainers.
Changelogs > Clean history.

Git History

Clean git history is strongly encouraged but not mandated.
Use a rebase workflow. It is explicitly acceptable to rewrite history (force push) during the PR review process.
Before the final merge, use interactive rebase to squash small “fixup” commits into logical units.

Commit Message & PR Description

Do not @mention users in commit messages or PR descriptions.
- Reason: Rebasing re-commits the message, spamming the mentioned user with duplicate notifications.
User-centric titles: Write PR titles/descriptions describing the user benefit, not the implementation details.
- Good: “Make goto definition work inside macros”.
- Bad: “Use original span for FileId”.
Changelog automation: You must categorize PRs so release notes can be auto-generated. Use one of two methods:
- Title prefix: feat:, fix:, internal:, or minor: (e.g., feat: Add hover support).
- Magic comment: changelog [fix] Description here in the PR body.
Visuals: For UI changes, include a GIF in the description to demonstrate the feature.

Linting

Clippy is used.

Code

Minimal Tests

Tests must use the absolute minimum code necessary to reproduce the case. Aggressively strip “noise” from copy-pasted real-world code.
Format declarative code densely (e.g., enum E { A, B } on a single line) to keep the test concise, provided it remains readable.
Unindented raw strings: Use r#...# literals for multiline fixtures. Ensure the content is unindented (starts at column 0) so that character offsets in the test match the actual file positions exactly.
Rationale:
- Reduce visual noise and scrolling, making the actual test case immediately obvious.
- Lower execution time and keeps debug logs clean.
- Unindented formatting allows you to use your editor’s “selection character count” to verify byte offsets directly, without needing to manually subtract indentation whitespace.

Marked Tests

Marked test: A technique used to verify that a specific, often hard-to-reach line of code was actually executed during a test.
Use cov_mark::hit! (in code) and cov_mark::check! (in tests) to create a strictly unique link between a specific edge case in the implementation and its corresponding test.
Principle: Only maintain one mark per test and one mark per code branch.
Never place multiple marks in a single test, and never reuse the same mark across different tests.
Rationale: This ensures that searching for a mark immediately reveals the single canonical test responsible for verifying that specific code branch, eliminating ambiguity.

`#[should_panic]`

#[should_panic] is prohibited - None and Err should be explicitly checked.
Rationale:
- #[should_panic] is a tool for library authors to make sure that the API does not fail silently when misused.
- rust-analyzer is a long-running server, not a library. It must handle all input gracefully, even invalid input (returning Err or None). It should never intentionally crash.
- Expected panics still dump stack traces into the test logs. This “noise” creates confusion, making it difficult to distinguish between a test verifying a panic and an actual bug causing a crash.
- Expected panics still dump stack traces into the test logs. This “noise” creates confusion, making it difficult to distinguish between a test verifying a panic and an actual bug causing a crash.

`#[ignore]`

Never ignore tests. Explicitly assert the wrong behavor and add a FIXME comment.
Rationale:
- Visibility: It ensures the test fails immediately if the bug is accidentally fixed (alerting you to update the test).
- Safety: It proves the bug causes incorrect output rather than a server crash (panic), which is a critical distinction for a long-running service.

Function Preconditions

Type Encoding

Function’s assumptions should be expressed in types.
The caller must be enforced to provide them.

#![allow(unused)]
fn main() {
// GOOD
fn is_zero(n: i32) -> bool {
  ...
}

// BAD
fn is_zero(n: Option<i32>) -> bool {
   let n = match n {
       Some(it) => ...,
       None => ...,
   };
}
}

Rationale:
- The caller has more context as to why to the callee’s assumptions do not hold.
- The control flow is therefore more explicit at the call site.

Parse, Don’t Validate

Bad practice:
- One function validates that the data is valid (validate the assumption).
- Another function uses that data based on the assumptions.
Good practice: Validate and immediately use the data in the same place (like match instead of bare if).
Reasons:
- The bad practice is prone to decay over time. The maintainer has to memorize the assumptions and make sure refactoring efforts of checks actually verify the assumptions.
- The good practice always ensure that the assumptions hold when manipulating the data.

Example from rust-analyzer

// GOOD
fn main() {
    let s: &str = ...;
    if let Some(contents) = string_literal_contents(s) {

    }
}

fn string_literal_contents(s: &str) -> Option<&str> {
    if s.starts_with('"') && s.ends_with('"') {
        Some(&s[1..s.len() - 1])
    } else {
        None
    }
}

// BAD
fn main() {
    let s: &str = ...;
    if is_string_literal(s) {
        let contents = &s[1..s.len() - 1];
    }
}

fn is_string_literal(s: &str) -> bool {
    s.starts_with('"') && s.ends_with('"')
}

Remarks:
- This pattern perfectly illustrates Robert Harper’s concept of “Boolean Blindness”. By reducing a complex check to a simple bool (true/false), we discard the proof of validity. The compiler sees a “true” flag, but it doesn’t see the “valid data,” forcing us to rely on faith later in the code.
- I learned this the hard way while building a DBML parser. I designed a system where the “validation phase” was separate from the “execution phase”. Because the validation step didn’t return a new, safe type (it just returned true), the execution phase had to blindly trust that the validation had run correctly.
- While systems like TypeScript uses Flow Typing to mitigate this (by inferring types inside if blocks), that safety is often local only. As soon as you pass that variable into a different function, the “flow context” is lost unless explicitly redefined.

Control Flow

Push “if“s up and “for“s down.

Assertions

Use stdx::never! liberally instead of assert!.
never! checks a condition and logs a backtrace if it fails, but returns a bool instead of crashing. This allows you to write: if stdx::never!(condition) { return; }.
Rationale: rust-analyzer is a long-running server. A bug in a minor feature (like a specific completion) should log an error and bail out of that specific request, not crash the entire IDE session.

Getters & Setters

Two cases to consider:
- No invariants: If a field can hold any value safely, just make it pub. Don’t write boilerplate code.
- Invariants exist: If the data has rules (e.g., “cannot be empty”), make the field private, enforce the rule in the Constructor, and provide a Getter.
Never provide setters. If data needs to change, it should likely be done via a specific behavior method or by creating a new instance, ensuring invariants are never bypassed.

Getters should return borrowed data. rust-analyzer’s example:

#![allow(unused)]
fn main() {
struct Person {
  // Invariant: never empty
  first_name: String,
  middle_name: Option<String>
}

// GOOD
impl Person {
    fn first_name(&self) -> &str { self.first_name.as_str() }
    fn middle_name(&self) -> Option<&str> { self.middle_name.as_ref() }
}

// BAD
impl Person {
    fn first_name(&self) -> String { self.first_name.clone() }
    fn middle_name(&self) -> &Option<String> { &self.middle_name }
}
}

Rationale:
- The APIs are internal so (internal) breaking changes can be allowed to move fast:
  - Using a pub field (with no invariants) introduces less boilerplate but may be breaking if the pub field is suddenly imposed an invariant and has to be changed to private.
  - Using an accessor can prevent breaking changes, but it means implicitly promising a contract and imposing some maintenance boilerplate.
- Privacy helps make invariants local to prevent code rot.
- A type that is too specific (borrow owned types like &String) leaks irrelevant details (neither right nor wrong), which creates noise and the client may accidentally rely on those irrelevent details.

Useless Types

Prefer general types.
If generality is not important, consistency is important.

#![allow(unused)]
fn main() {
// GOOD      BAD
&[T]         &Vec<T>
&str         &String
Option<&T>   &Option<T>
&Path        &PathBuf
}

Rationale:
- General types are more flexible.
- General types leak fewer irrelevant details (which the client may accidentally rely on).

Constructors

If a new function accepts zero arguments, then use the Default trait (either derive or manually implemented).
- Rationale:
  - Less boilerplate.
  - Consistent: Less cognitive load for the caller - “Should I call new() or default()?”
Use Vec::new instead of vec![].
- Rationale:
  - Strength reduction.
  - Uniformity.
Do not provide Default if the type doesn’t have sensible default value (many possible defaults or defaults that has invalid states).
- Preserve invariants.
- The user does not need to wonder if the provided default is their desired initial values.

#![allow(unused)]
fn main() {
// GOOD
#[derive(Default)] // 1. Best case: Derive it automatically
struct Options {
    check_on_save: bool,
}

// GOOD (Manual Implementation)
struct Buffer {
    data: Vec<u8>,
}

impl Default for Buffer {
    fn default() -> Self {
        Self {
            // 2. Use Vec::new() instead of vec![] (Strength Reduction)
            // It is semantically lighter (function vs macro) and more uniform.
            data: Vec::new(),
        }
    }
}

// BAD
struct OptionsBad {
    check_on_save: bool,
}

impl OptionsBad {
    // 3. Avoid zero-arg new().
    // It forces users to remember "Do I call new() or default() for this type?"
    fn new() -> Self {
        Self { check_on_save: false }
    }
}
}

Functions Over Objects

Public API: Prefer simple functions (do_thing()) over transient objects that exist only to execute one method (ThingDoer::new().do()).
Internal logic: It is acceptable (and encouraged) to use “Context” structs inside the function to manage complex state or arguments during execution.
Rationale:
- The “Iceberg” pattern: The user sees a simple function interface; the developer uses a structured object implementation behind the scenes.
- Implementor API is not mixed with user API.
Middle ground: If a struct is preferred for namespacing, provide a static do() helper method that handles the instantiation and execution in one step.
Rationale:
- Reduce boilerplate for the caller.
- Prevent implementation details (like temporary state management) from leaking into the public API.

#![allow(unused)]
fn main() {
// BAD (Caller has to build and run)
ThingDoer::new(arg1, arg2).do();

// GOOD (Caller just acts)
do_thing(arg1, arg2);

// ACCEPTABLE INTERNAL IMPLEMENTATION (Using a struct to organize code)
pub fn do_thing(arg1: Arg1, arg2: Arg2) -> Res {
    // The struct is an implementation detail, hidden from the user
    let mut ctx = Ctx { arg1, arg2 };
    ctx.run()
}
}

Functions With Many Parameters

Use Config struct:

#![allow(unused)]
fn main() {
// BAD (Call site is confusing)
fn annotations(db, file_id, true, false, true);

// GOOD (Call site is explicit and flexible)
fn annotations(db, file_id, AnnotationConfig {
    binary_target: true,
    annotate_runnables: false,
    annotate_impls: true,
});
}

Rationale:
- Call site is clearer.
- Encapsulating volatile parameters in a struct shields intermediate functions from breaking changes, allowing you to add new options without updating the signature of every function in the call chain.
No Default for Config: Do not implement Default. Force the caller to provide explicit context.
- Rationale: They know better than the struct what the initial state should be.
Pass configuration as an argument to the function, do not store it in the object’s state.
- Rationale: This allows the same object to handle multiple requests with different configurations dynamically.
Command pattern: If a set of parameters can yield different return types (e.g., Vec<T> vs Option<T>), wrap parameters in a “Command” struct.
- Rationale: This avoids creating multiple top-level functions (query_all, query_first) with identical argument lists.

#![allow(unused)]
fn main() {
// GOOD (Command Pattern)
// Captures arguments once, offers multiple execution paths
pub struct Query {
    pub name: String,
    pub case_sensitive: bool,
}

impl Query {
    // Return type A
    pub fn all(self) -> Vec<Item> { ... }
    // Return type B
    pub fn first(self) -> Option<Item> { ... }
}

// BAD (Parameter Duplication)
// Requires repeating arguments for every variation of the result
fn query_all(name: String, case_sensitive: bool) -> Vec<Item> { ... }
fn query_first(name: String, case_sensitive: bool) -> Option<Item> { ... }
}

Remarks: Does the Command rule conflict with the rule “Do not store Config in state”?
- The Config rule applies to long-lived services (e.g., Database). These should remain stateless so they can handle diverse requests without needing to be reset.
- The Command rule applies to short-lived tasks (e.g., Query). These objects exist solely to bundle parameters for a single operation and are discarded immediately after use.
- Relationship: The “Command” is effectively a temporary container that you pass to (or use with) the “Service”.

Prefer Separate Functions Over Parameters

Split “flag” arguments: If a function is solely invoked with hardcoded literals (true/None), refactor it into distinct named functions (e.g., process_fast() vs process_full()) to eliminate internal branching and “false sharing” of unrelated logic.
Rationale:
- Functions with flag arguments often display false sharing. There’s often if branching to distinguish between different cases.
```
#![allow(unused)]
fn main() {
// Caller
process_data(true);
process_data(false);

// Callee
fn process_data(is_fast: bool) {
    if is_fast {
        // Fast algorithm
    } else {
        // Accurate algorithm
    }
}
}
```
- These functions seem to share logic and are related and it seems that it makes sense to merge them into 1 function. However, over time, they diverge. Therefore, splitting the control flows into separate functions simplify the logic & eliminate irrelevant details (such as the maintainer would fail to see the cross-dependencies between the if branches).
- Split the common code of the if branches into a common helper instead.
Remark: This binder-refactoring PR is a hard lesson for me illustrating this point, both the problem and the solution (splitting into separate classes & extract common helpers). Although there is still lots of room for improvement, this is already significantly better.

Appropriate String Types

When calling OS APIs (file systems, env vars, arguments), use OsString and &OsStr, never String or &str.
Rust String guarantees valid UTF-8. Operating Systems do not.
- Linux/Unix: Paths are arbitrary byte sequences (except null).
- Windows: Paths are potentially ill-formed UTF-16 sequences.
Rationale: This creates a strict type-level boundary.
- If you hold a String, you know it is safe, clean text.
- If you hold an OsString, you know it is “dirty” data from the outside world.
- Using OsString prevents accidental panics or data corruption when encountering a file named with invalid encoding.
Avoid the standard std::Path, use the custom AbsPathBuf and AbsPath wrapper that guarantees the path inside is absolute.
Rationale:
- CWD is global mutable state.
- If you use a relative path like Path::new("src/main.rs"), the OS resolves it relative to where the server binary started, not where the project is.

Premature Pessimization

Avoid Allocations

Zero-allocation default: Prefer stack-based structures (iterators) over heap-based collections (Vec, String) unless you specifically need ownership or long-term storage.
Lazy vs eager: collect::<Vec>() eagerly processes the entire sequence and allocates memory immediately. Iterators are lazy and compute items only on demand.

#![allow(unused)]
fn main() {
use itertools::Itertools;

// BAD (Heavy)
// 1. Allocates heap memory. 2. Processes entire string. 3. Frees memory.
let parts: Vec<&str> = text.split(',').collect();
if parts.len() == 3 {
    process(parts[0], parts[1], parts[2]);
}

// GOOD (Light)
// 1. No allocation. 2. Stops after 3 items. 3. Stores ptrs on Stack.
if let Some((a, b, c)) = text.split(',').collect_tuple() {
    process(a, b, c);
}
}

Push Allocations to the Call Site

Rationale: The cost of calling a function becomes more explicit.

Collection Types

rustc_hash’s map and set are preferred over those in std::collections.
Rationale:
- Faster hasher.
- Slightly reduce code size if applied consistently.

Avoid Intermediate Collections

Use an accumulator parameter (list, set, map) as the first parameter to collect values.

Avoid Monomorphization

Minimize generics: Avoid heavy generic logic at crate boundaries to prevent compile-time bloat caused by monomorphization (code duplication).
“Inner dyn” pattern: Use thin generic wrappers that immediately delegate to private, non-generic functions using dynamic dispatch (dyn Trait).
Concrete types: Avoid AsRef polymorphism; prefer concrete types like &Path unless writing a widely-used library.
Rationale: Optimize for compile speed by default; runtime performance only matters for the “hot” 20% of code, but compilation costs affect 100%.

use std::fmt::Display;

// --- 1. The Wrapper (Generic) ---
// This function is "monomorphized" (duplicated) for every type T.
// But since it's only 1 line, the duplication cost is negligible.
pub fn log_message<T: Display>(msg: T) {
    // Coerce the specific type T into a trait object (&dyn Display)
    log_message_impl(&msg);
}

// --- 2. The Implementation (Dynamic) ---
// This function is compiled ONLY ONCE. 
// It handles the heavy lifting via dynamic dispatch (vtable).
fn log_message_impl(msg: &dyn Display) {
    // Imagine 500 lines of complex logging logic here...
    println!("Timestamp: [INFO] {}", msg);
    // ... extensive I/O, formatting, or network calls ...
}

fn main() {
    log_message("Hello");       // T is &str -> Wrapper created for &str
    log_message(42);            // T is i32  -> Wrapper created for i32
    log_message(3.14);          // T is f64  -> Wrapper created for f64
    
    // Result: 3 tiny wrappers, but the heavy `log_message_impl` exists only once.
}

Style

Order of Imports

Modules first: Declare modules (mod x;) at the top, before any imports, ordered by “suggested reading.”
Import structure:
- Grouping: Use one use statement per crate (group items inside { ... }).
- Spacing: Separate different groups with blank lines.
- Import order:
  1. std: Standard library (use std::...).
  2. External: Third-party and workspace crates.
  3. Local: Current crate (use crate::...).
  4. Relative: Parent/child modules (use super::...), though crate:: is preferred.
Re-exports: Place pub use after all imports, as they are treated as item definitions.
Rationale: Ensures consistency, improves readability for new contributors, and highlights dependencies clearly.

Import Style

Qualify layer types: Always qualify items from hir and ast to prevent ambiguity and clarify the architectural layer.
- Good: use syntax::ast; ... func: hir::Function
- Bad: use hir::Function; ... func: Function
Trait implementations: Import the module (e.g., std::fmt), not the trait itself, when implementing standard traits.
- Good: impl fmt::Display for ...
Rationale: Reduces typing and clearly distinguishes implementation from usage.
Avoid local globs: Do not use use MyEnum::*; inside functions.
Absolute paths: Prefer use crate::foo over relative paths like super:: or self:: for consistency.
No re-exports: Avoid re-exports in non-library code to prevent multiple access paths and maintain consistency.

Order of Items

Public API first: Always place public items (pub or pub(crate)) at the very top, before any private helpers or implementation details.
Types before logic: Define data structures (struct, enum) before functions and impl blocks.
Top-down: Order type definitions by dependency, place the “parent” container before the “child” component it contains.
Rationale:
- Optimize for a new reader scanning top-to-bottom.
- When code is folded, the file structure should read like API documentation.

Context Parameters

Context-first: Always pass “context” parameters (invariant data threaded through many calls) as the first arguments.
- Rationale:
  - This creates a visual hierarchy: “setting” “actors”.
  - It mimics the self convention in OOP (where the context/object always comes first).
  - When scanning a function call, you immediately see where the operation is happening (the context) before seeing what is being processed (the variable data).

#![allow(unused)]
fn main() {
// BAD: Context (db) is hidden at the end.
// In a long list of arguments, you might miss which 'db' is being used.
fn transform_data(data: String, id: usize, force: bool, db: &Database) { ... }

// Usage
transform_data(raw_input, 42, true, &primary_db);

// GOOD: Context (db) is front and center.
// It establishes the environment immediately.
fn transform_data(db: &Database, data: String, id: usize, force: bool) { ... }

// Usage
transform_data(&primary_db, raw_input, 42, true);

}

If there are multiple context parameters, bundle them into a struct and pass it as &self.

#![allow(unused)]
fn main() {
// BAD: Parameter explosion.
// Every helper function needs to accept all three context items.
fn parse(db: &Db, cfg: &Config, cache: &Cache, text: &str) {
    validate(db, cfg, cache, text); // Tedious repetition
}

// GOOD: Packed Context.
struct ParseCtx<'a> {
    db: &'a Db,
    cfg: &'a Config,
    cache: &'a Cache,
}

impl<'a> ParseCtx<'a> {
    // The signature is clean. Context is implicit in `&self`.
    fn parse(&self, text: &str) {
        self.validate(text);
    }
    
    fn validate(&self, text: &str) { ... }
}

}

The “dangling argument” problem:
- Context-first works better when non-context parameters are lambdas (closures).
- Rationale:
  - Rust closures often span multiple lines.
  - If the context parameter is placed after the closure, it ends up “dangling” after the closing brace }. This is visually confusing and looks like a syntax error or a forgotten fragment.
  - Placing context first ensures the closure body is the last thing the eye sees, resulting in a clean termination of the statement.

#![allow(unused)]
fn main() {
// BAD: Context Last
// The `&context` argument is easy to miss or looks like it belongs
// to a different statement because it follows a large code block.
apply_changes(|item| {
    // ... complex multi-line logic ...
    // ... complex multi-line logic ...
    item.finalize()
}, &context); // <--- The "Dangler"


// GOOD: Context First
// The statement starts with the context, and the closure flows naturally
// until the end of the function call.
apply_changes(&context, |item| {
    // ... complex multi-line logic ...
    // ... complex multi-line logic ...
    item.finalize()
}); // <--- Clean, standard closure syntax
}

Variable Naming

Prioritize verbose & boring but clear names: prefer long names that mirror the type (e.g., global_state: GlobalState). Rely on code completion, not brevity.
Standard variables:
- res: Function result.
- it: Generic item (when identity is irrelevant).
- n_foos: Count (preferred over foo_count).
- foo_idx: Index.
Keyword collisions: Avoid r#ident syntax. Use these consistent replacements:

Keyword Replacement

crate krate

enum enum_

fn func

impl imp

macro mac

mod module

struct strukt

trait trait_

type ty
Spelling & acronyms:
- Use American spelling (color).
- Avoid ad-hoc acronyms; stick to common ones (db, ctx).

Keyword	Replacement
`crate`	`krate`
`enum`	`enum_`
`fn`	`func`
`impl`	`imp`
`macro`	`mac`
`mod`	`module`
`struct`	`strukt`
`trait`	`trait_`
`type`	`ty`

Error Handling Trivia

Use anyhow::Result instead of the bare Result.
- Rationale: Makes the return type immediately clear without checking imports.
Macro choice: Prefer anyhow::format_err! over anyhow::anyhow.
- Rationale: More “boring” (standard), consistent, and avoids the stuttering of anyhow::anyhow.
Message formatting:
- There are no strict rules on message structure. Rust standard library uses lowecase, while anyhow uses uppercase.
- Do not end error or context messages with a period (.).

Early Returns

Handle negative cases immediately with an early return rather than wrapping the “happy path” inside a large if/else block.
Rationale: Flattens code nesting and reduces “cognitive stack usage” (mental load).
Explicit error returns: Use return Err(e) to exit with an error. Avoid using Err(e)? to simulate a throw.
Rationale: return evaluates to the “never type” (!), which allows the compiler to strictly identify dead code, whereas ? resolves to a generic type that can mask unreachable code.

Comparisons

Prefer less-than:
- Always use < or <= comparisons.
- Avoid > or >=.
Rationale:
- Spatial intuition: Corresponds to the real number line where values increase from left to right (0->infinity).
- Visual ordering: lo <= x && x <= hi visually places x in the middle, whereas x >= lo forces a mental “flip.”

If-let

Prefer match over if let ... else: When you need to handle both the “success” and “failure” cases, use a full match statement.
Rationale:
- Compactness: match is usually cleaner and requires less syntax for simple alternatives.
- Precision: The else block in if let is implicit (it covers everything else). match forces you, or allows you, to be explicit about what the negative case is (e.g., None vs Err(_)), making the code more robust to type changes.

Match Ergonomics

Avoid ref: Do not use the ref keyword in patterns.

Rationale:

Obsolescence: ref is largely redundant due to “match ergonomics” (introduced in recent Rust editions).
Simplicity: Relying on match ergonomics is cleaner and avoids mixing legacy syntax with modern style.

Empty Match Arms

Use the unit value (), for empty match arms, rather than an empty block {}.

#![allow(unused)]
fn main() {

// GOOD
Err(err) => error!("{}", err),
Ok(_) => (), // <--- Clean, single-line comma style

// BAD
Err(err) => error!("{}", err),
Ok(_) => {}  // <--- Block style breaks the visual rhythm
}

Rationale:
- In Rust, () is the value “nothing,” while {} is a block of code that evaluates to nothing.
- While they are functionally identical here, using => (), keeps the match arm visibly distinct as a “value” rather than a “scope,” maintaining a consistent visual rhythm in long match statements.

Functional Combinators

Use functional combinators (map, and_then) only when they fit naturally.
Prefer imperative control flow (if, for, match) over “forced” combinators like bool::then or Option::filter.
The philosophy: Code should be dense in computation (doing work) but sparse in structure (fewer indirections per line).
Rationale:
- Rust has strong support for imperative flow (loops, early returns).
- Rust functions are “less first-class” because of effects like ? (try-operator) and .await. These do not compose well inside long chains of closures (e.g., trying to ? inside a filter closure is painful).

Turbofish

Prefer explicit type ascription (let x: Vec<i32> = ...) over the “turbofish” syntax (...collect::<Vec<i32>>()).
Rationale: We want to able to read everything from left-to-right without maintaining to much context:
- Good: let names: Vec<String> = users.iter().map(...).collect();
  - Why: When you read the line left-to-right, you immediately know what is being built (Vec<String>). This context helps you understand the complex iterator chain that follows.
- Bad: let names = users.iter().map(...).collect::<Vec<String>>();
  - Why: You have to read the entire chain until the very end to figure out what type is actually being produced.
- No placeholders (_): Avoid let x: Vec<_> = ....
Rationale: If the compiler struggles to infer the type (forcing you to add a hint), a human reader will likely struggle too. Be kind to the reader and write the full type Vec<i32>.

Helper Functions

Avoid single-Use helpers: Do not create a separate function if it is only called once.
Alternative: Use a block { ... }.
Rationale: This isolates the scope but keeps access to the context variables.
Exception: Create a function if you need early returns (return) or error propagation (?).
Local helpers: Place nested helper functions at the end of the enclosing function.
- Structure: Main logic -> return result; -> fn helper() { ... }.
- Limit: Do not nest more than 1 level deep.

Helper Variables

Create boolean helper variables for complex conditions (e.g., inside match guards).
Rationale:
- Act as a “cognitively cheap” abstraction (names the logic without hiding the context).
- Make debugging easier (you can inspect/print the variable).

Syntax Macros

Tokens: Use the T![token] macro instead of SyntaxKind::TOKEN_KW.
Example: T![true] instead of SyntaxKind::TRUE_KW.
Rationale: Familiar syntax, avoids ambiguity (e.g., { vs [).

Documentation

Comments: Write proper sentences. Start with a capital, end with a dot ..
Markdown: Use sentence-per-line. Do not wrap lines hard; press Enter after every sentence.
Rationale:
- Makes diffs cleaner and editing easier.
- Formatting a comment as a sentence (capital + period) forces the brain to switch from “scribbling notes” to “explaining thoughts.”
- Context dump: It tricks you into emptying your “mental RAM” (assumptions, constraints) into the code, rather than leaving vague fragments like // fix this.

Syntax Tree & Parser

Official site: Link.

Commit link: 2020-01-09

Architectural Overview

The system has three isolated components:
- rowan: A generic, language-agnostic library for immutable syntax trees using the Red-Green tree model.
- syntax (crate): Wrap rowan with a rust-analyzer-specific API. It’s the only crate that knows about rowan.
- parser (crate): Output events from parsing without knowing anything about the tree structure.

Design Goals

Lossless: Everything is preserved - whitespace, comments, even invalid tokens.
Semantic-less: Pure structure with no type checking or name resolution.
Context-free: Trees are simple values that don’t depend on external context.
Resilient: Always produce a tree, even from broken code (nodes may be partial).
Performance: Optimized for memory and CPU using unsafe where needed.
Decoupled: Parser and tree evolve independently.
Intuitive traversal: Easy navigation between parent, children, and siblings.

The Three-Layer Syntax Tree

The core idea is a “Red-Green Tree” (borrowed from Roslyn), where three layers balance memory usage against API convenience.

Semi-transient: Trees don’t live in memory permanently - they get lowered to a compact form but can be rebuilt when needed.

Layer 1: GreenNode (The Storage)

GreenNode is where the actual data lives - it’s purely functional, immutable, and persistent with arbitrary arity.

Conceptual (unoptimized) example from rust-analyzer:

#![allow(unused)]
fn main() {
// Runtime tag for a Node/Token
// Nodes and tokens both use the same tag kind?
#[derive(PartialEq, Eq, Clone, Copy)]
struct SyntaxKind(u16);

#[derive(PartialEq, Eq, Clone)]
struct Node {
    kind: SyntaxKind,
    // Only the byte-length is stored, offset is not stored
    text_len: usize,
    // A generic node with a sequence of children nodes and tokens
    // `Arc` is used, so it allows for multithreading
    children: Vec<Arc<Either<Node, Token>>>,
}

#[derive(PartialEq, Eq, Clone)]
struct Token {
    kind: SyntaxKind,
    text: String,
}
}

A Token is a leaf node, while Node represents interior nodes.
Some things to note:
- There’s just one generic node structure - nodes are untyped with a runtime tag.
- The original text comes from combining all the children’s text.
- Finding a specific child means scanning through siblings linearly.
- Modifying the tree gets more expensive with height (due to path copying).
- Invalid input gets wrapped in an ERROR node.
- Errors live separately, not embedded in the tree itself.

The following optimizations are mostly attributed to CAD97.

Optimizations

[ RefCount | Kind | TextLen | N_Children | Child_1 | Child_2 | ... | Child_N ]
<------------------------- Single Heap Allocation ------------------------->

Dynamically sized type (DST): Everything in one heap allocation instead of splitting the node and its children into separate allocations.
Tagged pointers: A single pointer where the last bit indicates whether it’s a Node or Token (cheaper than Arc<Either<Node, Token>>).
Interning: Tokens get reused - 1 + 1 only stores one 1 token that’s shared. Returns Arc<Node> so trees are self-contained.
TextSize: Just a newtyped u32 for tracking text length.
SmolStr: Small strings like fn or if live on the stack inline, no heap allocation needed.
- Note: The text suggests moving this data directly into the interned token allocation in the future.

Alternative Designs

Dealing With Trivia

Explicit nodes (Rowan/IntelliJ): Whitespace becomes regular sibling nodes. Lossless and easy to edit, but noisy when traversing.
Attached trivia (Roslyn/Swift): Whitespace hides inside tokens as “leading/trailing” properties, so the tree only shows logical code.
Linked list (Dart): A hybrid approach with a clean semantic tree plus a separate doubly-linked list for all tokens.
There’s no standard approach to handling trivia. For reference, @dbml/parse uses the attached trivia approach.

Strategy	Pros	Cons
Explicit Nodes (Rowan, IntelliJ, `rust-analyzer`)	Uniformity: Everything is a node; simple data model. Fidelity: Lossless representation of text. Refactoring: Moving a branch naturally moves its formatting.	Noise: Tree traversal requires manual filtering of whitespace. Memory: Higher node count (every space is an object).
Attached Trivia (Roslyn, Swift)	Clarity: Tree contains only logical code. Iteration: Iterating children yields immediate semantic tokens.	Complexity: “Re-stitching” text during refactoring is hard. Ownership: Ambiguous whether a comment belongs to the previous or next token.
Linked List (Dart)	Dual View: Optimized for both semantic analysis (Tree) and formatting (List). Scanning: Linear access to tokens is very fast.	Syncing: High maintenance cost to keep the Tree and List in perfect sync. Overhead: Requires managing parallel data structures.

Accessing Children

Linear can (IntelliJ/Rowan):
- Children live in a dynamic list with trivia mixed in.
- You have to scan through siblings (O(N)) to find what you want.
- Very memory efficient - missing nodes cost nothing.
Fixed slots (Roslyn/Swift):
- Every possible child gets a fixed slot in the grammar.
- Direct O(1) access by index.
- Uses more memory since it allocates slots even for None values.
- Trivia has to be removed to keep indices stable.

Feature	Linear Scan (IntelliJ)	Fixed Slots (Roslyn/Swift)
Access Speed	O(N) (Must iterate/skip)	O(1) (Direct Index)
Memory Usage	Compact (Only stores present data)	Sparse (Allocates slots for `None`)
Missing Nodes	Implicitly absent	Explicitly stored as `Option::None`
Trivia	Included as sibling nodes	Removed or attached to tokens

Mutable Trees

Mutable trees (IntelliJ): Imperative style.
- You modify nodes directly in place (node.delete()).
- Simple API but thread-safety gets complicated (need read/write locks).
Immutable trees (rust-analyzer): Functional style.
- You build up changes with a Builder then generate a fresh tree.
- More verbose API, but concurrency becomes trivial (snapshots are naturally safe).

#![allow(unused)]
fn main() {
// MUTABLE (IntelliJ Style)
// Direct, in-place modification. Simple but requires locking.
for bound in bounds {
    // The tree changes immediately
    bound.delete();
}

// IMMUTABLE (rust-analyzer Style)
// Indirect. Must collect edits and rebuild the tree structure.
let mut builder = tree.edit();
for bound in bounds {
    // Schedule deletion on the 'builder', not the node itself
    builder.delete(bound);
}
// Apply generates a completely new root node
let new_tree_root = builder.apply();

}

Feature	Mutable Trees (IntelliJ)	Immutable Trees (`rust-analyzer`)
Editing API	Simple (Direct `node.delete()`)	Verbose (Requires `Builder` / rewriting spine)
Concurrency	Hard (Needs complex R/W locks)	Easy (Snapshots are thread-safe)
Mental Model	“DOM Manipulation”	“Version Control” / Persistent Data Structures
Performance	Fast small edits	Slower (allocates new nodes for the path to root)

Layer 2: SyntaxNode (The Cursor / RedNode)

GreenNode has some limitations: no parent pointers (can’t go up the tree), and if the same GreenNode appears in multiple places, they’re treated as identical.
SyntaxNode/RedNode (also called cursors or zippers) solve this by adding parent pointers and position information.

Conceptual (unoptimized) example from rust-analyzer:

#![allow(unused)]
fn main() {
type SyntaxNode = Arc<SyntaxData>;

struct SyntaxData {
    offset: usize,
    parent: Option<SyntaxNode>,
    green: Arc<GreenNode>,
}

impl SyntaxNode {
    fn new_root(root: Arc<GreenNode>) -> SyntaxNode {
        Arc::new(SyntaxData {
            offset: 0,
            parent: None,
            green: root,
        })
    }
    fn parent(&self) -> Option<SyntaxNode> {
        self.parent.clone()
    }
    fn children(&self) -> impl Iterator<Item = SyntaxNode> {
        let mut offset = self.offset;
        self.green.children().map(|green_child| {
            let child_offset = offset;
            offset += green_child.text_len;
            Arc::new(SyntaxData {
                offset: child_offset,
                parent: Some(Arc::clone(self)),
                green: Arc::clone(green_child),
            })
        })
    }
}

impl PartialEq for SyntaxNode {
    fn eq(&self, other: &SyntaxNode) -> bool {
        self.offset == other.offset
            && Arc::ptr_eq(&self.green, &other.green)
    }
}
}

Optimizations

Atomic avoidance: Use Rc (non-atomic) instead of Arc for parent pointers since traversals are single-threaded anyway, avoiding atomic overhead.
Thread movement: Can’t send SyntaxNode across threads directly, so you send lightweight coordinates like (GreenNode, TextRange) instead. The receiving thread rebuilds the node from those coordinates.
Transient trees: Instead of keeping full trees in memory, rust-analyzer just stores (FileId, TextRange) and reparses when needed - trading CPU time for memory.
Root-only Arc: Only the root holds an Arc<GreenNode> as the tree’s memory anchor. Descendants use raw pointers (fast) plus an Rc to their parent (safe). Since each child’s Rc keeps its parent alive all the way up to the root, the raw pointers never dangle. This means you only pay for atomic operations once at the root - the rest is just cheap Rc bumps.
Object pooling: Keep a thread-local “free list” of node objects for reuse instead of calling malloc constantly. The pool only needs to match traversal depth (not tree size), so a few dozen slots can handle huge trees. This lets you return “owned” nodes cheaply (just a pointer move) which are nicer to work with than references.

Alternative Designs

Memoized RedNodes

Memoized nodes (C#/Swift): Heavy Arc objects that permanently cache everything - offset, parent, children. You get true pointer equality (super fast comparison) but it doubles memory since the whole tree is duplicated in the “Red” layer. C# tries to claw back memory with weak references.
Cursor approach (rust-analyzer): Lightweight views that calculate positions on demand. Memory scales with traversal depth, not tree size.

#![allow(unused)]
fn main() {
// Memoized Design (C# / Swift)
// Persistent, heavy wrapper around the data.
struct SyntaxData {
    offset: usize,
    parent: Option<SyntaxNode>, // Back-link
    // Caches children forever. High memory cost.
    children: Vec<OnceCell<SyntaxNode>>,
}

}

Feature	Memoized Nodes (C#/Swift)	Cursors (`rust-analyzer`)
Node Identity	Pointer Equality (Fast)	Range Check (Slower)
Memory Cost	High (Doubles tree size)	Low (Proportional to depth)
Child Access	Cached (Immediate)	Computed (On-demand)
Persistence	Long-lived Objects	Transient / Re-created

Layer 3: AST (The API)

A strongly typed API is way more convenient than working with untyped GreenNodes directly (though untyped works fine for error or trivial nodes).
For example, having a typed FnDef with well-known methods beats a generic node you have to manually inspect.

Example AstNode trait (from rust-analyzer):

#![allow(unused)]
fn main() {
pub trait AstNode {
  fn cast(syntax: SyntaxNode) -> Option<Self>
  where
      Self: Sized;

  fn syntax(&self) -> &SyntaxNode;
}
}

The classes are auto-generated. One example from rust-analyzer:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct FnDef {
    syntax: SyntaxNode,
}

impl AstNode for FnDef {
    fn cast(syntax: SyntaxNode) -> Option<Self> {
        match syntax.kind {
            SyntaxKind::FN => Some(FnDef { syntax }),
            _ => None,
        }
    }
    fn syntax(&self) -> &SyntaxNode {
        &self.syntax
    }
}

impl FnDef {
    pub fn param_list(&self) -> Option<ParamList> {
        self.syntax.children().find_map(ParamList::cast)
    }
    pub fn ret_type(&self) -> Option<RetType> {
        self.syntax.children().find_map(RetType::cast)
    }
    pub fn body(&self) -> Option<BlockExpr> {
        self.syntax.children().find_map(BlockExpr::cast)
    }
    // ...
}
}

Variants such as expressions are represented as enums. One example from rust-analyzer:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum AssocItem {
    FnDef(FnDef),
    TypeAliasDef(TypeAliasDef),
    ConstDef(ConstDef),
}

impl AstNode for AssocItem {
    ...
}
}

Shared syntactic elements (like names, loop bodies, or doc comments) are modeled via traits:

#![allow(unused)]
fn main() {
trait HasName: AstNode {
    fn name(&self) -> Option<Name>;
}

impl HasName for FnDef {
    fn name(&self) -> Option<Name> {
        self.syntax.children().find_map(Name::cast)
    }
}
impl HasName for StructDef {
    fn name(&self) -> Option<Name> {
        self.syntax.children().find_map(Name::cast)
    }
}
}

Alternative Designs

Semantic Full AST

IntelliJ’s PSI is a “rich” AST that abstracts away where code came from - a PsiMethod looks the same whether it’s from source (.java) or compiled (.jar), hiding all that complexity.
Attached semantics: Nodes have methods like resolve() or getSuperClass() built in, so you can query semantics directly on the tree.
Memory optimization: Start as a lightweight “stub” (just serialized metadata) and only inflates to a full AST when you actually open the file.

Feature	Pure AST (`rust-analyzer`)	Semantic PSI (IntelliJ)
Scope	Source code structure only	Source + Compiled Libraries
Backing	Always AST	Dynamic (index stub -> full tree)
Logic	External (analysis passes)	Internal (methods on nodes)
Memory	Low (transient cursors)	High (rich objects), mitigated by stubs

Parsing - The Token Sequence Transformer

You can think of parsing as doing a DFS traversal of the tree you want to build.

When the parser finishes a node, it calls GreenNodeBuilder to construct it:

#![allow(unused)]
fn main() {
pub struct GreenNodeBuilder { ... }

impl GreenNodeBuilder {
    pub fn new() -> GreenNodeBuilder { ... }

    pub fn token(&mut self, kind: SyntaxKind, text: &str) { ... }

    pub fn start_node(&mut self, kind: SyntaxKind) { ... }
    pub fn finish_node(&mut self) { ... }

    pub fn finish(self) -> GreenNode { ... }
}
}

There are two kinds of input: source text (full of trivia like whitespace and comments) and macro token trees (just structural tokens, no trivia).
Input and output tokens don’t match 1-to-1. The parser uses abstract callbacks - TokenSource for reading input and TreeSink for writing output. An intermediary layer handles the mismatches, like stripping whitespace or merging > + > into >>.

The parser interface (from rust-analyzer):

#![allow(unused)]
fn main() {
// Different structure than GreenNode's Token
pub struct Token {
  pub kind: SyntaxKind,
  pub is_joined_to_next: bool,
}

pub trait TokenSource {
    fn current(&self) -> Token;
    fn lookahead_nth(&self, n: usize) -> Token;
    fn is_keyword(&self, kw: &str) -> bool;

    fn bump(&mut self);
}

pub trait TreeSink {
    fn token(&mut self, kind: SyntaxKind, n_tokens: u8);

    fn start_node(&mut self, kind: SyntaxKind);
    fn finish_node(&mut self);

    fn error(&mut self, error: ParseError);
}

pub fn parse(
    token_source: &mut dyn TokenSource,
    tree_sink: &mut dyn TreeSink,
) { ... }
}

The parser and syntax crates are completely separate with zero dependencies on each other, achieving strict modularity.
The parser mostly works with SyntaxKind tags and ignores the actual text, except for one hack: checking contextual keywords like union or default.
TreeSink isn’t atomic. When the parser emits a logical token like >>, the sink can consume multiple raw tokens (> and >).

Reporting Syntax Errors

Errors aren’t stored in the tree, they’re collected separately in a Vec<SyntaxError>.
This separation means you can build or modify trees manually without worrying about error state.
The parser is intentionally permissive (like allowing pub on trait methods). These “soft” errors get caught later in a validation pass.

Macros

Token hygiene: Macros need to remember where tokens came from so variables with the same name from different scopes don’t get confused. While building the tree, TreeSink also creates a map linking text ranges back to their original token IDs. So parsing gives you two things: the tree and a map tracing everything back to its source.
Operator precedence: To avoid breaking precedence (like $expr * 1 messing up order of operations), the parser adds invisible parentheses around macro expansions.

Whitespace & Comments

The core parser works with a “clean” token stream - no whitespace or comments to worry about.
But trivia isn’t thrown away. The TreeSink layer re-inserts it into the final tree as nodes get built.
Comments don’t just get dumped linearly - they’re attached heuristically to their semantic parents. Like a comment right before a function becomes a child of that FnDef node.

Incremental Reparse

Green trees make modifications cheap - you can “patch” by swapping a single node pointer with a freshly parsed subtree. No external state needed.
Block heuristic: Edits get isolated to the smallest {} block that contains them. This works because the parser keeps braces balanced even in broken code, giving you stable anchor points.

Q&A: How does the parser maintain balanced braces in broken code?

Q: How can braces be “balanced” when users type unmatched { or }?

A: The tree structure maintains balanced braces, not the source text. The parser uses error recovery to ensure every start_node(BLOCK) has a matching finish_node().

Q: What happens when a closing brace is missing?

A: The parser inserts an implicit/phantom closing brace. For example:

#![allow(unused)]
fn main() {
fn foo() {
    let x = 1;
// EOF - missing }
}

The tree structure acts as if the } exists at EOF, even though it’s not in the source text. This is implemented in the parser’s block parsing logic, which automatically closes unclosed blocks at EOF or when encountering incompatible tokens.

Q: What if there’s an unexpected token inside a block?

A: The parser uses early block termination. For example:

#![allow(unused)]
fn main() {
fn foo() {
    let x = 1;
fn bar() {  // unexpected fn - shouldn't be here
}

The parser:

Realizes fn shouldn’t be inside foo’s block
Inserts an implicit } to close foo
Continues parsing bar at the outer level
Wraps the malformed content in an ERROR node

This is part of the “panic mode” error recovery strategy implemented in the event-based parser.

Q: What about extra closing braces?

A: Unmatched } tokens are:

#![allow(unused)]
fn main() {
fn foo() {
}
}  // extra closing brace
}

Treated as error tokens
Wrapped in an ERROR node
Not matched with anything
The structural tree still has balanced blocks

Q: How does this enable the block heuristic?

A: Because braces are always structurally balanced, the parser can reliably:

Find the smallest enclosing {} block around any edit
Use these blocks as stable anchor points
Reparse just that subtree instead of the entire file

This is implemented in the incremental reparsing logic in the syntax crate, particularly the incremental_reparse function and is_balanced check.

In practice though, incremental reparsing often isn’t worth it. Modern parsers are fast enough to just reparse the whole file from scratch.

Parsing Algorithm

Recursive descent: Each grammar rule maps to a function that calls other functions recursively to parse sub-expressions. This makes the parser easy to understand and debug.
Pratt parsing: Handle operator precedence elegantly without building complex precedence tables. Used specifically for expressions with binary operators.
Error recovery: The parser makes a special effort to continue parsing even when it encounters syntax errors. Instead of stopping at the first error, it attempts to recover and continue building the tree, marking problematic sections with ERROR nodes.

Parser Recap

The parser’s design achieves strict modularity through well-defined interfaces:

TokenSource trait:
- Define how the parser reads input tokens.
- The parser doesn’t know or care whether tokens come from source text (with trivia), macro expansions (without trivia), or any other source.
TreeSink trait:
- Define how the parser writes output events (start_node, finish_node, token, error).
- The parser has no knowledge of how these events are converted into the final tree structure.

Interaction Between The `syntax` and The `parser` Crate

The parser and syntax crates have zero dependencies on each other. Both crates depend on shared interfaces (traits), but neither depends on the other’s implementation.

%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
flowchart LR
    A(["<i>string</i><br/><b>Source Text</b>"])

    subgraph top["<code>syntax</code> crate"]
        direction LR
        B["<i>fn</i><br/><b>tokenize</b>"]
        C(["<i>Vec&lt;Token&gt;</i><br/><b>Token Stream</b>"])
        B --> C
    end

    subgraph P["<code>parser</code> crate"]
        direction LR
        D{{"<i>trait</i><br/><b>TokenSource</b>"}}
        E["<i>fn</i><br/><b>parse</b>"]
        F{{"<i>trait</i><br/><b>TreeSink</b>"}}
        D --> E --> F
    end

    subgraph bottom["<code>syntax</code> crate"]
        direction RL
        I{{"<i>trait</i><br/><b>AstNode</b>"}}
        H(["<i>type</i><br/><b>SyntaxNode</b>"])
        G(["<i>type</i><br/><b>GreenNode</b>"])
        G --> H --> I
    end

    A --> B
    C -.->|implements| D
    F -.->|implemented by| G

    click B "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/parsing/lexer.rs#L30-L92" "View tokenize function"
    click C "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/parsing/lexer.rs#L9-L15" "View Token struct"
    click D "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L30-L44" "View TokenSource trait"
    click E "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L81-L84" "View parse function"
    click F "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L56-L69" "View TreeSink trait"
    click G "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/syntax_node.rs#L10-L10" "View GreenNode type"
    click H "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/syntax_node.rs#L33-L34" "View SyntaxNode type alias"
    click I "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/ast.rs#L31-L42" "View AstNode trait"

    classDef syntaxNode fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e40af
    classDef parserNode fill:#fed7aa,stroke:#f97316,stroke-width:2px,color:#9a3412
    classDef dataNode fill:#e5e7eb,stroke:#6b7280,stroke-width:2px,color:#374151
    classDef traitNode fill:#fce7f3,stroke:#db2777,stroke-width:2px,color:#831843

    class B,E syntaxNode
    class A,C,G,H dataNode
    class D,F,I traitNode

    style top fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,rx:5
    style bottom fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,rx:5
    style P fill:#fff7ed,stroke:#f97316,stroke-width:2px,rx:5

The solid arrows represent direct data flow within the same crate. The dashed arrows represent trait implementation/interface contracts.
The parser crate (orange box) only touches traits - it never directly depends on concrete types from syntax.
The syntax crate appears in two boxes because it provides both the input preparation (tokenization) and output consumption (tree building), but the parser sits independently in the middle.
This architecture is like an “hourglass”: wide input capabilities (multiple token sources), narrow interface (two traits), wide output capabilities (multiple tree representations).

How They Work Together

The syntax crate tokenizes the source text into a token stream.
The syntax crate wraps its token stream in a struct that implements the TokenSource trait, providing the interface the parser expects.
The parser crate’s parse() function reads from TokenSource and writes to TreeSink - it only knows about these trait interfaces, not the concrete implementations
The syntax crate provides a struct implementing TreeSink that receives parser events and builds the actual GreenNode tree structure
The syntax crate wraps the GreenNode in SyntaxNode and provides the typed AST layer

Insight: The parser crate emits abstract events (start_node, token, finish_node, error) without knowing how they’ll be consumed. The syntax crate provides concrete implementations that convert these events into the tree structure.

What the parser knows vs doesn’t know:

Knows: SyntaxKind tags (e.g., FN_DEF, IDENT, LET_STMT), the vocabulary of what syntactic constructs exist.
Knows: When to emit events like start_node(FN_DEF) or token(IDENT) based on parsing logic.
Doesn’t know: How these events are converted into actual tree nodes (GreenNode, SyntaxNode).
Doesn’t know: The memory layout, data structures, or APIs of the tree.

SyntaxKind is just a simple enum/tag shared between both crates. The parser uses it to label events, while the syntax crate uses it to tag tree nodes. This shared vocabulary is the only coupling between them.

Illustrative example

The example glue code in syntax crate

#![allow(unused)]
fn main() {
// `syntax` crate

pub fn parse_source_file(text: &str) -> SourceFile {
    // Step 1: Tokenize the source text (syntax crate's job)
    let tokens = tokenize(text);

    // Step 2: Create adapters that implement parser traits
    let mut token_source = TokenSource::new(text, &tokens);
    let mut tree_sink = TreeSink::new(text, &tokens);

    // Step 3: Call the parser (parser crate's job)
    // The parser only sees the trait interfaces, not the concrete types
    parser::parse(&mut token_source, &mut tree_sink);

    // Step 4: Extract the built tree (syntax crate's job)
    let (green_node, errors) = tree_sink.finish();

    // Step 5: Wrap in higher-level APIs
    SourceFile::new(green_node, errors)
}
}

Motivation

This separation provides several benefits:

The parser logic can change without touching tree representation, and vice versa.
Different token sources (source files, macro expansions, synthetic tokens) can all use the same parser.
Different tree sinks could produce different tree formats, or even non-tree outputs (e.g., streaming validation).
The parser can be tested with mock token sources and tree sinks.
The parser is truly language-agnostic - it could theoretically be reused for other languages by providing different SyntaxKind definitions.

Testing

Official site: Link.

Test Philosophy

rust-analyzer employs snapshot testing as its core testing strategy (surprised LOL). The essence: “A test is a piece of input text, usually Rust code, and some output text.”

The testing helper compares the actual result against the expected output. When tests fail, you can review the diff and either fix the code or update the expectation if the new behavior is correct.

Test Framework Components

The testing infrastructure combines two main elements:

expect-test crate: A lightweight snapshot testing library that enables automated expectation updates.
Custom testing helpers: Purpose-built functions for specific rust-analyzer features (type inference, diagnostics, etc.).

Common Test Helpers

Type Inference Tests

Located in crates/hir-ty/src/tests. Three main helper functions:

`check_no_mismatches()`

Validates that no type mismatches exist in the given code.

#[test]
fn simple_assignment() {
    check_no_mismatches(
        r#"
fn main() {
    let x: i32 = 42;
    let y: i32 = x;
}
"#,
    );
}

Note: This relies on rust-analyzer’s analysis, not the Rust compiler. The analyzer might miss errors the compiler would catch, or vice versa.

`check_types()`

Asserts specific expression types using inline annotations. Uses ^ markers under code to denote type assertions.

#[test]
fn type_annotations() {
    check_types(
        r#"
fn main() {
    let x = 1;
     // ^ i32
    let y = "hello";
     // ^ &str
    let z = x + 2;
     // ^ i32
}
"#,
    );
}

The // ^ type notation marks the expression on the line above and asserts its inferred type.

`check_infer()`

Uses expect-test to match complete type inference output against expected results. Displays byte ranges, source text, and inferred types for all expressions.

#![allow(unused)]
fn main() {
#[test]
fn infer_function_return() {
    check_infer(
        r#"
fn foo() -> i32 {
    42
}
"#,
        expect![[r#"
        10..11 '{': expected i32, got ()
        15..17 '42': i32
    "#]],
    );
}
}

Annotation Conventions

rust-analyzer tests use a special notation system for marking positions and ranges in code:

Notation	Purpose	Example
`$0`	Cursor position (single point)	`let x$0 = 1;`
`$0...$0`	Range/selection	`let $0x = 1$0;`
`// ^^^^`	Type assertion label	`x + 1` `// ^^^ i32`

These annotations are processed by the fixture parser and removed before the code is analyzed.

Fixture System

The fixture system is a mini-language for declaring multi-file, multi-crate test scenarios. It allows tests to simulate complex project structures without requiring actual file system operations.

Basic Fixture Structure

#![allow(unused)]
fn main() {
#[test]
fn test_cross_crate_reference() {
    check_types(
        r#"
//- minicore: sized
//- /lib.rs crate:foo
pub struct Foo;

//- /bar.rs crate:bar deps:foo
use foo::Foo;
fn use_foo(f: Foo) {
           // ^^^ Foo
}
"#,
    );
}
}

Fixture Directives

Minicore Flags

//- minicore: flag1, flag2, ...

Includes specific standard library components. This is the most commonly used directive.

Common minicore flags:

Flag	Provides	When to Use
`sized`	`Sized` trait	Almost always; recommended when stuck
`fn`	`Fn`, `FnMut`, `FnOnce` traits	Required for closures
`async_fn`	`AsyncFn*` traits	Required for async closures
`iterator`	`Iterator` trait	When testing iteration
`option`	`Option<T>` enum	When using `Option`
`result`	`Result<T, E>` enum	When using `Result`
`unsize`	`Unsize` trait	For unsized types (`dyn Trait`, slices)
`coerce_unsized`	`CoerceUnsized` trait	For unsized type coercions
`dispatch_from_dyn`	`DispatchFromDyn` trait	For trait object dispatch

Important: If your test mysteriously fails with type errors, try adding sized first. Many operations implicitly require the Sized trait.

File Declarations

//- /path/to/file.rs crate:name deps:dep1,dep2 edition:2021

Declares a file in the fixture with optional metadata:

Path: File path relative to project root
crate: Crate name (defaults to file name)
deps: Comma-separated dependency list
edition: Rust edition (defaults to 2021)

#![allow(unused)]
fn main() {
//- /lib.rs crate:mylib
pub mod foo;

//- /foo.rs
pub struct Bar;

//- /main.rs crate:main deps:mylib edition:2021
use mylib::foo::Bar;
}

Complete Fixture Example

#[test]
fn test_workspace_with_dependencies() {
    check_types(
        r#"
//- minicore: sized, fn, iterator
//- /workspace/utils/lib.rs crate:utils
pub fn helper() -> i32 { 42 }

//- /workspace/core/lib.rs crate:core deps:utils
use utils::helper;
pub fn process() -> i32 {
    helper()
 // ^^^^^^^^ fn helper() -> i32
}

//- /workspace/app/main.rs crate:app deps:core
use core::process;
fn main() {
    let x = process();
     // ^ i32
}
"#,
    );
}

This fixture creates a workspace with three crates (utils -> core -> app) with proper dependency relationships.

Fixture System Architecture

%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
flowchart LR
    A["Test Fixture String"]
    B["Parse Directives"]
    C["minicore Flags"]
    D["File Declarations"]
    E["Build In-Memory\nFile System"]
    F["Create Crate Graph"]
    G["Run rust-analyzer\nAnalysis"]
    H["Extract Results"]

    A --> B
    B --> C
    B --> D
    C --> E
    D --> E
    E --> F
    F --> G
    G --> H

    classDef inputNode fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a
    classDef parseNode fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
    classDef buildNode fill:#dcfce7,stroke:#22c55e,stroke-width:2px,color:#064e3b
    classDef outputNode fill:#fce7f3,stroke:#db2777,stroke-width:2px,color:#831843

    class A inputNode
    class B,C,D parseNode
    class E,F,G buildNode
    class H outputNode

The fixture parser:

Extracts directives (minicore, file paths, metadata).
Builds an in-memory file system.
Constructs a crate dependency graph.
Initializes rust-analyzer with this synthetic project.
Runs analysis and returns results for assertions.

UPDATE_EXPECT Workflow

The expect-test crate supports automatic expectation updates via the UPDATE_EXPECT=1 environment variable.

Manual Update

UPDATE_EXPECT=1 cargo test

This runs all tests and automatically updates any failing expect![[...]] blocks with the actual output.

IDE Integration

Some IDEs (notably VSCode with the rust-analyzer extension) provide convenient UI buttons for updating expectations:

“Update Expect” button in test function hover tooltips
Inline code actions for specific expect! blocks

Workflow Diagram

%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
stateDiagram-v2
    [*] --> WriteTest: Write test with expect!
    WriteTest --> RunTest: cargo test
    RunTest --> TestPasses: Output matches
    RunTest --> TestFails: Output differs
    TestPasses --> [*]
    TestFails --> ReviewDiff: Examine actual vs expected
    ReviewDiff --> FixCode: Bug in code
    ReviewDiff --> UpdateExpect: Behavior is correct
    FixCode --> RunTest
    UpdateExpect --> RunWithUpdate: UPDATE_EXPECT=1 cargo test
    RunWithUpdate --> Verify: Check updated expectation
    Verify --> [*]

Best Practices

1. Import Required Types

The minicore system only includes explicitly requested components. If your test fails with “cannot find type” errors:

#![allow(unused)]
fn main() {
//- minicore: sized, fn, iterator, option  // <- Add missing flags
}

2. Start with `sized`

When stuck with mysterious type errors, add sized to minicore flags. Many operations implicitly require the Sized trait:

#![allow(unused)]
fn main() {
//- minicore: sized  // <- Add this first when debugging
}

3. Use Semantic Highlighting for Debugging

If code looks wrong in the test, check its syntax highlighting. Incorrect highlighting often indicates:

Missing minicore flags
Syntax errors in the fixture
Incorrect file path declarations

4. Validate with Real IDEs

When test behavior seems suspicious, validate the code in an actual IDE with rust-analyzer. Sometimes tests simplify too much and miss real-world edge cases.

5. Unsized Types Need Special Flags

Working with dyn Trait, slices, or other unsized types? Add these flags:

#![allow(unused)]
fn main() {
//- minicore: sized, unsize, coerce_unsized, dispatch_from_dyn
}

6. Keep Tests Minimal

Follow the “Reproduction isolation” principle from rust-analyzer conventions: minimize test cases to the smallest code required to reproduce behavior. This reduces noise and debugging time.

Common Test Patterns

Testing Completion

#![allow(unused)]
fn main() {
#[test]
fn completes_struct_fields() {
    check_edit(
        "field",
        r#"
struct S { field: i32 }
fn foo(s: S) {
    s.$0
}
"#,
        r#"
struct S { field: i32 }
fn foo(s: S) {
    s.field$0
}
"#,
    );
}
}

The $0 marker indicates cursor position before (trigger point) and after (expected position) completion.

Testing Diagnostics

#![allow(unused)]
fn main() {
#[test]
fn detects_type_mismatch() {
    check_diagnostics(
        r#"
fn foo() -> i32 {
    "string"
 // ^^^^^^^^ error: expected i32, found &str
}
"#,
    );
}
}

Inline error annotations specify expected diagnostic messages at specific locations.

Testing Quick Fixes

#[test]
fn adds_missing_import() {
    check_fix(
        r#"
//- /lib.rs crate:foo
pub struct Foo;
//- /main.rs crate:main deps:foo
fn main() {
    Foo$0 {};
}
"#,
        r#"
use foo::Foo;

fn main() {
    Foo {};
}
"#,
    );
}

Tests that the quick fix correctly adds the missing import.

Test Organization

rust-analyzer organizes tests close to implementation code:

crates/
├── hir-ty/
│   └── src/
│       ├── infer.rs              # Type inference logic
│       └── tests/
│           ├── simple.rs         # Basic inference tests
│           ├── traits.rs         # Trait-related tests
│           └── coercion.rs       # Type coercion tests
├── ide/
│   └── src/
│       ├── completion.rs         # Completion implementation
│       └── completion/
│           └── tests.rs          # Completion tests

This co-location makes it easy to find relevant tests when modifying code.

Recap

rust-analyzer’s testing philosophy emphasizes:

Data-driven testing: Tests defined by input/output pairs, resilient to refactoring.
Fixture-based isolation: Each test builds a complete in-memory project.
Snapshot automation: UPDATE_EXPECT=1 eliminates manual expectation maintenance.
Minimal reproduction: Tests include only what’s necessary to trigger behavior.
Co-located tests: Tests live near implementation for easy discovery.

Rustc Development Guide

Basically the rust-analyzer has covered most of AST design & query-based compilation. However, it (seems to) only cover incrementalism inside a process. What if I want incrementalism to survive across recompilations? Rustc as a compiler would really want this, so it covers the following:

Incremental Compilation in Detail
Cache Persistence
StableHash Infrastructure
Glossary

Incremental Compilation in Detail - Persistent Incremental Compilation

Link: https://rustc-dev-guide.rust-lang.org/queries/incremental-compilation-in-detail.html

They remark that:

The incremental compilation scheme is, in essence, a surprisingly simple extension to the overall query system. It relies on the fact that:

Queries are pure function - given the same inputs, a query will always yield the same result, and

The query model structures compilation in an acyclic graph that makes dependencies between individual computations explicit.

They briefly discuss the red-green algorithm for incremental query system here: https://rustc-dev-guide.rust-lang.org/queries/incremental-compilation-in-detail.html#improving-accuracy-the-red-green-algorithm.

However, I want to focus on persistence of incremental computation here. Persistence means saving the query result cache to disk so it survives across compiler invocations. Without it, incremental compilation only helps within a single compiler process - once the process exits, all cached results are gone and the next build starts from scratch. Persistence is what makes incremental compilation useful in practice: the next cargo build can read the cache from disk and skip recomputing queries whose inputs have not changed.

Persistence Challenges

The sections above describe the underlying algorithm, however, another thing need to be addressed: Persistence across compilation.

The compiler process exits when it finishes - taking the query context and its result cache with it. -> Another compilation round will have to compute from scratch.
To make incremental compilation useful across sessions, that data must be persisted to disk, which introduces a whole new set of challenges:

Persistence introduces further challenges:

The query result cache lives on disk, so it is not readily available for change comparison when the next session starts.
A subsequent session starts with a new version of the code that may have arbitrary changes applied. All kinds of IDs and indices generated from a global sequential counter (e.g. NodeId, DefId) might have shifted, making the persisted results not immediately usable because the same numeric ID might now refer to a completely different thing.

Remark: I think this is the primary problem.
Persisting to disk is not free, so not every piece of information should be cached between sessions.

It’s desirable that fixed-sized, plain-old-data is preferred over complex structures that require an expensive serialization and deserialization step.

Stable Identifiers: Bridging the Gap Between Sessions

The ID instability problem (challenge 2 above) is the primary one.

Example: Imagine a file with two functions, foo and bar.

In session 1, these two functions are assigned DefId(0) and DefId(1) accordingly.
The compiler caches the result of type_of(DefId(1)) as i32.
In session 2, the user inserts a new function before foo, so the counter shifts: foo becomes DefId(1) and bar becomes DefId(2).

Outcome: The cached entry for DefId(1) now points to foo, not bar - the persisted result is silently wrong.

Solution:

The compiler should never persist numeric IDs directly.
Instead, the compiler uses stable identifiers derived from the hierarchical structure of the source code rather than from a global counter.

I recall that in AML, FQN (fully qualified name) is used for this purpose.

Concretely, in rustc:

Instead of DefIds, DefPaths, which are based on the path to the identified item, are used instead.
- DefPath: A path-based representation of a definition (e.g. std::collections::HashMap).
- Adding a new function in the middle of a file does not change the DefPath of existing definitions, so cached data keyed by DefPath remains valid.
DefPathHash: A 128-bit hash of the DefPath. It is Copy and cheap to compare, making it the practical unit of stable identity stored on disk.
- On the next compilation session, the compiler deserializes cached data keyed by DefPathHash and maps it back to the current session’s DefId values via a hash table lookup. This way, the cached results remain usable despite arbitrary source changes.
HirId: Not every HIR node is a definition-level item, so a DefId alone is not enough to identify sub-items within one owner. HirId pairs a DefPath with a LocalId that identifies something locally within its owner, so internal identifiers remain stable even if the owning item moves in the file.

`Fingerprints` and `StableHash`: Change Detection from Disk Without Full Deserialization

Red-green marking requires knowing whether a query result has changed since the previous session (backdating). Doing this naively has two problems:

Loading old results just to compare them is wasteful.
- The new result is already computed in memory: Loading the old one from disk also pollutes interners with data that will likely never be used again. -> Deserializing the old one just to throw it away after comparison is extra work.
- Deserialization touches interners: structures that deduplicate strings, paths, and symbols by storing them in a shared pool. -> Unnecessarily bloat the interners. -> Loading stale data from a previous session fills those pools with entries that belong to the old session and will almost certainly never be referenced again, wasting memory.
Not every result should be stored on disk.
- If a query result comes from an upstream crate compiled separately, it is already encoded in that crate’s compiled artifact.
- Caching it again in the local incremental store is redundant, as the next session can read it from the crate directly.
- Storing it twice wastes both disk space and the time spent serializing and deserializing it.

Solution: The compiler avoids both problems with Fingerprints.

Fingerprint: a 128-bit hash computed over each query result.
Idea:
- Instead of storing the full result, only its fingerprint is stored alongside the dependency graph. -> Since fingerprints are just bytes, loading the full set is cheap.
- When red-green marking needs to check if a result changed, it compares the already-loaded previous fingerprint against the fingerprint of the newly computed result. No old result needs to be deserialized.
- NOTE: The hashing must be done in a stable way: anything that could shift between sessions (e.g. a DefId) is replaced with its stable equivalent (e.g. the corresponding DefPath) before hashing. This is what the StableHash infrastructure provides, ensuring fingerprints from two different sessions are still comparable.

This approach has two known trade-offs:

Hash collisions: Two different results could produce the same fingerprint, causing the system to miss an update. The risk is mitigated by using a high-quality hash function with a 128-bit output, making collisions negligible in practice.

Remark: So rustc just accepts wrong compilation?
Fingerprint computation is expensive: It is the primary reason incremental compilation can be slower than non-incremental compilation - a high-quality hash function is required, and all identifiers must be mapped to their stable equivalents during hashing.

Two DepGraphs: Old and New

Dependency tracking involves two graphs at once: the one built during the previous session (loaded from disk, immutable) and the one being built for the current session.

When a query is invoked, the compiler first tries to mark the corresponding node green in the old graph. The mapping from a current query key to a node in the old graph is done via fingerprints: each dep-node is identified by a fingerprint of its query key. Since fingerprints are stable across sessions, computing one in the current session lets the compiler locate the matching node in the old graph. If no node with that fingerprint exists, the query key refers to something that did not exist in the previous session.

Once a node is successfully marked green, it is copied from the old graph into the new graph, along with its edges. This copy step is necessary because the normal dependency tracking system can only record edges by actually running a query, but running the query is exactly what we want to avoid when the result is already cached.

At the end of the session:

Unchanged nodes have been copied from the old graph into the new one.
Changed nodes have been added to the new graph by the tracking system as their queries were re-executed.

The new graph is then serialized to disk alongside the query result cache, ready to serve as the old graph in the next session.

Cache Promotion

The system has a subtle property: if all inputs of a dep-node are green, the node itself can be marked green without loading its cached result from disk. Applied transitively, this means intermediate results are often never loaded. Consider:

input(A) <-- intermediate_query(B) <-- leaf_query(C)

If leaf_query(C) is needed, the compiler loads its result from cache. intermediate_query(B) is never loaded. When the compiler writes the new result cache at the end of the session, only in-memory results are written, so intermediate_query(B) goes missing from the new cache.

The next session that actually needs intermediate_query(B) will have to recompute it, even though a valid cached result existed just before.

To prevent the result cache from silently shrinking, the compiler performs cache promotion: before writing the new cache, it walks all green dep-nodes and ensures their query results are loaded into memory, so they are included in the serialized output.

Cache Persistence

This document covers how rustc physically organizes, loads, and saves its incremental compilation cache across sessions. It complements Incremental Compilation in Detail, which covers the conceptual algorithm.

Sources:

Rustc dev guide: Incremental Compilation in Detail
compiler/rustc_incremental/src/persist/fs.rs - session directory management
compiler/rustc_incremental/src/persist/load.rs - cache loading
compiler/rustc_incremental/src/persist/save.rs - cache writing
compiler/rustc_middle/src/dep_graph/serialized.rs - dep-graph format
compiler/rustc_middle/src/query/on_disk_cache.rs - query-cache format

Physical Organization

In rustc, the root cache directory is specified via the -C incremental=<path> flag passed to rustc.
When using Cargo, this is managed automatically - set via incremental = true/false in Cargo.toml under [profile.*], and the cache is placed under target/{profile}/incremental/.

Inside the cache directory, the layout is (fs.rs:224-246):

<incr-comp-dir or root-cache-dir>/
  {crate-name-and-disambiguator}/     <- one per crate
    s-{timestamp}-{random}-working/   <- active session (in progress)
    s-{timestamp}-{SVH}/              <- finalized session (previous runs)

Note that on Windows, rustc uses a hack to “insert a weird prefix that makes in windows tolerate long paths”. It seems that this prefix is \\?\?

Directory of a Compilation Session

A compilation session is one invocation of the compiler on a crate.

Each session gets its own subdirectory named s-{timestamp}-{random}-working while active.
On successful completion it is renamed to s-{timestamp}-{SVH}, where SVH (Stable Version Hash) is a hash of the crate’s content.
If the compiler crashes, the -working directory is left behind and cleaned up on the next run.

Sessions follow a copy-on-write protocol (fs.rs:4-96):

A new session starts as a -working directory, populated by hard-linking files from the previous finalized session. Only files that change are rewritten.
On success, the -working directory is renamed to a finalized session directory (named with the crate’s stable version hash).
Multiple compiler processes can run concurrently for the same crate without interfering - each works on its own private copy.

Picking a Source Directory to Reuse

At startup, the compiler scans the crate’s directory for finalized session directories and picks the most recently timestamped one (fs.rs:518-557). The timestamp is embedded in the directory name (s-{timestamp}-{SVH}), so no stat calls are needed - it’s just a string comparison over directory entries.

The SVH in the directory name is not used for selection - only the timestamp matters. Whether the selected session’s cache is actually usable is determined later, when the dep-graph is loaded and its header is checked (compiler version + commandline args hash).

If the selected directory cannot be copied from (e.g. deleted mid-copy by a concurrent garbage collector), the compiler falls back to the next most recent candidate and retries. If no finalized session exists at all, it starts with an empty cache.

Pitfalls:

Clock skew: if the system clock goes backwards (NTP adjustment, VM migration), a newer session can get an older timestamp and be bypassed in favor of a genuinely older one. The compiler has no defense against this.
Optimistic copy: the source directory is selected and hard-linked before the dep-graph header is checked. If the cache turns out to be invalid (wrong compiler version, changed flags), the hard-linking was wasted work. Hard-linking is cheap on most filesystems, but on filesystems without hard link support (e.g. FAT), the files are fully copied instead - making an invalid cache selection genuinely costly.

Lock Files and Crash Detection

Each session directory has a lock file that the compiler holds an exclusive lock on for the duration of the session. The lock serves two purposes:

Crash detection (fs.rs:55-70): a crashed process cannot run cleanup code, so stale -working directories cannot be deleted by the process that created them. Instead, on each startup, garbage_collect_session_directories scans for leftover -working directories and tries to acquire an exclusive lock on each one.

Lock fails -> the owning process is still alive, leave it alone.
Lock succeeds -> the owning process is dead, safe to delete.

OS-level file locks are automatically released when a process dies (crash, OOM, kill), so this works reliably without any explicit crash handling. To avoid a race between a process creating the lock file and actually locking it, only directories older than a few seconds are considered for garbage collection.

Reader/writer coordination: when one process wants to copy a finalized session directory (to start a new session from it), it acquires a shared lock on that directory’s lock file. Any garbage-collecting process acquires an exclusive lock. This prevents a collector from deleting a directory that another process is actively copying from.

Each session directory contains three binary files (.bin signals raw binary format, not text or JSON):

dep-graph.bin - the dependency graph (nodes, edges, fingerprints)
query-cache.bin - cached query result values
work-products.bin - CGU artifact registry

Some noteworthy points:

A dummy lockfile is created solely for representing folder synchronization.

Locking a whole directory is unreliably, because the lock API requires calling the file lock API on an open file descriptor. However, opening a directory is unstable on Linux.
“There is still a small time window between the original process creating the lock file and actually locking it. In order to minimize the chance that another process tries to acquire the lock in just that instance, only session directories that are older than a few seconds are considered for garbage collection”.

Dependency and Change Tracking (`dep-graph`)

serialized.rs:1-40 - the module comment describes the format in full.

Stores the dependency graph: nodes, edges, and fingerprints. No query result data is stored here - only the graph structure and hashes needed for change detection. Fully decoded upfront into memory on load - no random access into the file itself.

Each node contains:

DepKind - which query this node represents.
Key fingerprint - a stable hash of the query key (e.g. a DefPathHash). This serves as the node’s stable identity across sessions - no numeric ID is persisted.
Value fingerprint - a hash of the query’s return value, used to detect whether the result changed since the previous session.
Edges - indices of the nodes this node depends on.

[ file header                   ]  <- compiler version + commandline args hash
[ NodeInfo, NodeInfo, NodeInfo  ]  <- streamed node records, written as queries finish
[ ...                           ]
[ total_node_count: u64         ]  <- last 16 bytes, appended after all nodes are written
[ total_edge_count: u64         ]

The counts are at the end rather than the start because nodes are streamed to disk as queries finish - the total is not known until writing is complete. Putting them first would require buffering the entire graph in memory or a second pass to rewrite the header.

Each node record corresponds to one NodeInfo (a dep-node plus its value fingerprint and edges). On disk it is split into two parts:

A fixed-size SerializedNodeHeader (serialized.rs:432): 38 bytes packed as DepKind discriminant (2 bytes) + key fingerprint (16 bytes) + value fingerprint (16 bytes) + edge count and edge byte-width metadata (remaining bits).
A variable-length edge list immediately following the header.

The total node/edge counts at the end of the file allow the decoder to pre-allocate arrays upfront before reading any records (serialized.rs:300-346). The dep-graph file is then read sequentially from start to end - there is no random access into individual node records. All lookups after load go through the in-memory SerializedDepGraph arrays.

Computed Result Store (`query-cache`)

on_disk_cache.rs:52-116

Stores the serialized return values of cached queries. On load, only the footer index is decoded into memory; result blobs are accessed lazily by offset via mmap. Values and their lookup index are kept in separate regions of the file:

[ query result blob ]  <- raw serialized value for some query
[ query result blob ]  <- raw serialized value for another query
[ ...               ]
[ Footer            ]  <- index: maps each dep-node -> byte offset of its result blob
[ footer_pos: u64   ]  <- last 8 bytes: byte offset of the Footer

On load, the compiler reads the last 8 bytes to find the footer position, jumps to it, and decodes the full index into memory (on_disk_cache.rs:159-165). Individual result blobs are then accessed directly by offset on demand.

Compiled Artifact Registry (`work-products`)

Stores the set of compiled artifact files produced per CGU. Fully loaded upfront on every startup - the list is small (one entry per CGU) and always needed in full. Simple sequential encoding with no index.

Each entry maps a WorkProductId (stable hash of the CGU name) to a WorkProduct (the artifact file paths for that CGU, e.g. "o" -> "/path/to/foo.o"), allowing the compiler to look up “given this CGU, where are its compiled files from last session?” (graph.rs:1110-1135).

On an incremental rebuild, if the CGU’s dep-node is green and the symbol hash matches, the saved files are reused directly. Otherwise the CGU is recompiled and a new entry is written.

Load Algorithm

load.rs:40-225

At startup, the compiler:

Scans the crate’s directory for finalized session directories (s-{timestamp}-{SVH}) and picks the most recently timestamped one as the source (fs.rs:518-557).
- If the copy fails (e.g. a concurrent garbage collector deleted the directory mid-copy), it retries with the next most recent candidate.
- If no finalized session exists, it starts with an empty cache and skips to step 4.
Hard-links files from the selected source directory into the new -working directory. Only files that change during this session will be rewritten.
Reads dep-graph.bin from the hard-linked copy.
- Checks the file header: if the compiler version or command-line arguments hash has changed, the entire cache is discarded.
- If valid, fully decodes it upfront into a SerializedDepGraph (serialized.rs:99-126): flat arrays of nodes, value fingerprints, and edges held in memory. It is wrapped in an Arc and never mutated after construction. Only fingerprints and indices are stored - no query result values.
Reads work-products.bin fully upfront. The list is small (one entry per CGU) and always needed in full, so no index or lazy loading is used. Entries whose artifact files no longer exist on disk are pruned.
Memory-maps query-cache.bin but only decodes its footer index (dep-node -> byte offset) into memory. The actual result blobs remain in the mmap and are decoded lazily on demand by offset.

At this point, two dep-graphs coexist in memory:

previous - the SerializedDepGraph decoded from disk: flat arrays of nodes, value fingerprints, and edges. Used for fingerprint lookups and color marking. Never mutated.
current - a new empty graph being built for this session, populated as queries execute.

They are never merged. Green nodes from the previous graph are copied into the current one as they are marked.

Write Algorithm

Dev guide: “Once the compilation session has finished…the new graph is serialized out to disk, alongside the query result cache”.

The dep-graph is streamed to a staging file (dep-graph.part.bin) as each query finishes (save.rs).

At the end of the session:

Cache promotion runs first: walks all green dep-nodes and loads any query results not yet in memory.
- Dev guide: “Before emitting the new result cache it will walk all green dep-nodes and make sure that their query result is loaded into memory”.
- Without this, those results silently disappear from the new cache, causing unnecessary recomputation next session.
The query-cache is written out with all in-memory results.
The -working directory is renamed to a finalized session directory.

If the compiler exits with errors, the -working directory is left unfinalized and cleaned up on the next run.

Garbage Collection

Two levels of GC run on each startup (fs.rs:613):

Session directories: garbage_collect_session_directories scans the crate directory and keeps only the most recent finalized session directory. All older finalized directories that can be exclusively locked are deleted via remove_dir_all (which removes the .bin files inside). Stale -working directories older than ~10 seconds that can be exclusively locked are also deleted.

Artifact files: during load, any work product entry whose artifact files are missing is passed to delete_workproduct_files, which deletes whatever artifact files still exist for that entry and drops the entry from the in-memory map (load.rs:35-38).

IO Techniques

Concrete techniques observed in the source code.

Hard-linking unchanged files (fs.rs:24-26): when starting a new session, the previous session’s files are hard-linked into the new -working directory. Only files that are actually rewritten get new disk writes.

Streaming dep-graph writes (serialized.rs:565-984): each dep-graph node is written to disk as its query completes - no in-memory accumulation of the full graph.

Nodes are streamed into dep-graph.part.bin during the session. The .part suffix signals the file is incomplete. At the end of the session it is renamed to dep-graph.bin once all nodes and the trailing counts are written.

Remark: This is probably to avoid unnecessary memory usage/recomputation at the end?

mmap for cache reads (load.rs:91-100, on_disk_cache.rs:152): both the dep-graph and query-cache files are memory-mapped via Mmap. The compiler does not read them into a heap buffer.

Footer-based random access (on_disk_cache.rs:159-165): the query-cache footer (an index of dep-node -> byte offset) is decoded once on startup. Individual results are then read by seeking directly to their offset.

Deferred reverse index construction (serialized.rs:159-166): after loading the dep-graph, the per-DepKind reverse index (LazyNodeIndex) is not built upfront. Each kind’s index is constructed on first lookup via OnceLock.

Pitfalls

The cache can silently shrink without cache promotion. Intermediate query results that are green but never accessed during a session are not in memory when the new cache is written, so they are dropped. The next session that needs them recomputes from scratch.

Whole-cache invalidation happens more often than expected. Any change to the compiler version or command-line arguments discards the entire cache - not just affected nodes. This is intentional and conservative: the file header check rejects mismatches at the binary level.

Artifact files can go missing independently of the dep-graph. A work-product entry can survive while its .o file has been deleted (e.g. by cargo clean on a specific target). The compiler handles this by checking file existence on load and pruning stale entries (load.rs:56-88). However, this check is not atomic - if a file is deleted after the existence check but before the backend copies it out during codegen, link_or_copy returns an error. For non-object files this is a compiler error; for .o files it is a fatal abort (emit_fatal) with no fallback to recompiling the CGU (write.rs:895-944).

The query-cache does not store upstream crate results. Results from external crates live in their compiled .rmeta artifacts and are read directly from there. Only results computed in the local crate are persisted in the local cache.

StableHash

This document covers how rustc hashes values in a way that is stable across compilation sessions, which is required for fingerprint-based change detection.

The conceptual need for stable hashing is described in Incremental Compilation in Detail. This document covers how it is actually implemented.

Sources:

Rustc dev guide: Incremental Compilation in Detail
compiler/rustc_data_structures/src/stable_hash.rs - trait definitions
compiler/rustc_middle/src/dep_graph/graph.rs - fingerprint computation

The Problem

Rust’s standard Hash trait is not stable across sessions. It can hash memory addresses, pointer values, or anything tied to the current process state. Two objects that are logically identical will not produce the same hash if session-local details differ.

For incremental compilation, fingerprints must be consistent: hashing the same value in session 1 and session 2 must produce the same bits, so the compiler can tell whether a query result changed.

The `StableHash` Trait

stable_hash.rs:76-77

rustc defines a separate StableHash trait. Its doc comment states these requirements:

Session-independent: must not hash memory addresses, DefId values, or any other session-local state.
Architecture-independent: must produce the same bits on any host (endianness, pointer size). The StableHasher handles this internally.
Consistent with equality: x == y implies stable_hash(x) == stable_hash(y), and x != y implies stable_hash(x) != stable_hash(y). This second condition is stricter than the standard Hash trait, which only requires the first.

For simple types (integers, strings, paths), StableHash delegates to the standard hasher since their values are already stable. For session-local types like DefId, it requires a hashing context (Hcx) to translate them into stable equivalents before hashing.

HashMap and HashSet explicitly do NOT implement StableHash because their iteration order is unstable (stable_hash.rs:593-594):

#![allow(unused)]
fn main() {
impl<V> !StableHash for std::collections::HashSet<V> {}
impl<K, V> !StableHash for std::collections::HashMap<K, V> {}
}

The Hashing Context (`StableHashCtxt`)

stable_hash.rs:23-36

The context passed into stable_hash provides two key operations:

def_path_hash(def_id) - converts an unstable DefId into its stable DefPathHash before feeding it to the hasher.
stable_hash_span(span) - converts a Span (which references session-local byte offsets) into a stable representation.

This means that when any type containing a DefId implements StableHash, it does not hash the DefId integer directly - it calls hcx.def_path_hash(def_id) and hashes that instead. The context is the translation layer between session-local IDs and stable identifiers.

Dev guide: “whenever something is hashed that might change in between compilation sessions (e.g. a DefId), we instead hash its stable equivalent”.

`StableHashControls`

stable_hash.rs:623-630

The context also carries a StableHashControls struct (currently just hash_spans: bool) that lets callers opt out of hashing certain data. The comment states: “Whenever a StableHash implementation caches its result, it needs to include StableHashControls as part of the key, to ensure that it does not produce an incorrect result”.

This is used when producing fingerprints that should not change just because line numbers shifted.

The `StableHasher` Algorithm

StableHasher wraps SipHasher128, which is SipHash 1-3 with a 128-bit output. It is initialized with a fixed key of (0, 0) so the output is deterministic across processes and platforms:

#![allow(unused)]
fn main() {
StableHasher { state: SipHasher128::new_with_keys(0, 0) }
}

SipHash was chosen because it is fast (designed for short keys), has good collision resistance for non-cryptographic use, and was already the algorithm behind Rust’s standard HashMap (so it was well-understood). The 128-bit output reduces collision probability to negligible levels for fingerprinting. Rustc explicitly accepts that a collision would produce an incorrect incremental result and treats it as an acceptable risk given the 2^128 output space.

Enum Discriminant Stability

For enums, StableHash impls hash the discriminant index first, then the variant fields. This means that inserting a new variant in the middle of an enum shifts all following discriminants, invalidating fingerprints for every value of those variants.

Rustc is careful about this. The practical rule is: new variants go at the end, or the enum uses explicit discriminant values so insertion does not shift existing ones.

Connection to Fingerprints

graph.rs:160-167, graph.rs:483-489

When a query finishes, its return value is fingerprinted via hash_result, which calls stable_hash with a StableHashCtxt backed by TyCtxt:

#![allow(unused)]
fn main() {
pub fn hash_result<R: StableHash>(hcx: &mut StableHashState<'_>, result: &R) -> Fingerprint {
    let mut stable_hasher = StableHasher::new();
    result.stable_hash(hcx, &mut stable_hasher);
    stable_hasher.finish()
}
}

This fingerprint is stored in the dep-graph node as the value fingerprint and compared against the stored fingerprint from the previous session to decide green vs red.

Dev guide: “Each time a new query result is computed, the query engine will compute a 128 bit hash value of the result”.

Glossary

Rustc-specific terms used across the rustc research docs.

CGU (Codegen Unit)

A partition of a crate’s code that is compiled independently by the backend (LLVM). rustc splits a crate into CGUs to enable parallel compilation and incremental reuse. Each CGU maps to a set of symbols (functions, statics) and produces its own artifact files.

DefId

A numeric identifier for a definition (function, struct, trait, etc.) within a specific compilation session. DefId values are session-local - the same definition may get a different DefId in the next session if the source changes. They must never be persisted directly to disk.

DefPath

A path-based, stable representation of a definition - e.g. std::collections::HashMap. Unlike DefId, DefPath is derived from the source structure and does not shift when unrelated code is added or removed.

DefPathHash

A 128-bit hash of a DefPath. It is Copy, cheap to compare, and stable across sessions. This is the practical unit of identity used when persisting anything keyed by definition to disk. On load, the compiler maps DefPathHash back to the current session’s DefId via a hash table lookup.

DepNode

A node in the dependency graph. Represents one unit of computation (a query invocation). Each DepNode carries a DepKind (which query) and a key fingerprint (which input). Edges between nodes encode “this computation read the result of that computation”.

Fingerprint

A 128-bit hash (u64, u64) produced by StableHasher. Used in two roles per dep node:

Key fingerprint: identifies which node this is (a stable hash of the query key, e.g. a DefPathHash).
Value fingerprint: a hash of the query’s return value, used for change detection across sessions.

HirId

An identifier for a HIR node that is not a definition-level item. Conceptually a pair of (DefPath, LocalId) - the DefPath identifies the owning item, and the LocalId identifies something within it. Stable across sessions as long as the owning item’s path does not change.

SVH (Stable Version Hash)

A hash of the crate’s content used to name finalized session directories. A session directory named s-{timestamp}-{SVH} is considered finalized and valid. This lets the compiler quickly identify the most recent valid cache for a given crate.

Work Product

A record of the compiled artifact files produced for one CGU in a session. Stored in work-products.bin. Contains the CGU name and a map of file type to file path (e.g. "o" -> "/path/to/foo.o"). On an incremental rebuild, if a CGU’s dep-node is green and the symbol hash matches, its work product files are reused directly instead of recompiling.

Source: graph.rs:1110-1135

`libsyntax`

Related RFC for libsyntax-2.0: Link.
- Discuss design choices for syntax tree structure and parser.

Salsa

Overview

Salsa: Incremental Computation Framework

Official site: Link.

What is Salsa?

A Rust framework for writing incremental, on-demand programs:

Adapts to input changes while maintaining up-to-date outputs.
Originally developed from incremental recompilation techniques in rustc.
Tracks dependencies and memoizes results for efficient recomputation.
Primary use case: Compilers and language tooling responding to source code modifications without full recompilation.

Core Concepts

Queries

The fundamental mechanism in Salsa:

Functions whose results are cached (memoized) and tracked for dependencies.

Two types of queries:

Input queries: Values provided from outside the system. When these change, dependent computations are invalidated.
Derived queries: Computed from other queries (inputs or derived). Results are cached and recomputed only when dependencies change.

Database

The central runtime that manages query execution, dependency tracking, and memoization. All queries execute through the database, which coordinates:

Active function tracking.
Dependency recording.
Cache invalidation.
Result storage.

Tracked Structures

Salsa manages tracked and interned structs:

Maintain state throughout computations.
Integrate with the dependency tracking system.
Enable Salsa to know when they change and which computations depend on them.

How It Works

Dependency Tracking

When a query executes:

Salsa automatically records which other queries it reads.
Creates a dependency graph.

When an input changes:

Salsa marks affected queries as “potentially stale”.
On next access, performs a “maybe changed after” check.
Only recomputes if dependencies actually changed.
Reuses cached results when possible.

Memoization

Query results are cached with their dependencies. Salsa stores:

The computed result.
Which queries were read during computation.
The “revision” when the computation occurred.

Incremental Updates

The framework uses a revision counter that increments when inputs change.

For each cached result, Salsa can determine:

Was this result computed before or after the last change?
Have any of its dependencies changed since computation?
Can we reuse the cached value?

Result: Efficient incremental updates where only affected computations run.

Cycle Handling

When circular dependencies occur (Query A depends on B, B depends on A):

Salsa detects the cycle.
Provides mechanisms to handle it.
Prevents infinite loops.

Example Workflow

#![allow(unused)]
fn main() {
// Input query - set from outside
#[salsa::input]
struct SourceText {
    #[return_ref]
    text: String,
}

// Derived query - computed from inputs
#[salsa::tracked]
fn parse(db: &dyn Db, input: SourceText) -> Ast {
    // Parse the text
    // Salsa automatically tracks that this depends on `input.text()`
}

#[salsa::tracked]
fn check_types(db: &dyn Db, input: SourceText) -> TypeCheckResult {
    let ast = parse(db, input);  // Depends on parse
    // Type check the ast
}

// Later, when input changes:
// 1. Set new input: input.set_text(db).to(new_text)
// 2. Access check_types(db, input)
// 3. Salsa sees parse() needs recompute (input changed)
// 4. Salsa recomputes parse(), then check_types()
// 5. If parse() result unchanged, check_types() reuses cache
}

Key Features

On-Demand Evaluation

Queries computed lazily.
If no one asks for a result, it doesn’t get computed.
Contrasts with systems that eagerly recompute everything after changes.

Durability

Queries can be marked as “durable” if their inputs rarely change.
Salsa uses durability levels to optimize invalidation checks.

LRU Caching

Memoized results can be evicted using LRU (Least Recently Used) strategy.
Manages memory by evicting old results.
Results recomputed if accessed after eviction.

Parallel Execution

Multiple queries execute concurrently.
Salsa coordinates access and handles race conditions through its runtime.

Cancellation

Long-running queries cancelled if inputs change during execution.
Avoids wasted computation.

Runtime Architecture

The Salsa runtime manages:

Active function tracking: Which queries are currently executing.
Revision management: Monotonic counter for tracking change epochs.
Dependency graph: Records which queries depend on which.
Memoization storage: Cached results indexed by query + inputs.
Worker coordination: Handles concurrent query execution.

Advantages

Automatic dependency tracking: No manual specification of what depends on what.
Fine-grained invalidation: Only recompute what actually changed.
Transparent caching: Memoization handled by framework, not user code.
Parallel-friendly: Built-in support for concurrent execution.
Memory efficient: LRU eviction prevents unbounded cache growth.

Integration with `rust-analyzer`

rust-analyzer uses Salsa in its base-db crate for incremental compilation.

Example uses:

Input queries: File contents, crate graph configuration.
Derived queries: Parse trees, name resolution, type inference.

Benefits:

When a file changes, only affected analysis passes rerun.
Editing one function: type checking other functions reuses cached results.
Responsive even on large codebases.
Most edits trigger minimal recomputation.

Comparison: Reactive vs Traditional

Traditional approach:

File change -> Reparse entire project -> Recheck all types -> Update all diagnostics

Salsa approach:

File change -> Mark queries as potentially stale
User requests diagnostics -> Salsa recomputes only affected queries -> Return result

The Salsa approach is faster because:

Only computes what’s actually needed (on-demand).
Reuses unchanged intermediate results (memoization).
Granular invalidation (file-level, not project-level).

Learning Resources

Tutorial: calc language: End-to-end example building a compiler/interpreter
Plumbing documentation: Internal mechanisms and algorithms
Community: Zulip instance for discussions

Summary

Salsa provides incremental computation through:

Query-based architecture: All computations are queries that depend on other queries.
Automatic dependency tracking: Framework records dependencies during execution.
Smart invalidation: Only recomputes when dependencies actually change.
Transparent memoization: Caches results without explicit cache management.
On-demand evaluation: Computes only what’s requested, when requested.

Result:

Responsive compilers and language servers.
Efficient change handling.
Real-time feedback practical even for large codebases.

Research Summary

Operations

Study established projects to inherit their mature operational workflows¹.
Convert recurring code review feedback into a permanent “living style guide” that scales mentorship and prevents repetitive corrections¹.
Structure commit metadata (titles and messages) to automate downstream documentation like changelogs, which reduces manual release overhead¹.

Code Evolution

Classify changes by their impact scope (internal vs. API/dependency) to apply proportional review scrutiny¹.
Internalize trivial utilities (“micro-dependencies”) to reduce supply chain attack surface and compilation bloat¹.
Prefer parsing data into types that guarantee validity over simple validation checks, which eliminates “boolean blindness” where the system loses the proof of validity after the check returns true¹.
Restrict heavy generics at system boundaries to prevent “monomorphization bloat,” trading minor runtime overhead for significant compile-time gains¹.

Compiler Architecture

rust-analyzer’s high-level architecture, from low-level to higher-level¹:

parser crate: A recursive descent parser that is tree-agnostic and event-based.
syntax crate: Syntax tree structure and parser that exposes a higher-level API for the code’s syntactic structure.
base-db crate: Integration with salsa, allowing for incremental and on-demand computation.
hir-xxx crates: Program analysis phases that explicitly integrate incremental computation.
hir crate: A high-level API that provides a static, inert and fully resolved view of the code.
ide crate: Provides high-level IDE features on top of the hir semantic model, speaking IDE languages such as texts and offsets instead of syntax nodes.
rust-analyzer: The language server that knows about LSP and JSON protocols.

Token & Node

Parser tokens have tags and their corresponding source text, while parser nodes have tags and source length with children nodes placed in a homogenous vector².
Tokens and nodes share the same SyntaxKind enum and are not as clearly distinguished as in @dbml/parse. Tokens function as leaf nodes while nodes function as interior nodes².
The explicit nodes approach treats whitespace and comments as sibling nodes, where rust-analyzer handles trivial and error nodes uniformly (everything is a node), while some parsers like @dbml/parse attach trivia to semantic parents and others use hybrid approaches².
For context-sensitive keywords like union and default, the parser checks the actual text via TokenSource::is_keyword() rather than relying solely on token kind².
An intermediary layer using TokenSource and TreeSink traits merges tokens based on context, such as combining > + > into >>².
Nodes store only text_len (not absolute offset) while tokens store text, and SyntaxNode computes offset on-demand from its parent, which enables incremental parsing without position invalidation².

Parsing

The three-layer tree architecture separates concerns effectively²:
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and Arc-sharing that stores text_len rather than offset.
- SyntaxNode (cursor/RedNode): Adds parent pointers and on-demand position computation where memory scales with traversal depth rather than tree size, and nodes are transient (rebuilt from GreenNode when needed).
- AST (typed API): Auto-generated typed wrappers like FnDef and ParamList around SyntaxNode that provide ergonomic access to specific constructs.
Everything is preserved including whitespace, comments, and invalid tokens, where invalid input gets wrapped in ERROR nodes and the original text can be reconstructed by concatenating token text².
Errors live in a separate Vec<SyntaxError> rather than being embedded in the tree, which enables manual tree construction without error state management and produces parser output as (green_node, errors)².
Resilient parsing combines multiple strategies²:
- The algorithm uses recursive descent with Pratt parsing for expressions and is intentionally permissive, accepting invalid constructs that are validated later.
- The event-based architecture has the parser emit abstract events via TokenSource (input) and TreeSink (output) traits, keeping the parser agnostic to tree structure.
- Error recovery employs panic mode (skipping to synchronization points like } or ;), inserts implicit closing braces, performs early block termination, and wraps malformed content in ERROR nodes.
Incremental parsing uses a sophisticated approach²:
- The Red-Green model separates Green (immutable storage) from Red (cursors with positions), where this separation enables cheap tree patches by swapping GreenNode pointers.
- The block heuristic reparses only the smallest {} block containing the edit, which works because the parser maintains structurally balanced braces through implicit } insertion and ERROR wrapping for extras.
- Pragmatically, incremental reparsing is often not worth the complexity since full reparse is fast and simpler, though the architecture remains valuable for tree edit cheapness and subtree sharing.
Error messages use a layered approach where a permissive parser is followed by a separate validation pass for “soft” errors, allowing the parser to focus on structure recovery while validation uses semantic context for detailed diagnostics².

Design Choices & General Architectures

Avoid blind serialization of internal types, which implicitly couples public clients to private implementation details¹.
Enforce opaque API boundaries (such as between analysis and IDE layers) to enable radical internal refactoring without breaking consumers¹.
Codify architectural laws (like “core layers are I/O free”) to permanently guarantee non-functional requirements such as speed and deterministic testing¹.
Order function arguments by stability (context -> data) to align code structure with mental models (“setting” -> “actors”) and reduce cognitive load during scanning¹.
Use distinct types to segregate unverified external input (like “dirty” OS strings) from validated internal data, preventing logic errors from crossing trust boundaries¹.
Enforce invariants via “construction & retrieval” (private fields with public getters) rather than “mutation” (setters), ensuring objects never enter invalid states¹.
Encode assumptions into the type system (such as non-nullable types) to force callers to handle edge cases explicitly, preserving context at the call site¹.
Encapsulate complex execution arguments into temporary structs to support multiple execution modes without duplicating function signatures¹.
Split functions with boolean flags (like do(true)) into distinct named functions, which adheres to the Single Responsibility Principle and prevents unrelated logic paths from coupling¹.

Implementation Patterns

Prioritize imperative clarity over functional brevity, where code should maximize “work per line” rather than minimizing line count via complex indirections¹.
Use spatial operators (like < or <=) that map intuitively to the mental number line (0->infinity), avoiding the mental effort required to “flip” comparisons¹.
Prefer syntax that supports left-to-right reading (such as explicit type ascription), which reduces the “context window” required to understand a statement by declaring intent up-front¹.
Use blocks { ... } to isolate temporary state, preventing variable pollution while retaining access to the parent context¹.
Push resource allocation (memory and I/O) up to the call site (like passing a buffer in) to make performance costs visible and controllable by the caller¹.
Use explicit namespaces or qualifiers to visually reinforce layer boundaries in code, such as distinguishing ast::Node from hir::Node¹.

Testing

Tests are defined as input/output pairs where the framework compares actual results against expected output, and expectations are updated when behavior changes intentionally³.
Define tests via data (input and output) rather than API calls, which makes tests survive refactoring¹.
Each test simulates a complete environment (multi-file, multi-crate) in memory without shared state³.
Failing tests can auto-update their expected output via a flag, which eliminates manual maintenance³.
Minimize test cases to the smallest input that reproduces the behavior, resulting in less noise and faster debugging¹.
Place tests near their implementation to enable easy discovery during development³.
Never ignore test failures. Instead, assert the incorrect behavior with a FIXME comment to keep the failure visible¹.

Documentation

Enforce full-sentence comments (starting with a capital letter and ending with a period) to psychologically shift the author from “note-taking” (describing what) to “explanation” (describing why and providing context)¹.

References

Glossary

Technical terms and concepts used throughout the dboxide project documentation.

API Boundary

An explicit interface layer that separates different components of a system. By making boundaries opaque (hiding implementation details), you can refactor internal code without breaking external consumers. Example: the separation between the hir analysis layer and the ide layer in rust-analyzer.

AST

Abstract Syntax Tree. A typed, high-level representation of source code structure that provides ergonomic access to syntactic constructs. In rust-analyzer, AST nodes like FnDef and ParamList are auto-generated wrappers around the lower-level SyntaxNode, offering convenient methods for traversing specific language constructs.

Block Heuristic

An incremental parsing optimization that reparses only the smallest {} block containing an edit. This works because parsers can maintain structurally balanced braces even in broken code by inserting implicit closing braces and wrapping errors. Often not worth the complexity since full reparse is typically fast enough.

Boolean Blindness

A situation where validation returns a simple boolean (true/false), causing the system to lose the proof of validity after the check. Instead of returning bool, prefer parsing into types that guarantee validity at the type level, preserving the validation evidence throughout the program.

Actually, when writing this, I figured this is related to proof theory’s distinction between Admissibility and Derivability (I learned about these in PFPL - see my type-theory repo). Returning bool is like admissibility (asserting “this is valid” without proof), while returning a validated type is like derivability (carrying constructive evidence). When you return bool, you lose the derivation - just like admissibility is a weaker assertion than derivability because it lacks the constructive proof tree.

DST

Event-Based Parser

A parser architecture where the parser emits abstract events (like start_node, token, finish_node) through trait interfaces (TokenSource for input, TreeSink for output) rather than directly constructing tree structures. This decouples the parsing algorithm from tree representation, enabling the same parser to work with different input sources and output formats.

GreenNode

The immutable storage layer in the Red-Green tree architecture. Stores the actual syntax tree data in a space-efficient, persistent structure that can be shared across threads via Arc. Contains only kind: SyntaxKind, text_len: usize, and children, with no position information or parent pointers. Optimizations include DST layout, tagged pointers, and token interning.

HIR

High-level Intermediate Representation. A fully resolved, semantic view of the code that abstracts away syntactic details. In rust-analyzer, the hir crate provides a static API for querying resolved types, names, and semantic relationships, built on top of the syntax tree and incremental analysis passes.

Incremental Parsing

A technique that reparses only the portions of a file affected by edits rather than reparsing the entire file. Enabled by immutable tree structures (GreenNode) that can be efficiently patched by swapping subtree pointers. Often not worth the implementation complexity as modern parsers are fast enough for full reparse.

Interning

An optimization technique that stores only one copy of duplicate data. In rust-analyzer, identical tokens (like multiple occurrences of 1 in 1 + 1) share the same allocated token, reducing memory usage.

Lossless Syntax Tree

A tree representation that preserves everything from the source code including whitespace, comments, and invalid tokens. Enables perfect reconstruction of the original source text by concatenating token text. Invalid input gets wrapped in ERROR nodes rather than being discarded.

LSP

Language Server Protocol. A standardized protocol for communication between editors and language servers, enabling features like auto-completion, go-to-definition, and diagnostics. Allows one language server implementation to work with multiple editors.

Monomorphization

The process where Rust’s compiler generates separate copies of generic functions for each concrete type used. Can lead to “monomorphization bloat” where excessive use of generics at system boundaries causes long compile times. Mitigated by restricting generics or using dynamic dispatch at boundaries.

Panic Mode

An error recovery technique where the parser skips tokens until it finds a synchronization point (like } or ;) after encountering a syntax error. Allows the parser to continue and find additional errors rather than stopping at the first one.

Pratt Parsing

An elegant technique for parsing expressions with operator precedence without building complex precedence tables. Handles infix, prefix, and postfix operators uniformly. Used by rust-analyzer for expression parsing alongside recursive descent for other constructs.

Red-Green Tree

A tree architecture that separates immutable storage (Green layer) from navigation and position information (Red layer). GreenNodes store actual data and can be shared across threads. SyntaxNodes (Red layer) are lightweight cursors that add parent pointers and compute positions on-demand. This separation enables efficient tree edits and incremental parsing.

Recursive Descent Parser

A top-down parsing technique where each grammar rule maps to a function that calls other parsing functions recursively. Makes the parser code easy to understand and debug since it directly mirrors the grammar structure. Used by rust-analyzer for most syntactic constructs.

Resilient Parsing

A parsing approach that always produces a tree even from syntactically invalid code. Employs error recovery techniques (panic mode, implicit token insertion, early termination) to maintain tree structure. Invalid sections get wrapped in ERROR nodes. Enables IDE features like completion to work even in broken code.

Salsa

An incremental computation framework that enables on-demand, cached evaluation of derived data. When inputs change, Salsa tracks dependencies and recomputes only what’s affected. Used in rust-analyzer’s base-db crate to make analysis incremental and efficient.

Snapshot Testing

A testing approach where tests are defined as input/output pairs. The framework compares actual output against expected output stored in test files. When behavior changes intentionally, a flag (like UPDATE_EXPECT=1) auto-updates expectations, eliminating manual maintenance.

TokenSource

A trait interface that defines how parsers read input tokens. Provides methods like current(), lookahead_nth(), and bump(). Abstracts over different token sources (source files, macro expansions, synthetic tokens) so the same parser can handle all inputs.

Tree-Agnostic Parser

A parser that emits abstract parsing events without knowledge of the final tree structure. By working through trait interfaces (TokenSource/TreeSink), the parser remains decoupled from tree representation, enabling the same parsing logic to produce different output formats.

TreeSink

A trait interface that defines how parsers write output. Receives events like start_node(), token(), finish_node(), and error(). The implementation decides how to build the actual tree structure. This abstraction enables different tree formats from the same parser.

Trivia

Non-semantic tokens like whitespace, comments, and formatting characters. Different parsers handle trivia differently: explicit nodes (trivia as sibling nodes), attached trivia (stored in leading/trailing properties of tokens), or linked lists (separate doubly-linked list alongside the tree). rust-analyzer uses explicit nodes for uniformity and losslessness.

Keyboard shortcuts

dboxide