Introduction
This repository, dboxide, is a rewrite of a DBML parser. It is a learning project to explore concepts in language theory, inspired by the work in the type-theory repository. The project builds on lessons learned from a previous parser, @dbml/parse.
The primary motivation for this rewrite is to address the shortcomings of previous versions and explore more advanced compiler design concepts. The original PEG.js parser was slow and lacked resilience, making it difficult to build a good developer experience (e.g., in a language server). The first rewrite, @dbml/parse, brought performance improvements and basic language services but suffered from design flaws that made it fragile and hard to maintain.
This project is not just about parsing DBML. It’s a research ground for:
- Compiler infrastructure: Investigating best practices for lexing, parsing, and representing code in a way that supports incremental updates and rich analysis.
- Developer experience: Building a foundation for a high-quality language server with features like precise error reporting, autocompletion, and code navigation.
- Modern techniques: Applying query-based (on-demand) compilation, inspired by Salsa, to create a scalable and efficient analysis engine.
The goal is to create a parser that is not only fast and correct but also serves as a solid foundation for a new generation of DBML tooling.
Goals
This document outlines the primary goals for dboxide. These goals are divided into two main categories: Project Goals, which focus on the tangible outcomes of the project, and Learning Goals, which focus on the practical techniques and concepts to be explored during development.
Project Goals
These goals are about building a high-quality, feature-rich, and performant DBML parser and associated tooling.
- Resilient and high-quality parser: Create a parser that can gracefully handle syntax errors, provide meaningful error messages, and produce a lossless syntax tree. This is crucial for a good developer experience.
- Language server implementation: Build a functional language server for DBML that provides features like:
- Autocompletion
- Go to definition
- Find references
- Real-time diagnostics
- Performance: The parser should be significantly faster than previous implementations, leveraging the performance benefits of Rust.
- SQL dialect awareness: The parser should be able to understand and handle different SQL dialects, making it more versatile.
- Modularity and extensibility: Design the parser and its components in a way that is easy to maintain, extend, and reuse.
Learning Goals
This project is also an opportunity to learn and apply modern compiler construction techniques. The focus is on understanding the “how” and “why” behind these techniques.
- Lexing and parsing techniques:
- Token representation: Investigate the ideal way to represent syntax tokens, including their position, kind, and value.
- Lossless syntax trees: Learn how to build and work with lossless syntax trees that preserve all source information, including whitespace and comments.
- Resilient parsing: Master techniques for error recovery and resilience in parsers.
- Incremental parsing and red-green trees: Understand and implement incremental parsing using red-green trees to efficiently re-parse only the changed parts of a file.
- Query-based compilation:
- Salsa framework: Learn how to use the
salsaframework to build a query-based compiler architecture. - Incremental computation: Understand how query-based systems enable incremental computation, where only the necessary computations are re-run when the input changes.
- Parallelism: Explore opportunities for parallelizing the compilation process to improve performance.
- Salsa framework: Learn how to use the
- Type-driven validation: Learn to apply the “Parse, don’t validate” principle to create a more robust and type-safe parser.
- Effective testing strategies: Learn how to write effective and maintainable unit and integration tests for a compiler.
Research
This section documents the research done for the dboxide project. It is divided into two main parts:
- Research Plan: This page outlines the research plan, including the list of topics to be investigated.
- Resources: This page documents the knowledge learnt from various resources.
- Glossary: This page contains brief explanations of the concepts and techniques researched for the project.
Research Plan
This page outlines the research plan for dboxide. The goal is to investigate and understand the following concepts and techniques, which are crucial for building a modern and efficient parser and language server.
Research Topics
- Lexing & Token representation:
- Ideal representation of a syntax token (source offset/pointer, kind, processed value).
- Handling of non-ASCII characters (UTF-8).
- Handling of trivial and error tokens in a lossless syntax tree.
- Handling of multi-word and unreserved keywords.
- Handling of ambiguous tokens (e.g.,
<as a less than sign or a generic bracket). - Efficient storage and computation of token positions for incremental parsing.
- On-demand lexing vs lexing all at once.
- String interning:
- What is string interning and why is it useful in a compiler?
- Where and how is it typically applied in a compilation pipeline (e.g., during lexing, parsing, or semantic analysis)?
- How can it be implemented efficiently?
- What are the trade-offs of using string interning?
- Parsing & Syntax tree:
- Good representation for a lossless syntax tree.
- Storing error nodes and partial nodes in the lossless syntax tree.
- Resilience parsing and error recovery techniques.
- Designing good error messages.
- Incremental parsing & Red-green trees:
- What are red-green trees and how do they work?
- How can they be used to implement incremental parsing?
- What are the performance implications of using red-green trees?
- Program analysis & Query-based compiler:
- Utilizing a query-based architecture (like
salsa) for program analysis, lexing, and parsing. - Module system features, including module resolution.
- Name resolution and related problems.
- Resources:
- Utilizing a query-based architecture (like
- Other topics:
- Language server implementation
- Parallelism in compilation
- SQL dialect awareness
Resources
This section documents the knowledge learnt from various sources during the implementation of dboxide:
rust-analyzer: An implementation of LSP for the Rust programming language - a good reference model.libsyntax: Rust compiler’s crate that documents some design notes on tree structure and parser.salsa: An incremental computation framework for building responsive compilers and language servers.
rust-analyzer
This section contains resources and notes about the rust-analyzer project.
rust-analyzer is an LSP implement for Rust for many code editors.
Official site: Link.
High-Level Architecture & Conventions
This section describes the architecture of rust-analyzer.
Official site: Link.
-
rust-analyzerinput/output:- Input (Ground state): Source code data from the client. Everything is kept in memory.
- Mapping from file paths to their contents.
- Project structure metadata represented as a crate graph (crate roots,
cfgflags, crate dependencies)
- Output (Derived state): “Structure semantic model” of the code.
- A representation of the project that is fully resolved - type-wise and reference-wise.
- Input (Ground state): Source code data from the client. Everything is kept in memory.
-
Optimizations:
- Incremental:
- Input can be a delta of changes.
- Output can be a fresh code model.
- Lazy: The output is computed on-demand.
- Incremental:
Parser - parser Crate
- A hand-written recursive descent tree-agnostic parser.
- Output: A sequence of events like “start node” and “finish node”, based on kotlin’s parser, which can be used to learn about dealing with syntax errors and incomplete input.
- Some traits (
TreeSinkandTokenSource) are used to bridge the tree-agnostic parser withrowantrees.
Architecture Invariant - Tree-Agnostic Parser
- The parser functions as a pure transformer, converting one flat stream of events into another.
- Dual independence: The parser is not locked into:
- A specific tree structure (output format).
- A specific token representation (input format).
- Benefits:
- Token independence allows using the same logic to parse:
- Standard source code (text -> tokens).
- Macro expansion (token trees -> tokens).
- Synthetic code generated programmatically.
- Tree independence allows easily varying the syntax tree implementation + light-parsing.
- Avoid allocation of tree nodes.
- For tasks like “find all function names in this 10k line file,” the parser can simply emit “DefineName” events. A listener catches those names and ignores everything else, finishing the task in a fraction of the time.
- Token independence allows using the same logic to parse:
Architecture Invariant - Infallible Parser
- Parsing never fails.
- Parser returns
(T, Vec<Error>).
Syntax Tree Structure & Parser - syntax Crate
Based on libsyntax-2.0.
rowan: The underlying library used to construct the raw, untyped syntax trees (Green/Red trees).astinternal crate: Provide a type-safe API layer on top of the rawrowantree.ungrammarinternal crate: A grammar description format used to automatically generate thesyntax_kindsandastmodules.
Architecture Invariant - syntax Crate as an API Boundary
- The
syntaxcrate knows nothing aboutsalsaand LSP. It’s an API boundary. - Benefits:
- Allows it to be used for lightweight tooling without needing a full build or semantic analysis.
Architecture Invariant - Syntax Tree as a Value Type
- The syntax tree is self-contained, defined solely by its contents without relying on global context (like interners).
- Pure syntax: Unlike traditional compiler trees, it strictly excludes semantic information (such as type inference data).
- Benefits:
- IDE optimization: Critical for tools like
rust-analyzer, where assists and refactors require frequent tree modifications. - Simplified transformation: Keeping the tree “dumb” (purely structural) allows for easy code manipulation without the complexity of managing semantic state during edits.
- IDE optimization: Critical for tools like
Architecture Invariant - Syntax Tree per File
- A syntax tree is built for a single file.
- Benefits: Enable parallel parsing of all files.
Architectural Invariant - Incomplete Syntax Tree
- Syntax trees are designed to tolerate incomplete or invalid code (common during live editing).
- AST accessor methods return
Optiontypes to safely handle missing data.
Query database - base-db Crate
salsa: A crate used for incremental and on-demand computation.salsaresembles a key-value store.salsacan compute derived values with specified functions.
- Define most input queries.
Architecture Invariant - File-System Agnostic
- Nothing is known about the file system & file paths.
FileId: An opaque type that represents a file.
Analyzer (Macro expansion, Name resolution, Type inference) - hir-xxx Crates
-
hir-expand: Macro expansion. -
hir-def: Name resolution. -
hir_ty: Type inference (Why does this one uses underscore?). -
Define various IRs of the core.
-
hir-xxxis ECS-based (Entity-Component-System):- ECS architecture: Instead of rich objects, compiler entities (like functions or structs) are represented as raw integer IDs (handles), similar to game entities.
- Database-driven: You cannot access data directly from an ID. You must query the central Salsa database (e.g.,
db.function_data(id)), which stores the actual content in “component” arrays.
-
Zero abstraction: The code is intentionally explicit about database access. It avoids helper methods to keep dependency tracking transparent and overhead low.
-
These crates “lower” (translate) Rust syntax into logic predicates, allowing the
chalkengine to solve complex trait bounds and type inference.
Architecture Invariant - Incremental
- The separation of “Identity” (ID) from “Data” in ECS allows
rust-analyzerto update only changed data without breaking references to the ID elsewhere, enabling millisecond-level updates.
High-Level IR - hir Crate
-
An API boundary for consuming
rust-analyzeras a library. -
hiracts as the high-level API boundary, wrapping internal raw IDs (ECS-style) into semantic structs (e.g.,Function) to provide a familiar object-oriented interface for library consumers. -
“Thin handle” Pattern: These structs hold no data (only the ID) and require the
dbto be passed into every method call (e.g.,func.name(db)), effectively bridging the stateless handles with the stateful Salsa database. -
Analogy: Internally, the ECS-style code is like SQL &
hiris like ORM.- Syntax inversion (object-oriented vs functional):
- In pure ECS, logic lives in external systems (e.g.
db.function_visibility(id)). - The
hircrate inverts this to an object-oriented style (func.visibility(db)), making the API discoverable via IDE autocomplete.
- In pure ECS, logic lives in external systems (e.g.
- Encapsulated “joins”:
- Pure ECS requires you to manually query multiple tables to piece together information (e.g., get parent module ID → look up module data → find visibility).
- The
hircrate abstracts these complex multi-step database lookups into single, coherent methods.
- Semantic types:
- ECS deals with efficient storage (raw
u32IDs). hirdeals with high-level meaning, exposing semantic types (like structType) rather than implementation details (like structTypeId).
- ECS deals with efficient storage (raw
- Syntax inversion (object-oriented vs functional):
Architecture Invariant - Inert Data Structure
hirpresents a fully resolved, inert view of the code, abstracting away the dynamic computations occurring in internal crates.- “Inert” here is relative to the
dbobject.
- “Inert” here is relative to the
- Syntax-to-HIR bridge: It manages the complex one-to-many mapping between raw syntax and semantic definitions (via the
Semanticstype). - The “Uber-IDE” pattern: To resolve a specific syntax node to an HIR entity (essential for “Go to Definition”), it employs a recursive strategy used by Roslyn and Kotlin: it resolves the syntax parent to a HIR owner, then queries that owner’s children to re-identify the target node.
IDE - ide-xxx Crates
-
Top-level API boundary: The ultimate entry point for external clients (LSP servers, text editors) to interact with rust-analyzer.
-
ideconsumes the semantic model provided by thehircrate to implement concrete user features like code completion, goto definition, and refactoring. -
ideis protocol-agnostic, designed to be used via LSP, custom protocols (like FlatBuffers), or directly as a library within an editor. -
This crate introduces the concept of change over time:
AnalysisHost: The mutable state container where youapply_change.Analysis: An immutable, transactional snapshot of the state used for querying.
-
Modular Architecture:
ide: The public facade and home for smaller features.ide-db: Shared infrastructure (e.g. reference search).ide-xxx: Isolated crates for major features (completion, diagnostics, assists, SSR).
Architecture Invariant - View Layer
- View/ViewModel layer:
ideacts as the “View” (MVC) or “ViewModel” (MVVM), translating complex compiler data into simple, editor-friendly terms (offsets, text labels) rather than internal definitions or syntax trees.- The API is built with POD types. All inputs and outputs are conceptually serializable (no complex object graphs or HIR types exposed).
- The boundary is explicitly drawn at the “UI” level, following the philosophy popularized by the Language Server Protocol. - It talks in the language of the text editor & not the language of the compiler.
rust-analyzer Crate
- Define the binary for the language server -> The entry point.
- It acts like the network/protocol adapter tha connects the pure logic of the
idecrate to the outside world. -> Functional core, imperative shell.
Architecture Invariant - LSP & JSON Awareness
rust-analyzeris the only place where LSP types and JSON serialization exist.- Lower crates (
ide,hir) remain pure and protocol-agnostic. They are forbidden from derivingSerializeorDeserializefor LSP purposes. rust-analyzermaintains its own set of serializable types. It manually converts theidecrate’s Rust-native data structures (likeTextRange) into LSP’s wire-format structures (likeRangewith line/character) before sending them over the wire.
Architecture Invariant - Protocol-Wise Statelessness
- The server is stateless, in the sense that it doesn’t know about the previous requests.
Utilities - stdx Crate
rust-analyzeravoids small helper crates.stdxis the crate to store all small reusable utilities.
Macro Crates
- Core abstraction (
tt): Macros are defined purely asTokenTree→TokenTreetransforms, isolated from other compiler parts. Thettcrate defines this structure (single tokens or delimited sequences). - Declarative macros (
mbe): Thembecrate implements “Macros By Example” (macro_rules!). It handles parsing, expansion, and the translation between the IDE’s syntax trees and the raw token trees. - Procedural Macros: Proc-macros run in a separate process to isolate the IDE from user code crashes.
- Server (
proc-macro-srv): Load the dynamic libraries (built by Cargo) and executes the macros. - Client (
proc-macro-api): Communicate with the server, sending/receiving Token Trees.
- Server (
Architecture Invariant - Isolation
- Because arbitrary macro code can panic or segfault (crashing the editor),
rust-analyzerexecutes them in a separate process. This allows the main IDE to survive fatal errors and recover gracefully. salsa’s incremental system assumes all functions are pure (deterministic). Since proc-macros can be non-deterministic (e.g., reading external files or random numbers), they violate this core assumption and require special handling to prevent database corruption or infinite invalidation loops.
Virtual File System - vfs-xxx and paths Crates
- Virtual file system (VFS): These crates provide an abstraction layer that generates consistent snapshots of the file system, insulating the compiler from raw, messy OS paths.
- The architecture does not assume a single unified file system. A single
rust-analyzerprocess can serve multiple remote machines simultaneously, meaning the same path string could exist on two different machines and refer to different content. - “Witness” API: To resolve this ambiguity, path APIs generally require a “file system witness” (an existing anchor path) to identify which specific file system context the operation targets.
Interning - intern Crate
- Use
Arc(Atomic Reference Counting) to ensure identical data (like strings or paths) is stored only once in memory. - Optimized for “value types” that are defined by their content (e.g.,
std::vec::Vec), rather than “entities” defined by an ID (e.g.,Function #42). db-independent: Unlikesalsa’s integer IDs,Interned<T>owns its data, allowing access and inspection without needing a reference to the compiler database (db).
- Interning enables instant equality checks by comparing memory pointers instead of scanning content, which is critical for frequently compared items like file paths.
- Interning serves as a lower-level optimization layer for static, immutable data that doesn’t require the full overhead of incremental dependency tracking.
Architectural Policies
Stability Guarantees
rust-analyzeravoids new stability guarantees to move fast.- The internal
ideAPI is explicitly unstable. - Stability is only guaranteed at the LSP level (managed by the protocol) and input level (Rust language/Cargo).
- De-facto stability:
rust-project.jsonbecame stable implicitly by virtue of having users — a lesson to explicitly mark APIs as unstable/opt-in before release.
Code Generation
- The API for syntax trees (
syntax::ast) and manual sections (features, assists, config) are generated automatically. - To simplify builds,
rust-analyzerdoes not use itself for codegen. It usessynand manual string parsing instead.
3. Cancellation (Concurrency)
- The problem: If the user types while the IDE is computing (e.g., highlighting), the result is immediately stale.
- The solution: The salsa database maintains a global revision counter.
- When input changes, the counter is bumped.
- Old threads checking the counter notice the mismatch and panic with a special
Canceledtoken.
- The
ideboundary catches this panic and converts it into aResult<T, Canceled>.
Testing Strategy
- Tests are concentrated on three system boundaries:
- Outer (
rust-analyzercrate): “Heavy” integration tests via LSP/stdio. Validates the protocol but is slow (reads real files). - Middle (
idecrate): The most important layer. TestsAnalysisHost(simulating an editor) against expectations. - Inner (
hircrate): Tests semantic models using rich types and snapshot testing (via theexpectcrate).
- Outer (
Key Testing Invariants
- Data-driven: Tests use string fixtures (representing multiple files) rather than calling API setup functions manually. This allows significant API refactorings.
- No
libstd: Tests do not link tolibstd/libcoreto ensure speed; all necessary code is defined within the test fixture.
Error Handling
- No IO in core: Internal crates (
ide,hir) are pure and never fail (noResult). They return partial data plus errors:(T, Vec<Error>). - Panic resilience: Since bugs are inevitable, every LSP request is wrapped in
catch_unwindso a crash in one feature doesn’t kill the server. - Macros: Uses
always!andnever!macros to handle impossible states gracefully.
Observability
- Profiling: Includes a custom low-overhead hierarchical profiler (
hprof) enabled via env vars (RA_PROFILE).
Serialization
- The trap of ease: While
#[derive(Serialize)]is easy to add, it creates rigid IPC boundaries (backward compatibility contracts) that are extremely difficult to change later. - To strictly preserve internal flexibility, types in core crates like
ideandbase_dbare not serializable by design. - Serialization is forced to the “edge” (the client). External clients must define their own stable schemas (e.g.,
rust-project.json) and manually convert them into internal structures, isolating the core compiler from protocol versioning issues.
Coding Conventions
Official site: Link.
General Philosophy
rust-analyzer’s approach to clean code:- Velocity over perfection: Do not block functional PRs on purely stylistic changes.
- “Show, don’t just tell”: For complex style issues, reviewers are encouraged to merge the PR and then send a follow-up cleanup PR themselves. This resolves the issue faster and teaches the author “by example” rather than through endless comment threads.
- If a review comment applies generally, update the Style Guide instead of leaving a one-off comment. This way, temporary feedback is turned into permanent documentation.
- Small, atomic cleanup PRs (even just renaming a variable) are explicitly encouraged to keep the codebase healthy.
Scale of Changes
-
Generally, small & focused PRs are preferred; but sometimes, that isn’t possible.
-
rust-analyzercategorizes PRs into 3 groups.
Internal Changes (Low Risk)
- Definition: Changes confined to the internals of a single component. No
pubitems are changed or added (no changes to the interfaces and no new dependencies). - Review standard: Easy Merge.
- Does the happy path work?
- Are there tests?
- Does it avoid panicking on the unhappy path?
API Expansion (Medium Risk)
- Definition: Adding new
pubfunctions or types that expose internal capabilities to other crates. - Review standard: High scrutiny.
- The interface matters more than the implementation. It must be correct and future-proof.
rust-analyzer’s guideline: If you start a “Type 1” change and realize you need to change the API, stop. Split the API change into its own separate, focused PR first.
Dependency Changes (High Risk)
- Definition: Introducing new connections between components via pub use re-exports or
Cargo.tomldependencies. - Review standard: Rare & dangerous.
- These break encapsulation.
- Even an innocent-looking
pub usecan accidentally degrade the architecture by leaking abstractions across boundaries.
Crates.io Dependencies
- Restrict external dependencies: Be extremely conservative with
crates.iousage to minimize compile times and breakage risks. - Do not use small “helper” libraries (allowed exceptions:
itertools,either). - Internalize utilities: Place general, reusable logic into the internal
stdxcrate rather than adding a dependency. - Audit dependency tree: Periodically review
Cargo.lockto prune irrational transitive dependencies.
Rationale
- Compilation speed:
- Rust compiles dependencies from source.
- Avoiding bloat is the best way to keep build times and feedback loops fast.
- Transitive bloat:
- Small “helper” crates often pull in deep chains of hidden dependencies (the “iceberg” effect).
- Stability & Security:
- Reduce the risk of upstream abandonment, breaking changes, or supply chain attacks.
- Self-reliance: If logic is simple enough for a micro-crate, it belongs in the internal
stdxlibrary, not as an external liability.
Commit Style
- Document for changelogs to avoid release burden on the maintainers.
- Changelogs > Clean history.
Git History
- Clean git history is strongly encouraged but not mandated.
- Use a rebase workflow. It is explicitly acceptable to rewrite history (force push) during the PR review process.
- Before the final merge, use interactive rebase to squash small “fixup” commits into logical units.
Commit Message & PR Description
- Do not
@mentionusers in commit messages or PR descriptions.- Reason: Rebasing re-commits the message, spamming the mentioned user with duplicate notifications.
- User-centric titles: Write PR titles/descriptions describing the user benefit, not the implementation details.
- Good: “Make goto definition work inside macros”.
- Bad: “Use original span for
FileId”.
- Changelog automation: You must categorize PRs so release notes can be auto-generated. Use one of two methods:
- Title prefix:
feat:,fix:,internal:, orminor:(e.g.,feat: Add hover support). - Magic comment:
changelog [fix] Description here in the PR body.
- Title prefix:
- Visuals: For UI changes, include a GIF in the description to demonstrate the feature.
Linting
- Clippy is used.
Code
Minimal Tests
- Tests must use the absolute minimum code necessary to reproduce the case. Aggressively strip “noise” from copy-pasted real-world code.
- Format declarative code densely (e.g.,
enum E { A, B }on a single line) to keep the test concise, provided it remains readable. - Unindented raw strings: Use
r#...#literals for multiline fixtures. Ensure the content is unindented (starts at column0) so that character offsets in the test match the actual file positions exactly. - Rationale:
- Reduce visual noise and scrolling, making the actual test case immediately obvious.
- Lower execution time and keeps debug logs clean.
- Unindented formatting allows you to use your editor’s “selection character count” to verify byte offsets directly, without needing to manually subtract indentation whitespace.
Marked Tests
- Marked test: A technique used to verify that a specific, often hard-to-reach line of code was actually executed during a test.
- Use
cov_mark::hit!(in code) andcov_mark::check!(in tests) to create a strictly unique link between a specific edge case in the implementation and its corresponding test. - Principle: Only maintain one mark per test and one mark per code branch.
- Never place multiple marks in a single test, and never reuse the same mark across different tests.
- Rationale: This ensures that searching for a mark immediately reveals the single canonical test responsible for verifying that specific code branch, eliminating ambiguity.
#[should_panic]
#[should_panic]is prohibited -NoneandErrshould be explicitly checked.- Rationale:
#[should_panic]is a tool for library authors to make sure that the API does not fail silently when misused.rust-analyzeris a long-running server, not a library. It must handle all input gracefully, even invalid input (returningErrorNone). It should never intentionally crash.- Expected panics still dump stack traces into the test logs. This “noise” creates confusion, making it difficult to distinguish between a test verifying a panic and an actual bug causing a crash.
- Expected panics still dump stack traces into the test logs. This “noise” creates confusion, making it difficult to distinguish between a test verifying a panic and an actual bug causing a crash.
#[ignore]
- Never ignore tests. Explicitly assert the wrong behavor and add a
FIXMEcomment. - Rationale:
- Visibility: It ensures the test fails immediately if the bug is accidentally fixed (alerting you to update the test).
- Safety: It proves the bug causes incorrect output rather than a server crash (panic), which is a critical distinction for a long-running service.
Function Preconditions
Type Encoding
- Function’s assumptions should be expressed in types.
- The caller must be enforced to provide them.
#![allow(unused)]
fn main() {
// GOOD
fn is_zero(n: i32) -> bool {
...
}
// BAD
fn is_zero(n: Option<i32>) -> bool {
let n = match n {
Some(it) => ...,
None => ...,
};
}
}
- Rationale:
- The caller has more context as to why to the callee’s assumptions do not hold.
- The control flow is therefore more explicit at the call site.
Parse, Don’t Validate
-
Bad practice:
- One function validates that the data is valid (validate the assumption).
- Another function uses that data based on the assumptions.
-
Good practice: Validate and immediately use the data in the same place (like
matchinstead of bareif). -
Reasons:
- The bad practice is prone to decay over time. The maintainer has to memorize the assumptions and make sure refactoring efforts of checks actually verify the assumptions.
- The good practice always ensure that the assumptions hold when manipulating the data.
-
Example from
rust-analyzer// GOOD fn main() { let s: &str = ...; if let Some(contents) = string_literal_contents(s) { } } fn string_literal_contents(s: &str) -> Option<&str> { if s.starts_with('"') && s.ends_with('"') { Some(&s[1..s.len() - 1]) } else { None } } // BAD fn main() { let s: &str = ...; if is_string_literal(s) { let contents = &s[1..s.len() - 1]; } } fn is_string_literal(s: &str) -> bool { s.starts_with('"') && s.ends_with('"') } -
Remarks:
- This pattern perfectly illustrates Robert Harper’s concept of “Boolean Blindness”. By reducing a complex check to a simple bool (true/false), we discard the proof of validity. The compiler sees a “true” flag, but it doesn’t see the “valid data,” forcing us to rely on faith later in the code.
- I learned this the hard way while building a DBML parser. I designed a system where the “validation phase” was separate from the “execution phase”. Because the validation step didn’t return a new, safe type (it just returned true), the execution phase had to blindly trust that the validation had run correctly.
- While systems like TypeScript uses Flow Typing to mitigate this (by inferring types inside
ifblocks), that safety is often local only. As soon as you pass that variable into a different function, the “flow context” is lost unless explicitly redefined.
Control Flow
- Push “if“s up and “for“s down.
Assertions
- Use
stdx::never!liberally instead ofassert!. never!checks a condition and logs a backtrace if it fails, but returns aboolinstead of crashing. This allows you to write:if stdx::never!(condition) { return; }.- Rationale:
rust-analyzeris a long-running server. A bug in a minor feature (like a specific completion) should log an error and bail out of that specific request, not crash the entire IDE session.
Getters & Setters
-
Two cases to consider:
- No invariants: If a field can hold any value safely, just make it
pub. Don’t write boilerplate code. - Invariants exist: If the data has rules (e.g., “cannot be empty”), make the field private, enforce the rule in the Constructor, and provide a Getter.
- No invariants: If a field can hold any value safely, just make it
-
Never provide setters. If data needs to change, it should likely be done via a specific behavior method or by creating a new instance, ensuring invariants are never bypassed.
-
Getters should return borrowed data.
rust-analyzer’s example:#![allow(unused)] fn main() { struct Person { // Invariant: never empty first_name: String, middle_name: Option<String> } // GOOD impl Person { fn first_name(&self) -> &str { self.first_name.as_str() } fn middle_name(&self) -> Option<&str> { self.middle_name.as_ref() } } // BAD impl Person { fn first_name(&self) -> String { self.first_name.clone() } fn middle_name(&self) -> &Option<String> { &self.middle_name } } } -
Rationale:
- The APIs are internal so (internal) breaking changes can be allowed to move fast:
- Using a
pubfield (with no invariants) introduces less boilerplate but may be breaking if thepubfield is suddenly imposed an invariant and has to be changed to private. - Using an accessor can prevent breaking changes, but it means implicitly promising a contract and imposing some maintenance boilerplate.
- Using a
- Privacy helps make invariants local to prevent code rot.
- A type that is too specific (borrow owned types like
&String) leaks irrelevant details (neither right nor wrong), which creates noise and the client may accidentally rely on those irrelevent details.
- The APIs are internal so (internal) breaking changes can be allowed to move fast:
Useless Types
- Prefer general types.
- If generality is not important, consistency is important.
#![allow(unused)]
fn main() {
// GOOD BAD
&[T] &Vec<T>
&str &String
Option<&T> &Option<T>
&Path &PathBuf
}
- Rationale:
- General types are more flexible.
- General types leak fewer irrelevant details (which the client may accidentally rely on).
Constructors
- If a
newfunction accepts zero arguments, then use theDefaulttrait (either derive or manually implemented).- Rationale:
- Less boilerplate.
- Consistent: Less cognitive load for the caller - “Should I call
new()ordefault()?”
- Rationale:
- Use
Vec::newinstead ofvec![].- Rationale:
- Strength reduction.
- Uniformity.
- Rationale:
- Do not provide
Defaultif the type doesn’t have sensible default value (many possible defaults or defaults that has invalid states).- Preserve invariants.
- The user does not need to wonder if the provided default is their desired initial values.
#![allow(unused)]
fn main() {
// GOOD
#[derive(Default)] // 1. Best case: Derive it automatically
struct Options {
check_on_save: bool,
}
// GOOD (Manual Implementation)
struct Buffer {
data: Vec<u8>,
}
impl Default for Buffer {
fn default() -> Self {
Self {
// 2. Use Vec::new() instead of vec![] (Strength Reduction)
// It is semantically lighter (function vs macro) and more uniform.
data: Vec::new(),
}
}
}
// BAD
struct OptionsBad {
check_on_save: bool,
}
impl OptionsBad {
// 3. Avoid zero-arg new().
// It forces users to remember "Do I call new() or default() for this type?"
fn new() -> Self {
Self { check_on_save: false }
}
}
}
Functions Over Objects
- Public API: Prefer simple functions (
do_thing()) over transient objects that exist only to execute one method (ThingDoer::new().do()). - Internal logic: It is acceptable (and encouraged) to use “Context” structs inside the function to manage complex state or arguments during execution.
- Rationale:
- The “Iceberg” pattern: The user sees a simple function interface; the developer uses a structured object implementation behind the scenes.
- Implementor API is not mixed with user API.
- Middle ground: If a struct is preferred for namespacing, provide a static
do()helper method that handles the instantiation and execution in one step. - Rationale:
- Reduce boilerplate for the caller.
- Prevent implementation details (like temporary state management) from leaking into the public API.
#![allow(unused)]
fn main() {
// BAD (Caller has to build and run)
ThingDoer::new(arg1, arg2).do();
// GOOD (Caller just acts)
do_thing(arg1, arg2);
// ACCEPTABLE INTERNAL IMPLEMENTATION (Using a struct to organize code)
pub fn do_thing(arg1: Arg1, arg2: Arg2) -> Res {
// The struct is an implementation detail, hidden from the user
let mut ctx = Ctx { arg1, arg2 };
ctx.run()
}
}
Functions With Many Parameters
-
Use
Configstruct:#![allow(unused)] fn main() { // BAD (Call site is confusing) fn annotations(db, file_id, true, false, true); // GOOD (Call site is explicit and flexible) fn annotations(db, file_id, AnnotationConfig { binary_target: true, annotate_runnables: false, annotate_impls: true, }); } -
Rationale:
- Call site is clearer.
- Encapsulating volatile parameters in a struct shields intermediate functions from breaking changes, allowing you to add new options without updating the signature of every function in the call chain.
-
No
DefaultforConfig: Do not implementDefault. Force the caller to provide explicit context.- Rationale: They know better than the struct what the initial state should be.
-
Pass configuration as an argument to the function, do not store it in the object’s state.
- Rationale: This allows the same object to handle multiple requests with different configurations dynamically.
-
Commandpattern: If a set of parameters can yield different return types (e.g.,Vec<T>vsOption<T>), wrap parameters in a “Command” struct.- Rationale: This avoids creating multiple top-level functions (
query_all,query_first) with identical argument lists.
- Rationale: This avoids creating multiple top-level functions (
#![allow(unused)]
fn main() {
// GOOD (Command Pattern)
// Captures arguments once, offers multiple execution paths
pub struct Query {
pub name: String,
pub case_sensitive: bool,
}
impl Query {
// Return type A
pub fn all(self) -> Vec<Item> { ... }
// Return type B
pub fn first(self) -> Option<Item> { ... }
}
// BAD (Parameter Duplication)
// Requires repeating arguments for every variation of the result
fn query_all(name: String, case_sensitive: bool) -> Vec<Item> { ... }
fn query_first(name: String, case_sensitive: bool) -> Option<Item> { ... }
}
- Remarks: Does the
Commandrule conflict with the rule “Do not storeConfigin state”?- The
Configrule applies to long-lived services (e.g.,Database). These should remain stateless so they can handle diverse requests without needing to be reset. - The
Commandrule applies to short-lived tasks (e.g.,Query). These objects exist solely to bundle parameters for a single operation and are discarded immediately after use. - Relationship: The “Command” is effectively a temporary container that you pass to (or use with) the “Service”.
- The
Prefer Separate Functions Over Parameters
-
Split “flag” arguments: If a function is solely invoked with hardcoded literals (
true/None), refactor it into distinct named functions (e.g.,process_fast()vsprocess_full()) to eliminate internal branching and “false sharing” of unrelated logic. -
Rationale:
- Functions with flag arguments often display false sharing. There’s often
ifbranching to distinguish between different cases.#![allow(unused)] fn main() { // Caller process_data(true); process_data(false); // Callee fn process_data(is_fast: bool) { if is_fast { // Fast algorithm } else { // Accurate algorithm } } } - These functions seem to share logic and are related and it seems that it makes sense to merge them into 1 function. However, over time, they diverge. Therefore, splitting the control flows into separate functions simplify the logic & eliminate irrelevant details (such as the maintainer would fail to see the cross-dependencies between the
ifbranches). - Split the common code of the
ifbranches into a common helper instead.
- Functions with flag arguments often display false sharing. There’s often
-
Remark: This binder-refactoring PR is a hard lesson for me illustrating this point, both the problem and the solution (splitting into separate classes & extract common helpers). Although there is still lots of room for improvement, this is already significantly better.
Appropriate String Types
-
When calling OS APIs (file systems, env vars, arguments), use
OsStringand&OsStr, neverStringor&str. -
Rust
Stringguarantees valid UTF-8. Operating Systems do not.- Linux/Unix: Paths are arbitrary byte sequences (except
null). - Windows: Paths are potentially ill-formed UTF-16 sequences.
- Linux/Unix: Paths are arbitrary byte sequences (except
-
Rationale: This creates a strict type-level boundary.
- If you hold a
String, you know it is safe, clean text. - If you hold an
OsString, you know it is “dirty” data from the outside world. - Using
OsStringprevents accidental panics or data corruption when encountering a file named with invalid encoding.
- If you hold a
-
Avoid the standard
std::Path, use the customAbsPathBufandAbsPathwrapper that guarantees the path inside is absolute. -
Rationale:
- CWD is global mutable state.
- If you use a relative path like
Path::new("src/main.rs"), the OS resolves it relative to where the server binary started, not where the project is.
Premature Pessimization
Avoid Allocations
- Zero-allocation default: Prefer stack-based structures (iterators) over heap-based collections (
Vec,String) unless you specifically need ownership or long-term storage. - Lazy vs eager:
collect::<Vec>()eagerly processes the entire sequence and allocates memory immediately. Iterators are lazy and compute items only on demand.
#![allow(unused)]
fn main() {
use itertools::Itertools;
// BAD (Heavy)
// 1. Allocates heap memory. 2. Processes entire string. 3. Frees memory.
let parts: Vec<&str> = text.split(',').collect();
if parts.len() == 3 {
process(parts[0], parts[1], parts[2]);
}
// GOOD (Light)
// 1. No allocation. 2. Stops after 3 items. 3. Stores ptrs on Stack.
if let Some((a, b, c)) = text.split(',').collect_tuple() {
process(a, b, c);
}
}
Push Allocations to the Call Site
- Rationale: The cost of calling a function becomes more explicit.
Collection Types
-
rustc_hash’s map and set are preferred over those instd::collections. -
Rationale:
- Faster hasher.
- Slightly reduce code size if applied consistently.
Avoid Intermediate Collections
- Use an accumulator parameter (list, set, map) as the first parameter to collect values.
Avoid Monomorphization
- Minimize generics: Avoid heavy generic logic at crate boundaries to prevent compile-time bloat caused by monomorphization (code duplication).
- “Inner dyn” pattern: Use thin generic wrappers that immediately delegate to private, non-generic functions using dynamic dispatch (
dyn Trait). - Concrete types: Avoid
AsRefpolymorphism; prefer concrete types like&Pathunless writing a widely-used library. - Rationale: Optimize for compile speed by default; runtime performance only matters for the “hot” 20% of code, but compilation costs affect 100%.
use std::fmt::Display;
// --- 1. The Wrapper (Generic) ---
// This function is "monomorphized" (duplicated) for every type T.
// But since it's only 1 line, the duplication cost is negligible.
pub fn log_message<T: Display>(msg: T) {
// Coerce the specific type T into a trait object (&dyn Display)
log_message_impl(&msg);
}
// --- 2. The Implementation (Dynamic) ---
// This function is compiled ONLY ONCE.
// It handles the heavy lifting via dynamic dispatch (vtable).
fn log_message_impl(msg: &dyn Display) {
// Imagine 500 lines of complex logging logic here...
println!("Timestamp: [INFO] {}", msg);
// ... extensive I/O, formatting, or network calls ...
}
fn main() {
log_message("Hello"); // T is &str -> Wrapper created for &str
log_message(42); // T is i32 -> Wrapper created for i32
log_message(3.14); // T is f64 -> Wrapper created for f64
// Result: 3 tiny wrappers, but the heavy `log_message_impl` exists only once.
}
Style
Order of Imports
- Modules first: Declare modules (
mod x;) at the top, before any imports, ordered by “suggested reading.” - Import structure:
- Grouping: Use one
usestatement per crate (group items inside{ ... }). - Spacing: Separate different groups with blank lines.
- Import order:
std: Standard library (use std::...).- External: Third-party and workspace crates.
- Local: Current crate (use
crate::...). - Relative: Parent/child modules (use
super::...), thoughcrate::is preferred.
- Grouping: Use one
- Re-exports: Place
pub useafter all imports, as they are treated as item definitions. - Rationale: Ensures consistency, improves readability for new contributors, and highlights dependencies clearly.
Import Style
- Qualify layer types: Always qualify items from
hirandastto prevent ambiguity and clarify the architectural layer.- Good:
use syntax::ast; ... func: hir::Function - Bad:
use hir::Function; ... func: Function
- Good:
- Trait implementations: Import the module (e.g.,
std::fmt), not the trait itself, when implementing standard traits.- Good:
impl fmt::Display for ...
- Good:
- Rationale: Reduces typing and clearly distinguishes implementation from usage.
- Avoid local globs: Do not use use
MyEnum::*;inside functions. - Absolute paths: Prefer
use crate::fooover relative paths likesuper::orself::for consistency. - No re-exports: Avoid re-exports in non-library code to prevent multiple access paths and maintain consistency.
Order of Items
- Public API first: Always place public items (
puborpub(crate)) at the very top, before any private helpers or implementation details. - Types before logic: Define data structures (
struct,enum) before functions andimplblocks. - Top-down: Order type definitions by dependency, place the “parent” container before the “child” component it contains.
- Rationale:
- Optimize for a new reader scanning top-to-bottom.
- When code is folded, the file structure should read like API documentation.
Context Parameters
- Context-first: Always pass “context” parameters (invariant data threaded through many calls) as the first arguments.
- Rationale:
- This creates a visual hierarchy: “setting” “actors”.
- It mimics the
selfconvention in OOP (where the context/object always comes first). - When scanning a function call, you immediately see where the operation is happening (the context) before seeing what is being processed (the variable data).
- Rationale:
#![allow(unused)]
fn main() {
// BAD: Context (db) is hidden at the end.
// In a long list of arguments, you might miss which 'db' is being used.
fn transform_data(data: String, id: usize, force: bool, db: &Database) { ... }
// Usage
transform_data(raw_input, 42, true, &primary_db);
// GOOD: Context (db) is front and center.
// It establishes the environment immediately.
fn transform_data(db: &Database, data: String, id: usize, force: bool) { ... }
// Usage
transform_data(&primary_db, raw_input, 42, true);
}
- If there are multiple context parameters, bundle them into a struct and pass it as
&self.
#![allow(unused)]
fn main() {
// BAD: Parameter explosion.
// Every helper function needs to accept all three context items.
fn parse(db: &Db, cfg: &Config, cache: &Cache, text: &str) {
validate(db, cfg, cache, text); // Tedious repetition
}
// GOOD: Packed Context.
struct ParseCtx<'a> {
db: &'a Db,
cfg: &'a Config,
cache: &'a Cache,
}
impl<'a> ParseCtx<'a> {
// The signature is clean. Context is implicit in `&self`.
fn parse(&self, text: &str) {
self.validate(text);
}
fn validate(&self, text: &str) { ... }
}
}
- The “dangling argument” problem:
- Context-first works better when non-context parameters are lambdas (closures).
- Rationale:
- Rust closures often span multiple lines.
- If the context parameter is placed after the closure, it ends up “dangling” after the closing brace
}. This is visually confusing and looks like a syntax error or a forgotten fragment. - Placing context first ensures the closure body is the last thing the eye sees, resulting in a clean termination of the statement.
#![allow(unused)]
fn main() {
// BAD: Context Last
// The `&context` argument is easy to miss or looks like it belongs
// to a different statement because it follows a large code block.
apply_changes(|item| {
// ... complex multi-line logic ...
// ... complex multi-line logic ...
item.finalize()
}, &context); // <--- The "Dangler"
// GOOD: Context First
// The statement starts with the context, and the closure flows naturally
// until the end of the function call.
apply_changes(&context, |item| {
// ... complex multi-line logic ...
// ... complex multi-line logic ...
item.finalize()
}); // <--- Clean, standard closure syntax
}
Variable Naming
-
Prioritize verbose & boring but clear names: prefer long names that mirror the type (e.g., g
lobal_state: GlobalState). Rely on code completion, not brevity. -
Standard variables:
res: Function result.it: Generic item (when identity is irrelevant).n_foos: Count (preferred overfoo_count).foo_idx: Index.
-
Keyword collisions: Avoid
r#identsyntax. Use these consistent replacements:Keyword Replacement cratekrateenumenum_fnfuncimplimpmacromacmodmodulestructstrukttraittrait_typety -
Spelling & acronyms:
- Use American spelling (
color). - Avoid ad-hoc acronyms; stick to common ones (
db,ctx).
- Use American spelling (
Error Handling Trivia
- Use
anyhow::Resultinstead of the bareResult.- Rationale: Makes the return type immediately clear without checking imports.
- Macro choice: Prefer
anyhow::format_err!overanyhow::anyhow.- Rationale: More “boring” (standard), consistent, and avoids the stuttering of
anyhow::anyhow.
- Rationale: More “boring” (standard), consistent, and avoids the stuttering of
- Message formatting:
- There are no strict rules on message structure. Rust standard library uses lowecase, while
anyhowuses uppercase. - Do not end error or context messages with a period (
.).
- There are no strict rules on message structure. Rust standard library uses lowecase, while
Early Returns
- Handle negative cases immediately with an early return rather than wrapping the “happy path” inside a large if/else block.
- Rationale: Flattens code nesting and reduces “cognitive stack usage” (mental load).
- Explicit error returns: Use
return Err(e)to exit with an error. Avoid usingErr(e)?to simulate a throw. - Rationale:
returnevaluates to the “never type” (!), which allows the compiler to strictly identify dead code, whereas?resolves to a generic type that can mask unreachable code.
Comparisons
- Prefer less-than:
- Always use
<or<=comparisons. - Avoid
>or>=.
- Always use
- Rationale:
- Spatial intuition: Corresponds to the real number line where values increase from left to right (
0→∞). - Visual ordering:
lo <= x && x <= hivisually placesxin the middle, whereasx >= loforces a mental “flip.”
- Spatial intuition: Corresponds to the real number line where values increase from left to right (
If-let
- Prefer
matchoverif let ... else: When you need to handle both the “success” and “failure” cases, use a full match statement. - Rationale:
- Compactness:
matchis usually cleaner and requires less syntax for simple alternatives. Precision: The else block inif letis implicit (it covers everything else).matchforces you, or allows you, to be explicit about what the negative case is (e.g.,NonevsErr(_)), making the code more robust to type changes.
- Compactness:
Match Ergonomics
- Avoid
ref: Do not use therefkeyword in patterns.
Rationale:
- Obsolescence:
refis largely redundant due to “match ergonomics” (introduced in recent Rust editions). - Simplicity: Relying on
matchergonomics is cleaner and avoids mixing legacy syntax with modern style.
Empty Match Arms
- Use the unit value
(), for empty match arms, rather than an empty block{}.
#![allow(unused)]
fn main() {
// GOOD
Err(err) => error!("{}", err),
Ok(_) => (), // <--- Clean, single-line comma style
// BAD
Err(err) => error!("{}", err),
Ok(_) => {} // <--- Block style breaks the visual rhythm
}
- Rationale:
- In Rust,
()is the value “nothing,” while{}is a block of code that evaluates to nothing. - While they are functionally identical here, using
=> (), keeps the match arm visibly distinct as a “value” rather than a “scope,” maintaining a consistent visual rhythm in long match statements.
- In Rust,
Functional Combinators
- Use functional combinators (
map,and_then) only when they fit naturally. - Prefer imperative control flow (
if,for,match) over “forced” combinators likebool::thenorOption::filter. - The philosophy: Code should be dense in computation (doing work) but sparse in structure (fewer indirections per line).
- Rationale:
- Rust has strong support for imperative flow (loops, early returns).
- Rust functions are “less first-class” because of effects like
?(try-operator) and.await. These do not compose well inside long chains of closures (e.g., trying to?inside afilterclosure is painful).
Turbofish
- Prefer explicit type ascription (
let x: Vec<i32> = ...) over the “turbofish” syntax (...collect::<Vec<i32>>()). - Rationale: We want to able to read everything from left-to-right without maintaining to much context:
- Good:
let names: Vec<String> = users.iter().map(...).collect();- Why: When you read the line left-to-right, you immediately know what is being built (
Vec<String>). This context helps you understand the complex iterator chain that follows.
- Why: When you read the line left-to-right, you immediately know what is being built (
- Bad:
let names = users.iter().map(...).collect::<Vec<String>>();- Why: You have to read the entire chain until the very end to figure out what type is actually being produced.
- No placeholders (
_): Avoidlet x: Vec<_> = ....
- Good:
- Rationale: If the compiler struggles to infer the type (forcing you to add a hint), a human reader will likely struggle too. Be kind to the reader and write the full type
Vec<i32>.
Helper Functions
-
Avoid single-Use helpers: Do not create a separate function if it is only called once.
-
Alternative: Use a block
{ ... }. -
Rationale: This isolates the scope but keeps access to the context variables.
-
Exception: Create a function if you need early returns (
return) or error propagation (?). -
Local helpers: Place nested helper functions at the end of the enclosing function.
- Structure: Main logic →
return result;→fn helper() { ... }. - Limit: Do not nest more than 1 level deep.
- Structure: Main logic →
Helper Variables
- Create boolean helper variables for complex conditions (e.g., inside match guards).
- Rationale:
- Act as a “cognitively cheap” abstraction (names the logic without hiding the context).
- Make debugging easier (you can inspect/print the variable).
Syntax Macros
- Tokens: Use the
T![token]macro instead ofSyntaxKind::TOKEN_KW. - Example:
T![true]instead ofSyntaxKind::TRUE_KW. - Rationale: Familiar syntax, avoids ambiguity (e.g.,
{vs[).
Documentation
- Comments: Write proper sentences. Start with a capital, end with a dot
.. - Markdown: Use sentence-per-line. Do not wrap lines hard; press Enter after every sentence.
- Rationale:
- Makes diffs cleaner and editing easier.
- Formatting a comment as a sentence (capital + period) forces the brain to switch from “scribbling notes” to “explaining thoughts.”
- Context dump: It tricks you into emptying your “mental RAM” (assumptions, constraints) into the code, rather than leaving vague fragments like
// fix this.
Syntax Tree & Parser
Official site: Link.
Commit link: 2020-01-09
Architectural Overview
- The system has three isolated components:
rowan: A generic, language-agnostic library for immutable syntax trees using the Red-Green tree model.syntax(crate): Wraprowanwith arust-analyzer-specific API. It’s the only crate that knows aboutrowan.parser(crate): Output events from parsing without knowing anything about the tree structure.
Design Goals
- Lossless: Everything is preserved—whitespace, comments, even invalid tokens.
- Semantic-less: Pure structure with no type checking or name resolution.
- Context-free: Trees are simple values that don’t depend on external context.
- Resilient: Always produce a tree, even from broken code (nodes may be partial).
- Performance: Optimized for memory and CPU using
unsafewhere needed. - Decoupled: Parser and tree evolve independently.
- Intuitive traversal: Easy navigation between parent, children, and siblings.
The Three-Layer Syntax Tree
The core idea is a “Red-Green Tree” (borrowed from Roslyn), where three layers balance memory usage against API convenience.
- Semi-transient: Trees don’t live in memory permanently—they get lowered to a compact form but can be rebuilt when needed.
Layer 1: GreenNode (The Storage)
-
GreenNode is where the actual data lives—it’s purely functional, immutable, and persistent with arbitrary arity.
-
Conceptual (unoptimized) example from
rust-analyzer:#![allow(unused)] fn main() { // Runtime tag for a Node/Token // Nodes and tokens both use the same tag kind? #[derive(PartialEq, Eq, Clone, Copy)] struct SyntaxKind(u16); #[derive(PartialEq, Eq, Clone)] struct Node { kind: SyntaxKind, // Only the byte-length is stored, offset is not stored text_len: usize, // A generic node with a sequence of children nodes and tokens // `Arc` is used, so it allows for multithreading children: Vec<Arc<Either<Node, Token>>>, } #[derive(PartialEq, Eq, Clone)] struct Token { kind: SyntaxKind, text: String, } } -
A
Tokenis a leaf node, whileNoderepresents interior nodes. -
Some things to note:
- There’s just one generic node structure—nodes are untyped with a runtime tag.
- The original text comes from combining all the children’s text.
- Finding a specific child means scanning through siblings linearly.
- Modifying the tree gets more expensive with height (due to path copying).
- Invalid input gets wrapped in an
ERRORnode. - Errors live separately, not embedded in the tree itself.
The following optimizations are mostly attributed to CAD97.
Optimizations
[ RefCount | Kind | TextLen | N_Children | Child_1 | Child_2 | ... | Child_N ]
<------------------------- Single Heap Allocation ------------------------->
- Dynamically sized type (DST): Everything in one heap allocation instead of splitting the node and its children into separate allocations.
- Tagged pointers: A single pointer where the last bit indicates whether it’s a
NodeorToken(cheaper thanArc<Either<Node, Token>>). - Interning: Tokens get reused—
1 + 1only stores one1token that’s shared. ReturnsArc<Node>so trees are self-contained. TextSize: Just a newtypedu32for tracking text length.SmolStr: Small strings likefnoriflive on the stack inline, no heap allocation needed.- Note: The text suggests moving this data directly into the interned token allocation in the future.
Alternative Designs
Dealing With Trivia
-
Explicit nodes (Rowan/IntelliJ): Whitespace becomes regular sibling nodes. Lossless and easy to edit, but noisy when traversing.
-
Attached trivia (Roslyn/Swift): Whitespace hides inside tokens as “leading/trailing” properties, so the tree only shows logical code.
-
Linked list (Dart): A hybrid approach with a clean semantic tree plus a separate doubly-linked list for all tokens.
-
There’s no standard approach to handling trivia. For reference,
@dbml/parseuses the attached trivia approach.
| Strategy | Pros | Cons |
|---|---|---|
| Explicit Nodes (Rowan, IntelliJ, rust-analyzer) | Uniformity: Everything is a node; simple data model. Fidelity: Lossless representation of text. Refactoring: Moving a branch naturally moves its formatting. | Noise: Tree traversal requires manual filtering of whitespace. Memory: Higher node count (every space is an object). |
| Attached Trivia (Roslyn, Swift) | Clarity: Tree contains only logical code. Iteration: Iterating children yields immediate semantic tokens. | Complexity: “Re-stitching” text during refactoring is hard. Ownership: Ambiguous whether a comment belongs to the previous or next token. |
| Linked List (Dart) | Dual View: Optimized for both semantic analysis (Tree) and formatting (List). Scanning: Linear access to tokens is very fast. | Syncing: High maintenance cost to keep the Tree and List in perfect sync. Overhead: Requires managing parallel data structures. |
Accessing Children
- Linear can (IntelliJ/Rowan):
- Children live in a dynamic list with trivia mixed in.
- You have to scan through siblings (O(N)) to find what you want.
- Very memory efficient—missing nodes cost nothing.
- Fixed slots (Roslyn/Swift):
- Every possible child gets a fixed slot in the grammar.
- Direct O(1) access by index.
- Uses more memory since it allocates slots even for
Nonevalues. - Trivia has to be removed to keep indices stable.
| Feature | Linear Scan (IntelliJ) | Fixed Slots (Roslyn/Swift) |
|---|---|---|
| Access Speed | O(N) (Must iterate/skip) | O(1) (Direct Index) |
| Memory Usage | Compact (Only stores present data) | Sparse (Allocates slots for None) |
| Missing Nodes | Implicitly absent | Explicitly stored as Option::None |
| Trivia | Included as sibling nodes | Removed or attached to tokens |
Mutable Trees
- Mutable trees (IntelliJ): Imperative style.
- You modify nodes directly in place (
node.delete()). - Simple API but thread-safety gets complicated (need read/write locks).
- You modify nodes directly in place (
- Immutable trees (
rust-analyzer): Functional style.- You build up changes with a
Builderthen generate a fresh tree. - More verbose API, but concurrency becomes trivial (snapshots are naturally safe).
- You build up changes with a
#![allow(unused)]
fn main() {
// MUTABLE (IntelliJ Style)
// Direct, in-place modification. Simple but requires locking.
for bound in bounds {
// The tree changes immediately
bound.delete();
}
// IMMUTABLE (rust-analyzer Style)
// Indirect. Must collect edits and rebuild the tree structure.
let mut builder = tree.edit();
for bound in bounds {
// Schedule deletion on the 'builder', not the node itself
builder.delete(bound);
}
// Apply generates a completely new root node
let new_tree_root = builder.apply();
}
| Feature | Mutable Trees (IntelliJ) | Immutable Trees (rust-analyzer) |
|---|---|---|
| Editing API | Simple (Direct node.delete()) | Verbose (Requires Builder / rewriting spine) |
| Concurrency | Hard (Needs complex R/W locks) | Easy (Snapshots are thread-safe) |
| Mental Model | “DOM Manipulation” | “Version Control” / Persistent Data Structures |
| Performance | Fast small edits | Slower (allocates new nodes for the path to root) |
Layer 2: SyntaxNode (The Cursor / RedNode)
-
GreenNode has some limitations: no parent pointers (can’t go up the tree), and if the same GreenNode appears in multiple places, they’re treated as identical.
-
SyntaxNode/RedNode (also called cursors or zippers) solve this by adding parent pointers and position information.
-
Conceptual (unoptimized) example from
rust-analyzer:#![allow(unused)] fn main() { type SyntaxNode = Arc<SyntaxData>; struct SyntaxData { offset: usize, parent: Option<SyntaxNode>, green: Arc<GreenNode>, } impl SyntaxNode { fn new_root(root: Arc<GreenNode>) -> SyntaxNode { Arc::new(SyntaxData { offset: 0, parent: None, green: root, }) } fn parent(&self) -> Option<SyntaxNode> { self.parent.clone() } fn children(&self) -> impl Iterator<Item = SyntaxNode> { let mut offset = self.offset; self.green.children().map(|green_child| { let child_offset = offset; offset += green_child.text_len; Arc::new(SyntaxData { offset: child_offset, parent: Some(Arc::clone(self)), green: Arc::clone(green_child), }) }) } } impl PartialEq for SyntaxNode { fn eq(&self, other: &SyntaxNode) -> bool { self.offset == other.offset && Arc::ptr_eq(&self.green, &other.green) } } }
Optimizations
- Atomic avoidance: Use
Rc(non-atomic) instead ofArcfor parent pointers since traversals are single-threaded anyway, avoiding atomic overhead. - Thread movement: Can’t send
SyntaxNodeacross threads directly, so you send lightweight coordinates like(GreenNode, TextRange)instead. The receiving thread rebuilds the node from those coordinates. - Transient trees: Instead of keeping full trees in memory,
rust-analyzerjust stores(FileId, TextRange)and reparses when needed—trading CPU time for memory. - Root-only
Arc: Only the root holds anArc<GreenNode>as the tree’s memory anchor. Descendants use raw pointers (fast) plus anRcto their parent (safe). Since each child’sRckeeps its parent alive all the way up to the root, the raw pointers never dangle. This means you only pay for atomic operations once at the root—the rest is just cheapRcbumps. - Object pooling: Keep a thread-local “free list” of node objects for reuse instead of calling
mallocconstantly. The pool only needs to match traversal depth (not tree size), so a few dozen slots can handle huge trees. This lets you return “owned” nodes cheaply (just a pointer move) which are nicer to work with than references.
Alternative Designs
Memoized RedNodes
- Memoized nodes (C#/Swift): Heavy
Arcobjects that permanently cache everything—offset, parent, children. You get true pointer equality (super fast comparison) but it doubles memory since the whole tree is duplicated in the “Red” layer. C# tries to claw back memory with weak references. - Cursor approach (
rust-analyzer): Lightweight views that calculate positions on demand. Memory scales with traversal depth, not tree size.
#![allow(unused)]
fn main() {
// Memoized Design (C# / Swift)
// Persistent, heavy wrapper around the data.
struct SyntaxData {
offset: usize,
parent: Option<SyntaxNode>, // Back-link
// Caches children forever. High memory cost.
children: Vec<OnceCell<SyntaxNode>>,
}
}
| Feature | Memoized Nodes (C#/Swift) | Cursors (rust-analyzer) |
|---|---|---|
| Node Identity | Pointer Equality (Fast) | Range Check (Slower) |
| Memory Cost | High (Doubles tree size) | Low (Proportional to depth) |
| Child Access | Cached (Immediate) | Computed (On-demand) |
| Persistence | Long-lived Objects | Transient / Re-created |
Layer 3: AST (The API)
-
A strongly typed API is way more convenient than working with untyped GreenNodes directly (though untyped works fine for error or trivial nodes).
-
For example, having a typed
FnDefwith well-known methods beats a generic node you have to manually inspect. -
Example
AstNodetrait (fromrust-analyzer):#![allow(unused)] fn main() { pub trait AstNode { fn cast(syntax: SyntaxNode) -> Option<Self> where Self: Sized; fn syntax(&self) -> &SyntaxNode; } } -
The classes are auto-generated. One example from
rust-analyzer:#![allow(unused)] fn main() { #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub struct FnDef { syntax: SyntaxNode, } impl AstNode for FnDef { fn cast(syntax: SyntaxNode) -> Option<Self> { match syntax.kind { SyntaxKind::FN => Some(FnDef { syntax }), _ => None, } } fn syntax(&self) -> &SyntaxNode { &self.syntax } } impl FnDef { pub fn param_list(&self) -> Option<ParamList> { self.syntax.children().find_map(ParamList::cast) } pub fn ret_type(&self) -> Option<RetType> { self.syntax.children().find_map(RetType::cast) } pub fn body(&self) -> Option<BlockExpr> { self.syntax.children().find_map(BlockExpr::cast) } // ... } } -
Variants such as expressions are represented as
enums. One example fromrust-analyzer:#![allow(unused)] fn main() { #[derive(Debug, Clone, PartialEq, Eq, Hash)] pub enum AssocItem { FnDef(FnDef), TypeAliasDef(TypeAliasDef), ConstDef(ConstDef), } impl AstNode for AssocItem { ... } } -
Shared syntactic elements (like names, loop bodies, or doc comments) are modeled via traits:
#![allow(unused)] fn main() { trait HasName: AstNode { fn name(&self) -> Option<Name>; } impl HasName for FnDef { fn name(&self) -> Option<Name> { self.syntax.children().find_map(Name::cast) } } impl HasName for StructDef { fn name(&self) -> Option<Name> { self.syntax.children().find_map(Name::cast) } } }
Alternative Designs
Semantic Full AST
- IntelliJ’s PSI is a “rich” AST that abstracts away where code came from—a
PsiMethodlooks the same whether it’s from source (.java) or compiled (.jar), hiding all that complexity. - Attached semantics: Nodes have methods like
resolve()orgetSuperClass()built in, so you can query semantics directly on the tree. - Memory optimization: Start as a lightweight “stub” (just serialized metadata) and only inflates to a full AST when you actually open the file.
| Feature | Pure AST (rust-analyzer) | Semantic PSI (IntelliJ) |
|---|---|---|
| Scope | Source code structure only | Source + Compiled Libraries |
| Backing | Always AST | Dynamic (index stub -> full tree) |
| Logic | External (analysis passes) | Internal (methods on nodes) |
| Memory | Low (transient cursors) | High (rich objects), mitigated by stubs |
Parsing - The Token Sequence Transformer
-
You can think of parsing as doing a DFS traversal of the tree you want to build.
-
When the parser finishes a node, it calls
GreenNodeBuilderto construct it:#![allow(unused)] fn main() { pub struct GreenNodeBuilder { ... } impl GreenNodeBuilder { pub fn new() -> GreenNodeBuilder { ... } pub fn token(&mut self, kind: SyntaxKind, text: &str) { ... } pub fn start_node(&mut self, kind: SyntaxKind) { ... } pub fn finish_node(&mut self) { ... } pub fn finish(self) -> GreenNode { ... } } } -
There are two kinds of input: source text (full of trivia like whitespace and comments) and macro token trees (just structural tokens, no trivia).
-
Input and output tokens don’t match 1-to-1. The parser uses abstract callbacks—
TokenSourcefor reading input andTreeSinkfor writing output. An intermediary layer handles the mismatches, like stripping whitespace or merging>+>into>>. -
The parser interface (from
rust-analyzer):#![allow(unused)] fn main() { // Different structure than GreenNode's Token pub struct Token { pub kind: SyntaxKind, pub is_joined_to_next: bool, } pub trait TokenSource { fn current(&self) -> Token; fn lookahead_nth(&self, n: usize) -> Token; fn is_keyword(&self, kw: &str) -> bool; fn bump(&mut self); } pub trait TreeSink { fn token(&mut self, kind: SyntaxKind, n_tokens: u8); fn start_node(&mut self, kind: SyntaxKind); fn finish_node(&mut self); fn error(&mut self, error: ParseError); } pub fn parse( token_source: &mut dyn TokenSource, tree_sink: &mut dyn TreeSink, ) { ... } } -
The
parserandsyntaxcrates are completely separate with zero dependencies on each other, achieving strict modularity. -
The parser mostly works with
SyntaxKindtags and ignores the actual text, except for one hack: checking contextual keywords likeunionordefault. -
TreeSinkisn’t atomic. When the parser emits a logical token like>>, the sink can consume multiple raw tokens (>and>).
Reporting Syntax Errors
- Errors aren’t stored in the tree, they’re collected separately in a
Vec<SyntaxError>. - This separation means you can build or modify trees manually without worrying about error state.
- The parser is intentionally permissive (like allowing
pubon trait methods). These “soft” errors get caught later in a validation pass.
Macros
- Token hygiene: Macros need to remember where tokens came from so variables with the same name from different scopes don’t get confused. While building the tree,
TreeSinkalso creates a map linking text ranges back to their original token IDs. So parsing gives you two things: the tree and a map tracing everything back to its source. - Operator precedence: To avoid breaking precedence (like
$expr * 1messing up order of operations), the parser adds invisible parentheses around macro expansions.
Whitespace & Comments
- The core parser works with a “clean” token stream—no whitespace or comments to worry about.
- But trivia isn’t thrown away. The
TreeSinklayer re-inserts it into the final tree as nodes get built. - Comments don’t just get dumped linearly—they’re attached heuristically to their semantic parents. Like a comment right before a function becomes a child of that
FnDefnode.
Incremental Reparse
- Green trees make modifications cheap—you can “patch” by swapping a single node pointer with a freshly parsed subtree. No external state needed.
- Block heuristic: Edits get isolated to the smallest
{}block that contains them. This works because the parser keeps braces balanced even in broken code, giving you stable anchor points.
Q&A: How does the parser maintain balanced braces in broken code?
Q: How can braces be “balanced” when users type unmatched { or }?
A: The tree structure maintains balanced braces, not the source text. The parser uses error recovery to ensure every start_node(BLOCK) has a matching finish_node().
Q: What happens when a closing brace is missing?
A: The parser inserts an implicit/phantom closing brace. For example:
#![allow(unused)]
fn main() {
fn foo() {
let x = 1;
// EOF - missing }
}
The tree structure acts as if the } exists at EOF, even though it’s not in the source text. This is implemented in the parser’s block parsing logic, which automatically closes unclosed blocks at EOF or when encountering incompatible tokens.
Q: What if there’s an unexpected token inside a block?
A: The parser uses early block termination. For example:
#![allow(unused)]
fn main() {
fn foo() {
let x = 1;
fn bar() { // unexpected fn - shouldn't be here
}
The parser:
- Realizes
fnshouldn’t be insidefoo’s block - Inserts an implicit
}to closefoo - Continues parsing
barat the outer level - Wraps the malformed content in an
ERRORnode
This is part of the “panic mode” error recovery strategy implemented in the event-based parser.
Q: What about extra closing braces?
A: Unmatched } tokens are:
#![allow(unused)]
fn main() {
fn foo() {
}
} // extra closing brace
}
- Treated as error tokens
- Wrapped in an
ERRORnode - Not matched with anything
- The structural tree still has balanced blocks
Q: How does this enable the block heuristic?
A: Because braces are always structurally balanced, the parser can reliably:
- Find the smallest enclosing
{}block around any edit - Use these blocks as stable anchor points
- Reparse just that subtree instead of the entire file
This is implemented in the incremental reparsing logic in the syntax crate, particularly the incremental_reparse function and is_balanced check.
- In practice though, incremental reparsing often isn’t worth it. Modern parsers are fast enough to just reparse the whole file from scratch.
Parsing Algorithm
- Recursive descent: Each grammar rule maps to a function that calls other functions recursively to parse sub-expressions. This makes the parser easy to understand and debug.
- Pratt parsing: Handle operator precedence elegantly without building complex precedence tables. Used specifically for expressions with binary operators.
- Error recovery: The parser makes a special effort to continue parsing even when it encounters syntax errors. Instead of stopping at the first error, it attempts to recover and continue building the tree, marking problematic sections with
ERRORnodes.
Parser Recap
The parser’s design achieves strict modularity through well-defined interfaces:
TokenSourcetrait:- Define how the parser reads input tokens.
- The parser doesn’t know or care whether tokens come from source text (with trivia), macro expansions (without trivia), or any other source.
TreeSinktrait:- Define how the parser writes output events (
start_node,finish_node,token,error). - The parser has no knowledge of how these events are converted into the final tree structure.
- Define how the parser writes output events (
Interaction Between The syntax and The parser Crate
The parser and syntax crates have zero dependencies on each other. Both crates depend on shared interfaces (traits), but neither depends on the other’s implementation.
%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
flowchart LR
A(["<i>string</i><br/><b>Source Text</b>"])
subgraph top["<code>syntax</code> crate"]
direction LR
B["<i>fn</i><br/><b>tokenize</b>"]
C(["<i>Vec<Token></i><br/><b>Token Stream</b>"])
B --> C
end
subgraph P["<code>parser</code> crate"]
direction LR
D{{"<i>trait</i><br/><b>TokenSource</b>"}}
E["<i>fn</i><br/><b>parse</b>"]
F{{"<i>trait</i><br/><b>TreeSink</b>"}}
D --> E --> F
end
subgraph bottom["<code>syntax</code> crate"]
direction RL
I{{"<i>trait</i><br/><b>AstNode</b>"}}
H(["<i>type</i><br/><b>SyntaxNode</b>"])
G(["<i>type</i><br/><b>GreenNode</b>"])
G --> H --> I
end
A --> B
C -.->|implements| D
F -.->|implemented by| G
click B "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/parsing/lexer.rs#L30-L92" "View tokenize function"
click C "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/parsing/lexer.rs#L9-L15" "View Token struct"
click D "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L30-L44" "View TokenSource trait"
click E "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L81-L84" "View parse function"
click F "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_parser/src/lib.rs#L56-L69" "View TreeSink trait"
click G "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/syntax_node.rs#L10-L10" "View GreenNode type"
click H "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/syntax_node.rs#L33-L34" "View SyntaxNode type alias"
click I "https://github.com/rust-lang/rust-analyzer/blob/cf5bdf464cad7ceb9a67e07985a3f4d3799ec0b6/crates/ra_syntax/src/ast.rs#L31-L42" "View AstNode trait"
classDef syntaxNode fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e40af
classDef parserNode fill:#fed7aa,stroke:#f97316,stroke-width:2px,color:#9a3412
classDef dataNode fill:#e5e7eb,stroke:#6b7280,stroke-width:2px,color:#374151
classDef traitNode fill:#fce7f3,stroke:#db2777,stroke-width:2px,color:#831843
class B,E syntaxNode
class A,C,G,H dataNode
class D,F,I traitNode
style top fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,rx:5
style bottom fill:#eff6ff,stroke:#3b82f6,stroke-width:2px,rx:5
style P fill:#fff7ed,stroke:#f97316,stroke-width:2px,rx:5
- The solid arrows represent direct data flow within the same crate. The dashed arrows represent trait implementation/interface contracts.
- The
parsercrate (orange box) only touches traits - it never directly depends on concrete types fromsyntax. - The
syntaxcrate appears in two boxes because it provides both the input preparation (tokenization) and output consumption (tree building), but the parser sits independently in the middle. - This architecture is like an “hourglass”: wide input capabilities (multiple token sources), narrow interface (two traits), wide output capabilities (multiple tree representations).
How They Work Together
- The
syntaxcrate tokenizes the source text into a token stream. - The
syntaxcrate wraps its token stream in a struct that implements theTokenSourcetrait, providing the interface the parser expects. - The
parsercrate’sparse()function reads fromTokenSourceand writes toTreeSink- it only knows about these trait interfaces, not the concrete implementations - The
syntaxcrate provides a struct implementingTreeSinkthat receives parser events and builds the actualGreenNodetree structure - The
syntaxcrate wraps theGreenNodeinSyntaxNodeand provides the typed AST layer
Insight: The parser crate emits abstract events (start_node, token, finish_node, error) without knowing how they’ll be consumed. The syntax crate provides concrete implementations that convert these events into the tree structure.
What the parser knows vs doesn’t know:
- Knows:
SyntaxKindtags (e.g.,FN_DEF,IDENT,LET_STMT), the vocabulary of what syntactic constructs exist. - Knows: When to emit events like
start_node(FN_DEF)ortoken(IDENT)based on parsing logic. - Doesn’t know: How these events are converted into actual tree nodes (
GreenNode,SyntaxNode). - Doesn’t know: The memory layout, data structures, or APIs of the tree.
SyntaxKind is just a simple enum/tag shared between both crates. The parser uses it to label events, while the syntax crate uses it to tag tree nodes. This shared vocabulary is the only coupling between them.
Illustrative example
The example glue code in syntax crate
#![allow(unused)]
fn main() {
// `syntax` crate
pub fn parse_source_file(text: &str) -> SourceFile {
// Step 1: Tokenize the source text (syntax crate's job)
let tokens = tokenize(text);
// Step 2: Create adapters that implement parser traits
let mut token_source = TokenSource::new(text, &tokens);
let mut tree_sink = TreeSink::new(text, &tokens);
// Step 3: Call the parser (parser crate's job)
// The parser only sees the trait interfaces, not the concrete types
parser::parse(&mut token_source, &mut tree_sink);
// Step 4: Extract the built tree (syntax crate's job)
let (green_node, errors) = tree_sink.finish();
// Step 5: Wrap in higher-level APIs
SourceFile::new(green_node, errors)
}
}
Motivation
This separation provides several benefits:
- The parser logic can change without touching tree representation, and vice versa.
- Different token sources (source files, macro expansions, synthetic tokens) can all use the same parser.
- Different tree sinks could produce different tree formats, or even non-tree outputs (e.g., streaming validation).
- The parser can be tested with mock token sources and tree sinks.
- The parser is truly language-agnostic - it could theoretically be reused for other languages by providing different
SyntaxKinddefinitions.
Testing
Official site: Link.
Test Philosophy
rust-analyzer employs snapshot testing as its core testing strategy (surprised LOL). The essence: “A test is a piece of input text, usually Rust code, and some output text.”
The testing helper compares the actual result against the expected output. When tests fail, you can review the diff and either fix the code or update the expectation if the new behavior is correct.
Test Framework Components
The testing infrastructure combines two main elements:
expect-testcrate: A lightweight snapshot testing library that enables automated expectation updates.- Custom testing helpers: Purpose-built functions for specific
rust-analyzerfeatures (type inference, diagnostics, etc.).
Common Test Helpers
Type Inference Tests
Located in crates/hir-ty/src/tests. Three main helper functions:
check_no_mismatches()
Validates that no type mismatches exist in the given code.
#[test]
fn simple_assignment() {
check_no_mismatches(
r#"
fn main() {
let x: i32 = 42;
let y: i32 = x;
}
"#,
);
}
Note: This relies on rust-analyzer’s analysis, not the Rust compiler. The analyzer might miss errors the compiler would catch, or vice versa.
check_types()
Asserts specific expression types using inline annotations. Uses ^ markers under code to denote type assertions.
#[test]
fn type_annotations() {
check_types(
r#"
fn main() {
let x = 1;
// ^ i32
let y = "hello";
// ^ &str
let z = x + 2;
// ^ i32
}
"#,
);
}
The // ^ type notation marks the expression on the line above and asserts its inferred type.
check_infer()
Uses expect-test to match complete type inference output against expected results. Displays byte ranges, source text, and inferred types for all expressions.
#![allow(unused)]
fn main() {
#[test]
fn infer_function_return() {
check_infer(
r#"
fn foo() -> i32 {
42
}
"#,
expect![[r#"
10..11 '{': expected i32, got ()
15..17 '42': i32
"#]],
);
}
}
Annotation Conventions
rust-analyzer tests use a special notation system for marking positions and ranges in code:
| Notation | Purpose | Example |
|---|---|---|
$0 | Cursor position (single point) | let x$0 = 1; |
$0...$0 | Range/selection | let $0x = 1$0; |
// ^^^^ | Type assertion label | x + 1// ^^^ i32 |
These annotations are processed by the fixture parser and removed before the code is analyzed.
Fixture System
The fixture system is a mini-language for declaring multi-file, multi-crate test scenarios. It allows tests to simulate complex project structures without requiring actual file system operations.
Basic Fixture Structure
#![allow(unused)]
fn main() {
#[test]
fn test_cross_crate_reference() {
check_types(
r#"
//- minicore: sized
//- /lib.rs crate:foo
pub struct Foo;
//- /bar.rs crate:bar deps:foo
use foo::Foo;
fn use_foo(f: Foo) {
// ^^^ Foo
}
"#,
);
}
}
Fixture Directives
Minicore Flags
//- minicore: flag1, flag2, ...
Includes specific standard library components. This is the most commonly used directive.
Common minicore flags:
| Flag | Provides | When to Use |
|---|---|---|
sized | Sized trait | Almost always; recommended when stuck |
fn | Fn, FnMut, FnOnce traits | Required for closures |
async_fn | AsyncFn* traits | Required for async closures |
iterator | Iterator trait | When testing iteration |
option | Option<T> enum | When using Option |
result | Result<T, E> enum | When using Result |
unsize | Unsize trait | For unsized types (dyn Trait, slices) |
coerce_unsized | CoerceUnsized trait | For unsized type coercions |
dispatch_from_dyn | DispatchFromDyn trait | For trait object dispatch |
Important: If your test mysteriously fails with type errors, try adding sized first. Many operations implicitly require the Sized trait.
File Declarations
//- /path/to/file.rs crate:name deps:dep1,dep2 edition:2021
Declares a file in the fixture with optional metadata:
- Path: File path relative to project root
- crate: Crate name (defaults to file name)
- deps: Comma-separated dependency list
- edition: Rust edition (defaults to 2021)
#![allow(unused)]
fn main() {
//- /lib.rs crate:mylib
pub mod foo;
//- /foo.rs
pub struct Bar;
//- /main.rs crate:main deps:mylib edition:2021
use mylib::foo::Bar;
}
Complete Fixture Example
#[test]
fn test_workspace_with_dependencies() {
check_types(
r#"
//- minicore: sized, fn, iterator
//- /workspace/utils/lib.rs crate:utils
pub fn helper() -> i32 { 42 }
//- /workspace/core/lib.rs crate:core deps:utils
use utils::helper;
pub fn process() -> i32 {
helper()
// ^^^^^^^^ fn helper() -> i32
}
//- /workspace/app/main.rs crate:app deps:core
use core::process;
fn main() {
let x = process();
// ^ i32
}
"#,
);
}
This fixture creates a workspace with three crates (utils → core → app) with proper dependency relationships.
Fixture System Architecture
%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
flowchart LR
A["Test Fixture String"]
B["Parse Directives"]
C["minicore Flags"]
D["File Declarations"]
E["Build In-Memory\nFile System"]
F["Create Crate Graph"]
G["Run rust-analyzer\nAnalysis"]
H["Extract Results"]
A --> B
B --> C
B --> D
C --> E
D --> E
E --> F
F --> G
G --> H
classDef inputNode fill:#dbeafe,stroke:#3b82f6,stroke-width:2px,color:#1e3a8a
classDef parseNode fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#78350f
classDef buildNode fill:#dcfce7,stroke:#22c55e,stroke-width:2px,color:#064e3b
classDef outputNode fill:#fce7f3,stroke:#db2777,stroke-width:2px,color:#831843
class A inputNode
class B,C,D parseNode
class E,F,G buildNode
class H outputNode
The fixture parser:
- Extracts directives (
minicore, file paths, metadata). - Builds an in-memory file system.
- Constructs a crate dependency graph.
- Initializes
rust-analyzerwith this synthetic project. - Runs analysis and returns results for assertions.
UPDATE_EXPECT Workflow
The expect-test crate supports automatic expectation updates via the UPDATE_EXPECT=1 environment variable.
Manual Update
UPDATE_EXPECT=1 cargo test
This runs all tests and automatically updates any failing expect![[...]] blocks with the actual output.
IDE Integration
Some IDEs (notably VSCode with the rust-analyzer extension) provide convenient UI buttons for updating expectations:
- “Update Expect” button in test function hover tooltips
- Inline code actions for specific
expect!blocks
Workflow Diagram
%%{init: {'theme':'neutral', 'themeVariables': {'fontSize':'16px'}}}%%
stateDiagram-v2
[*] --> WriteTest: Write test with expect!
WriteTest --> RunTest: cargo test
RunTest --> TestPasses: Output matches
RunTest --> TestFails: Output differs
TestPasses --> [*]
TestFails --> ReviewDiff: Examine actual vs expected
ReviewDiff --> FixCode: Bug in code
ReviewDiff --> UpdateExpect: Behavior is correct
FixCode --> RunTest
UpdateExpect --> RunWithUpdate: UPDATE_EXPECT=1 cargo test
RunWithUpdate --> Verify: Check updated expectation
Verify --> [*]
Best Practices
1. Import Required Types
The minicore system only includes explicitly requested components. If your test fails with “cannot find type” errors:
#![allow(unused)]
fn main() {
//- minicore: sized, fn, iterator, option // ← Add missing flags
}
2. Start with sized
When stuck with mysterious type errors, add sized to minicore flags. Many operations implicitly require the Sized trait:
#![allow(unused)]
fn main() {
//- minicore: sized // ← Add this first when debugging
}
3. Use Semantic Highlighting for Debugging
If code looks wrong in the test, check its syntax highlighting. Incorrect highlighting often indicates:
- Missing minicore flags
- Syntax errors in the fixture
- Incorrect file path declarations
4. Validate with Real IDEs
When test behavior seems suspicious, validate the code in an actual IDE with rust-analyzer. Sometimes tests simplify too much and miss real-world edge cases.
5. Unsized Types Need Special Flags
Working with dyn Trait, slices, or other unsized types? Add these flags:
#![allow(unused)]
fn main() {
//- minicore: sized, unsize, coerce_unsized, dispatch_from_dyn
}
6. Keep Tests Minimal
Follow the “Reproduction isolation” principle from rust-analyzer conventions: minimize test cases to the smallest code required to reproduce behavior. This reduces noise and debugging time.
Common Test Patterns
Testing Completion
#![allow(unused)]
fn main() {
#[test]
fn completes_struct_fields() {
check_edit(
"field",
r#"
struct S { field: i32 }
fn foo(s: S) {
s.$0
}
"#,
r#"
struct S { field: i32 }
fn foo(s: S) {
s.field$0
}
"#,
);
}
}
The $0 marker indicates cursor position before (trigger point) and after (expected position) completion.
Testing Diagnostics
#![allow(unused)]
fn main() {
#[test]
fn detects_type_mismatch() {
check_diagnostics(
r#"
fn foo() -> i32 {
"string"
// ^^^^^^^^ error: expected i32, found &str
}
"#,
);
}
}
Inline error annotations specify expected diagnostic messages at specific locations.
Testing Quick Fixes
#[test]
fn adds_missing_import() {
check_fix(
r#"
//- /lib.rs crate:foo
pub struct Foo;
//- /main.rs crate:main deps:foo
fn main() {
Foo$0 {};
}
"#,
r#"
use foo::Foo;
fn main() {
Foo {};
}
"#,
);
}
Tests that the quick fix correctly adds the missing import.
Test Organization
rust-analyzer organizes tests close to implementation code:
crates/
├── hir-ty/
│ └── src/
│ ├── infer.rs # Type inference logic
│ └── tests/
│ ├── simple.rs # Basic inference tests
│ ├── traits.rs # Trait-related tests
│ └── coercion.rs # Type coercion tests
├── ide/
│ └── src/
│ ├── completion.rs # Completion implementation
│ └── completion/
│ └── tests.rs # Completion tests
This co-location makes it easy to find relevant tests when modifying code.
Recap
rust-analyzer’s testing philosophy emphasizes:
- Data-driven testing: Tests defined by input/output pairs, resilient to refactoring.
- Fixture-based isolation: Each test builds a complete in-memory project.
- Snapshot automation:
UPDATE_EXPECT=1eliminates manual expectation maintenance. - Minimal reproduction: Tests include only what’s necessary to trigger behavior.
- Co-located tests: Tests live near implementation for easy discovery.
libsyntax
- Related RFC for
libsyntax-2.0: Link.- Discuss design choices for syntax tree structure and parser.
Salsa
Salsa: Incremental Computation Framework
Official site: Link.
What is Salsa?
A Rust framework for writing incremental, on-demand programs:
- Adapts to input changes while maintaining up-to-date outputs.
- Originally developed from incremental recompilation techniques in
rustc. - Tracks dependencies and memoizes results for efficient recomputation.
- Primary use case: Compilers and language tooling responding to source code modifications without full recompilation.
Core Concepts
Queries
The fundamental mechanism in Salsa:
- Functions whose results are cached (memoized) and tracked for dependencies.
Two types of queries:
- Input queries: Values provided from outside the system. When these change, dependent computations are invalidated.
- Derived queries: Computed from other queries (inputs or derived). Results are cached and recomputed only when dependencies change.
Database
The central runtime that manages query execution, dependency tracking, and memoization. All queries execute through the database, which coordinates:
- Active function tracking.
- Dependency recording.
- Cache invalidation.
- Result storage.
Tracked Structures
Salsa manages tracked and interned structs:
- Maintain state throughout computations.
- Integrate with the dependency tracking system.
- Enable Salsa to know when they change and which computations depend on them.
How It Works
Dependency Tracking
When a query executes:
- Salsa automatically records which other queries it reads.
- Creates a dependency graph.
When an input changes:
- Salsa marks affected queries as “potentially stale”.
- On next access, performs a “maybe changed after” check.
- Only recomputes if dependencies actually changed.
- Reuses cached results when possible.
Memoization
Query results are cached with their dependencies. Salsa stores:
- The computed result.
- Which queries were read during computation.
- The “revision” when the computation occurred.
Incremental Updates
The framework uses a revision counter that increments when inputs change.
For each cached result, Salsa can determine:
- Was this result computed before or after the last change?
- Have any of its dependencies changed since computation?
- Can we reuse the cached value?
Result: Efficient incremental updates where only affected computations run.
Cycle Handling
When circular dependencies occur (Query A depends on B, B depends on A):
- Salsa detects the cycle.
- Provides mechanisms to handle it.
- Prevents infinite loops.
Example Workflow
#![allow(unused)]
fn main() {
// Input query - set from outside
#[salsa::input]
struct SourceText {
#[return_ref]
text: String,
}
// Derived query - computed from inputs
#[salsa::tracked]
fn parse(db: &dyn Db, input: SourceText) -> Ast {
// Parse the text
// Salsa automatically tracks that this depends on `input.text()`
}
#[salsa::tracked]
fn check_types(db: &dyn Db, input: SourceText) -> TypeCheckResult {
let ast = parse(db, input); // Depends on parse
// Type check the ast
}
// Later, when input changes:
// 1. Set new input: input.set_text(db).to(new_text)
// 2. Access check_types(db, input)
// 3. Salsa sees parse() needs recompute (input changed)
// 4. Salsa recomputes parse(), then check_types()
// 5. If parse() result unchanged, check_types() reuses cache
}
Key Features
On-Demand Evaluation
- Queries computed lazily.
- If no one asks for a result, it doesn’t get computed.
- Contrasts with systems that eagerly recompute everything after changes.
Durability
- Queries can be marked as “durable” if their inputs rarely change.
- Salsa uses durability levels to optimize invalidation checks.
LRU Caching
- Memoized results can be evicted using LRU (Least Recently Used) strategy.
- Manages memory by evicting old results.
- Results recomputed if accessed after eviction.
Parallel Execution
- Multiple queries execute concurrently.
- Salsa coordinates access and handles race conditions through its runtime.
Cancellation
- Long-running queries cancelled if inputs change during execution.
- Avoids wasted computation.
Runtime Architecture
The Salsa runtime manages:
- Active function tracking: Which queries are currently executing.
- Revision management: Monotonic counter for tracking change epochs.
- Dependency graph: Records which queries depend on which.
- Memoization storage: Cached results indexed by query + inputs.
- Worker coordination: Handles concurrent query execution.
Advantages
- Automatic dependency tracking: No manual specification of what depends on what.
- Fine-grained invalidation: Only recompute what actually changed.
- Transparent caching: Memoization handled by framework, not user code.
- Parallel-friendly: Built-in support for concurrent execution.
- Memory efficient: LRU eviction prevents unbounded cache growth.
Integration with rust-analyzer
rust-analyzer uses Salsa in its base-db crate for incremental compilation.
Example uses:
- Input queries: File contents, crate graph configuration.
- Derived queries: Parse trees, name resolution, type inference.
Benefits:
- When a file changes, only affected analysis passes rerun.
- Editing one function: type checking other functions reuses cached results.
- Responsive even on large codebases.
- Most edits trigger minimal recomputation.
Comparison: Reactive vs Traditional
Traditional approach:
File change → Reparse entire project → Recheck all types → Update all diagnostics
Salsa approach:
File change → Mark queries as potentially stale
User requests diagnostics → Salsa recomputes only affected queries → Return result
The Salsa approach is faster because:
- Only computes what’s actually needed (on-demand).
- Reuses unchanged intermediate results (memoization).
- Granular invalidation (file-level, not project-level).
Learning Resources
- Tutorial: calc language: End-to-end example building a compiler/interpreter
- Plumbing documentation: Internal mechanisms and algorithms
- Community: Zulip instance for discussions
Summary
Salsa provides incremental computation through:
- Query-based architecture: All computations are queries that depend on other queries.
- Automatic dependency tracking: Framework records dependencies during execution.
- Smart invalidation: Only recomputes when dependencies actually change.
- Transparent memoization: Caches results without explicit cache management.
- On-demand evaluation: Computes only what’s requested, when requested.
Result:
- Responsive compilers and language servers.
- Efficient change handling.
- Real-time feedback practical even for large codebases.
Research Summary
Operations
- Study established projects to inherit their mature operational workflows1.
- Convert recurring code review feedback into a permanent “living style guide” that scales mentorship and prevents repetitive corrections1.
- Structure commit metadata (titles and messages) to automate downstream documentation like changelogs, which reduces manual release overhead1.
Code Evolution
- Classify changes by their impact scope (internal vs. API/dependency) to apply proportional review scrutiny1.
- Internalize trivial utilities (“micro-dependencies”) to reduce supply chain attack surface and compilation bloat1.
- Prefer parsing data into types that guarantee validity over simple validation checks, which eliminates “boolean blindness” where the system loses the proof of validity after the check returns
true1. - Restrict heavy generics at system boundaries to prevent “monomorphization bloat,” trading minor runtime overhead for significant compile-time gains1.
Compiler Architecture
rust-analyzer’s high-level architecture, from low-level to higher-level1:
parsercrate: A recursive descent parser that is tree-agnostic and event-based.syntaxcrate: Syntax tree structure and parser that exposes a higher-level API for the code’s syntactic structure.base-dbcrate: Integration withsalsa, allowing for incremental and on-demand computation.hir-xxxcrates: Program analysis phases that explicitly integrate incremental computation.hircrate: A high-level API that provides a static, inert and fully resolved view of the code.idecrate: Provides high-level IDE features on top of thehirsemantic model, speaking IDE languages such as texts and offsets instead of syntax nodes.rust-analyzer: The language server that knows about LSP and JSON protocols.
Token & Node
- Parser tokens have tags and their corresponding source text, while parser nodes have tags and source length with children nodes placed in a homogenous vector2.
- Tokens and nodes share the same
SyntaxKindenum and are not as clearly distinguished as in@dbml/parse. Tokens function as leaf nodes while nodes function as interior nodes2. - The explicit nodes approach treats whitespace and comments as sibling nodes, where
rust-analyzerhandles trivial and error nodes uniformly (everything is a node), while some parsers like@dbml/parseattach trivia to semantic parents and others use hybrid approaches2. - For context-sensitive keywords like
unionanddefault, the parser checks the actual text viaTokenSource::is_keyword()rather than relying solely on token kind2. - An intermediary layer using
TokenSourceandTreeSinktraits merges tokens based on context, such as combining>+>into>>2. - Nodes store only
text_len(not absolute offset) while tokens store text, and SyntaxNode computes offset on-demand from its parent, which enables incremental parsing without position invalidation2.
Parsing
- The three-layer tree architecture separates concerns effectively2:
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and
Arc-sharing that storestext_lenrather than offset. - SyntaxNode (cursor/RedNode): Adds parent pointers and on-demand position computation where memory scales with traversal depth rather than tree size, and nodes are transient (rebuilt from GreenNode when needed).
- AST (typed API): Auto-generated typed wrappers like
FnDefandParamListaroundSyntaxNodethat provide ergonomic access to specific constructs.
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and
- Everything is preserved including whitespace, comments, and invalid tokens, where invalid input gets wrapped in
ERRORnodes and the original text can be reconstructed by concatenating token text2. - Errors live in a separate
Vec<SyntaxError>rather than being embedded in the tree, which enables manual tree construction without error state management and produces parser output as(green_node, errors)2. - Resilient parsing combines multiple strategies2:
- The algorithm uses recursive descent with Pratt parsing for expressions and is intentionally permissive, accepting invalid constructs that are validated later.
- The event-based architecture has the parser emit abstract events via
TokenSource(input) andTreeSink(output) traits, keeping the parser agnostic to tree structure. - Error recovery employs panic mode (skipping to synchronization points like
}or;), inserts implicit closing braces, performs early block termination, and wraps malformed content in ERROR nodes.
- Incremental parsing uses a sophisticated approach2:
- The Red-Green model separates Green (immutable storage) from Red (cursors with positions), where this separation enables cheap tree patches by swapping GreenNode pointers.
- The block heuristic reparses only the smallest
{}block containing the edit, which works because the parser maintains structurally balanced braces through implicit}insertion and ERROR wrapping for extras. - Pragmatically, incremental reparsing is often not worth the complexity since full reparse is fast and simpler, though the architecture remains valuable for tree edit cheapness and subtree sharing.
- Error messages use a layered approach where a permissive parser is followed by a separate validation pass for “soft” errors, allowing the parser to focus on structure recovery while validation uses semantic context for detailed diagnostics2.
Design Choices & General Architectures
- Avoid blind serialization of internal types, which implicitly couples public clients to private implementation details1.
- Enforce opaque API boundaries (such as between analysis and IDE layers) to enable radical internal refactoring without breaking consumers1.
- Codify architectural laws (like “core layers are I/O free”) to permanently guarantee non-functional requirements such as speed and deterministic testing1.
- Order function arguments by stability (context → data) to align code structure with mental models (“setting” → “actors”) and reduce cognitive load during scanning1.
- Use distinct types to segregate unverified external input (like “dirty” OS strings) from validated internal data, preventing logic errors from crossing trust boundaries1.
- Enforce invariants via “construction & retrieval” (private fields with public getters) rather than “mutation” (setters), ensuring objects never enter invalid states1.
- Encode assumptions into the type system (such as non-nullable types) to force callers to handle edge cases explicitly, preserving context at the call site1.
- Encapsulate complex execution arguments into temporary structs to support multiple execution modes without duplicating function signatures1.
- Split functions with boolean flags (like
do(true)) into distinct named functions, which adheres to the Single Responsibility Principle and prevents unrelated logic paths from coupling1.
Implementation Patterns
- Prioritize imperative clarity over functional brevity, where code should maximize “work per line” rather than minimizing line count via complex indirections1.
- Use spatial operators (like
<or<=) that map intuitively to the mental number line (0→∞), avoiding the mental effort required to “flip” comparisons1. - Prefer syntax that supports left-to-right reading (such as explicit type ascription), which reduces the “context window” required to understand a statement by declaring intent up-front1.
- Use blocks
{ ... }to isolate temporary state, preventing variable pollution while retaining access to the parent context1. - Push resource allocation (memory and I/O) up to the call site (like passing a buffer in) to make performance costs visible and controllable by the caller1.
- Use explicit namespaces or qualifiers to visually reinforce layer boundaries in code, such as distinguishing
ast::Nodefromhir::Node1.
Testing
- Tests are defined as input/output pairs where the framework compares actual results against expected output, and expectations are updated when behavior changes intentionally3.
- Define tests via data (input and output) rather than API calls, which makes tests survive refactoring1.
- Each test simulates a complete environment (multi-file, multi-crate) in memory without shared state3.
- Failing tests can auto-update their expected output via a flag, which eliminates manual maintenance3.
- Minimize test cases to the smallest input that reproduces the behavior, resulting in less noise and faster debugging1.
- Place tests near their implementation to enable easy discovery during development3.
- Never ignore test failures. Instead, assert the incorrect behavior with a FIXME comment to keep the failure visible1.
Documentation
Enforce full-sentence comments (starting with a capital letter and ending with a period) to psychologically shift the author from “note-taking” (describing what) to “explanation” (describing why and providing context)1.
References
rust-analyzerhigh-level architecture and conventionsrust-analyzersyntax tree and parserrust-analyzertesting
Glossary
Technical terms and concepts used throughout the dboxide project documentation.
API Boundary
An explicit interface layer that separates different components of a system. By making boundaries opaque (hiding implementation details), you can refactor internal code without breaking external consumers. Example: the separation between the hir analysis layer and the ide layer in rust-analyzer.
AST
Abstract Syntax Tree. A typed, high-level representation of source code structure that provides ergonomic access to syntactic constructs. In rust-analyzer, AST nodes like FnDef and ParamList are auto-generated wrappers around the lower-level SyntaxNode, offering convenient methods for traversing specific language constructs.
Block Heuristic
An incremental parsing optimization that reparses only the smallest {} block containing an edit. This works because parsers can maintain structurally balanced braces even in broken code by inserting implicit closing braces and wrapping errors. Often not worth the complexity since full reparse is typically fast enough.
Boolean Blindness
A situation where validation returns a simple boolean (true/false), causing the system to lose the proof of validity after the check. Instead of returning bool, prefer parsing into types that guarantee validity at the type level, preserving the validation evidence throughout the program.
Actually, when writing this, I figured this is related to proof theory’s distinction between Admissibility and Derivability (I learned about these in PFPL - see my type-theory repo). Returning bool is like admissibility (asserting “this is valid” without proof), while returning a validated type is like derivability (carrying constructive evidence). When you return bool, you lose the derivation - just like admissibility is a weaker assertion than derivability because it lacks the constructive proof tree.
DST
Dynamically Sized Type. A Rust type whose size is not known at compile time. Used in rust-analyzer’s GreenNode implementation to store node data and all children in a single heap allocation: [RefCount | Kind | TextLen | N_Children | Child_1 | ... | Child_N].
Event-Based Parser
A parser architecture where the parser emits abstract events (like start_node, token, finish_node) through trait interfaces (TokenSource for input, TreeSink for output) rather than directly constructing tree structures. This decouples the parsing algorithm from tree representation, enabling the same parser to work with different input sources and output formats.
GreenNode
The immutable storage layer in the Red-Green tree architecture. Stores the actual syntax tree data in a space-efficient, persistent structure that can be shared across threads via Arc. Contains only kind: SyntaxKind, text_len: usize, and children, with no position information or parent pointers. Optimizations include DST layout, tagged pointers, and token interning.
HIR
High-level Intermediate Representation. A fully resolved, semantic view of the code that abstracts away syntactic details. In rust-analyzer, the hir crate provides a static API for querying resolved types, names, and semantic relationships, built on top of the syntax tree and incremental analysis passes.
Incremental Parsing
A technique that reparses only the portions of a file affected by edits rather than reparsing the entire file. Enabled by immutable tree structures (GreenNode) that can be efficiently patched by swapping subtree pointers. Often not worth the implementation complexity as modern parsers are fast enough for full reparse.
Interning
An optimization technique that stores only one copy of duplicate data. In rust-analyzer, identical tokens (like multiple occurrences of 1 in 1 + 1) share the same allocated token, reducing memory usage.
Lossless Syntax Tree
A tree representation that preserves everything from the source code including whitespace, comments, and invalid tokens. Enables perfect reconstruction of the original source text by concatenating token text. Invalid input gets wrapped in ERROR nodes rather than being discarded.
LSP
Language Server Protocol. A standardized protocol for communication between editors and language servers, enabling features like auto-completion, go-to-definition, and diagnostics. Allows one language server implementation to work with multiple editors.
Monomorphization
The process where Rust’s compiler generates separate copies of generic functions for each concrete type used. Can lead to “monomorphization bloat” where excessive use of generics at system boundaries causes long compile times. Mitigated by restricting generics or using dynamic dispatch at boundaries.
Panic Mode
An error recovery technique where the parser skips tokens until it finds a synchronization point (like } or ;) after encountering a syntax error. Allows the parser to continue and find additional errors rather than stopping at the first one.
Pratt Parsing
An elegant technique for parsing expressions with operator precedence without building complex precedence tables. Handles infix, prefix, and postfix operators uniformly. Used by rust-analyzer for expression parsing alongside recursive descent for other constructs.
Red-Green Tree
A tree architecture that separates immutable storage (Green layer) from navigation and position information (Red layer). GreenNodes store actual data and can be shared across threads. SyntaxNodes (Red layer) are lightweight cursors that add parent pointers and compute positions on-demand. This separation enables efficient tree edits and incremental parsing.
Recursive Descent Parser
A top-down parsing technique where each grammar rule maps to a function that calls other parsing functions recursively. Makes the parser code easy to understand and debug since it directly mirrors the grammar structure. Used by rust-analyzer for most syntactic constructs.
Resilient Parsing
A parsing approach that always produces a tree even from syntactically invalid code. Employs error recovery techniques (panic mode, implicit token insertion, early termination) to maintain tree structure. Invalid sections get wrapped in ERROR nodes. Enables IDE features like completion to work even in broken code.
Salsa
An incremental computation framework that enables on-demand, cached evaluation of derived data. When inputs change, Salsa tracks dependencies and recomputes only what’s affected. Used in rust-analyzer’s base-db crate to make analysis incremental and efficient.
Snapshot Testing
A testing approach where tests are defined as input/output pairs. The framework compares actual output against expected output stored in test files. When behavior changes intentionally, a flag (like UPDATE_EXPECT=1) auto-updates expectations, eliminating manual maintenance.
TokenSource
A trait interface that defines how parsers read input tokens. Provides methods like current(), lookahead_nth(), and bump(). Abstracts over different token sources (source files, macro expansions, synthetic tokens) so the same parser can handle all inputs.
Tree-Agnostic Parser
A parser that emits abstract parsing events without knowledge of the final tree structure. By working through trait interfaces (TokenSource/TreeSink), the parser remains decoupled from tree representation, enabling the same parsing logic to produce different output formats.
TreeSink
A trait interface that defines how parsers write output. Receives events like start_node(), token(), finish_node(), and error(). The implementation decides how to build the actual tree structure. This abstraction enables different tree formats from the same parser.
Trivia
Non-semantic tokens like whitespace, comments, and formatting characters. Different parsers handle trivia differently: explicit nodes (trivia as sibling nodes), attached trivia (stored in leading/trailing properties of tokens), or linked lists (separate doubly-linked list alongside the tree). rust-analyzer uses explicit nodes for uniformity and losslessness.