Research Summary
Operations
- Study established projects to inherit their mature operational workflows1.
- Convert recurring code review feedback into a permanent “living style guide” that scales mentorship and prevents repetitive corrections1.
- Structure commit metadata (titles and messages) to automate downstream documentation like changelogs, which reduces manual release overhead1.
Code Evolution
- Classify changes by their impact scope (internal vs. API/dependency) to apply proportional review scrutiny1.
- Internalize trivial utilities (“micro-dependencies”) to reduce supply chain attack surface and compilation bloat1.
- Prefer parsing data into types that guarantee validity over simple validation checks, which eliminates “boolean blindness” where the system loses the proof of validity after the check returns
true1. - Restrict heavy generics at system boundaries to prevent “monomorphization bloat,” trading minor runtime overhead for significant compile-time gains1.
Compiler Architecture
rust-analyzer’s high-level architecture, from low-level to higher-level1:
parsercrate: A recursive descent parser that is tree-agnostic and event-based.syntaxcrate: Syntax tree structure and parser that exposes a higher-level API for the code’s syntactic structure.base-dbcrate: Integration withsalsa, allowing for incremental and on-demand computation.hir-xxxcrates: Program analysis phases that explicitly integrate incremental computation.hircrate: A high-level API that provides a static, inert and fully resolved view of the code.idecrate: Provides high-level IDE features on top of thehirsemantic model, speaking IDE languages such as texts and offsets instead of syntax nodes.rust-analyzer: The language server that knows about LSP and JSON protocols.
Token & Node
- Parser tokens have tags and their corresponding source text, while parser nodes have tags and source length with children nodes placed in a homogenous vector2.
- Tokens and nodes share the same
SyntaxKindenum and are not as clearly distinguished as in@dbml/parse. Tokens function as leaf nodes while nodes function as interior nodes2. - The explicit nodes approach treats whitespace and comments as sibling nodes, where
rust-analyzerhandles trivial and error nodes uniformly (everything is a node), while some parsers like@dbml/parseattach trivia to semantic parents and others use hybrid approaches2. - For context-sensitive keywords like
unionanddefault, the parser checks the actual text viaTokenSource::is_keyword()rather than relying solely on token kind2. - An intermediary layer using
TokenSourceandTreeSinktraits merges tokens based on context, such as combining>+>into>>2. - Nodes store only
text_len(not absolute offset) while tokens store text, and SyntaxNode computes offset on-demand from its parent, which enables incremental parsing without position invalidation2.
Parsing
- The three-layer tree architecture separates concerns effectively2:
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and
Arc-sharing that storestext_lenrather than offset. - SyntaxNode (cursor/RedNode): Adds parent pointers and on-demand position computation where memory scales with traversal depth rather than tree size, and nodes are transient (rebuilt from GreenNode when needed).
- AST (typed API): Auto-generated typed wrappers like
FnDefandParamListaroundSyntaxNodethat provide ergonomic access to specific constructs.
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and
- Everything is preserved including whitespace, comments, and invalid tokens, where invalid input gets wrapped in
ERRORnodes and the original text can be reconstructed by concatenating token text2. - Errors live in a separate
Vec<SyntaxError>rather than being embedded in the tree, which enables manual tree construction without error state management and produces parser output as(green_node, errors)2. - Resilient parsing combines multiple strategies2:
- The algorithm uses recursive descent with Pratt parsing for expressions and is intentionally permissive, accepting invalid constructs that are validated later.
- The event-based architecture has the parser emit abstract events via
TokenSource(input) andTreeSink(output) traits, keeping the parser agnostic to tree structure. - Error recovery employs panic mode (skipping to synchronization points like
}or;), inserts implicit closing braces, performs early block termination, and wraps malformed content in ERROR nodes.
- Incremental parsing uses a sophisticated approach2:
- The Red-Green model separates Green (immutable storage) from Red (cursors with positions), where this separation enables cheap tree patches by swapping GreenNode pointers.
- The block heuristic reparses only the smallest
{}block containing the edit, which works because the parser maintains structurally balanced braces through implicit}insertion and ERROR wrapping for extras. - Pragmatically, incremental reparsing is often not worth the complexity since full reparse is fast and simpler, though the architecture remains valuable for tree edit cheapness and subtree sharing.
- Error messages use a layered approach where a permissive parser is followed by a separate validation pass for “soft” errors, allowing the parser to focus on structure recovery while validation uses semantic context for detailed diagnostics2.
Design Choices & General Architectures
- Avoid blind serialization of internal types, which implicitly couples public clients to private implementation details1.
- Enforce opaque API boundaries (such as between analysis and IDE layers) to enable radical internal refactoring without breaking consumers1.
- Codify architectural laws (like “core layers are I/O free”) to permanently guarantee non-functional requirements such as speed and deterministic testing1.
- Order function arguments by stability (context → data) to align code structure with mental models (“setting” → “actors”) and reduce cognitive load during scanning1.
- Use distinct types to segregate unverified external input (like “dirty” OS strings) from validated internal data, preventing logic errors from crossing trust boundaries1.
- Enforce invariants via “construction & retrieval” (private fields with public getters) rather than “mutation” (setters), ensuring objects never enter invalid states1.
- Encode assumptions into the type system (such as non-nullable types) to force callers to handle edge cases explicitly, preserving context at the call site1.
- Encapsulate complex execution arguments into temporary structs to support multiple execution modes without duplicating function signatures1.
- Split functions with boolean flags (like
do(true)) into distinct named functions, which adheres to the Single Responsibility Principle and prevents unrelated logic paths from coupling1.
Implementation Patterns
- Prioritize imperative clarity over functional brevity, where code should maximize “work per line” rather than minimizing line count via complex indirections1.
- Use spatial operators (like
<or<=) that map intuitively to the mental number line (0→∞), avoiding the mental effort required to “flip” comparisons1. - Prefer syntax that supports left-to-right reading (such as explicit type ascription), which reduces the “context window” required to understand a statement by declaring intent up-front1.
- Use blocks
{ ... }to isolate temporary state, preventing variable pollution while retaining access to the parent context1. - Push resource allocation (memory and I/O) up to the call site (like passing a buffer in) to make performance costs visible and controllable by the caller1.
- Use explicit namespaces or qualifiers to visually reinforce layer boundaries in code, such as distinguishing
ast::Nodefromhir::Node1.
Testing
- Tests are defined as input/output pairs where the framework compares actual results against expected output, and expectations are updated when behavior changes intentionally3.
- Define tests via data (input and output) rather than API calls, which makes tests survive refactoring1.
- Each test simulates a complete environment (multi-file, multi-crate) in memory without shared state3.
- Failing tests can auto-update their expected output via a flag, which eliminates manual maintenance3.
- Minimize test cases to the smallest input that reproduces the behavior, resulting in less noise and faster debugging1.
- Place tests near their implementation to enable easy discovery during development3.
- Never ignore test failures. Instead, assert the incorrect behavior with a FIXME comment to keep the failure visible1.
Documentation
Enforce full-sentence comments (starting with a capital letter and ending with a period) to psychologically shift the author from “note-taking” (describing what) to “explanation” (describing why and providing context)1.