Research Summary

Operations

Study established projects to inherit their mature operational workflows¹.
Convert recurring code review feedback into a permanent “living style guide” that scales mentorship and prevents repetitive corrections¹.
Structure commit metadata (titles and messages) to automate downstream documentation like changelogs, which reduces manual release overhead¹.

Code Evolution

Classify changes by their impact scope (internal vs. API/dependency) to apply proportional review scrutiny¹.
Internalize trivial utilities (“micro-dependencies”) to reduce supply chain attack surface and compilation bloat¹.
Prefer parsing data into types that guarantee validity over simple validation checks, which eliminates “boolean blindness” where the system loses the proof of validity after the check returns true¹.
Restrict heavy generics at system boundaries to prevent “monomorphization bloat,” trading minor runtime overhead for significant compile-time gains¹.

Compiler Architecture

rust-analyzer’s high-level architecture, from low-level to higher-level¹:

parser crate: A recursive descent parser that is tree-agnostic and event-based.
syntax crate: Syntax tree structure and parser that exposes a higher-level API for the code’s syntactic structure.
base-db crate: Integration with salsa, allowing for incremental and on-demand computation.
hir-xxx crates: Program analysis phases that explicitly integrate incremental computation.
hir crate: A high-level API that provides a static, inert and fully resolved view of the code.
ide crate: Provides high-level IDE features on top of the hir semantic model, speaking IDE languages such as texts and offsets instead of syntax nodes.
rust-analyzer: The language server that knows about LSP and JSON protocols.

Token & Node

Parser tokens have tags and their corresponding source text, while parser nodes have tags and source length with children nodes placed in a homogenous vector².
Tokens and nodes share the same SyntaxKind enum and are not as clearly distinguished as in @dbml/parse. Tokens function as leaf nodes while nodes function as interior nodes².
The explicit nodes approach treats whitespace and comments as sibling nodes, where rust-analyzer handles trivial and error nodes uniformly (everything is a node), while some parsers like @dbml/parse attach trivia to semantic parents and others use hybrid approaches².
For context-sensitive keywords like union and default, the parser checks the actual text via TokenSource::is_keyword() rather than relying solely on token kind².
An intermediary layer using TokenSource and TreeSink traits merges tokens based on context, such as combining > + > into >>².
Nodes store only text_len (not absolute offset) while tokens store text, and SyntaxNode computes offset on-demand from its parent, which enables incremental parsing without position invalidation².

Parsing

The three-layer tree architecture separates concerns effectively²:
- GreenNode (storage): An immutable, persistent layer with optimizations including DST (single heap allocation), tagged pointers, token interning, and Arc-sharing that stores text_len rather than offset.
- SyntaxNode (cursor/RedNode): Adds parent pointers and on-demand position computation where memory scales with traversal depth rather than tree size, and nodes are transient (rebuilt from GreenNode when needed).
- AST (typed API): Auto-generated typed wrappers like FnDef and ParamList around SyntaxNode that provide ergonomic access to specific constructs.
Everything is preserved including whitespace, comments, and invalid tokens, where invalid input gets wrapped in ERROR nodes and the original text can be reconstructed by concatenating token text².
Errors live in a separate Vec<SyntaxError> rather than being embedded in the tree, which enables manual tree construction without error state management and produces parser output as (green_node, errors)².
Resilient parsing combines multiple strategies²:
- The algorithm uses recursive descent with Pratt parsing for expressions and is intentionally permissive, accepting invalid constructs that are validated later.
- The event-based architecture has the parser emit abstract events via TokenSource (input) and TreeSink (output) traits, keeping the parser agnostic to tree structure.
- Error recovery employs panic mode (skipping to synchronization points like } or ;), inserts implicit closing braces, performs early block termination, and wraps malformed content in ERROR nodes.
Incremental parsing uses a sophisticated approach²:
- The Red-Green model separates Green (immutable storage) from Red (cursors with positions), where this separation enables cheap tree patches by swapping GreenNode pointers.
- The block heuristic reparses only the smallest {} block containing the edit, which works because the parser maintains structurally balanced braces through implicit } insertion and ERROR wrapping for extras.
- Pragmatically, incremental reparsing is often not worth the complexity since full reparse is fast and simpler, though the architecture remains valuable for tree edit cheapness and subtree sharing.
Error messages use a layered approach where a permissive parser is followed by a separate validation pass for “soft” errors, allowing the parser to focus on structure recovery while validation uses semantic context for detailed diagnostics².

Design Choices & General Architectures

Avoid blind serialization of internal types, which implicitly couples public clients to private implementation details¹.
Enforce opaque API boundaries (such as between analysis and IDE layers) to enable radical internal refactoring without breaking consumers¹.
Codify architectural laws (like “core layers are I/O free”) to permanently guarantee non-functional requirements such as speed and deterministic testing¹.
Order function arguments by stability (context -> data) to align code structure with mental models (“setting” -> “actors”) and reduce cognitive load during scanning¹.
Use distinct types to segregate unverified external input (like “dirty” OS strings) from validated internal data, preventing logic errors from crossing trust boundaries¹.
Enforce invariants via “construction & retrieval” (private fields with public getters) rather than “mutation” (setters), ensuring objects never enter invalid states¹.
Encode assumptions into the type system (such as non-nullable types) to force callers to handle edge cases explicitly, preserving context at the call site¹.
Encapsulate complex execution arguments into temporary structs to support multiple execution modes without duplicating function signatures¹.
Split functions with boolean flags (like do(true)) into distinct named functions, which adheres to the Single Responsibility Principle and prevents unrelated logic paths from coupling¹.

Implementation Patterns

Prioritize imperative clarity over functional brevity, where code should maximize “work per line” rather than minimizing line count via complex indirections¹.
Use spatial operators (like < or <=) that map intuitively to the mental number line (0->infinity), avoiding the mental effort required to “flip” comparisons¹.
Prefer syntax that supports left-to-right reading (such as explicit type ascription), which reduces the “context window” required to understand a statement by declaring intent up-front¹.
Use blocks { ... } to isolate temporary state, preventing variable pollution while retaining access to the parent context¹.
Push resource allocation (memory and I/O) up to the call site (like passing a buffer in) to make performance costs visible and controllable by the caller¹.
Use explicit namespaces or qualifiers to visually reinforce layer boundaries in code, such as distinguishing ast::Node from hir::Node¹.

Testing

Tests are defined as input/output pairs where the framework compares actual results against expected output, and expectations are updated when behavior changes intentionally³.
Define tests via data (input and output) rather than API calls, which makes tests survive refactoring¹.
Each test simulates a complete environment (multi-file, multi-crate) in memory without shared state³.
Failing tests can auto-update their expected output via a flag, which eliminates manual maintenance³.
Minimize test cases to the smallest input that reproduces the behavior, resulting in less noise and faster debugging¹.
Place tests near their implementation to enable easy discovery during development³.
Never ignore test failures. Instead, assert the incorrect behavior with a FIXME comment to keep the failure visible¹.

Enforce full-sentence comments (starting with a capital letter and ending with a period) to psychologically shift the author from “note-taking” (describing what) to “explanation” (describing why and providing context)¹.

dboxide