Introduction
Loupe is an all-in-one suite of ergonomic APIs and CLI tools for code analysis & codebase querying, aiming for code comprehension, refactoring & automation.
It leverages well-known program analysis techniques, algorithms, and heuristics to answer very common questions about codebases.
Components
Structure-wise, Loupe has 2 main parts: the API and the CLI interface.
Functionality-wise, Loupe is concerned with 4 aspects:
- Code analysis: understanding structure, symbols, and relationships within a codebase.
- Code querying: searching and pattern-matching over code in a structured, expressive way.
- Code refactoring: applying transformations like renaming, reorganizing, and restructuring.
- Quality assurance: detecting smells, bad patterns, and potential bugs.
Goals
- Ergonomic: easy-to-use APIs that feel natural to work with.
- Extensible: can be extended from the core API without fighting the design.
- Fast: instant results; must be meaningfully faster than AI doing manual search.
- Simple: self-contained and not bloated.
Research
This section collects related works across the 4 functional areas of Loupe: code analysis, querying, refactoring, and quality assurance. It covers seminal papers, modern techniques, and practical tools.
The goal here is not exhaustive coverage, but a curated map of the space, so it’s easier to understand where Loupe sits and what prior art is worth learning from.
Plan
Areas: Start with code analysis, then querying, then refactoring, then quality assurance.
- Analysis: The foundation for all other things. We need to understand ASTs, parsing, and code intelligence before anything else makes sense.
- Querying: Once we have a model of the code, querying is what we do with it.
- Refactoring: applies both analysis (know what to change) and querying (find where to change it).
- Quality assurance: it overlaps all three, and makes most sense once we understand the tools and techniques of the other areas.
Within each area, resources are ordered from overview to seminal papers, then modern trends, then practical tools. Start at the top and work down.
Code Analysis
Code analysis is about understanding the structure, symbols, and relationships within a codebase. This covers parsing (ASTs), call graphs, code intelligence protocols (LSP, SCIP), and navigation tools.
This is the foundation everything else in Loupe builds on. Without a solid model of what the code actually is, querying, refactoring, and QA are all guesswork.
Syntactic Analysis
Syntactic analysis is about parsing source code into structured representations, typically abstract syntax trees (ASTs) or concrete syntax trees (CSTs). This is the first step in any code analysis pipeline: before you can reason about symbols, types, or data flow, you need a reliable parse of the source text.
The key concerns here are incremental parsing (re-parsing only what changed after an edit), error recovery (producing a useful tree even when the code is incomplete or malformed), and multi-language support (handling many grammars without writing a custom parser for each).
For Loupe, the parser must support many languages through shared infrastructure (language-agnostic), patch existing trees on edits rather than re-parse from scratch (incremental), and stay off the critical path, since everything downstream depends on it (fast).
Subpages
- Tree-sitter - Strange Loop 2018 and Tree-sitter: the dominant incremental parsing framework, designed for editor use cases. Produces concrete syntax trees with error recovery, supports many languages via community grammars.
- Roslyn: the .NET Compiler Platform, which exposes rich syntax tree APIs including full-fidelity syntax trees with trivia, incremental parsing, and syntax walkers/rewriters for C# and VB.NET.
- Spoofax Language Workbench: a platform for defining languages declaratively, including their syntax, name binding, and type systems.
[talk] [seminal] Tree-sitter - A New Parsing System for Programming Tools
Source: Strange Loop 2018
Link: https://www.thestrangeloop.com/2018/tree-sitter—a-new-parsing-system-for-programming-tools.html
Video: https://www.youtube.com/watch?v=Jes3bD6P0To
Speaker: Max Brunsfeld, GitHub Atom team
Keywords:
- Parsing
- Incremental parsing
- Syntax tree
- Code navigation
Takeaway:
Tree-sitter takes the approach of a compiler generator:
- GLR parsing
- Error recovery with GLR parsing also
- Incremental parsing by reparse but with the old AST as a reference
Introduction
Developer tools that support multiple languages have historically been built on regex-based code analysis.
Syntax highlighting in Atom, VS Code, and Sublime Text uses TextMate grammars, which are regular-expression pattern sets. Code folding uses indentation heuristics. Symbol search uses pattern matching over raw text. (Confirmed fact!)
Regex-based tooling has a hard ceiling:
- Regex cannot represent nested structure or scope boundaries. Balanced delimiters, block nesting, type-sensitive disambiguation - none of this is expressible.
- Reparsing an entire file on every keystroke is too slow for real-time use. Compiler parsers are designed for batch runs, not interactive editing.
- Incomplete code breaks everything. While typing, code is almost always syntactically invalid. A parser that rejects invalid input cannot be used in an editor.
- Every language needed its own editor plugin written from scratch, with no shared interface.
Tree-sitter was built to fix all of this.
- A library for parsing source code written in C and C++.
- Language-agnostic: Cover a wide range of languages.
- Unified: Unified API for different languages.
- Incremental: No need to reparse the whole file.
- Error recovery: Recover on error instead of aborting.
IDEs’ limitations:
- Language-specific.
- No IDEs at the time were incremental.
Language servers’ limitations:
- Fixed feature set.
- RPC + parsing from scratch on every changes.
- Each maintain a different set of dependencies.
Tree-sitter’s strengths:
- Arbitrary analysis based on syntax tree.
- In-process incremental parsing.
- No dependencies.
Features
Syntax Highlighting
-
In Sublime Text 3/Vscode/Old Atom, syntax highlighting seems to be plagued with highlighting inconsistencies:
- Different colors for different types in different contexts.
- Same variables had different colors.
- Fields were not highlighted.
-
A concrete example: Long lines of code (typically seen in bundled javascript) will break regex-based highlightings.
-
Improvements of tree-sitter:
- None of the problems above.
- Very fast.
Code Folding
-
Typically based on indentation: Do not always work, especially when some weird code convention is used!
-
With tree-sitter:
- Folding always works.
- Configurable code folding. This is unlocked by treesitter exposing a consistent AST.
Extend Selection
- It’s the feature where clicking on a position cause semantically larger code items to be selected.
- Works well with multi-cursor.
How Tree-sitter work
Writing a Grammar
- Kind of look like most compiler generator.
- A context-free grammar in tree-sitter is a Javascript value in a file such as
grammar.js. - Compiled into
parser.c - The parser can then be imported.
- A context-free grammar in tree-sitter is a Javascript value in a file such as
- Bindings exist.
Algorithms
Seminal Paper
Based on Tim A. Wagner (1998)’s paper: Practical algorithms for Incremental Software Development Environments.
- Laid out methodology for efficient code analysis in IDE.
- Basic parsing theory to start with.
- Extend the theory to incremental parsing.
- Formally proofs of performance properties.
- How to handle ambiguities, errors, etc.
LR Parsing
Good source: wikipedia.org/wiki/LR_parser.
- Looks from start to end without ever backtracking.
- As it reads the characters, they are grouped into tokens, tokens are grouped into subtrees, stored on a stack for consultation. The items on the stack are merged rapidly.
LR Parsing does not work for all languages. Although practically, a well-designed language shouldn’t be a problem, but some languages have weird quirks.
Tree-sitter does some unique things based on LR Parsing.
Extension: GLR Parsing
When ambiguities happen, fork the branch to work on multiple interpretations.
=> Tree-sitter is based on this.
Error Recovery
- The tree should include error nodes.
- Invalid code is a source of lots of ambiguities.
Tree-sitter uses the same idea of GLR for resolving ambiguities.
Incremental Parsing
- Walk the current AST tree and mark all nodes that contain the editted positions.
- Reparse like before, but it can now use the old tree as a reference.
Tree-sitter
Source: tree-sitter
Link: https://tree-sitter.github.io
Abstract
This section covers tree-sitter: design philosophy, architecture, usage, and applications.
Problem:
- IDEs (visual studio, sublime text) used regexes for syntax features before tree-sitter.
- This caused inconsistent highlighting, fragility, poor performance.
Solution: Tree-sitter provides fast, accurate, incremental parsing for editors and IDEs.
See Strange Loop Talk for more context.
[tool] [tool-dive] Tree-sitter Concepts
Source: tree-sitter
Link: https://tree-sitter.github.io
Takeaways
- Design for IDEs: Tree-sitter prioritizes fast incremental re-parsing and token position fidelity over theoretical parser properties.
- CST design balances detail and complexity by preserving all tokens while using named nodes (grammar rules) vs. anonymous nodes (inline strings).
- Incremental parsing only re-parses edited ranges by reusing unchanged subtrees, critical for real-time editor feedback.
- GLR explores all parse branches at runtime and scores with dynamic precedence, handling true ambiguities without grammar restrictions.
- Queries separate concerns: Tree navigation is independent of grammar design, enabling services to evolve without grammar changes.
Abstract
This section covers tree-sitter’s architecture, design principles, and core concepts.
- Concepts (this article): Design philosophy, architecture, IDE optimization.
- Grammar (Guideline & Syntax): Author grammars with design principles and DSL reference.
- Query System (Queries): Pattern matching for data extraction and tree navigation.
- Services (Services): Syntax highlighting, code navigation, language injection.
Introduction
Tree-sitter is:
- A parser generator tool, taking a grammar described in JS and yielding an incremental parsing library in C.
- Runtime in various languages to load the generated C library.
The Context
Covered in the Strange Loop talk.
Roughly:
- Before tree-sitter, most IDE (vscode, sublime) features (such as syntax-highlighting, code folding) were implemented in terms of regexes.
- This caused a lot of problems:
- Inconsistent highlighting: The same type of tokens were not in the same color in some edge cases (weird formatting convention) or even in some pretty common case.
- Regexes highlighting can easily break, such as when inspecting the bundled JS.
- Slow (at least slower than tree-sitter).
- So, tree-sitter was created to addressed these issues, reporting:
- Faster processing of source code.
- Consistent highlighting.
- Enable many DX features.
- Unified interface for toolings.
Design Philosophy
Some of the design philosophy of tree-sitter can be found in the Strange Loop talk:
- Arbitrary analysis based on syntax tree.
- In-process incremental parsing.
- No dependencies.
On the tree-sitter homepage, tree-sitter aims to, quoted in verbatim, be:
- General enough to parse any programming language
- Fast enough to parse on every keystroke in a text editor
- Robust enough to provide useful results even in the presence of syntax errors
- Dependency-free so that the runtime library (which is written in pure C11) can be embedded in any application
Additional source: https://zed.dev/blog/syntax-aware-editing.
IDE-centric design:
- Tree-sitter was designed with IDE-centric workflow in mind.
- Concrete syntax trees retain all token locations to support many editor features.
- Flexible queries match structural patterns independently from grammar design.
- Grammar + query set = language support without custom traversal code.
This aligns with Loupe’s goal: syntactic code analysis and querying tool.
graph TB
A["Grammar.js<br/>User-defined rules"] -->|tree-sitter generate| B["C Parser<br/>Compiled from grammar"]
B --> C["TSParser<br/>Execution engine"]
C -->|ts_parser_parse_string| D["TSTree<br/>Concrete Syntax Tree"]
D --> E["TSNode<br/>Syntax tree element"]
F["Tree Queries<br/>Pattern matching"] -->|ts_query_cursor_exec| E
G["Edit API<br/>ts_tree_edit"] -->|Incremental| C
H["Language Binding<br/>Rust/JS/Python"] -->|Load| B
style A fill:#e1f5ff
style B fill:#fff3e0
style C fill:#f3e5f5
style D fill:#e8f5e9
style F fill:#fce4ec
style G fill:#fff9c4
Practical Use Cases
Source: https://zed.dev/blog/syntax-aware-editing.
- These features in Zed are implemented using tree-sitter queries:
-
Syntax highlighting (example query from their blog):
["do" "for" "while"] @keyword (function name: (identifier) @function) (pair key: (property_identifier) @function.method value: [(function) (arrow_function)]) -
Symbol outlines (example query from their blog) - fuzzy search symbols within another symbol/context:
(impl_item "impl" @context trait: (_)? @name "for"? @context type: (_) @name) @item (function_item (visibility_modifier)? @context (function_modifiers)? @context "fn" @context name: (_) @name) @item -
Auto-indentation (example query from their blog):
(statement_block "}" @end) @indent [ (assignment_expression) (member_expression) (if_statement) ] @indent -
Language injection (example query from their blog) - sometimes, inside a source code in a language comes another language (e.g. HTML & JS or Rust macro):
(script_element (raw_text) @content (#set! "language" "javascript"))Kind of look like Lexer mode in ANTLR.
-
User-facing API
Tree-sitter’s API follows a layered approach:
- Core APIs handle efficient parsing & a unified AST representation.
- High-level APIs provide convenience features.
Four main object types:
- Languages (
TSLanguagein the C API).- An (opaque) object representing the programming language parsing logic/rules (not the current parsing states).
- The implementations for specific Language objects are generated by tree-sitter.
- Parsers (
TSParserin the C API).- The language parser itself, which can be assigned a language object.
- Can accept some source code and produce a
TSTree.
- Syntax trees (
TSTreein the C API).- The syntax tree of a piece of a source code.
- Composed of
TSNodes. - Can be editted to produce a new
TSTree.
- Syntax nodes (
TSNodein the C API).- A single node in the syntax tree.
- Tracks start and end positions & relation to other nodes.
Example flow using a generated parser in C:
- Create a
TSParserinstance. - Set the
TSParserwith aTSLanguage. - Trigger the parser to produce a
TSTree. - Query the
TSTreeand retrieveTSNodes.
Example program:
#include <stdio.h>
#include <string.h>
#include <tree_sitter/api.h>
// **Language**
// Declare the external grammar function. This is generated by Tree-sitter
// and contains all the syntax rules for JSON.
const TSLanguage *tree_sitter_json(void);
int main() {
// The source code we want to parse
const char *source_code = "{\"name\": \"Gemini\"}";
uint32_t source_len = strlen(source_code);
// **Parser**
// Create the parser and equip it with the JSON language grammar
TSParser *parser = ts_parser_new(); // Dynamically allocated
ts_parser_set_language(parser, tree_sitter_json());
// **Parse tree**
// Instruct the parser to read the string and build the entire tree.
TSTree *tree = ts_parser_parse_string(
parser,
NULL, // Previous tree (used for incremental parsing, NULL for first time)
source_code, // The actual code
source_len // Length of the code
);
// **Tree node**
// Extract the absolute top node of the tree to begin our traversal.
TSNode root_node = ts_tree_root_node(tree);
// Cleanup memory
ts_tree_delete(tree);
ts_parser_delete(parser);
return 0;
}
Parser API
TSParser accepts input via: string buffer (const * string with uint32_t length) or custom TSInput interface.
ts_parser_parse_string: Accept string buffer with incremental parsing support.TSTree *ts_parser_parse_string ( TSParser *self, const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk const char *string, uint32_t length )ts_parser_parse: Accept a custom data structure consumer:TSTree *ts_parser_parse( TSParser *self, const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk TSInput input );
Syntax Nodes (& Trees)
DOM-style interface for inspection:
Position APIs:
ts_node_type: Node type identifier corresponding to grammar rule.ts_node_start_byte/ts_node_end_byte: Raw byte offsets.ts_node_start_point/ts_node_end_point: Zero-based row/column coordinates. Only\nis newline separator.
Tree navigation:
ts_tree_root_node: Get root node from tree.ts_node_child: Access children via array index.ts_node_next_sibling/ts_node_prev_sibling: Navigate siblings (may return null node).ts_node_parent: Navigate to parent node.
Named and Anonymous Nodes
Problem: CSTs preserve all tokens, creating deeply nested trees tedious to traverse.
Solution: Distinguish two node types for selective AST inclusion:
- Named nodes: Explicit grammar rules (e.g.,
if_statement,expression). Always appear in parse tree. - Anonymous nodes: Inline strings/regexes (e.g.,
"if","(",";"). Hidden from parse tree by default.
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
^^^^^^^^^^^^ ^^^ ^^^ ^^^
named node anonymous nodes (hidden from AST)
This distinction enables unambiguous, detailed grammars while keeping trees manageable.
graph TD
A["if_statement<br/>named node"] -->|child| B["_expression<br/>named node<br/>hidden via _prefix"]
A -->|not in AST| C["'if'<br/>anonymous"]
A -->|not in AST| D["'('<br/>anonymous"]
A -->|not in AST| E["')'<br/>anonymous"]
A -->|child| F["_statement<br/>named node<br/>hidden via _prefix"]
Node Field Names
Access child nodes:
- By index: Default approach via children array.
- By field name: More convenient when field names are known.
Like ANTLR, supports both indexed and named accesses.
TSNode ts_node_child_by_field_name(
TSNode self,
const char *field_name, // Unique field name from grammar
uint32_t field_name_length
);
Field name lookup optimization:
- String comparisons are repeated and inefficient.
- Tree-sitter interns field names into
TSFieldIdfor efficient lookup.
uint32_t ts_language_field_count(const TSLanguage *);
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);
Tree Edit
Problem: Source edits invalidate tree byte positions. Reparsing entire file is wasteful.
Solution: Two-step incremental update:
-
Edit tree structure: Tell tree-sitter what changed (old/new byte range, position delta).
typedef struct { uint32_t start_byte; // Edit start uint32_t old_end_byte; // Old end position uint32_t new_end_byte; // New end position TSPoint start_point; // Start as row/column TSPoint old_end_point; // Old end as row/column TSPoint new_end_point; // New end as row/column } TSInputEdit; void ts_tree_edit(TSTree *, const TSInputEdit *); -
Reparse with old tree as context:
TSTree *new_tree = ts_parser_parse_string( parser, old_tree, // Enable incremental parsing updated_source, new_length );
Rough flow:
- Source change.
- Invoke
ts_tree_editto update the tree’s range - Invoke
ts_parser_parsewith the old tree. - New tree with reused structure.
Why do you need to call ts_tree_edit first? Wouldn’t call re-parse with the old tree suffice? My guess is that it would be more efficient incrementally if tree-sitter knows the editted position from the get-go.
Warning!
Old extracted
TSNodereferences become invalid after reparse. These will need to be updated viats_node_edit()or fresh nodes can be refetched instead.
My theory is that TSNode does not contain source text information? So, the edit API does not convey information about the new source.
Language Injection Support
Many documents contain multiple languages (e.g., HTML with embedded JS, Bash with heredocs).
Tree-sitter stitches disjointed segments as one unified syntax tree.
typedef struct {
TSPoint start_point;
TSPoint end_point;
uint32_t start_byte;
uint32_t end_byte;
} TSRange;
void ts_parser_set_included_ranges(
TSParser *self,
const TSRange *ranges,
uint32_t range_count
);
Example: ERB document with embedded HTML and Ruby.
<ul>
<% people.each do |person| %>
<li><%= person.name %></li>
<% end %>
</ul>
Steps:
- Parse entire document with root language (ERB).
- Identify nodes with embedded languages in AST.
- Extract boundaries into separate
TSRangearrays. - Switch parser language via
ts_parser_set_language. - Feed ranges to parser via
ts_parser_set_included_ranges. - Re-parse to generate isolated syntax trees per language.
Concurrency Support
TSTree instances are not thread-safe. However, they can be cheaply copied via ts_tree_copy(TSTree *):
TSTrees are immutable.- An atomic counter is incremented to keep track of completely freed
TSTrees.
(Efficient) Tree Traversal
The problem with the standard node fetching APIs (index/named access):
- Stateless.
- Therefore, repeated accesses would need to recalculate positions from scratch.
- Suitable for occasional accesses.
For bulk traversal, use tree cursors.
-
ts_tree_cursor_new(TSNode): Create cursor from node. -
ts_tree_cursor_goto_first_child: Move to first child. -
ts_tree_cursor_goto_next_sibling: Move to next sibling. -
ts_tree_cursor_goto_parent: Move to parent. -
ts_tree_cursor_current_node: Get full node (triggers lazy evaluation). -
ts_tree_cursor_current_field_name: Get field name. -
ts_tree_cursor_current_field_id: Get field ID.
// 1. Initialize the cursor (this node becomes the impenetrable boundary)
TSTreeCursor cursor = ts_tree_cursor_new(start_node);
// 2. Traversal Flow
// Try moving down into the children (O(1) operation)
if (ts_tree_cursor_goto_first_child(&cursor)) {
// 3. Extract lightweight data directly
const char *field = ts_tree_cursor_current_field_name(&cursor);
// 4. Extract the full node struct only if actually needed
TSNode current = ts_tree_cursor_current_node(&cursor);
// 5. Move horizontally (O(1) operation)
if (ts_tree_cursor_goto_next_sibling(&cursor)) {
// Evaluates the next sibling
}
// 6. Move back up the tree
ts_tree_cursor_goto_parent(&cursor);
}
Resources for the Design & Implementation
-
General design: tree-sitter doc
-
Parsing functionality:
Grammar
How to author and design tree-sitter grammars.
Abstract
This section covers grammar authoring and design.
Grammar design directly affects parser quality, performance, usability.
- Well-designed: Unambiguous, efficient, maintainable.
- Poor design: Parsing conflicts, performance issues, difficult query patterns.
Tree-sitter Grammar Syntax
Source: tree-sitter
Link: https://tree-sitter.github.io
Dependencies
- JavaScript runtime: Node.js (to execute
grammar.js) - C Compiler: GCC, Clang, or MSVC (to compile the generated C code)
- Tree-sitter CLI: Installed via npm, Cargo, or downloaded binary. For Nix, use
tree-sitter
Basic Flow
graph LR
A((Start)) --> B[**Setup** <br> Install CLI & Compiler]
B --> C[**Scaffold** <br> <code>tree-sitter init</code>]
C --> D[**Define Rules** <br> Edit <code>grammar.js</code>]
D --> E[**Compile** <br> <code>tree-sitter generate</code>]
E --> F[**Test** <br> <code>tree-sitter parse</code>]
F --> G((Syntax Tree AST))
classDef step fill:#24292f,stroke:#0969da,stroke-width:2px,color:#fff;
class B,C,D,E,F step;
-
Initialize with
tree-sitter init: Creates a project template withgrammar.js,src/, and test files. -
Define grammar in
grammar.js:module.exports = grammar({ name: "my_language", rules: { source_file: ($) => repeat($.statement), statement: ($) => seq($.identifier, ";"), identifier: ($) => /[a-z]+/, }, }); -
Generate with
tree-sitter generate: Compilesgrammar.jsto C code insrc/parser.c. -
Test with
tree-sitter parse <file>: Parse a sample file and inspect the resulting syntax tree. -
Iterate: Refine rules, regenerate, test until the grammar is correct.
Grammar DSL
The Grammar DSL is a set of JavaScript functions for expressing syntax rules. All patterns are built from simple, composable blocks.
Quick Reference
| Pattern | Example | Matches |
|---|---|---|
| Literal string | "if" | Exactly the text “if” |
| Regex | /\d+/ | One or more digits |
| Sequence | seq(a, b, c) | a followed by b followed by c |
| Choice | choice(a, b) | Either a or b |
| Repeat | repeat(a) | Zero or more of a |
| Repeat 1+ | repeat1(a) | One or more of a |
| Optional | optional(a) | Zero or one of a |
| Symbol ref | $.rule_name | Reference another rule |
| Token | token(a) | Squash a into single leaf node |
| Precedence | prec(1, a) | Resolve conflicts, higher number wins |
| Associativity | prec.left(a) or prec.right(a) | Group left-to-right or right-to-left |
Pattern Functions
Core DSL functions for building rules in grammar.js. All rules are functions accepting $, which provides access to other rules.
Basic Patterns
-
Symbols (the
$object): Reference other grammar rules.rules: { expression: $ => $.identifier, // Reference the 'identifier' rule identifier: $ => /[a-z]+/ }Pitfall: avoid rule names starting with
MISSINGorUNEXPECTED, astree-sitter testtreats them specially. -
String literals: Match exact text (become anonymous nodes in the AST).
rules: { keyword: $ => 'if', // Matches exactly "if" semicolon: $ => ';', // Matches exactly ";" } -
Regular expressions: Match patterns (become anonymous nodes in the AST).
rules: { number: $ => /\d+/, // Digits identifier: $ => new RustRegex('(?i)[a-z]+') // Case-insensitive }Tree-sitter compiles JS regex to C using Rust’s regex syntax, ignoring JS semantics. Lookaheads (
?=) and lookbehinds (?<=) are not supported (break LR(1) parsing). -
Sequences: Match rules in strict order.
rules: { return_statement: ($) => seq("return", $.expression, ";"); } -
Choices: Match any one rule (logical OR).
rules: { literal: ($) => choice("true", "false", $.number); }Order does not matter. If choices overlap, use
prec()to resolve conflicts.
Repetition and Optionality
-
Repetitions: Match a rule zero or more times (
repeat) or one or more times (repeat1).rules: { block: $ => seq('{', repeat($.statement), '}'), // 0+ list: $ => repeat1($.identifier), // 1+ }Tip: use
repeat(X)instead ofoptional(repeat1(X)). -
Options: Match a rule zero or one time.
rules: { class_declaration: ($) => seq(optional("export"), "class", $.identifier); }
Advanced Patterns
-
Tokens: Squash a complex sequence of strings and regexes into a single leaf node.
rules: { html_comment: ($) => token(seq("<!--", /.*/, "-->")); }token()only accepts terminal rules (strings, regexes,seq,choice). Cannot pass non-terminal symbols (like$.statement). -
Immediate tokens: Force a token to match only when physically touching the previous token (no intervening whitespace).
rules: { decorator: ($) => seq("@", token.immediate(/[a-zA-Z_]\w*/)); } -
Field names: Give child nodes semantic names for easier access.
rules: { assignment: ($) => seq(field("target", $.identifier), "=", field("value", $.expression)); } -
Aliases: Rename a rule’s node type in the AST.
rules: { expression: ($) => alias($.identifier, $.variable); // Appears as 'variable' node }
Conflict Resolution
When multiple grammar rules could match the same input, use precedence and associativity to disambiguate.
-
Precedence (
prec(number, rule)): Assign numerical priority to resolve conflicts. Higher number wins.rules: { unary_expression: $ => prec(1, seq('-', $.expression)), binary_expression: $ => seq($.expression, '-', $.expression) }There are two flavors of precedence depending on context:
- Parse precedence (structural): Resolves LR(1) conflicts at compile time. Higher number wins.
- Lexical precedence (tokens): Resolves which token definition matches at the character level. Wrap in
token().
Default precedence is
0. Use negative numbers to deprioritize. To apply lexical precedence, nest insidetoken():token(prec(1, 'keyword')). -
Associativity: When precedence values are equal, associativity acts as a tie-breaker.
rules: { subtraction: ($) => prec.left(1, seq($.expr, "-", $.expr)); // (1 - 2) - 3 }rules: { assignment: ($) => prec.right(1, seq($.id, "=", $.expr)); // a = (b = c) }Only matters when precedence is tied. If rule A has precedence 2 and rule B has precedence 1, associativity is ignored.
-
Dynamic precedence: Resolves true runtime ambiguities during GLR parsing.
rules: { ambiguous_rule: ($) => prec.dynamic(1, choice($.pattern_a, $.pattern_b)); }While
prec()resolves conflicts at grammar compile time,prec.dynamic()scores competing parse trees at runtime. Tree-sitter explores all ambiguous branches (GLR), then picks the tree with the highest total dynamic precedence score. (See Strange Loop talk for details.) -
Reserved keywords: Contextually override forbidden keywords in specific rules, allowing keywords to be used as identifiers in some places.
export default grammar({ name: "my_language", reserved: { global: ($) => ["if", "else", "class"], allow_all: ($) => [], }, rules: { variable_name: ($) => $.identifier, // Uses global reserved set property_name: ($) => reserved("allow_all", $.identifier), // Allows keywords like `obj.if` }, });
Top-Level Configuration
Beyond rules, tree-sitter grammars accept several top-level config fields that fine-tune parsing behavior, optimize tree structure, and handle special cases. Each solves a specific problem that grammar rules alone cannot address.
graph TD
A["Top-level Configs"] --> B["Structure"]
A --> C["Behavior"]
A --> D["Performance"]
B --> B1["inline: Inline nodes"]
B --> B2["supertypes: Abstract groups of grammar rules"]
C --> C1["extras: Allowed anywhere (trivia tokens that have no meanings)"]
C --> C2["conflicts: GLR fallback"]
C --> C3["reserved: Context keywords"]
C --> C4["externals: Custom lex"]
D --> D1["word: Keyword optimization"]
D --> D2["precedences: Named levels"]
Quick Reference
| Config | Purpose | Use when |
|---|---|---|
extras | Allow tokens anywhere (whitespace, comments) | You want to skip whitespace/comments without mentioning them in rules |
inline | Hide intermediate rule nodes from AST | Helper rules clutter the tree but organize your grammar |
conflicts | Enable GLR for ambiguous rules | Grammar has intentional ambiguities that can’t be resolved by precedence |
externals | Custom C lexer for context-sensitive tokens | Language needs stateful token rules (Python indentation, template strings) |
precedences | Named precedence levels instead of numbers | You want readable precedence definitions instead of magic numbers |
word | Keyword extraction optimization | Language has many keywords, want fast lookup |
supertypes | Abstract node categories | Multiple rules represent one semantic concept (all expressions, all statements) |
reserved | Contextual keywords | Words are keywords in some contexts, identifiers in others (e.g., obj.if) |
extras: Allow Tokens Anywhere
Allow tokens like whitespaces, comments to appear anywhere (between tokens), without polluting the main rules.
export default grammar({
name: "my_language",
extras: ($) => [
/\s/,
/\/\/.*$/, // Line comments
/\/\*[\s\S]*?\*\//, // Block comments
],
rules: {
source_file: ($) => repeat($.statement),
statement: ($) => seq($.keyword, $.expression, ";"),
},
});
Default to whitespaces. One can turn this off by specifying extras: $ => [].
inline: Flatten Intermediate Nodes
Some rules are lightweight wrappers around a set of other rules - they themselves should not appear in the AST to reduce noise. Essentially, we want both:
- The ease of writing grammars by utilizing helper grammar rules.
- The ease of traversing the parse tree without the noise of the helper nodes.
inline is used to eliminate nodes corresponding to helper rules in the AST, as if the helper rules are written inline in the rules they are used in.
export default grammar({
name: "my_language",
inline: ["_expression", "_statement"],
rules: {
source_file: ($) => repeat($._statement),
_statement: ($) => choice($.if_statement, $.while_statement, $.assignment),
if_statement: ($) => seq("if", $._expression, $.block),
_expression: ($) => choice($.binary_op, $.number, $.identifier),
binary_op: ($) => seq($._expression, /[+\-*/]/, $._expression),
},
});
At compile time, _expression and _statement are expanded into their call sites. The runtime AST omits these intermediate nodes, simplifying tree navigation.
conflicts: Enable GLR for Ambiguities
An array of rule sets (which are arrays themselves).
Some conflicts are inherent in a language, conflicts is used to mark a set of rules as intended to be conflicting. The conflicts will be resolved by GLR branching & ties are broken for the rule that has the highest total dynamic precedence.
externals: Custom Lexing
Some languages have context-sensitive lexical rules that regexes (which can only express regular languages) cannot express. Common examples include:
- Indentation-based tokens (Python’s INDENT/DEDENT).
- Heredocs in Bash and Ruby.
- Percent strings in Ruby.
- Template string boundaries.
- Haskell’s significant whitespace.
External scanners allow plugging in custom C code to handle these advanced cases.
precedences: Named Precedence Levels
Using magic numbers to manage prec can be fragile:
- If a new precedence level needs to be introduced, we can break the precedence system.
- The magic numbers do not represent anything meaningful on its own.
precedences is an array of arrays of strings. Each array defines a precedence level. The precedence levels are in decreasing order.
export default grammar({
name: "my_language",
precedences: [
["logical_or"],
["logical_and"],
["equality"],
["comparison"],
["addition"],
["multiplication"],
],
rules: {
logical_or: ($) =>
prec.left("logical_or", seq($._expression, "||", $._expression)),
logical_and: ($) =>
prec.left("logical_and", seq($._expression, "&&", $._expression)),
},
});
word: Keyword Extraction Optimization
The name of token that will match keywords to the keyword extraction optimization.
Probably, the keyword extraction optimization technique can only handle 1 special type of tokens, so it cannot handle
export default grammar({
name: "my_language",
word: ($) => $.identifier_token,
rules: {
if_statement: ($) => seq("if", "(", $.expression, ")", $.block),
identifier_token: ($) => token(prec(1, /[a-zA-Z_]\w*/)),
},
});
supertypes: Abstract Node Categories
Multiple different grammar rules represent a single semantic concept (e.g., binary_op, unary_op, and function_call are all expressions), but they appear as separate nodes in the tree.
supertypes allows rules to marked as supertypes and tree-sitter will automatically hide them from the parse tree, letting you still reference them by name in the grammar.
reserved: Contextual Keywords
Some words are keywords in one context (e.g., if as a statement) but valid identifiers in another (e.g., obj.if as a property name).
grammar({
name: "example",
reserved: ($) => ({
global: ["if", "else", "function"], // Globally forbidden as identifiers
property_keywords: [], // Empty set to allow all words
}),
rules: {
// Uses global reserved set; 'if' will fail to match as a variable
variable: ($) => $.identifier,
// Explicitly uses the empty 'property_keywords' set to allow 'if'
property_name: ($) => reserved($.property_keywords, $.identifier),
},
});
Solution: Define named reserved word sets, apply the global set everywhere, and override with a permissive set in specific contexts.
The structure is similar to the rules property.
- Each reserved rule must be a terminal token.
- Each reserved rule must exist and be used in the grammar.
- The first reserved word set is the global word set, which will apply by default. The
reservedfunction can be called to change the reserved words.
Tree-sitter Grammar Guideline
Source: tree-sitter
Link: https://tree-sitter.github.io
Takeaways
We want to choose the best CFG among the inifinite number of them to describe a language.
The best CFG is determined by:
- Direct correspondence between symbols in the grammar and the recognizable constructs in the language (ignore non-relevant symbols like trivia). -> Easier to analyze.
- Should closely adhere to LR(1) for performance reason.
Tree-sitter grammars are said to be different from Yacc and Bison (in terms of grammar authoring) & different in nature from ANTLR grammars, PEG, ambiguous grammar (in language specifications). (? Why)
Design Phase
Criteria for a tree-sitter grammar:
- Intuitive mapping: Grammar symbols should map to recognizable language constructs. The AST should mirror how developers think about code, not formal specifications.
- Breadth-first development: Start with skeleton rules covering major categories (declarations, statements, expressions), then incrementally flesh out details. Test frequently.
- LR(1) compatibility: Tree-sitter uses GLR parsing and performs best with LR(1)-compatible grammars (like Yacc/Bison), not ANTLR or PEG approaches.
Design Process
- Start with a formal specification of the language (probably containing a CFG, but this can be hard to just trivially translateinto a tree-sitter grammar).
- Work out the just-enough high-level constructs of the programming language, typically: Declarations, Definitions, Statements, Expressions, Types and Patterns.
The start rule for the grammar is the first property in the
rulesobject. Perform a breadth-first design: Cover the language in breadth, not a subset of it. - Iteratively build down the high-level ocnstructs just enough, switch to other constructs. In the process, test often.
Tests for each rule are placed in
test/corpus.
Standard Rule Naming Conventions
Rule names are mostly unrestricted, but for the sake of comprehensibility, we should follow the following established conventions:
| Convention | Example | Purpose |
|---|---|---|
| Root node | source_file | Represents entire file |
| Major constructs | expression, statement | Top-level language concepts, typically represented as a supertype/choice of the more specific sub-expression/sub-statement rules |
| Block scopes | block | Parent node for block scopes |
| Types | type | The types, such as int, choice and void |
| Identifiers | identifier | Variable names, function arguments and object fields, commonly used as the word token |
| Strings | string | Represent string literals |
| Comments | Represent comments | Commonly put in an extra |
Good Grammar Rule Structure
Commonly, to write a grammar for an expression language, one typically defines the root rule that slowly propagate to expression rules for operators that have increasing precedence:
Expression -> Assignment
Assignment -> Conditional
Conditional -> Or
Or -> And
And -> Equality
Equality -> Comparison
Comparison -> Add
Add -> Mul
Mul -> Unary
However, the notorious problem with this is that it creates very deep nesting that is absolutely unnecessary. And furthermore, it’s hard to know which type of expression a node represents.
Although this section of the doc doesn’t specify how to properly does this, probably, prec should be leverage in this case for the best AST that is appropriate from the perspective of the user.
Update: Check out the sections below Precedence Usage.
Precedence Usage
Continue from the above section.
In short, tree-sitter provides us with a whole lot of utilities for building more ergonomic ASTs, instead of weirdly deep structures.
grammar({
rules: {
// ...
expression: ($) =>
choice(
$.unary_expression,
$.binary_expression,
// ...
),
unary_expression: ($) =>
choice(
seq("-", $.expression),
// ...
),
binary_expression: ($) =>
choice(
seq($.expression, "*", $.expression),
seq($.expression, "-", $.expression),
// ...
),
},
});
This is a flat representation, but is highly ambiguous. Tree-sitter would complain:
Error: Unresolved conflict for symbol sequence:
'-' _expression • '*' …
Possible interpretations:
1: '-' (binary_expression _expression • '*' _expression)
2: (unary_expression '-' _expression) • '*' …
Possible resolutions:
1: Specify a higher precedence in `binary_expression` than in the other rules.
2: Specify a higher precedence in `unary_expression` than in the other rules.
3: Specify a left or right associativity in `unary_expression`
4: Add a conflict for these rules: `binary_expression` `unary_expression`
This is reasonable, as tree-sitter is unsure whether - 3 * 2 should be parsed as - (3 * 2) or (-3) * 2.
The • character indicates the ambiguity point.
Remedy with prec:
{
// ...
unary_expression: ($) =>
prec(
2,
choice(
seq("-", $.expression),
seq("!", $.expression),
// ...
),
);
}
In this case, the - 3 * 2 ambiguity is resolved.
Associativity Usage
This case still causes an ambiguity 1 * 2 * 3 - because the * has the same precedence, so these two are valid derivations:
1 * 2 * 3 -> 1 * (2 * 3)
// expression
// -> binary_expression
// -> expression '*' expression
// -> number '*' (binary_expression)
// Result: (binary_expression 1 '*' (binary_expression 2 '*' 3))
1 * 2 * 3 -> (1 * 2) * 3
// expression
// -> binary_expression
// -> expression '*' expression
// -> binary_expression '*' number
// Result: (binary_expression (binary_expression 1 '*' 2) '*' 3)
prec.left and prec.right will tell tree-sitter whether to prioritize the former or the latter case.
In this case, prec.left should be used:
{
// ...
binary_expression: $ => choice(
prec.left(2, seq($.expression, '*', $.expression)),
prec.left(1, seq($.expression, '+', $.expression)),
// ...
),
}
Conflicts Usage
By default, tree-sitter marks all conflicts as errors. But they can be desirable.
The example in tree-sitter I think is a bit weird, as I think we can disambiguate
arrayandarray_patternfrom the surrounding context alone (like in left-side of assignments or in declarations).
To mark something as intentionally conflicting, we use the top-level configs field:
{
name: "javascript",
conflicts: $ => [
[$.array, $.array_pattern],
],
rules: {
// ...
},
}
Hidden Rules (_rule)
Rules starting with _ are hidden from the AST, though it still exists in the parser.
I found this to be quite similar to inlined rules, but it seems that they differ a bit in that inlined rules are totally eliminated from the parser, but hidden rules are still known to the parser. I wonder how that would be different from the user’s perspective.
Fields Usage
Fields can be given names to allow named accesses instead of indexed accesses.
function_definition: ($) =>
seq(
"func",
field("name", $.identifier), // this one can be accessed via `ts_node_child_by_field_name`
field("parameters", $.parameter_list),
field("return_type", $._type),
field("body", $.block),
);
Extras Usage
“Extra” tokens like comments, spaces can appear between any two significant tokens.
grammar({
name: "my_language",
extras: ($) => [
/\s/, // whitespace
$.comment,
],
rules: {
comment: ($) =>
token(
choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
),
},
});
For complicated
extrastokens, it’s preferable to associate the pattern with a rule, to avoid bloating up the parser because tree-sitter will inline the rules everywhere. However, it’s fine for whitespace character classes like\sor[ \t\n\r]as tree-sitter autodetects this.
Supertypes usage
For abstract categories of syntax nodes (“expression”, “type”, “declaration”), they are represented as choices between other sub-rules:
expression: $ => choice(
$.identifier,
$.unary_expression,
$.binary_expression,
$.call_expression,
$.member_expression,
// ...
),
But, this causes expression to be added as a syntax node, creating an unnecessary nesting level. To eliminate these, we use supertypes:
module.exports = grammar({
name: "javascript",
supertypes: ($) => [$.expression],
rules: {
expression: ($) =>
choice(
$.identifier,
// ...
),
// ...
},
});
So how does supertypes differ from inlined rules or hidden rules? supertypes is optimized for this use case, and it can be queried (pattern matched), unlike inlined or hidden rules.
Deep-dive & Optimization Guide - Lexical Analysis
Tree-sitter’s parsing process = Parsing + Lexing.
Token Conflict Resolution
Token conflicts are much more common than grammar rule conflicts, such as the keyword "if" and identifiers /[a-z]+/.
There are various scenarios of token conflicts:
- Context-aware lexing: Tree-sitter performs lexing on-demand during parsing. -> The lexer only match tokens that are possible at the current position in the file.
- Lexical precedence: Precedence values in side a
tokenfunction will allow the lexer to break ties when two tokens can be matched at a given position. - Match length: When multiple tokens of the same precedence can be matched at a given position, the token that matches the longest sequence will be chosen.
- Match specificity: If multiple tokens of the same precedence match the same sequence of characters, then bare-string tokens are prioritized over regexp tokens.
- Rule order: Otherwise, the first token in the grammar is prioritized.
In short, these are the prioritization order (only tokens that can appear at a given position are considered):
- The token precedence specified by
spec.- The length of the matched token.
- String over Regexp.
- The earlier rule in the grammar is prioritized.
External scanners can produce tokens that are treated differently from regular tokens.
Lexical Precedence & Parse Precedence
- Lexical Precedence: Decides which token matches a specific piece of raw text. This is a lower-level operation and happens first.
- Parse Precedence: Decides which structural rule should be chosen to interpret a sequence of already-extracted tokens.
- The tip is that: When you get completely stuck debugging a grammar, it is almost always a lexical precedence problem, not a structural one.
- Syntax distinction:
token(prec(N, ...))= Applies Lexical Precedence (prioritizes the token generation).prec(N, token(...))= Applies Parse Precedence (prioritizes the structural rule, which won’t help if the lexer is grabbing the wrong token).
Keywords
-
Problem: In many languages, keywords (like
iforinstanceof) are just specific strings that overlap with your genericidentifierregex (like/[a-z]+/). -
Example:
if (a instanceofSomething) b(); // ^^^^^^^^^^^^^^^^^^^ // we're here -
Tree-sitter uses context-aware lexing:
- In normal Javascript grammar, the parser expects a keyword/operator/etc. there (other than an identifier).
- Therefore,
instanceofSomethingwould never be matched as an identifier by tree-sitter in normal instances. - Rather than that, tree-sitter would likely consider keyword first (
instanceof) then chops off and makesSomethingthe following identifier, which is incorrect, as an identifier cannot immediately follow a keyword.
-
The tree-sitter’s solution is to use the
wordtop-level configuration rule. -
With
word, tree-sitter uses a 2-step process:- It first matches the
wordtoken. - If it’s a match, then do nothing. Otherwise, it proceeds lexing as normal.
- It first matches the
-
Benefits:
- Improved error detection.
- (Side effect) Smaller & simpler lexing function, which is more efficient.
To avoid ambiguity: The
wordtoken must not be reused in another rule. To allow reusing, one mustaliasthewordtoken:_type_identifier: $ => alias($.identifier, $.type_identifier),
So basically alias works somewhat like inlined or hidden rule, but for different use cases in mind.
External Scanners
External scanners are custom scanners plugged into tree-sitter. These are needed as the approach of using regexes to specify tokens does not suffice for some non-regular token grammars such as:
- Off-side rule (indent and dedent tokens in Python).
- Heredocs in Bash & Ruby.
Tree-sitter resolves this by allowing authors to plug in external scanners.
Usage
-
Add an
externalstop-level configuration:grammar({ // ... externals: ($) => [$.indent, $.dedent, $.newline], });This section declares the names of all the external tokens, which can be used elsewhere in the grammar.
-
Add
src/scanner.c(mandatory) as a C source file containing the external scanner. -
In
src/scanner.c, define anenumcontaining the names of the external tokens. The order must match the order in the grammar’sexternals(names can differ).#include "tree_sitter/parser.h" #include "tree_sitter/alloc.h" #include "tree_sitter/array.h" enum TokenType { INDENT, DEDENT, NEWLINE } -
Write the custom scanning logic based on 5 actions:
create,destroy,serialize,deserializeandscan.-
create: Create the scanner object.void * tree_sitter_<your_language>_external_scanner_create() {} -
destroy: Free memory of the scanner object.void tree_sitter_<your_language>_external_scanner_destroy(void *scanner) {} -
serialize: Serialize the scanner’s state into a byte buffer & return the number of written bytes. This is called by tree-sitter after the external scanner successfully recognizes a token.unsigned tree_sitter_<your_language>_external_scanner_serialize( void *scanner, char *buffer ) {} -
deserialize: Restore the state of the scanner from the written bytes.void tree_sitter_<your_language>_external_scanner_deserialize( void *scanner, const char *buffer, unsigned length ) {} -
scan: Tree-sitter will trigger this method with a set of the valid symbols for the external scan to scan these tokens.bool tree_sitter_<your_language>_external_scanner_scan( void *scannar, TSLexer *lexer, // The tree-sitter lexer exposed to the external scanner, which the external scanner will call const bool *valid_symbols // a binary array such that `valid_symbols[TOKEN_TYPE]` indicates whether that token type is valid ) {}
The names are in the following format:
tree_sitter_<your_language>_external_scanner_<action>.So basically, we can think of
src/scanner.cas an interface with methods that tree-sitter can call. Alternatively, we can think ofsrc/scanner.cutilize the action hooks to hook into tree-sitter. -
Utils
See more here: https://tree-sitter.github.io/tree-sitter/creating-parsers/4-external-scanners.html#external-scanner-helpers
Tree-sitter Query System
Source: tree-sitter
Link: https://tree-sitter.github.io
Takeaways
- Separate grammar from navigation: Queries enable applications without modifying grammar, keeping grammar focused on parsing and semantics.
- Declarative patterns are more maintainable and portable than imperative traversal code written in each binding language.
- Portability: The same
.scmquery file works across all language bindings and tools for standardized highlighting and navigation. - Architecture: C core extracts queries; language bindings evaluate predicates and directives. This splits parsing (fast, in C) from filtering (flexible, in bindings).
Quick Reference
| Pattern | Purpose | Example |
|---|---|---|
| Capture | Extract a matched node | (identifier) @name |
| Wildcard | Match any node | (_) matches any named node |
| Field name | Match by field | left: (identifier) matches specific child |
| String literal | Match anonymous node | "if" or "(" |
| Negation | Must not have field | !type_parameters |
| Quantifier | Repetition | (arg)* or (arg)+ |
| Alternation | Match one of several | ["if" "while" "for"] |
| Anchor | Position constraint | . before/after/between children |
| Predicate | Conditional filter | (#eq? @name "foo") |
| Directive | Metadata attachment | (#set! language "javascript") |
Syntax
- Query: A series of patterns that is used to extract certain code structure out of the source code.
- Pattern: An S-expression matching a certain set of syntax nodes.
Pattern’s structure:
(<node-type> ...<children-patterns>)
For example:
(binary_expression (number_literal) (number_literal))
would match a binary expression with two children of type number literal.
Children can be omitted, the remaining nodes would described the expected order of children nodes.
(<node-type> <child-1> <child_2>)
; ^^^^^^^^ ^^^^^^^^
; <child_1> needs only precede <child_2>, some other nodes may be in between
For example:
(binary_expression (number_literal))
would match any binary expression with a number literal as its child.
Fields: Child patterns can be named to be correspondent with field names, and recommended so:
(assignment_expression
left: (member_expression
object: (call_expression)))
If in the grammar rule assignment_expression, there are 2 field named left and object, then this query would return all assignment_expressions whose left and object are member_expression and call_expression.
Negated fields: A field name prefixed with ! indicates a lack of such field.
(class_declaration
name: (identifier) @class_name
!type_parameters)
Anonymous nodes: (<pattern>) only applies to named nodes. To reference anonymous nodes, double quotes the textual contents.
(binary_expression
operator: "!="
right: (null))
Wildcard nodes: _ and (_) are used to matched any nodes. The only difference is that (_) can match any named node and _ can match any named/anonymous node.
(call (_) @call.inner)
; ^^^^^^^^^^^
; any nodes inside a call
ERROR nodes: An unrecognized piece of text and can also be queried.
(ERROR) @error-node
MISSING nodes: Missing expected nodes will be inserted as special missing nodes in the syntax tree.
- To recover from syntax errors, the parser inserts zero-width “phantom” tokens (like a forgotten semicolon) to keep the syntax tree valid.
- Instead of allocating a completely new, heavyweight structural node in memory, it simply attaches an internal “missing” flag to that generated token.
- This approach saves memory and processing power because the token occupies zero physical characters in your source code.
Something’s a bit confusing here - Tree-sitter seems to mention nodes and tokens interchangeably.
(MISSING) @missing-node
(MISSING identifier) @missing-identifier
(MISSING ";") @missing-semicolon
Supertype nodes: Supertypes in queries can be used to match any subtypes.
Supertype here means roughly like supertype in OOP: A ggrammar rule that can specializes to multiple sub-grammar rules. The supertype itself won’t be a standalone node in the AST, rather like a union type though.
(expression) @any-expression
; ^^^^^^^^^
; match any subtypes of `expression`
Match specific subtypes (named or anonymous):
(expression/binary_expression) @binary-expression
(expression/"()") @empty-expression
Operators
The @ operator: Specific nodes in the pattern can be extracted out by suffixing the node with a capture name
(assignment_expression
left: (identifier) @the-function-name
; ^^^^^^^^^^^^^^^^^
right: (function))
The quantification operators + & *: Similar to those in regular expressions.
Sequence group:
- The sequence group
()creates an anonymous block of sibling nodes. - Its primary use case is grouping nodes so you can apply quantifiers (
*,+,?) to an entire repeating sequence. - Example: Matching a repeating comma-separated list using
( (",") (identifier) )*. - It does not enforce strict adjacency; undeclared nodes can still appear between the grouped items.
- Use
()to repeat a pattern, and use.(the anchor operator, see below) to glue nodes tightly together.
Alternation operator []: Similar to character classes in regular expressions.
[
"break"
"delete"
"else"
"for"
"function"
"if"
"return"
"try"
"while"
] @keyword
The alternants can be quantified.
Anchor operator .: Used to constrain the ways in which child patterns are matched, with different behaviors based on its position in a query.
- Before the first child within the parent pattern: That child is only matched when it’s the first named node of the parent.
(array . (identifier) @the-element)
; ^^^^^^^^^^
; must be the first
Without the anchor above, @the-element would be bound to every identifier in the array, and accessing it would return an arrays of all identifiers.
-
After the last child within the parent pattern: Similarly to the above case.
-
Between two child patterns: Match immediate siblings.
(dotted_name
(identifier) @prev-id
.
(identifier) @next-id)
Note that anonymous nodes are not taken into account.
Predicates
The predicate syntax #...?: Add arbitrary conditions or metadata to a pattern. Predicate names start with # and end with ?, and can accept captures or strings as arguments.
The eq? predicate: Compares a capture’s text against a string or another capture.
- Includes variants:
#not-eq?,#any-eq?,#any-not-eq?. - By default, quantified captures require all matched nodes to pass the predicate. The
any-prefix overrides this to match if at least one node passes.
((identifier) @variable.builtin
(#eq? @variable.builtin "self"))
(
(pair
key: (property_identifier) @key-name
value: (identifier) @value-name)
(#eq? @key-name @value-name)
)
The match? predicate: Matches a capture’s text against a regular expression.
- Also supports the
not-andany-prefixes.
((identifier) @constant
(#match? @constant "^[A-Z][A-Z_]+"))
The any-of? predicate: Checks if a capture’s text strictly equals any string in a provided list.
((identifier) @variable.builtin
(#any-of? @variable.builtin
"arguments"
"module"
"console"))
The is? predicate: Asserts that a capture has a specific internal property (e.g., using #is-not? local to ensure a matched variable is not a local variable).
Directives
The directive syntax #...!: Associates arbitrary metadata with a pattern or alters its output. They function similarly to predicates but end with ! instead of ?.
The set! directive: Associates arbitrary key-value metadata pairs with a matched pattern (commonly used for language injections).
((comment) @injection.content
(#match? @injection.content "/[*\/][!*\/]<?[^a-zA-Z]")
(#set! injection.language "doxygen"))
The select-adjacent! directive: Takes two capture names and filters the first capture’s text so that only nodes adjacent to the second capture are preserved.
The strip! directive: Takes a capture and a regular expression, removing any matched text from that capture’s final output.
Note
- The core Tree-sitter C library does not execute predicates or directives.
- It simply extracts them from your query and passes them along as raw metadata.
- It is entirely up to the higher-level bindings (like Rust, Python, JS, WASM) to read that metadata and actually run the filtering logic (like evaluating
#eq?).So, if a predicate isn’t working, it might be because the specific language binding you are using hasn’t implemented support for it yet, not necessarily because your query syntax is wrong.
Query API
Creating a query (ts_query_new): You create a query by passing in the language and a string containing your patterns.
- If it fails,
error_offsetindicates the exact byte where it failed. error_typeindicates the cause (Syntax,NodeType,Field, orCapture).
TSQuery *ts_query_new(
const TSLanguage *language,
const char *source,
uint32_t source_len,
uint32_t *error_offset,
TSQueryError *error_type
);
Thread safety and state (TSQuery vs TSQueryCursor):
TSQuery: The parsed query itself. It is immutable and can be safely shared across multiple threads.TSQueryCursor: Holds the state required to actually run the query. It is not thread-safe, but you can reuse a single cursor for many executions to save memory.
TSQueryCursor *ts_query_cursor_new(void);
Executing a query (ts_query_cursor_exec): Prepares the cursor to find matches within a specific syntax node (usually the root node of your parsed code).
void ts_query_cursor_exec(TSQueryCursor *, const TSQuery *, TSNode);
Iterating matches (ts_query_cursor_next_match): Steps through the successful query matches one by one.
- Returns
trueand populates thematchstructure, orfalsewhen there are no more matches. TSQueryMatchcontains metadata (which pattern matched) and an array of captured nodes.TSQueryCapturecontains the actualTSNodefrom the syntax tree and the index of its capture name.
typedef struct {
TSNode node;
uint32_t index;
} TSQueryCapture;
typedef struct {
uint32_t id;
uint16_t pattern_index;
uint16_t capture_count;
const TSQueryCapture *captures;
} TSQueryMatch;
bool ts_query_cursor_next_match(TSQueryCursor *, TSQueryMatch *match);
Tree-sitter Services and Applications
Source: tree-sitter
Link: https://tree-sitter.github.io
Takeaways
- Services decouple from grammar: Syntax highlighting, code navigation, and language injection are query-based, not grammar-based. Grammar describes “what is”, queries describe “how to use it”.
- Capture names are standardized (e.g.,
@keyword,@definition.function) instead of language-specific logic. Same queries work across all tools and editors. - Scope and context via queries: Local variable tracking and language injection are queries, not lexical features. This keeps grammars simpler while enabling sophisticated analysis.
- Standardization enables portability: Same
.scmquery file and role/kind vocabulary work across editors and languages.
[tool] [seminal] .NET Compiler Platform (“Roslyn”) Overview
Source: dotnet/roslyn (GitHub)
Link: https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md
Keywords:
- Syntax tree
- Semantic model
Takeaway:
- Roslyn represents a revolutionary step: Exposing the internal compilation pipeline as a public API.
- Two main API layers:
- Lower-level compiler APIs for detailed code analysis.
- Higher-level workspace APIs for managing complete solutions within a host environment.
- Core data models representing the code are completely immutable and preserve full fidelity to enable thread-safe analysis without data duplication.
- ASTs are composed of 3 items: nodes (non-terminals), tokens (terminals), and trivia.
- Trivia are attached to adjacent tokens instead of being in the ASTs.
- Each AST item exposes a span.
- Tokens and Trivia are expressed as value types.
- The state flow relies on an active, mutable workspace containing strictly immutable solution snapshots.
Abstract
This document details the motivation behind the Roslyn’s team’s decision to expose the compiler APIs and its object model, highlighting the trend of “compiler as a platform”.
The important part is their object model and compiler APIs. Object model is basically the data that the compiler API accepts or returns.
Introduction
Compilers are traditionally black boxes:
- Source code goes in.
- Object code comes out.
- The internal model is discarded. This internal model contain deep understanding of the original code.
However, that model, built through parsing, name resolution, and type checking, is exactly what IDE features need. IntelliSense, Go to Definition, Find All References, intelligent rename - all of these require compiler-level understanding of the code.
Roslyn is the .NET Compiler Platform for C# and Visual Basic. Its mission is to open up that black box: instead of every tool rebuilding a partial, inaccurate approximation of what the compiler already computed, Roslyn exposes the compiler’s own data structures as a stable public API.
The same object models that power Visual Studio’s own language features are available to any external tool. The VS IDE was rebuilt on top of the public Roslyn APIs as a validation of this claim - if the public API isn’t sufficient for world-class IDE features, it isn’t good enough.
This advocates for the view of “compilers as platforms”. To be more precise, compilers provide necessary information that powers all code IntelliSense.
Exposing the Compiler APIs
Compiler Pipeline
Roslyn exposes to the consumer each phase of the traditional compiler pipeline as a separate, queryable component:
| Phase (API) | What it does | Exposed as (Object model) |
|---|---|---|
| Parse | Tokenize and parse source text | Syntax tree |
| Declaration | Analyze declarations + imported metadata to form named symbols | Hierarchical symbol table |
| Bind | Match identifiers to symbols | Semantic model |
| Emit | Produce IL bytecode | Emit API |
Every phase is independently accessible. You can parse a file with no compilation context, or query the semantic model of a specific expression without touching the emit phase.
A visualization from Roslyn:

Because each phase has a well-defined object model, each compiler combines these phases together to form an end-to-end whole.
As in the picture above, some language services map to some phases and object models.
API Layers

Two main layers of APIs: Compiler APIs and Workspaces APIs.
Compiler APIs
The lowest layer. Exposes the object models corresponding to each pipeline phase, both syntactic and semantic. No dependency on Visual Studio.
Contains:
- Object models for each pipeline phase:
- Syntax trees.
- Symbol tables.
- Semantic models.
- An immutable snapshot of a single invocation of a compiler:
- Assembly references.
- Compiler options.
- Source code files.
C# and VB have separate but shape-compatible APIs in this layer. This layer has no dependencies on Visual Studio components.
Diagnostic APIs
Extensible diagnostics (can be user-defined) cover:
- Syntax errors.
- Semantic errors.
- Definite assignment errors.
- Warnings.
- Informational.
Hosting/Scripting APIs
Used for:
- Code snippet execution.
- Accumulate a runtime execution context.
The REPL utilizes these.
Workspaces APIs
The higher-level layer:
- Organizes an entire solution (projects, files, references, options) into a single object model.
- Handles project-to-project dependency resolution automatically.
- Gives direct access to compiler layer objects without manual setup.
- Surfaces higher-level IDE APIs:
- Find All References.
- Formatting.
- Code Generation.
No dependency on Visual Studio.
Parse API / The Syntax Tree Object Model
Purposes:
- Allow tools to process the syntactic structure of a project.
- Enable tools to manipulate source codes.
This seems to document the public interface, not how internally, how it is all organized.
Overview:
- 3 main syntactic categories: nodes (non-terminal), tokens (terminal), trivia (attached to tokens).
- Each node/token/trivia has a
Spanand aFullSpan, containing the first position + the character count.Span may be computed on the fly, else, AST edits would invalidate a large number of spans in the AST. Also, if implemented this way,
SyntaxTokencannot be reused. Therefore, probably a red-green tree design might me used underneath. - Nodes have both properties for
Parentand children.Red-green trees must be involved here for the tree nodes to be immutable.
Syntax Trees
Essentially a tree data structure:
- Internal nodes are parents of other nodes.
- Made up of nodes, tokens and trivia.
Three key properties:
- Full fidelity/Lossless: every token, whitespace, comment, preprocessor directive, and syntax error is preserved. Nothing is thrown away.
- Round-trippability: any subtree can reconstruct its exact source text. Editing the tree is equivalent to editing the source.
- Immutability & Thread-safe:
- Trees are snapshots that do not change.
- Allow structural sharing & node-reusing.
- Allow thread-safe accesses without locks.
Syntax Nodes
- Represent non-terminal elements: declarations, statements, clauses, expressions.
- Always non-terminal nodes in the syntax tree - Only tokens can be terminal nodes.
The class hierarchy:
- A base class:
SyntaxNode. - Each category of syntax node is a subclass of a
SyntaxNode. - New, custom node classes can not be added.
The classes:
Parentproperty: A node’s parent. The root node has a null parent.ChildNodesmethod: A sorted list of child nodes (do not include tokens).- A collection of
Descendant*methods:DescendantNodes: The list of all nodes in the sub-tree rooted at a node. (In what order though?)DescendantTokens: The list of all tokens in the sub-tree rooted at a node. (Is it sorted?).DescendantTrivia: The list of all trivia in the sub-tree rooted at a node.
- Type-safe children properties.
- Example:
BinaryExpressionSyntaxhasLeft,OperatorTokenandRightadditionally. - Some can be optional.
- Example:
Syntax Tokens
- Terminals: keywords, identifiers, literals, punctuation.
- Never are parents.
Implementation:
- Implemented as CLR value types (one struct for all token kinds) for performance. Typed as
Object. - Carry both raw source text and a decoded value (
Valuefor the typed result,ValueTextfor the Unicode-normalized string).
Syntax Trivia
- Insignificant to the understanding of the code: whitespace, comments, preprocessor directives.
- Still kept for full fidelity and round-trippability.
Relationship with tokens:
- Not tree children in the normal sense - attached to adjacent tokens via
LeadingTrivia/TrailingTrivia. A token owns any trivia after it on the same line up to the next token. - Do not have parent.
- First trivia of the line belong to the first token.
Implementation:
Tokenproperty: The owning token.- Also CLR value types:
SyntaxTrivia

Spans
The position & length information of a node, token or trivia.
- Text position = A 32-bit integer that is a zero-based Unicode character index.
- Text span = A beginning position + A count of characters. If the count is zero, it refers to the position after the beginning position but before the next character.
A node has 2 TextSpan:
Span: text range of the node excluding trivia.FullSpan: text range including leading/trailing trivia.
Both are zero-based Unicode character indices.
Kinds
- Every node, token or trivia has a
RawKindcastable to a language-specificSyntaxKindenum.- This disambiguates elements sharing the same class - e.g.,
BinaryExpressionSyntaxcan beAddExpression,SubtractExpression, etc.So
RawKinddoesn’t have an inherent meaning?
- This disambiguates elements sharing the same class - e.g.,
- For tokens and trivia,
RawKindis the only way to distinguish element types.
So basically,
SyntaxNodeis the low-level underlying representations for the “higher-level” node types appearing in some specific programming language?
Error Recovery
Malformed code still produces a full, round-trippable tree. Two strategies:
- Missing token: Expected but absent token is inserted as a zero-width node (
IsMissing = true). - Skipped tokens: The parser may skip tokens during error recovery. Unrecognized tokens are attached as
SkippedTokenstrivia.So Lexing and Parsing must be happening at the same time, as otherwise,
SyntaxTokenwouldn’t be immutable.
Declaration/Binding API + Hierarchical symbol table/Semantic model
- Name resolution.
- References to compiled libraries.
Compilation
- Represents everything needed to compile a program: source files, assembly references, compiler options.
- Contains methods for finding and relating the symbols declared in source code or metadata from compiled libraries.
- Immutable - changes produce a new
Compilationderived from the old one.
Entry point for all semantic queries: symbol lookup by name, access to the global namespace tree, getting a SemanticModel for any source file.
Symbols
Overview:
- Represents a declared elements in the source code or compiled libraries’ metadata.
- Namespaces, types, methods, properties, fields, events, parameters are all represented by symbols.
- A
Compilationhas methods and properties to retrieve symbols or the symbol table as a tree of symbols.
Class hierarchy:
ISymbol: The interface that all symbls implement.
Symbol implementation:
- Additional information that symbol’s methods and properties can retrieve:
- Other referenced symbols.
Effects:
- A consistent representation for namespaces, types, and members across both source code and metadata.
- Elements from source code and imported metadata are treated exactly the same (e.g., a method from source code & metadata are represented by an
IMethodSymbolwith identical properties).
Symbols are somewhat similar to the CLR type system (System.reflection API), but they can model more than types and are a representation of language concepts.
For instance, an iterator method:
- A single language symbol.
- A CLR type and CLR multiple methods.
Semantic Model
Scoped to a single source file. Answers:
- What symbol does this identifier resolve to?
- What is the type of this expression?
- What diagnostics apply here?
- How do variables flow in/out of this region?
Obtained from Compilation by passing a SyntaxTree. Queries are lazy and cached.
Workspace API
- Entrypoint for whole-solution analysis and refactoring.
- Organization of information about projects in a solution into 1 object model: Direct access to compiler layer object models.
- Host environments, e.g. IDEs, a workspace ~ an open solution.
- Loading a solution also utilizes this model.
Workspace
- Workspace: An active representation of a solution.
- Solution: A collection of projects.
- Project: A collection of documents.
- Workspace -> A host environment, which is constantly changing.
- Can be created and run independently without needing a host environment or UI.
- Update mechanism:
- Fires events when the host environment changes (e.g. user typing).
- Automatically updates the
CurrentSolutionproperty. - The fired event indicates that the workspace and which particular document has changed. -> Consumers can listen to these.
Solutions, Projects, Documents
Overview
- Workspaces change continuously with user input, but they provide access to isolated, immutable solution models.
- Solutions are thread-safe snapshots of projects and documents that can be shared without locking or duplication.
Component hierarchy
- Projects are immutable solution components that bundle source documents, references, and options to grant direct access to compilations.
- Documents are immutable project components representing individual source files that provide access to raw text, syntax trees, and semantic models.
Modifications
- A solution instance obtained from the workspace will never change.
- You apply updates by constructing a new solution instance based on your specific changes and explicitly applying it back to the workspace.
Visualization of the relationship from Roslyn:

Using Roslyn APIs to Build a .NET IL Interpreter
Name Resolution
Name resolution is about figuring out what each identifier in the code actually refers to. Given a variable, function call, or type annotation, which declaration does it point to? This requires understanding scoping rules, imports, visibility modifiers, and the module system of the language in question.
Getting this right is critical for go-to-definition, find-all-references, and rename refactoring. The approaches here range from formal scope graph models to Datalog-based relational encodings to tree-sitter-based graph construction DSLs.
Scoping rules vary wildly across languages, so Loupe needs a resolution model that encodes rules as data, not code (language-agnostic). When a file changes, only affected names should be re-resolved (incremental). And since every go-to-definition and rename triggers resolution, it has to be instant (fast).
Subpages
- Scope Graph: a formal model where scoping rules are encoded as a graph of declarations, references, and scope edges. Language-agnostic by design.
- Stack Graphs: GitHub’s adaptation of scope graphs for language-agnostic code navigation, designed to work without a full type checker.
tree-sitter-graphDSL: a DSL for constructing graphs (including stack graphs) directly from tree-sitter parse trees.
Scope Graph
[tool] [tool-dive] Stack Graphs
Source: GitHub (tree-sitter team)
Link: https://github.com/github/stack-graphs
tree-sitter-graph DSL
Type Checking
Type checking is about verifying that operations in the code are applied to values of compatible types. Beyond catching errors, type information powers features like auto-completion, signature help, and more precise code navigation.
The challenge for a tool like Loupe is doing this in a language-agnostic or at least language-parametric way, without reimplementing a full type checker for every supported language.
This is where the language-agnostic goal is most in tension with correctness. Loupe can’t reimplement every language’s type system, so the question is how much it can infer in a language-parametric way and when to defer to external checkers (e.g., via LSP). Whatever type reasoning Loupe does must be incremental (no full re-checks on edits) and fast enough to not block the user from doing meaningful works. Refactoring should be fast, especially in the scaffolding phase.
Subpages
- Polymorphic Type Inference Engines & Template Checking: research on inference algorithms that can handle parametric polymorphism and template-style generics across languages.
- Statix & Scopes-as-Types: a constraint-based specification language that builds on scope graphs, treating scopes and types uniformly.
- Relational Semantics via Datalog & Souffle: encoding name resolution as Datalog relations, enabling declarative and incremental resolution using the Souffle engine.
- Lady Deirdre Framework: a Rust framework for building incremental compilers and analyzers, with a focus on error-resilient parsing.
Polymorphic Type Inference Engines & Template Checking
Statix
Relational Semantics via Datalog & Soufflé
Lady Deirdre Framework
Spoofax Language Workbench
Code Property Graphs (CPG) & the Joern Framework
Code Indexing & Navigation
Code indexing is about pre-computing a structured index of symbols, references, and relationships in a codebase so that navigation queries (go-to-definition, find references, call hierarchy) can be answered quickly without re-analyzing the code on every request.
The main design tension is between index freshness and cost. A full re-index on every change is expensive, but a stale index gives wrong answers. Formats like SCIP and LSIF try to standardize what gets stored, while systems like Glean and Kythe focus on how to store and query it at scale.
Loupe’s index format must not assume a specific language (language-agnostic), must support cheap partial updates when files change (incremental), and must answer navigation queries instantly even on large codebases (fast).
Subpages
- SCIP - A Better Code Indexing Format than LSIF: Sourcegraph’s comparison of SCIP vs LSIF, arguing for a simpler, more debuggable indexing format.
- Code Navigation for AI SWEs: a practical look at how AI coding agents can leverage code navigation infrastructure.
- SCIP: a deep dive into the SCIP indexing format itself.
- Glean: Meta’s system for collecting, deriving, and querying facts about code at scale.
- Google Kythe: Google’s ecosystem for building tools that work with code, centered on a language-agnostic graph schema.
- OpenGrok: Oracle’s fast source code search and cross-reference engine.
- livegrep: a regex-based code search tool optimized for speed over large repositories.
- Mycroft: a code indexing and navigation tool.
[blog] [overview] SCIP - A Better Code Indexing Format than LSIF
Source: sourcegraph’s blog
Link: https://sourcegraph.com/blog/announcing-scip
Keywords:
- Indexing format
Takeaways:
An indexing format should be aware of:
- Human-readability for debuggability.
- Allow for local reasoning.
- Allow for incrementalization.
Introduction
“Sourcegraph Inc. is a company developing code search and code intelligence tools that semantically index and analyze large codebases so that they can be searched across commercial, open-source, local, and cloud-based repositories.”
An indexing format is a quickly queryable representation of an entity. Previously, we have LSIF (Language Server Index Format), an open-source standard by Microsoft that serves as a standard indexing format for codebases, enabling rich code navigation, highlighting, etc. This article identified some pain points of LSIF, and thus, the motivation for SCIP.
As indexing formats enable a large part of the Sourcegraph ecosystem, this is a crucial component, especially for code navigation.
Background
There are two fundamentally different ways to implement code navigation: text/heuristic-based and semantic.
Text-based approaches (ctags, tree-sitter) are faster and language-agnostic, but they make mistakes. “Go to definition” might take you to a function with the same name in a different module. “Find references” might miss usages behind aliases or dynamic dispatch. Good enough for quick exploration, bad for anything that requires correctness.
It’s worth separating ctags and tree-sitter here, because they’re quite different in what they actually do.
-
ctags is purely text/regex-based. It scans source files and extracts symbol names and their locations using patterns, with no real parsing. No understanding of scope, types, or imports. “Go to definition” just finds the nearest thing with that name. Inherently search-based.
-
Tree-sitter is a proper incremental parser. It builds a full, language-aware AST, so it actually understands syntax. In principle it could do more than ctags. But it stops at the syntax level: it doesn’t resolve names. It can tell you “this token is an identifier,” but not “this identifier refers to that declaration over in that file.” That resolution step (name binding, type inference, import resolution) requires a type-checker or a full compiler front-end, which tree-sitter doesn’t do and isn’t meant to do.
So the bottleneck isn’t parsing, it’s name resolution. To know where a symbol actually comes from, you need the full semantic graph of the program, not just its syntax tree. That’s what language servers (via LSP) and indexers (via LSIF/SCIP) provide, sitting on top of a compiler.
Semantic approaches actually run a language-aware indexer, essentially a compiler front-end, that resolves symbols with full type information. The results are compiler-accurate: no false positives, no missed references, cross-repository aware. The catch is that this is expensive to set up. You need a working build environment, a language-specific indexer, and a pipeline to feed the output into wherever you’re querying from.
Sourcegraph built their precise navigation on LSIF. The flow is: a language-specific indexer runs against your codebase and dumps an LSIF file, you push that file to Sourcegraph, Sourcegraph ingests and indexes it, and from that point on you get compiler-accurate navigation across repos.
flowchart TB
subgraph row1 [" "]
direction LR
A[Codebase] --> B[Language-specific indexer]
B -->|LSIF file| C[Upload to Sourcegraph]
end
subgraph row2 [" "]
direction LR
D[Sourcegraph backend] -->|Stores index| E[(Database)]
E --> F[Compiler-accurate code navigation]
end
C --> D
This is actually a new workflow for me…
Detour: LSIF Format
LSIF is newline-delimited JSON (one object per line). Every line is either a vertex or an edge, distinguished by a "type" field. Every object also has an "id" (a monotonically-increasing integer) and a "label" (the semantic kind).
{ "id": 7, "type": "vertex", "label": "resultSet" }
{ "id": 10, "type": "edge", "label": "next", "outV": 8, "inV": 7 }
Vertices represent things: documents, ranges (spans of code), result hubs, hover content, definition locations, reference sets. Edges connect them using outV (from) and inV / inVs (to). The IDs carry no meaning on their own; they’re just handles for wiring the graph together.
Key vertex types:
| Label | What it represents |
|---|---|
document | A source file |
range | A span within a file (start/end line + character) |
resultSet | A hub shared by all ranges of the same symbol |
hoverResult | The hover popup content |
definitionResult | Where a symbol is defined |
referenceResult | All definition and reference locations |
Key edge types:
| Label | Meaning |
|---|---|
contains | Document owns ranges. Project owns documents |
next | Range delegates to its resultSet |
textDocument/hover | resultSet points to hover content |
textDocument/definition | resultSet points to definition locations |
textDocument/references | resultSet points to reference set |
item | Binds a result vertex back to concrete ranges |
The resultSet is the central design idea. Consider a symbol like foo that appears 50 times across a codebase — once as a definition and 49 as references. Naively, you’d attach hover content, definition location, and reference list to each of those 50 ranges individually. That’s a lot of duplication.
Instead, LSIF introduces a resultSet as a shared hub: all 50 ranges point to the same resultSet via a next edge, and the actual results (hover, definition, references) hang off the resultSet once. When you query “what’s the hover for range(19)?”, you follow next to get the resultSet, then follow textDocument/hover to get the result.
It’s a reasonable design for compactness, but it adds an indirection step to every query. And because everything is wired by opaque IDs, you have to load the entire graph into memory to efficiently resolve anything — you can’t jump to a range and read its data directly, you have to chase IDs across the whole file.
Challenges of LSIF
Sourcegraph has built dozens of LSIF indexers, from prototype to production grade, covering Go, Java, Scala, Kotlin, TypeScript, and JavaScript.
At scale, that means 45k+ repos with precise navigation enabled and 4k+ LSIF uploads processed per day. At that point, the cracks in the protocol become hard to ignore.
Most of the pain traces back to one design decision: LSIF uses a graph encoding with opaque numeric IDs to connect vertices and edges. Here’s what that looks like for this TypeScript snippet:
// main.ts
function foo(): void {} // line 0: foo is defined here
const x = 1;
foo(); // line 3: foo is referenced here
The LSIF graph for this looks like (notation: vertexLabel(id) where the number is the opaque integer ID assigned by the indexer):
flowchart TB
classDef docNode fill:#4A90D9,color:#fff,stroke:#2c6fad
classDef rangeNode fill:#5CB85C,color:#fff,stroke:#3d8b3d
classDef resultSetNode fill:#F0AD4E,color:#000,stroke:#c87f0a
classDef resultNode fill:#9B59B6,color:#fff,stroke:#6c3483
doc["document(4)<br/>main.ts"]:::docNode
doc -->|contains| r8(["range(8)<br/>line 0: foo definition"]):::rangeNode
doc -->|contains| r19(["range(19)<br/>line 3: foo reference"]):::rangeNode
r8 -->|next| rs(("resultSet(7)")):::resultSetNode
r19 -->|next| rs
rs -->|textDocument/hover| hov["hoverResult(11)"]:::resultNode
rs -->|textDocument/definition| def["definitionResult(13)"]:::resultNode
rs -->|textDocument/references| ref["referenceResult(16)"]:::resultNode
def -->|item| r8
ref -->|item definitions| r8
ref -->|item references| r19
Every connection is expressed as {id, type: "edge", outV: N, inV: M} where N and M are opaque integers. That single choice cascades into a bunch of problems:
-
No machine-readable schema, no static types:
- LSIF has no formal schema. Every line is a free-form JSON object, so when writing an indexer or reader you’re working with raw maps and doing manual field access.
-
Large in-memory data structures: Because edges reference IDs that may not have appeared yet in the file, you can’t process LSIF as a stream. You have to buffer everything first, then resolve references. For a large repo with millions of ranges and edges, this means holding the full graph in memory before you can answer a single query.
flowchart LR
A["Read line 1<br/>range(8)"] --> B["Read line 2<br/>edge next outV=8 inV=7"]
B --> C["resultSet(7) not seen yet<br/>must buffer"]
C --> D["Read line 3<br/>resultSet(7)"]
D --> E["Now resolve<br/>buffered edges"]
- Opaque IDs make debugging painful: When something goes wrong, you’re staring at a wall of numbers with no context. There’s no way to tell what ID 7 or ID 13 represents without mentally tracing the whole graph from the start.
{"id":14,"type":"edge","label":"textDocument/definition","outV":7,"inV":13} {"id":15,"type":"edge","label":"item","outV":13,"inVs":[8],"document":4}
What is outV:7? What is inV:13? What range is 8? You have to scroll back through the file and match IDs by hand.
-
Incremental indexing is hard: IDs are globally monotonically increasing. If you index file A and get IDs 1 to 500, then index file B and get IDs 501 to 1000, you can’t independently re-index just file A later since the new IDs would collide or require renumbering. Merging two separately-generated LSIF outputs is also non-trivial because both start their IDs from 1.
Index run 1 (full): range(8) -> resultSet(7) -> definitionResult(13) Index run 2 (file A): range(1) -> resultSet(2) -> definitionResult(3) // ID collision
The vibe I get is that the problem of LSIF is that it’s too monolithic & isn’t modular. Local analysis of LSIF file is not really possible - a global picture is required.
SCIP addresses all of this by replacing the graph encoding with a Protobuf schema built around human-readable string IDs for symbols, dropping the concepts of “monikers” and “resultSet” entirely.
SCIP
The SCIP schema is a Protobuf schema. The design is heavily inspired by SemanticDB, a code indexing format that originated in the Scala ecosystem.
Sourcegraph reported a wide range of improvements:
- Development speed: Static types from the Protobuf schema give rich editor completions and catch typos at compile time. Human-readable string symbols make debugging much more straightforward. Abstractions like import/export monikers that could silently break navigation in LSIF are gone.
- Runtime performance: They saw a 10x speedup in CI when replacing
lsif-nodewithscip-typescript(though not entirely attributable to the protocol itself). - Index size: LSIF indexes are on average 4x larger gzip-compressed and around 5x larger uncompressed compared to equivalent SCIP payloads.
- Testability: Sourcegraph built a snapshot testing utility on top of SCIP that is reused across indexers. Snapshot testing with LSIF was painful by comparison.
Don Stewart from Meta integrated SCIP with Glean (Meta’s system for collecting and querying facts about code) and found SCIP to be 8x smaller and processable 3x faster than LSIF. The mapping into Glean was around 550 lines of code vs. 1500 for LSIF.
SCIP also unblocks use cases that LSIF struggled with:
-
Incremental indexing: Because symbols are identified by human-readable strings rather than global IDs, re-indexing only the changed files becomes feasible. Users would wait less for precise navigation to become available after a push.
-
Cross-language navigation: Navigating between, say, a Protobuf definition and its generated Java or Go bindings becomes possible. This kind of cross-language linking was essentially out of reach with LSIF.
Detour: SCIP Format
Sources:
- https://github.com/sourcegraph/scip/blob/main/scip.proto
- https://github.com/sourcegraph/scip/blob/main/docs/DESIGN.md
- https://github.com/sourcegraph/scip-typescript/tree/main/snapshots/output
For the same foo example, here’s what SCIP looks like (shown as JSON, actually Protobuf binary):
{
"metadata": {
"toolInfo": { "name": "scip-typescript" },
"projectRoot": "file:///project"
},
"documents": [
{
"language": "typescript",
"relativePath": "src/main.ts",
"occurrences": [
{
"range": [0, 9, 12],
"symbol": "scip-typescript npm . . src/`main.ts`/foo().",
"symbolRoles": 1
},
{
"range": [3, 2, 5],
"symbol": "scip-typescript npm . . src/`main.ts`/foo().",
"symbolRoles": 0
}
],
"symbols": [
{
"symbol": "scip-typescript npm . . src/`main.ts`/foo().",
"documentation": ["```ts\nfunction foo(): void\n```"],
"kind": "Function"
}
]
}
]
}
Two occurrences, one symbol entry. No graph, no edges, no resultSet. The same LSIF equivalent required a document vertex, two range vertices, a resultSet, a hoverResult, a definitionResult, a referenceResult, and six edges.
Takeaway: So just a locally-processed form of a file.
Cross-file and cross-repo references work the same way: any occurrence in any document carrying the same symbol string is linked, purely by string equality. The consumer builds a hashtable at index load time.
Cross-file symbol resolution
There are no cross-file edges or pointers. It’s just a hashtable.
When the consumer loads all documents, it builds a map:
symbol string -> list of (document, range, roles)
Every occurrence in every document gets inserted into this map by its symbol string. “Go to definition” is a lookup: find all entries for this symbol string where roles & Definition != 0. “Find references” is the same lookup without the role filter.
So for foo defined in a.ts and referenced in b.ts:
# a.ts indexer emits:
{ range: [0, 9, 12], symbol: "...`a.ts`/foo().", symbolRoles: 1 } // Definition
# b.ts indexer emits:
{ range: [5, 0, 3], symbol: "...`a.ts`/foo().", symbolRoles: 0 } // Reference
Both carry the same symbol string. The consumer puts both into the map. When you hover foo in b.ts, it looks up "...a.ts/foo().", finds the definition entry in a.ts, and navigates there.
Cross-repo works identically. The <package> component (npm @example/a 1.0.0) makes the string globally unique across repos. Two indexes from different repos get loaded into the same map, and the lookup is no different.
Personal Takeaways
An indexing format should be aware of:
- Human-readability for debuggability.
- Allow for local reasoning.
- Allow for incrementalization.
[blog] [overview] Code Navigation for AI SWEs: What We’ve Learned So Far
Source: engines.dev
Link: https://www.engines.dev/blog/code-navigation
Keywords:
- Code navigation
- LSP
Takeaways:
I think:
- Stack Graphs.
- Glean.
- SCIP.
are the ones that are worth looking into.
I want to solve the problem at the API/indexing format rather than wrapping existing binaries.
AI SWE is probably AI software engineer. It is AI as a software engineer, not an engineer working in AI.
Introduction
The core framing: AI agents need code navigation the same way humans do. You can’t just throw files at a model and hope it figures out where things are.
The question is how to help AI SWE navigate a codebase.
Existing Approaches
The field is split on how to solve this:
- SWE-Agent: brute-force string search across the entire codebase, then shows the LLM 100 lines at a time through a file viewer.
Basically, a naive heuristic where you grep through the codebase and show the matches lines in their surrounding context.
- CodeMonkey: passes all files into a small LLM and ranks them by relevance.
Basically, delegate to a narrower but faster LLM to act like a “semantic search” so that the bigger LLM can work more efficiently.
- Moatless: semantic search to identify relevant files.
Semantic search is a little vague? Does it mean Moatless performs a semantic analysis.
- OpenHands / Engines: expose navigation as explicit agent tools, such as find all references and go to definition.
Basically my idea, focus on the tools first, then AI is just an integration.
The last approach maps cleanly onto what a language server already provides. The question is which backend to use.
Ideal system
- Scalable: index at most once per commit.
- Incremental: re-index only changed files on new commits.
- Flexible: navigate any arbitrary commit hash.
No existing open-source system hits all four.
Systems Evaluated
-
lsproxy: auto-detects language then spins up the right LSP server.
-
Stack Graphs (GitHub):
- An advanced data structure that powers Github’s web-based code navigation.
- Desirable properties:
- Incremental.
- Theoretically language-agnostic.
- Fast queries via path tracing on a prebuilt graph.
- Under the hood, uses tree-sitter + custom
.tsggrammar files per language to build the graph. Maintaining.tsgfiles is tiring.
The Stack Graphs architecture is the right idea: incremental, no running server required, commit-aware. The
.tsgmaintenance burden is the only real blocker.
- Glean (Meta):
- Stores arbitrary facts about a codebase (user-defined schema + Angle query language).
- Desirable properties:
- Incremental: Only reindex new changes.
- Scalable: Still work on Meta’s massive C++ codebase.
- Flexible: Many queries beyond simple code navigation is supported.
- Navigate arbitrary commits: Facts can be associated with any commits.
- Can ingest SCIP as a schema, so it covers a wide range of language support.
- Drawbacks:
- Glean uses Thrift for its RPC protocol. The Thrift definition Glean uses relies on Meta’s internal Thrift client.
Basically, Glean is the right direction, just some issues with open-source.
- multilspy: Python wrapper over an LSP server.
- Easy to set up.
- Support multi-language.
- Drawbacks:
- A bit bloated, as it downloads language server binaries.
- LSP server’s lifetime is tied to the Python process.
Sourcegraph: SCIP offers a precise code navigation API. However, it’s not open source.
Their Solution
They concluded that they want:
- The language agnostic approach of Glean/Stack Graph.
- The usability of lsproxy.
Their first step though, was wrapping multispy.
- A server component with MCP support.
- A Docker container for sandboxing.
They are solving the problem at a totally different level: Ready-to-use binary, rather than API/indexing format. I want to research the solution at the Glean/SCIP level first.
Personal Takeaways
I think:
- Stack Graphs.
- Glean.
- SCIP.
are the ones that are worth looking into.
I want to solve the problem at the API/indexing format rather than wrapping existing binaries.
[tool] [tool-dive] SCIP
Source: Sourcegraph
Link: https://github.com/sourcegraph/scip
[tool] [tool-dive] Glean
Source: Meta
Link: https://github.com/facebookincubator/Glean
[tool] [tool-dive] Google Kythe
Source: Wikipedia
Link: https://en.wikipedia.org/wiki/Google_Kythe
[tool] [tool-dive] OpenGrok
Source: GitHub
Link: https://github.com/oracle/opengrok
[tool] [tool-dive] livegrep
Source: GitHub
Link: https://github.com/livegrep/livegrep
Mycroft
Code Querying
Code querying is about searching and pattern-matching over code in a structured, expressive way. This covers everything from basic text search (ripgrep) to AST-level structural search (ast-grep, Semgrep) to full Datalog-based code-as-data query engines (CodeQL, Soufflé).
The key design question here is the tradeoff between expressiveness, speed, and ease of use.
[blog] [overview] Design Space for Code Search Query
Source: ast-grep blog
Link: https://ast-grep.github.io/blog/code-search-design-space.html
[overview] [overview] Comparison With Other Frameworks
Source: ast-grep docs
Link: https://ast-grep.github.io/advanced/tool-comparison.html
[overview] [overview] Best Code Search Tools for Developers in 2024
Source: debugg.ai
Link: https://debugg.ai/resources/best-code-search-tools-for-developers-2024-navigate-understand-refactor
[paper] [seminal] codeQuest
Source: Springer
Link: https://link.springer.com/chapter/10.1007/11785477_2
[paper] [seminal] Declarative Static Analysis with CodeQL
Source: Wiley
Link: https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3199
[paper] [trend] Incrementalizing Production CodeQL Analyses
Source: ACM (ESEC/FSE 2023)
Link: https://dl.acm.org/doi/10.1145/3611643.3613860
[paper] [trend] CodeFuse-Query
Source: arXiv
Link: https://arxiv.org/html/2401.01571v1
[tool] [tool-dive] ripgrep
Source: GitHub
Link: https://github.com/BurntSushi/ripgrep
[tool] [tool-dive] ast-grep
Source: ast-grep
Link: https://ast-grep.github.io
[tool] [tool-dive] Semgrep
Source: GitHub
Link: https://github.com/semgrep/semgrep
[blog] [tool-dive] Semgrep: A Static Analysis Journey
Source: Semgrep blog
Link: https://semgrep.dev/blog/2021/semgrep-a-static-analysis-journey/
[tool] [tool-dive] GritQL
Source: grit.io
Link: https://docs.grit.io
[tool] [tool-dive] CodeQL
Source: GitHub
Link: https://codeql.github.com
[tool] [tool-dive] Soufflé
Source: Soufflé
Link: https://souffle-lang.github.io
Code Refactoring
Code refactoring covers applying transformations to a codebase: renaming symbols, reorganizing structure, rewriting patterns. The challenge is doing this correctly and safely, without leaving the code in a malformed state.
This is one of the core pain points that motivated Loupe. AI reaches for grep and sed, which is imprecise and error-prone. Proper refactoring requires understanding the code structurally, not just textually.
[paper] [seminal] Refactoring-Aware AST Differencing
Source: Concordia University (TOSEM 2024)
Link: https://users.encs.concordia.ca/~nikolaos/publications/TOSEM_2024.pdf
[paper] [trend] PyCraft
Source: University of Colorado (FSE 2024)
Link: https://danny.cs.colorado.edu/papers/PyCraft_FSE2024.pdf
[tool] [tool-dive] Comby
Source: comby.dev
Link: https://comby.dev
Quality Assurance
Quality assurance covers detecting code smells, bad patterns, and potential bugs. This is where static analysis meets practical tooling. The resources here range from empirical benchmarks of existing tools to ML-based detection methods.
[paper] [overview] Static Analysis Tools for Secure Code Review
Source: arXiv
Link: https://arxiv.org/pdf/2407.12241
[paper] [trend] LLift
Source: OOPSLA 2024 / UC Riverside
Link: https://www.cs.ucr.edu/~zhiyunq/pub/oopsla24_llift.pdf