Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Loupe is an all-in-one suite of ergonomic APIs and CLI tools for code analysis & codebase querying, aiming for code comprehension, refactoring & automation.

It leverages well-known program analysis techniques, algorithms, and heuristics to answer very common questions about codebases.

Components

Structure-wise, Loupe has 2 main parts: the API and the CLI interface.

Functionality-wise, Loupe is concerned with 4 aspects:

  • Code analysis: understanding structure, symbols, and relationships within a codebase.
  • Code querying: searching and pattern-matching over code in a structured, expressive way.
  • Code refactoring: applying transformations like renaming, reorganizing, and restructuring.
  • Quality assurance: detecting smells, bad patterns, and potential bugs.

Goals

  • Ergonomic: easy-to-use APIs that feel natural to work with.
  • Extensible: can be extended from the core API without fighting the design.
  • Fast: instant results; must be meaningfully faster than AI doing manual search.
  • Simple: self-contained and not bloated.

Research

This section collects related works across the 4 functional areas of Loupe: code analysis, querying, refactoring, and quality assurance. It covers seminal papers, modern techniques, and practical tools.

The goal here is not exhaustive coverage, but a curated map of the space, so it’s easier to understand where Loupe sits and what prior art is worth learning from.

Plan

Areas: Start with code analysis, then querying, then refactoring, then quality assurance.

  1. Analysis: The foundation for all other things. We need to understand ASTs, parsing, and code intelligence before anything else makes sense.
  2. Querying: Once we have a model of the code, querying is what we do with it.
  3. Refactoring: applies both analysis (know what to change) and querying (find where to change it).
  4. Quality assurance: it overlaps all three, and makes most sense once we understand the tools and techniques of the other areas.

Within each area, resources are ordered from overview to seminal papers, then modern trends, then practical tools. Start at the top and work down.

Code Analysis

Code analysis is about understanding the structure, symbols, and relationships within a codebase. This covers parsing (ASTs), call graphs, code intelligence protocols (LSP, SCIP), and navigation tools.

This is the foundation everything else in Loupe builds on. Without a solid model of what the code actually is, querying, refactoring, and QA are all guesswork.

Syntactic Analysis

Syntactic analysis is about parsing source code into structured representations, typically abstract syntax trees (ASTs) or concrete syntax trees (CSTs). This is the first step in any code analysis pipeline: before you can reason about symbols, types, or data flow, you need a reliable parse of the source text.

The key concerns here are incremental parsing (re-parsing only what changed after an edit), error recovery (producing a useful tree even when the code is incomplete or malformed), and multi-language support (handling many grammars without writing a custom parser for each).

For Loupe, the parser must support many languages through shared infrastructure (language-agnostic), patch existing trees on edits rather than re-parse from scratch (incremental), and stay off the critical path, since everything downstream depends on it (fast).

Subpages

  • Tree-sitter - Strange Loop 2018 and Tree-sitter: the dominant incremental parsing framework, designed for editor use cases. Produces concrete syntax trees with error recovery, supports many languages via community grammars.
  • Roslyn: the .NET Compiler Platform, which exposes rich syntax tree APIs including full-fidelity syntax trees with trivia, incremental parsing, and syntax walkers/rewriters for C# and VB.NET.
  • Spoofax Language Workbench: a platform for defining languages declaratively, including their syntax, name binding, and type systems.

[talk] [seminal] Tree-sitter - A New Parsing System for Programming Tools

Source: Strange Loop 2018

Link: https://www.thestrangeloop.com/2018/tree-sitter—a-new-parsing-system-for-programming-tools.html

Video: https://www.youtube.com/watch?v=Jes3bD6P0To

Speaker: Max Brunsfeld, GitHub Atom team

Keywords:

  • Parsing
  • Incremental parsing
  • Syntax tree
  • Code navigation

Takeaway:

Tree-sitter takes the approach of a compiler generator:

  • GLR parsing
  • Error recovery with GLR parsing also
  • Incremental parsing by reparse but with the old AST as a reference

Introduction

Developer tools that support multiple languages have historically been built on regex-based code analysis.

Syntax highlighting in Atom, VS Code, and Sublime Text uses TextMate grammars, which are regular-expression pattern sets. Code folding uses indentation heuristics. Symbol search uses pattern matching over raw text. (Confirmed fact!)

Regex-based tooling has a hard ceiling:

  • Regex cannot represent nested structure or scope boundaries. Balanced delimiters, block nesting, type-sensitive disambiguation - none of this is expressible.
  • Reparsing an entire file on every keystroke is too slow for real-time use. Compiler parsers are designed for batch runs, not interactive editing.
  • Incomplete code breaks everything. While typing, code is almost always syntactically invalid. A parser that rejects invalid input cannot be used in an editor.
  • Every language needed its own editor plugin written from scratch, with no shared interface.

Tree-sitter was built to fix all of this.

  • A library for parsing source code written in C and C++.
  • Language-agnostic: Cover a wide range of languages.
  • Unified: Unified API for different languages.
  • Incremental: No need to reparse the whole file.
  • Error recovery: Recover on error instead of aborting.

IDEs’ limitations:

  • Language-specific.
  • No IDEs at the time were incremental.

Language servers’ limitations:

  • Fixed feature set.
  • RPC + parsing from scratch on every changes.
  • Each maintain a different set of dependencies.

Tree-sitter’s strengths:

  • Arbitrary analysis based on syntax tree.
  • In-process incremental parsing.
  • No dependencies.

Features

Syntax Highlighting

  • In Sublime Text 3/Vscode/Old Atom, syntax highlighting seems to be plagued with highlighting inconsistencies:

    • Different colors for different types in different contexts.
    • Same variables had different colors.
    • Fields were not highlighted.
  • A concrete example: Long lines of code (typically seen in bundled javascript) will break regex-based highlightings.

  • Improvements of tree-sitter:

    • None of the problems above.
    • Very fast.

Code Folding

  • Typically based on indentation: Do not always work, especially when some weird code convention is used!

  • With tree-sitter:

    • Folding always works.
    • Configurable code folding. This is unlocked by treesitter exposing a consistent AST.

Extend Selection

  • It’s the feature where clicking on a position cause semantically larger code items to be selected.
  • Works well with multi-cursor.

How Tree-sitter work

Writing a Grammar

  • Kind of look like most compiler generator.
    • A context-free grammar in tree-sitter is a Javascript value in a file such as grammar.js.
    • Compiled into parser.c
    • The parser can then be imported.
  • Bindings exist.

Algorithms

Seminal Paper

Based on Tim A. Wagner (1998)’s paper: Practical algorithms for Incremental Software Development Environments.

  • Laid out methodology for efficient code analysis in IDE.
  • Basic parsing theory to start with.
  • Extend the theory to incremental parsing.
  • Formally proofs of performance properties.
  • How to handle ambiguities, errors, etc.

LR Parsing

Good source: wikipedia.org/wiki/LR_parser.

  • Looks from start to end without ever backtracking.
  • As it reads the characters, they are grouped into tokens, tokens are grouped into subtrees, stored on a stack for consultation. The items on the stack are merged rapidly.

LR Parsing does not work for all languages. Although practically, a well-designed language shouldn’t be a problem, but some languages have weird quirks.

Tree-sitter does some unique things based on LR Parsing.

Extension: GLR Parsing

When ambiguities happen, fork the branch to work on multiple interpretations.

=> Tree-sitter is based on this.

Error Recovery

  • The tree should include error nodes.
  • Invalid code is a source of lots of ambiguities.

Tree-sitter uses the same idea of GLR for resolving ambiguities.

Incremental Parsing

  • Walk the current AST tree and mark all nodes that contain the editted positions.
  • Reparse like before, but it can now use the old tree as a reference.

Tree-sitter

Source: tree-sitter

Link: https://tree-sitter.github.io

Abstract

This section covers tree-sitter: design philosophy, architecture, usage, and applications.

Problem:

  • IDEs (visual studio, sublime text) used regexes for syntax features before tree-sitter.
  • This caused inconsistent highlighting, fragility, poor performance.

Solution: Tree-sitter provides fast, accurate, incremental parsing for editors and IDEs.

See Strange Loop Talk for more context.

[tool] [tool-dive] Tree-sitter Concepts

Source: tree-sitter

Link: https://tree-sitter.github.io

Takeaways

  • Design for IDEs: Tree-sitter prioritizes fast incremental re-parsing and token position fidelity over theoretical parser properties.
  • CST design balances detail and complexity by preserving all tokens while using named nodes (grammar rules) vs. anonymous nodes (inline strings).
  • Incremental parsing only re-parses edited ranges by reusing unchanged subtrees, critical for real-time editor feedback.
  • GLR explores all parse branches at runtime and scores with dynamic precedence, handling true ambiguities without grammar restrictions.
  • Queries separate concerns: Tree navigation is independent of grammar design, enabling services to evolve without grammar changes.

Abstract

This section covers tree-sitter’s architecture, design principles, and core concepts.

  1. Concepts (this article): Design philosophy, architecture, IDE optimization.
  2. Grammar (Guideline & Syntax): Author grammars with design principles and DSL reference.
  3. Query System (Queries): Pattern matching for data extraction and tree navigation.
  4. Services (Services): Syntax highlighting, code navigation, language injection.

Introduction

Tree-sitter is:

  • A parser generator tool, taking a grammar described in JS and yielding an incremental parsing library in C.
  • Runtime in various languages to load the generated C library.

The Context

Covered in the Strange Loop talk.

Roughly:

  • Before tree-sitter, most IDE (vscode, sublime) features (such as syntax-highlighting, code folding) were implemented in terms of regexes.
  • This caused a lot of problems:
    • Inconsistent highlighting: The same type of tokens were not in the same color in some edge cases (weird formatting convention) or even in some pretty common case.
    • Regexes highlighting can easily break, such as when inspecting the bundled JS.
    • Slow (at least slower than tree-sitter).
  • So, tree-sitter was created to addressed these issues, reporting:
    • Faster processing of source code.
    • Consistent highlighting.
    • Enable many DX features.
    • Unified interface for toolings.

Design Philosophy

Some of the design philosophy of tree-sitter can be found in the Strange Loop talk:

  • Arbitrary analysis based on syntax tree.
  • In-process incremental parsing.
  • No dependencies.

On the tree-sitter homepage, tree-sitter aims to, quoted in verbatim, be:

  • General enough to parse any programming language
  • Fast enough to parse on every keystroke in a text editor
  • Robust enough to provide useful results even in the presence of syntax errors
  • Dependency-free so that the runtime library (which is written in pure C11) can be embedded in any application

Additional source: https://zed.dev/blog/syntax-aware-editing.

IDE-centric design:

  • Tree-sitter was designed with IDE-centric workflow in mind.
  • Concrete syntax trees retain all token locations to support many editor features.
  • Flexible queries match structural patterns independently from grammar design.
  • Grammar + query set = language support without custom traversal code.

    This aligns with Loupe’s goal: syntactic code analysis and querying tool.

graph TB
    A["Grammar.js<br/>User-defined rules"] -->|tree-sitter generate| B["C Parser<br/>Compiled from grammar"]
    B --> C["TSParser<br/>Execution engine"]

    C -->|ts_parser_parse_string| D["TSTree<br/>Concrete Syntax Tree"]
    D --> E["TSNode<br/>Syntax tree element"]

    F["Tree Queries<br/>Pattern matching"] -->|ts_query_cursor_exec| E

    G["Edit API<br/>ts_tree_edit"] -->|Incremental| C

    H["Language Binding<br/>Rust/JS/Python"] -->|Load| B

    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
    style F fill:#fce4ec
    style G fill:#fff9c4

Practical Use Cases

Source: https://zed.dev/blog/syntax-aware-editing.

  • These features in Zed are implemented using tree-sitter queries:
    • Syntax highlighting (example query from their blog):

      ["do" "for" "while"] @keyword
      
      (function
      name: (identifier) @function)
      
      (pair
      key: (property_identifier) @function.method
      value: [(function) (arrow_function)])
      
    • Symbol outlines (example query from their blog) - fuzzy search symbols within another symbol/context:

      (impl_item
        "impl" @context
        trait: (_)? @name
        "for"? @context
        type: (_) @name) @item
      
      (function_item
        (visibility_modifier)? @context
        (function_modifiers)? @context
        "fn" @context
        name: (_) @name) @item
      
    • Auto-indentation (example query from their blog):

      (statement_block "}" @end) @indent
      
      [
        (assignment_expression)
        (member_expression)
        (if_statement)
      ] @indent
      
    • Language injection (example query from their blog) - sometimes, inside a source code in a language comes another language (e.g. HTML & JS or Rust macro):

      (script_element
        (raw_text) @content
        (#set! "language" "javascript"))
      

      Kind of look like Lexer mode in ANTLR.

User-facing API

Tree-sitter’s API follows a layered approach:

  • Core APIs handle efficient parsing & a unified AST representation.
  • High-level APIs provide convenience features.

Four main object types:

  1. Languages (TSLanguage in the C API).
    • An (opaque) object representing the programming language parsing logic/rules (not the current parsing states).
    • The implementations for specific Language objects are generated by tree-sitter.
  2. Parsers (TSParser in the C API).
    • The language parser itself, which can be assigned a language object.
    • Can accept some source code and produce a TSTree.
  3. Syntax trees (TSTree in the C API).
    • The syntax tree of a piece of a source code.
    • Composed of TSNodes.
    • Can be editted to produce a new TSTree.
  4. Syntax nodes (TSNode in the C API).
    • A single node in the syntax tree.
    • Tracks start and end positions & relation to other nodes.

Example flow using a generated parser in C:

  1. Create a TSParser instance.
  2. Set the TSParser with a TSLanguage.
  3. Trigger the parser to produce a TSTree.
  4. Query the TSTree and retrieve TSNodes.

Example program:

#include <stdio.h>
#include <string.h>
#include <tree_sitter/api.h>

// **Language**
// Declare the external grammar function. This is generated by Tree-sitter
// and contains all the syntax rules for JSON.
const TSLanguage *tree_sitter_json(void);

int main() {
    // The source code we want to parse
    const char *source_code = "{\"name\": \"Gemini\"}";
    uint32_t source_len = strlen(source_code);

    // **Parser**
    // Create the parser and equip it with the JSON language grammar
    TSParser *parser = ts_parser_new(); // Dynamically allocated
    ts_parser_set_language(parser, tree_sitter_json());

    // **Parse tree**
    // Instruct the parser to read the string and build the entire tree.
    TSTree *tree = ts_parser_parse_string(
        parser,
        NULL,           // Previous tree (used for incremental parsing, NULL for first time)
        source_code,    // The actual code
        source_len      // Length of the code
    );

    // **Tree node**
    // Extract the absolute top node of the tree to begin our traversal.
    TSNode root_node = ts_tree_root_node(tree);


    // Cleanup memory
    ts_tree_delete(tree);
    ts_parser_delete(parser);

    return 0;
}

Parser API

TSParser accepts input via: string buffer (const * string with uint32_t length) or custom TSInput interface.

  • ts_parser_parse_string: Accept string buffer with incremental parsing support.
    TSTree *ts_parser_parse_string (
      TSParser *self,
      const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk
      const char *string,
      uint32_t length
    )
    
  • ts_parser_parse: Accept a custom data structure consumer:
    TSTree *ts_parser_parse(
     TSParser *self,
     const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk
     TSInput input
    );
    

Syntax Nodes (& Trees)

DOM-style interface for inspection:

Position APIs:

  • ts_node_type: Node type identifier corresponding to grammar rule.
  • ts_node_start_byte/ts_node_end_byte: Raw byte offsets.
  • ts_node_start_point/ts_node_end_point: Zero-based row/column coordinates. Only \n is newline separator.

Tree navigation:

  • ts_tree_root_node: Get root node from tree.
  • ts_node_child: Access children via array index.
  • ts_node_next_sibling/ts_node_prev_sibling: Navigate siblings (may return null node).
  • ts_node_parent: Navigate to parent node.

Named and Anonymous Nodes

Problem: CSTs preserve all tokens, creating deeply nested trees tedious to traverse.

Solution: Distinguish two node types for selective AST inclusion:

  • Named nodes: Explicit grammar rules (e.g., if_statement, expression). Always appear in parse tree.
  • Anonymous nodes: Inline strings/regexes (e.g., "if", "(", ";"). Hidden from parse tree by default.
if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
^^^^^^^^^^^^           ^^^   ^^^                 ^^^
named node        anonymous nodes (hidden from AST)

This distinction enables unambiguous, detailed grammars while keeping trees manageable.

graph TD
    A["if_statement<br/>named node"] -->|child| B["_expression<br/>named node<br/>hidden via _prefix"]
    A -->|not in AST| C["'if'<br/>anonymous"]
    A -->|not in AST| D["'('<br/>anonymous"]
    A -->|not in AST| E["')'<br/>anonymous"]
    A -->|child| F["_statement<br/>named node<br/>hidden via _prefix"]

Node Field Names

Access child nodes:

  • By index: Default approach via children array.
  • By field name: More convenient when field names are known.

Like ANTLR, supports both indexed and named accesses.

TSNode ts_node_child_by_field_name(
  TSNode self,
  const char *field_name,    // Unique field name from grammar
  uint32_t field_name_length
);

Field name lookup optimization:

  • String comparisons are repeated and inefficient.
  • Tree-sitter interns field names into TSFieldId for efficient lookup.
uint32_t ts_language_field_count(const TSLanguage *);
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);

Tree Edit

Problem: Source edits invalidate tree byte positions. Reparsing entire file is wasteful.

Solution: Two-step incremental update:

  1. Edit tree structure: Tell tree-sitter what changed (old/new byte range, position delta).

    typedef struct {
      uint32_t start_byte;       // Edit start
      uint32_t old_end_byte;     // Old end position
      uint32_t new_end_byte;     // New end position
      TSPoint start_point;       // Start as row/column
      TSPoint old_end_point;     // Old end as row/column
      TSPoint new_end_point;     // New end as row/column
    } TSInputEdit;
    
    void ts_tree_edit(TSTree *, const TSInputEdit *);
    
  2. Reparse with old tree as context:

    TSTree *new_tree = ts_parser_parse_string(
      parser,
      old_tree,           // Enable incremental parsing
      updated_source,
      new_length
    );
    

Rough flow:

  1. Source change.
  2. Invoke ts_tree_edit to update the tree’s range
  3. Invoke ts_parser_parse with the old tree.
  4. New tree with reused structure.

Why do you need to call ts_tree_edit first? Wouldn’t call re-parse with the old tree suffice? My guess is that it would be more efficient incrementally if tree-sitter knows the editted position from the get-go.

Warning!

Old extracted TSNode references become invalid after reparse. These will need to be updated via ts_node_edit() or fresh nodes can be refetched instead.

My theory is that TSNode does not contain source text information? So, the edit API does not convey information about the new source.

Language Injection Support

Many documents contain multiple languages (e.g., HTML with embedded JS, Bash with heredocs).

Tree-sitter stitches disjointed segments as one unified syntax tree.

typedef struct {
  TSPoint start_point;
  TSPoint end_point;
  uint32_t start_byte;
  uint32_t end_byte;
} TSRange;

void ts_parser_set_included_ranges(
  TSParser *self,
  const TSRange *ranges,
  uint32_t range_count
);

Example: ERB document with embedded HTML and Ruby.

<ul>
  <% people.each do |person| %>
  <li><%= person.name %></li>
  <% end %>
</ul>

Steps:

  1. Parse entire document with root language (ERB).
  2. Identify nodes with embedded languages in AST.
  3. Extract boundaries into separate TSRange arrays.
  4. Switch parser language via ts_parser_set_language.
  5. Feed ranges to parser via ts_parser_set_included_ranges.
  6. Re-parse to generate isolated syntax trees per language.

Concurrency Support

TSTree instances are not thread-safe. However, they can be cheaply copied via ts_tree_copy(TSTree *):

  • TSTrees are immutable.
  • An atomic counter is incremented to keep track of completely freed TSTrees.

(Efficient) Tree Traversal

The problem with the standard node fetching APIs (index/named access):

  • Stateless.
  • Therefore, repeated accesses would need to recalculate positions from scratch.
  • Suitable for occasional accesses.

For bulk traversal, use tree cursors.

  • ts_tree_cursor_new(TSNode): Create cursor from node.

  • ts_tree_cursor_goto_first_child: Move to first child.

  • ts_tree_cursor_goto_next_sibling: Move to next sibling.

  • ts_tree_cursor_goto_parent: Move to parent.

  • ts_tree_cursor_current_node: Get full node (triggers lazy evaluation).

  • ts_tree_cursor_current_field_name: Get field name.

  • ts_tree_cursor_current_field_id: Get field ID.

// 1. Initialize the cursor (this node becomes the impenetrable boundary)
TSTreeCursor cursor = ts_tree_cursor_new(start_node);

// 2. Traversal Flow
// Try moving down into the children (O(1) operation)
if (ts_tree_cursor_goto_first_child(&cursor)) {

    // 3. Extract lightweight data directly
    const char *field = ts_tree_cursor_current_field_name(&cursor);

    // 4. Extract the full node struct only if actually needed
    TSNode current = ts_tree_cursor_current_node(&cursor);

    // 5. Move horizontally (O(1) operation)
    if (ts_tree_cursor_goto_next_sibling(&cursor)) {
        // Evaluates the next sibling
    }

    // 6. Move back up the tree
    ts_tree_cursor_goto_parent(&cursor);
}

Resources for the Design & Implementation

Grammar

How to author and design tree-sitter grammars.

Abstract

This section covers grammar authoring and design.

Grammar design directly affects parser quality, performance, usability.

  • Well-designed: Unambiguous, efficient, maintainable.
  • Poor design: Parsing conflicts, performance issues, difficult query patterns.

Tree-sitter Grammar Syntax

Source: tree-sitter

Link: https://tree-sitter.github.io

Dependencies

  • JavaScript runtime: Node.js (to execute grammar.js)
  • C Compiler: GCC, Clang, or MSVC (to compile the generated C code)
  • Tree-sitter CLI: Installed via npm, Cargo, or downloaded binary. For Nix, use tree-sitter

Basic Flow

graph LR
    A((Start)) --> B[**Setup** <br> Install CLI & Compiler]
    B --> C[**Scaffold** <br> <code>tree-sitter init</code>]
    C --> D[**Define Rules** <br> Edit <code>grammar.js</code>]
    D --> E[**Compile** <br> <code>tree-sitter generate</code>]
    E --> F[**Test** <br> <code>tree-sitter parse</code>]
    F --> G((Syntax Tree AST))

    classDef step fill:#24292f,stroke:#0969da,stroke-width:2px,color:#fff;
    class B,C,D,E,F step;
  1. Initialize with tree-sitter init: Creates a project template with grammar.js, src/, and test files.

  2. Define grammar in grammar.js:

    module.exports = grammar({
      name: "my_language",
      rules: {
        source_file: ($) => repeat($.statement),
        statement: ($) => seq($.identifier, ";"),
        identifier: ($) => /[a-z]+/,
      },
    });
    
  3. Generate with tree-sitter generate: Compiles grammar.js to C code in src/parser.c.

  4. Test with tree-sitter parse <file>: Parse a sample file and inspect the resulting syntax tree.

  5. Iterate: Refine rules, regenerate, test until the grammar is correct.

Grammar DSL

The Grammar DSL is a set of JavaScript functions for expressing syntax rules. All patterns are built from simple, composable blocks.

Quick Reference

PatternExampleMatches
Literal string"if"Exactly the text “if”
Regex/\d+/One or more digits
Sequenceseq(a, b, c)a followed by b followed by c
Choicechoice(a, b)Either a or b
Repeatrepeat(a)Zero or more of a
Repeat 1+repeat1(a)One or more of a
Optionaloptional(a)Zero or one of a
Symbol ref$.rule_nameReference another rule
Tokentoken(a)Squash a into single leaf node
Precedenceprec(1, a)Resolve conflicts, higher number wins
Associativityprec.left(a) or prec.right(a)Group left-to-right or right-to-left

Pattern Functions

Core DSL functions for building rules in grammar.js. All rules are functions accepting $, which provides access to other rules.

Basic Patterns

  • Symbols (the $ object): Reference other grammar rules.

    rules: {
      expression: $ => $.identifier,  // Reference the 'identifier' rule
      identifier: $ => /[a-z]+/
    }
    

    Pitfall: avoid rule names starting with MISSING or UNEXPECTED, as tree-sitter test treats them specially.

  • String literals: Match exact text (become anonymous nodes in the AST).

    rules: {
      keyword: $ => 'if',     // Matches exactly "if"
      semicolon: $ => ';',    // Matches exactly ";"
    }
    
  • Regular expressions: Match patterns (become anonymous nodes in the AST).

    rules: {
      number: $ => /\d+/,                           // Digits
      identifier: $ => new RustRegex('(?i)[a-z]+') // Case-insensitive
    }
    

    Tree-sitter compiles JS regex to C using Rust’s regex syntax, ignoring JS semantics. Lookaheads (?=) and lookbehinds (?<=) are not supported (break LR(1) parsing).

  • Sequences: Match rules in strict order.

    rules: {
      return_statement: ($) => seq("return", $.expression, ";");
    }
    
  • Choices: Match any one rule (logical OR).

    rules: {
      literal: ($) => choice("true", "false", $.number);
    }
    

    Order does not matter. If choices overlap, use prec() to resolve conflicts.

Repetition and Optionality

  • Repetitions: Match a rule zero or more times (repeat) or one or more times (repeat1).

    rules: {
      block: $ => seq('{', repeat($.statement), '}'),     // 0+
      list: $ => repeat1($.identifier),                     // 1+
    }
    

    Tip: use repeat(X) instead of optional(repeat1(X)).

  • Options: Match a rule zero or one time.

    rules: {
      class_declaration: ($) => seq(optional("export"), "class", $.identifier);
    }
    

Advanced Patterns

  • Tokens: Squash a complex sequence of strings and regexes into a single leaf node.

    rules: {
      html_comment: ($) => token(seq("<!--", /.*/, "-->"));
    }
    

    token() only accepts terminal rules (strings, regexes, seq, choice). Cannot pass non-terminal symbols (like $.statement).

  • Immediate tokens: Force a token to match only when physically touching the previous token (no intervening whitespace).

    rules: {
      decorator: ($) => seq("@", token.immediate(/[a-zA-Z_]\w*/));
    }
    
  • Field names: Give child nodes semantic names for easier access.

    rules: {
      assignment: ($) =>
        seq(field("target", $.identifier), "=", field("value", $.expression));
    }
    
  • Aliases: Rename a rule’s node type in the AST.

    rules: {
      expression: ($) => alias($.identifier, $.variable); // Appears as 'variable' node
    }
    

Conflict Resolution

When multiple grammar rules could match the same input, use precedence and associativity to disambiguate.

  • Precedence (prec(number, rule)): Assign numerical priority to resolve conflicts. Higher number wins.

    rules: {
      unary_expression: $ => prec(1, seq('-', $.expression)),
      binary_expression: $ => seq($.expression, '-', $.expression)
    }
    

    There are two flavors of precedence depending on context:

    • Parse precedence (structural): Resolves LR(1) conflicts at compile time. Higher number wins.
    • Lexical precedence (tokens): Resolves which token definition matches at the character level. Wrap in token().

    Default precedence is 0. Use negative numbers to deprioritize. To apply lexical precedence, nest inside token(): token(prec(1, 'keyword')).

  • Associativity: When precedence values are equal, associativity acts as a tie-breaker.

    rules: {
      subtraction: ($) => prec.left(1, seq($.expr, "-", $.expr)); // (1 - 2) - 3
    }
    
    rules: {
      assignment: ($) => prec.right(1, seq($.id, "=", $.expr)); // a = (b = c)
    }
    

    Only matters when precedence is tied. If rule A has precedence 2 and rule B has precedence 1, associativity is ignored.

  • Dynamic precedence: Resolves true runtime ambiguities during GLR parsing.

    rules: {
      ambiguous_rule: ($) => prec.dynamic(1, choice($.pattern_a, $.pattern_b));
    }
    

    While prec() resolves conflicts at grammar compile time, prec.dynamic() scores competing parse trees at runtime. Tree-sitter explores all ambiguous branches (GLR), then picks the tree with the highest total dynamic precedence score. (See Strange Loop talk for details.)

  • Reserved keywords: Contextually override forbidden keywords in specific rules, allowing keywords to be used as identifiers in some places.

    export default grammar({
      name: "my_language",
      reserved: {
        global: ($) => ["if", "else", "class"],
        allow_all: ($) => [],
      },
      rules: {
        variable_name: ($) => $.identifier, // Uses global reserved set
        property_name: ($) => reserved("allow_all", $.identifier), // Allows keywords like `obj.if`
      },
    });
    

Top-Level Configuration

Beyond rules, tree-sitter grammars accept several top-level config fields that fine-tune parsing behavior, optimize tree structure, and handle special cases. Each solves a specific problem that grammar rules alone cannot address.

graph TD
    A["Top-level Configs"] --> B["Structure"]
    A --> C["Behavior"]
    A --> D["Performance"]

    B --> B1["inline: Inline nodes"]
    B --> B2["supertypes: Abstract groups of grammar rules"]

    C --> C1["extras: Allowed anywhere (trivia tokens that have no meanings)"]
    C --> C2["conflicts: GLR fallback"]
    C --> C3["reserved: Context keywords"]
    C --> C4["externals: Custom lex"]

    D --> D1["word: Keyword optimization"]
    D --> D2["precedences: Named levels"]

Quick Reference

ConfigPurposeUse when
extrasAllow tokens anywhere (whitespace, comments)You want to skip whitespace/comments without mentioning them in rules
inlineHide intermediate rule nodes from ASTHelper rules clutter the tree but organize your grammar
conflictsEnable GLR for ambiguous rulesGrammar has intentional ambiguities that can’t be resolved by precedence
externalsCustom C lexer for context-sensitive tokensLanguage needs stateful token rules (Python indentation, template strings)
precedencesNamed precedence levels instead of numbersYou want readable precedence definitions instead of magic numbers
wordKeyword extraction optimizationLanguage has many keywords, want fast lookup
supertypesAbstract node categoriesMultiple rules represent one semantic concept (all expressions, all statements)
reservedContextual keywordsWords are keywords in some contexts, identifiers in others (e.g., obj.if)

extras: Allow Tokens Anywhere

Allow tokens like whitespaces, comments to appear anywhere (between tokens), without polluting the main rules.

export default grammar({
  name: "my_language",
  extras: ($) => [
    /\s/,
    /\/\/.*$/, // Line comments
    /\/\*[\s\S]*?\*\//, // Block comments
  ],
  rules: {
    source_file: ($) => repeat($.statement),
    statement: ($) => seq($.keyword, $.expression, ";"),
  },
});

Default to whitespaces. One can turn this off by specifying extras: $ => [].

inline: Flatten Intermediate Nodes

Some rules are lightweight wrappers around a set of other rules - they themselves should not appear in the AST to reduce noise. Essentially, we want both:

  • The ease of writing grammars by utilizing helper grammar rules.
  • The ease of traversing the parse tree without the noise of the helper nodes.

inline is used to eliminate nodes corresponding to helper rules in the AST, as if the helper rules are written inline in the rules they are used in.

export default grammar({
  name: "my_language",
  inline: ["_expression", "_statement"],
  rules: {
    source_file: ($) => repeat($._statement),
    _statement: ($) => choice($.if_statement, $.while_statement, $.assignment),
    if_statement: ($) => seq("if", $._expression, $.block),
    _expression: ($) => choice($.binary_op, $.number, $.identifier),
    binary_op: ($) => seq($._expression, /[+\-*/]/, $._expression),
  },
});

At compile time, _expression and _statement are expanded into their call sites. The runtime AST omits these intermediate nodes, simplifying tree navigation.

conflicts: Enable GLR for Ambiguities

An array of rule sets (which are arrays themselves).

Some conflicts are inherent in a language, conflicts is used to mark a set of rules as intended to be conflicting. The conflicts will be resolved by GLR branching & ties are broken for the rule that has the highest total dynamic precedence.

externals: Custom Lexing

Some languages have context-sensitive lexical rules that regexes (which can only express regular languages) cannot express. Common examples include:

  • Indentation-based tokens (Python’s INDENT/DEDENT).
  • Heredocs in Bash and Ruby.
  • Percent strings in Ruby.
  • Template string boundaries.
  • Haskell’s significant whitespace.

External scanners allow plugging in custom C code to handle these advanced cases.

precedences: Named Precedence Levels

Using magic numbers to manage prec can be fragile:

  • If a new precedence level needs to be introduced, we can break the precedence system.
  • The magic numbers do not represent anything meaningful on its own.

precedences is an array of arrays of strings. Each array defines a precedence level. The precedence levels are in decreasing order.

export default grammar({
  name: "my_language",
  precedences: [
    ["logical_or"],
    ["logical_and"],
    ["equality"],
    ["comparison"],
    ["addition"],
    ["multiplication"],
  ],
  rules: {
    logical_or: ($) =>
      prec.left("logical_or", seq($._expression, "||", $._expression)),
    logical_and: ($) =>
      prec.left("logical_and", seq($._expression, "&&", $._expression)),
  },
});

word: Keyword Extraction Optimization

The name of token that will match keywords to the keyword extraction optimization.

Probably, the keyword extraction optimization technique can only handle 1 special type of tokens, so it cannot handle

export default grammar({
  name: "my_language",
  word: ($) => $.identifier_token,
  rules: {
    if_statement: ($) => seq("if", "(", $.expression, ")", $.block),
    identifier_token: ($) => token(prec(1, /[a-zA-Z_]\w*/)),
  },
});

supertypes: Abstract Node Categories

Multiple different grammar rules represent a single semantic concept (e.g., binary_op, unary_op, and function_call are all expressions), but they appear as separate nodes in the tree.

supertypes allows rules to marked as supertypes and tree-sitter will automatically hide them from the parse tree, letting you still reference them by name in the grammar.

reserved: Contextual Keywords

Some words are keywords in one context (e.g., if as a statement) but valid identifiers in another (e.g., obj.if as a property name).

grammar({
  name: "example",
  reserved: ($) => ({
    global: ["if", "else", "function"], // Globally forbidden as identifiers
    property_keywords: [], // Empty set to allow all words
  }),
  rules: {
    // Uses global reserved set; 'if' will fail to match as a variable
    variable: ($) => $.identifier,

    // Explicitly uses the empty 'property_keywords' set to allow 'if'
    property_name: ($) => reserved($.property_keywords, $.identifier),
  },
});

Solution: Define named reserved word sets, apply the global set everywhere, and override with a permissive set in specific contexts.

The structure is similar to the rules property.

  • Each reserved rule must be a terminal token.
  • Each reserved rule must exist and be used in the grammar.
  • The first reserved word set is the global word set, which will apply by default. The reserved function can be called to change the reserved words.

Tree-sitter Grammar Guideline

Source: tree-sitter

Link: https://tree-sitter.github.io

Takeaways

We want to choose the best CFG among the inifinite number of them to describe a language.

The best CFG is determined by:

  • Direct correspondence between symbols in the grammar and the recognizable constructs in the language (ignore non-relevant symbols like trivia). -> Easier to analyze.
  • Should closely adhere to LR(1) for performance reason.

Tree-sitter grammars are said to be different from Yacc and Bison (in terms of grammar authoring) & different in nature from ANTLR grammars, PEG, ambiguous grammar (in language specifications). (? Why)

Design Phase

Criteria for a tree-sitter grammar:

  • Intuitive mapping: Grammar symbols should map to recognizable language constructs. The AST should mirror how developers think about code, not formal specifications.
  • Breadth-first development: Start with skeleton rules covering major categories (declarations, statements, expressions), then incrementally flesh out details. Test frequently.
  • LR(1) compatibility: Tree-sitter uses GLR parsing and performs best with LR(1)-compatible grammars (like Yacc/Bison), not ANTLR or PEG approaches.

Design Process

  1. Start with a formal specification of the language (probably containing a CFG, but this can be hard to just trivially translateinto a tree-sitter grammar).
  2. Work out the just-enough high-level constructs of the programming language, typically: Declarations, Definitions, Statements, Expressions, Types and Patterns.

    The start rule for the grammar is the first property in the rules object. Perform a breadth-first design: Cover the language in breadth, not a subset of it.

  3. Iteratively build down the high-level ocnstructs just enough, switch to other constructs. In the process, test often.

    Tests for each rule are placed in test/corpus.

Standard Rule Naming Conventions

Rule names are mostly unrestricted, but for the sake of comprehensibility, we should follow the following established conventions:

ConventionExamplePurpose
Root nodesource_fileRepresents entire file
Major constructsexpression, statementTop-level language concepts, typically represented as a supertype/choice of the more specific sub-expression/sub-statement rules
Block scopesblockParent node for block scopes
TypestypeThe types, such as int, choice and void
IdentifiersidentifierVariable names, function arguments and object fields, commonly used as the word token
StringsstringRepresent string literals
CommentsRepresent commentsCommonly put in an extra

Good Grammar Rule Structure

Commonly, to write a grammar for an expression language, one typically defines the root rule that slowly propagate to expression rules for operators that have increasing precedence:

Expression        -> Assignment
Assignment        -> Conditional
Conditional       -> Or
Or                -> And
And               -> Equality
Equality          -> Comparison
Comparison        -> Add
Add               -> Mul
Mul               -> Unary

However, the notorious problem with this is that it creates very deep nesting that is absolutely unnecessary. And furthermore, it’s hard to know which type of expression a node represents.

Although this section of the doc doesn’t specify how to properly does this, probably, prec should be leverage in this case for the best AST that is appropriate from the perspective of the user.

Update: Check out the sections below Precedence Usage.

Precedence Usage

Continue from the above section.

In short, tree-sitter provides us with a whole lot of utilities for building more ergonomic ASTs, instead of weirdly deep structures.

grammar({
  rules: {
    // ...
    expression: ($) =>
      choice(
        $.unary_expression,
        $.binary_expression,
        // ...
      ),

    unary_expression: ($) =>
      choice(
        seq("-", $.expression),
        // ...
      ),

    binary_expression: ($) =>
      choice(
        seq($.expression, "*", $.expression),
        seq($.expression, "-", $.expression),
        // ...
      ),
  },
});

This is a flat representation, but is highly ambiguous. Tree-sitter would complain:

Error: Unresolved conflict for symbol sequence:

  '-'  _expression  •  '*'  …

Possible interpretations:

  1:  '-'  (binary_expression  _expression  •  '*'  _expression)
  2:  (unary_expression  '-'  _expression)  •  '*'  …

Possible resolutions:

  1:  Specify a higher precedence in `binary_expression` than in the other rules.
  2:  Specify a higher precedence in `unary_expression` than in the other rules.
  3:  Specify a left or right associativity in `unary_expression`
  4:  Add a conflict for these rules: `binary_expression` `unary_expression`

This is reasonable, as tree-sitter is unsure whether - 3 * 2 should be parsed as - (3 * 2) or (-3) * 2.

The • character indicates the ambiguity point.

Remedy with prec:

{
  // ...

  unary_expression: ($) =>
    prec(
      2,
      choice(
        seq("-", $.expression),
        seq("!", $.expression),
        // ...
      ),
    );
}

In this case, the - 3 * 2 ambiguity is resolved.

Associativity Usage

This case still causes an ambiguity 1 * 2 * 3 - because the * has the same precedence, so these two are valid derivations:

1 * 2 * 3 -> 1 * (2 * 3)
// expression
// -> binary_expression
// -> expression '*' expression
// -> number '*' (binary_expression)
// Result: (binary_expression 1 '*' (binary_expression 2 '*' 3))

1 * 2 * 3 -> (1 * 2) * 3
// expression
// -> binary_expression
// -> expression '*' expression
// -> binary_expression '*' number
// Result: (binary_expression (binary_expression 1 '*' 2) '*' 3)

prec.left and prec.right will tell tree-sitter whether to prioritize the former or the latter case.

In this case, prec.left should be used:

{
  // ...

  binary_expression: $ => choice(
    prec.left(2, seq($.expression, '*', $.expression)),
    prec.left(1, seq($.expression, '+', $.expression)),
    // ...
  ),
}

Conflicts Usage

By default, tree-sitter marks all conflicts as errors. But they can be desirable.

The example in tree-sitter I think is a bit weird, as I think we can disambiguate array and array_pattern from the surrounding context alone (like in left-side of assignments or in declarations).

To mark something as intentionally conflicting, we use the top-level configs field:

{
  name: "javascript",

  conflicts: $ => [
    [$.array, $.array_pattern],
  ],

  rules: {
    // ...
  },
}

Hidden Rules (_rule)

Rules starting with _ are hidden from the AST, though it still exists in the parser.

I found this to be quite similar to inlined rules, but it seems that they differ a bit in that inlined rules are totally eliminated from the parser, but hidden rules are still known to the parser. I wonder how that would be different from the user’s perspective.

Fields Usage

Fields can be given names to allow named accesses instead of indexed accesses.

function_definition: ($) =>
  seq(
    "func",
    field("name", $.identifier), // this one can be accessed via `ts_node_child_by_field_name`
    field("parameters", $.parameter_list),
    field("return_type", $._type),
    field("body", $.block),
  );

Extras Usage

“Extra” tokens like comments, spaces can appear between any two significant tokens.

grammar({
  name: "my_language",

  extras: ($) => [
    /\s/, // whitespace
    $.comment,
  ],

  rules: {
    comment: ($) =>
      token(
        choice(seq("//", /.*/), seq("/*", /[^*]*\*+([^/*][^*]*\*+)*/, "/")),
      ),
  },
});

For complicated extras tokens, it’s preferable to associate the pattern with a rule, to avoid bloating up the parser because tree-sitter will inline the rules everywhere. However, it’s fine for whitespace character classes like \s or [ \t\n\r] as tree-sitter autodetects this.

Supertypes usage

For abstract categories of syntax nodes (“expression”, “type”, “declaration”), they are represented as choices between other sub-rules:

expression: $ => choice(
  $.identifier,
  $.unary_expression,
  $.binary_expression,
  $.call_expression,
  $.member_expression,
  // ...
),

But, this causes expression to be added as a syntax node, creating an unnecessary nesting level. To eliminate these, we use supertypes:

module.exports = grammar({
  name: "javascript",

  supertypes: ($) => [$.expression],

  rules: {
    expression: ($) =>
      choice(
        $.identifier,
        // ...
      ),

    // ...
  },
});

So how does supertypes differ from inlined rules or hidden rules? supertypes is optimized for this use case, and it can be queried (pattern matched), unlike inlined or hidden rules.

Deep-dive & Optimization Guide - Lexical Analysis

Tree-sitter’s parsing process = Parsing + Lexing.

Token Conflict Resolution

Token conflicts are much more common than grammar rule conflicts, such as the keyword "if" and identifiers /[a-z]+/.

There are various scenarios of token conflicts:

  • Context-aware lexing: Tree-sitter performs lexing on-demand during parsing. -> The lexer only match tokens that are possible at the current position in the file.
  • Lexical precedence: Precedence values in side a token function will allow the lexer to break ties when two tokens can be matched at a given position.
  • Match length: When multiple tokens of the same precedence can be matched at a given position, the token that matches the longest sequence will be chosen.
  • Match specificity: If multiple tokens of the same precedence match the same sequence of characters, then bare-string tokens are prioritized over regexp tokens.
  • Rule order: Otherwise, the first token in the grammar is prioritized.

In short, these are the prioritization order (only tokens that can appear at a given position are considered):

  1. The token precedence specified by spec.
  2. The length of the matched token.
  3. String over Regexp.
  4. The earlier rule in the grammar is prioritized.

External scanners can produce tokens that are treated differently from regular tokens.

Lexical Precedence & Parse Precedence

  • Lexical Precedence: Decides which token matches a specific piece of raw text. This is a lower-level operation and happens first.
  • Parse Precedence: Decides which structural rule should be chosen to interpret a sequence of already-extracted tokens.
  • The tip is that: When you get completely stuck debugging a grammar, it is almost always a lexical precedence problem, not a structural one.
  • Syntax distinction:
    • token(prec(N, ...)) = Applies Lexical Precedence (prioritizes the token generation).
    • prec(N, token(...)) = Applies Parse Precedence (prioritizes the structural rule, which won’t help if the lexer is grabbing the wrong token).

Keywords

  • Problem: In many languages, keywords (like if or instanceof) are just specific strings that overlap with your generic identifier regex (like /[a-z]+/).

  • Example:

    if (a instanceofSomething) b();
      //  ^^^^^^^^^^^^^^^^^^^
      //   we're here
    
  • Tree-sitter uses context-aware lexing:

    • In normal Javascript grammar, the parser expects a keyword/operator/etc. there (other than an identifier).
    • Therefore, instanceofSomething would never be matched as an identifier by tree-sitter in normal instances.
    • Rather than that, tree-sitter would likely consider keyword first (instanceof) then chops off and makes Something the following identifier, which is incorrect, as an identifier cannot immediately follow a keyword.
  • The tree-sitter’s solution is to use the word top-level configuration rule.

  • With word, tree-sitter uses a 2-step process:

    1. It first matches the word token.
    2. If it’s a match, then do nothing. Otherwise, it proceeds lexing as normal.
  • Benefits:

    • Improved error detection.
    • (Side effect) Smaller & simpler lexing function, which is more efficient.

To avoid ambiguity: The word token must not be reused in another rule. To allow reusing, one must alias the word token:

_type_identifier: $ => alias($.identifier, $.type_identifier),

So basically alias works somewhat like inlined or hidden rule, but for different use cases in mind.

External Scanners

External scanners are custom scanners plugged into tree-sitter. These are needed as the approach of using regexes to specify tokens does not suffice for some non-regular token grammars such as:

  • Off-side rule (indent and dedent tokens in Python).
  • Heredocs in Bash & Ruby.

Tree-sitter resolves this by allowing authors to plug in external scanners.

Usage

  1. Add an externals top-level configuration:

    grammar({
      // ...
      externals: ($) => [$.indent, $.dedent, $.newline],
    });
    

    This section declares the names of all the external tokens, which can be used elsewhere in the grammar.

  2. Add src/scanner.c (mandatory) as a C source file containing the external scanner.

  3. In src/scanner.c, define an enum containing the names of the external tokens. The order must match the order in the grammar’s externals (names can differ).

    #include "tree_sitter/parser.h"
    #include "tree_sitter/alloc.h"
    #include "tree_sitter/array.h"
    
    enum TokenType {
      INDENT,
      DEDENT,
      NEWLINE
    }
    
  4. Write the custom scanning logic based on 5 actions: create, destroy, serialize, deserialize and scan.

    • create: Create the scanner object.

      void * tree_sitter_<your_language>_external_scanner_create() {}
      
    • destroy: Free memory of the scanner object.

      void tree_sitter_<your_language>_external_scanner_destroy(void *scanner) {}
      
    • serialize: Serialize the scanner’s state into a byte buffer & return the number of written bytes. This is called by tree-sitter after the external scanner successfully recognizes a token.

      unsigned tree_sitter_<your_language>_external_scanner_serialize(
        void *scanner,
        char *buffer
      ) {}
      
    • deserialize: Restore the state of the scanner from the written bytes.

      void tree_sitter_<your_language>_external_scanner_deserialize(
        void *scanner,
        const char *buffer,
        unsigned length
      ) {}
      
    • scan: Tree-sitter will trigger this method with a set of the valid symbols for the external scan to scan these tokens.

      bool tree_sitter_<your_language>_external_scanner_scan(
        void *scannar,
        TSLexer *lexer, // The tree-sitter lexer exposed to the external scanner, which the external scanner will call
        const bool *valid_symbols // a binary array such that `valid_symbols[TOKEN_TYPE]` indicates whether that token type is valid
      ) {}
      

    The names are in the following format: tree_sitter_<your_language>_external_scanner_<action>.

    So basically, we can think of src/scanner.c as an interface with methods that tree-sitter can call. Alternatively, we can think of src/scanner.c utilize the action hooks to hook into tree-sitter.

Utils

See more here: https://tree-sitter.github.io/tree-sitter/creating-parsers/4-external-scanners.html#external-scanner-helpers

Tree-sitter Query System

Source: tree-sitter

Link: https://tree-sitter.github.io

Takeaways

  • Separate grammar from navigation: Queries enable applications without modifying grammar, keeping grammar focused on parsing and semantics.
  • Declarative patterns are more maintainable and portable than imperative traversal code written in each binding language.
  • Portability: The same .scm query file works across all language bindings and tools for standardized highlighting and navigation.
  • Architecture: C core extracts queries; language bindings evaluate predicates and directives. This splits parsing (fast, in C) from filtering (flexible, in bindings).

Quick Reference

PatternPurposeExample
CaptureExtract a matched node(identifier) @name
WildcardMatch any node(_) matches any named node
Field nameMatch by fieldleft: (identifier) matches specific child
String literalMatch anonymous node"if" or "("
NegationMust not have field!type_parameters
QuantifierRepetition(arg)* or (arg)+
AlternationMatch one of several["if" "while" "for"]
AnchorPosition constraint. before/after/between children
PredicateConditional filter(#eq? @name "foo")
DirectiveMetadata attachment(#set! language "javascript")

Syntax

  • Query: A series of patterns that is used to extract certain code structure out of the source code.
  • Pattern: An S-expression matching a certain set of syntax nodes.

Pattern’s structure:

(<node-type> ...<children-patterns>)

For example:

(binary_expression (number_literal) (number_literal))

would match a binary expression with two children of type number literal.

Children can be omitted, the remaining nodes would described the expected order of children nodes.

(<node-type> <child-1> <child_2>)
;            ^^^^^^^^  ^^^^^^^^
;            <child_1> needs only precede <child_2>, some other nodes may be in between

For example:

(binary_expression (number_literal))

would match any binary expression with a number literal as its child.

Fields: Child patterns can be named to be correspondent with field names, and recommended so:

(assignment_expression
  left: (member_expression
    object: (call_expression)))

If in the grammar rule assignment_expression, there are 2 field named left and object, then this query would return all assignment_expressions whose left and object are member_expression and call_expression.

Negated fields: A field name prefixed with ! indicates a lack of such field.

(class_declaration
  name: (identifier) @class_name
  !type_parameters)

Anonymous nodes: (<pattern>) only applies to named nodes. To reference anonymous nodes, double quotes the textual contents.

(binary_expression
  operator: "!="
  right: (null))

Wildcard nodes: _ and (_) are used to matched any nodes. The only difference is that (_) can match any named node and _ can match any named/anonymous node.

(call (_) @call.inner)
;         ^^^^^^^^^^^
;         any nodes inside a call

ERROR nodes: An unrecognized piece of text and can also be queried.

(ERROR) @error-node

MISSING nodes: Missing expected nodes will be inserted as special missing nodes in the syntax tree.

  • To recover from syntax errors, the parser inserts zero-width “phantom” tokens (like a forgotten semicolon) to keep the syntax tree valid.
  • Instead of allocating a completely new, heavyweight structural node in memory, it simply attaches an internal “missing” flag to that generated token.
  • This approach saves memory and processing power because the token occupies zero physical characters in your source code.

Something’s a bit confusing here - Tree-sitter seems to mention nodes and tokens interchangeably.

(MISSING) @missing-node
(MISSING identifier) @missing-identifier
(MISSING ";") @missing-semicolon

Supertype nodes: Supertypes in queries can be used to match any subtypes.

Supertype here means roughly like supertype in OOP: A ggrammar rule that can specializes to multiple sub-grammar rules. The supertype itself won’t be a standalone node in the AST, rather like a union type though.

(expression) @any-expression
; ^^^^^^^^^
; match any subtypes of `expression`

Match specific subtypes (named or anonymous):

(expression/binary_expression) @binary-expression
(expression/"()") @empty-expression

Operators

The @ operator: Specific nodes in the pattern can be extracted out by suffixing the node with a capture name

(assignment_expression
  left: (identifier) @the-function-name
                    ; ^^^^^^^^^^^^^^^^^
  right: (function))

The quantification operators + & *: Similar to those in regular expressions.

Sequence group:

  • The sequence group () creates an anonymous block of sibling nodes.
  • Its primary use case is grouping nodes so you can apply quantifiers (*, +, ?) to an entire repeating sequence.
  • Example: Matching a repeating comma-separated list using ( (",") (identifier) )*.
  • It does not enforce strict adjacency; undeclared nodes can still appear between the grouped items.
  • Use () to repeat a pattern, and use . (the anchor operator, see below) to glue nodes tightly together.

Alternation operator []: Similar to character classes in regular expressions.

[
  "break"
  "delete"
  "else"
  "for"
  "function"
  "if"
  "return"
  "try"
  "while"
] @keyword

The alternants can be quantified.

Anchor operator .: Used to constrain the ways in which child patterns are matched, with different behaviors based on its position in a query.

  • Before the first child within the parent pattern: That child is only matched when it’s the first named node of the parent.
(array . (identifier) @the-element)
;         ^^^^^^^^^^
;         must be the first

Without the anchor above, @the-element would be bound to every identifier in the array, and accessing it would return an arrays of all identifiers.

  • After the last child within the parent pattern: Similarly to the above case.

  • Between two child patterns: Match immediate siblings.

(dotted_name
  (identifier) @prev-id
  .
  (identifier) @next-id)

Note that anonymous nodes are not taken into account.

Predicates

The predicate syntax #...?: Add arbitrary conditions or metadata to a pattern. Predicate names start with # and end with ?, and can accept captures or strings as arguments.

The eq? predicate: Compares a capture’s text against a string or another capture.

  • Includes variants: #not-eq?, #any-eq?, #any-not-eq?.
  • By default, quantified captures require all matched nodes to pass the predicate. The any- prefix overrides this to match if at least one node passes.
((identifier) @variable.builtin
  (#eq? @variable.builtin "self"))
(
  (pair
    key: (property_identifier) @key-name
    value: (identifier) @value-name)
  (#eq? @key-name @value-name)
)

The match? predicate: Matches a capture’s text against a regular expression.

  • Also supports the not- and any- prefixes.
((identifier) @constant
  (#match? @constant "^[A-Z][A-Z_]+"))

The any-of? predicate: Checks if a capture’s text strictly equals any string in a provided list.

((identifier) @variable.builtin
  (#any-of? @variable.builtin
        "arguments"
        "module"
        "console"))

The is? predicate: Asserts that a capture has a specific internal property (e.g., using #is-not? local to ensure a matched variable is not a local variable).

Directives

The directive syntax #...!: Associates arbitrary metadata with a pattern or alters its output. They function similarly to predicates but end with ! instead of ?.

The set! directive: Associates arbitrary key-value metadata pairs with a matched pattern (commonly used for language injections).

((comment) @injection.content
  (#match? @injection.content "/[*\/][!*\/]<?[^a-zA-Z]")
  (#set! injection.language "doxygen"))

The select-adjacent! directive: Takes two capture names and filters the first capture’s text so that only nodes adjacent to the second capture are preserved.

The strip! directive: Takes a capture and a regular expression, removing any matched text from that capture’s final output.

Note

  • The core Tree-sitter C library does not execute predicates or directives.
  • It simply extracts them from your query and passes them along as raw metadata.
  • It is entirely up to the higher-level bindings (like Rust, Python, JS, WASM) to read that metadata and actually run the filtering logic (like evaluating #eq?).

So, if a predicate isn’t working, it might be because the specific language binding you are using hasn’t implemented support for it yet, not necessarily because your query syntax is wrong.

Query API

Creating a query (ts_query_new): You create a query by passing in the language and a string containing your patterns.

  • If it fails, error_offset indicates the exact byte where it failed.
  • error_type indicates the cause (Syntax, NodeType, Field, or Capture).
TSQuery *ts_query_new(
  const TSLanguage *language,
  const char *source,
  uint32_t source_len,
  uint32_t *error_offset,
  TSQueryError *error_type
);

Thread safety and state (TSQuery vs TSQueryCursor):

  • TSQuery: The parsed query itself. It is immutable and can be safely shared across multiple threads.
  • TSQueryCursor: Holds the state required to actually run the query. It is not thread-safe, but you can reuse a single cursor for many executions to save memory.
TSQueryCursor *ts_query_cursor_new(void);

Executing a query (ts_query_cursor_exec): Prepares the cursor to find matches within a specific syntax node (usually the root node of your parsed code).

void ts_query_cursor_exec(TSQueryCursor *, const TSQuery *, TSNode);

Iterating matches (ts_query_cursor_next_match): Steps through the successful query matches one by one.

  • Returns true and populates the match structure, or false when there are no more matches.
  • TSQueryMatch contains metadata (which pattern matched) and an array of captured nodes.
  • TSQueryCapture contains the actual TSNode from the syntax tree and the index of its capture name.
typedef struct {
  TSNode node;
  uint32_t index;
} TSQueryCapture;

typedef struct {
  uint32_t id;
  uint16_t pattern_index;
  uint16_t capture_count;
  const TSQueryCapture *captures;
} TSQueryMatch;

bool ts_query_cursor_next_match(TSQueryCursor *, TSQueryMatch *match);

Tree-sitter Services and Applications

Source: tree-sitter

Link: https://tree-sitter.github.io

Takeaways

  • Services decouple from grammar: Syntax highlighting, code navigation, and language injection are query-based, not grammar-based. Grammar describes “what is”, queries describe “how to use it”.
  • Capture names are standardized (e.g., @keyword, @definition.function) instead of language-specific logic. Same queries work across all tools and editors.
  • Scope and context via queries: Local variable tracking and language injection are queries, not lexical features. This keeps grammars simpler while enabling sophisticated analysis.
  • Standardization enables portability: Same .scm query file and role/kind vocabulary work across editors and languages.

[tool] [seminal] .NET Compiler Platform (“Roslyn”) Overview

Source: dotnet/roslyn (GitHub)

Link: https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md

Keywords:

  • Syntax tree
  • Semantic model

Takeaway:

  • Roslyn represents a revolutionary step: Exposing the internal compilation pipeline as a public API.
  • Two main API layers:
  • Lower-level compiler APIs for detailed code analysis.
  • Higher-level workspace APIs for managing complete solutions within a host environment.
  • Core data models representing the code are completely immutable and preserve full fidelity to enable thread-safe analysis without data duplication.
    • ASTs are composed of 3 items: nodes (non-terminals), tokens (terminals), and trivia.
    • Trivia are attached to adjacent tokens instead of being in the ASTs.
    • Each AST item exposes a span.
    • Tokens and Trivia are expressed as value types.
  • The state flow relies on an active, mutable workspace containing strictly immutable solution snapshots.

Abstract

This document details the motivation behind the Roslyn’s team’s decision to expose the compiler APIs and its object model, highlighting the trend of “compiler as a platform”.

The important part is their object model and compiler APIs. Object model is basically the data that the compiler API accepts or returns.

Introduction

Compilers are traditionally black boxes:

  • Source code goes in.
  • Object code comes out.
  • The internal model is discarded. This internal model contain deep understanding of the original code.

However, that model, built through parsing, name resolution, and type checking, is exactly what IDE features need. IntelliSense, Go to Definition, Find All References, intelligent rename - all of these require compiler-level understanding of the code.

Roslyn is the .NET Compiler Platform for C# and Visual Basic. Its mission is to open up that black box: instead of every tool rebuilding a partial, inaccurate approximation of what the compiler already computed, Roslyn exposes the compiler’s own data structures as a stable public API.

The same object models that power Visual Studio’s own language features are available to any external tool. The VS IDE was rebuilt on top of the public Roslyn APIs as a validation of this claim - if the public API isn’t sufficient for world-class IDE features, it isn’t good enough.

This advocates for the view of “compilers as platforms”. To be more precise, compilers provide necessary information that powers all code IntelliSense.

Exposing the Compiler APIs

Compiler Pipeline

Roslyn exposes to the consumer each phase of the traditional compiler pipeline as a separate, queryable component:

Phase (API)What it doesExposed as (Object model)
ParseTokenize and parse source textSyntax tree
DeclarationAnalyze declarations + imported metadata to form named symbolsHierarchical symbol table
BindMatch identifiers to symbolsSemantic model
EmitProduce IL bytecodeEmit API

Every phase is independently accessible. You can parse a file with no compilation context, or query the semantic model of a specific expression without touching the emit phase.

A visualization from Roslyn:

Roslyn’s compiler’s pipeline API

Because each phase has a well-defined object model, each compiler combines these phases together to form an end-to-end whole.

As in the picture above, some language services map to some phases and object models.

API Layers

Roslyn’s API layers

Two main layers of APIs: Compiler APIs and Workspaces APIs.

Compiler APIs

The lowest layer. Exposes the object models corresponding to each pipeline phase, both syntactic and semantic. No dependency on Visual Studio.

Contains:

  • Object models for each pipeline phase:
    • Syntax trees.
    • Symbol tables.
    • Semantic models.
  • An immutable snapshot of a single invocation of a compiler:
    • Assembly references.
    • Compiler options.
    • Source code files.

C# and VB have separate but shape-compatible APIs in this layer. This layer has no dependencies on Visual Studio components.

Diagnostic APIs

Extensible diagnostics (can be user-defined) cover:

  • Syntax errors.
  • Semantic errors.
  • Definite assignment errors.
  • Warnings.
  • Informational.

Hosting/Scripting APIs

Used for:

  • Code snippet execution.
  • Accumulate a runtime execution context.

The REPL utilizes these.

Workspaces APIs

The higher-level layer:

  • Organizes an entire solution (projects, files, references, options) into a single object model.
  • Handles project-to-project dependency resolution automatically.
  • Gives direct access to compiler layer objects without manual setup.
  • Surfaces higher-level IDE APIs:
    • Find All References.
    • Formatting.
    • Code Generation.

No dependency on Visual Studio.

Parse API / The Syntax Tree Object Model

Purposes:

  • Allow tools to process the syntactic structure of a project.
  • Enable tools to manipulate source codes.

This seems to document the public interface, not how internally, how it is all organized.

Overview:

  • 3 main syntactic categories: nodes (non-terminal), tokens (terminal), trivia (attached to tokens).
  • Each node/token/trivia has a Span and a FullSpan, containing the first position + the character count.

    Span may be computed on the fly, else, AST edits would invalidate a large number of spans in the AST. Also, if implemented this way, SyntaxToken cannot be reused. Therefore, probably a red-green tree design might me used underneath.

  • Nodes have both properties for Parent and children.

    Red-green trees must be involved here for the tree nodes to be immutable.

Syntax Trees

Essentially a tree data structure:

  • Internal nodes are parents of other nodes.
  • Made up of nodes, tokens and trivia.

Three key properties:

  • Full fidelity/Lossless: every token, whitespace, comment, preprocessor directive, and syntax error is preserved. Nothing is thrown away.
  • Round-trippability: any subtree can reconstruct its exact source text. Editing the tree is equivalent to editing the source.
  • Immutability & Thread-safe:
    • Trees are snapshots that do not change.
    • Allow structural sharing & node-reusing.
    • Allow thread-safe accesses without locks.

Syntax Nodes

  • Represent non-terminal elements: declarations, statements, clauses, expressions.
  • Always non-terminal nodes in the syntax tree - Only tokens can be terminal nodes.

The class hierarchy:

  • A base class: SyntaxNode.
  • Each category of syntax node is a subclass of a SyntaxNode.
  • New, custom node classes can not be added.

The classes:

  • Parent property: A node’s parent. The root node has a null parent.
  • ChildNodes method: A sorted list of child nodes (do not include tokens).
  • A collection of Descendant* methods:
    • DescendantNodes: The list of all nodes in the sub-tree rooted at a node. (In what order though?)
    • DescendantTokens: The list of all tokens in the sub-tree rooted at a node. (Is it sorted?).
    • DescendantTrivia: The list of all trivia in the sub-tree rooted at a node.
  • Type-safe children properties.
    • Example: BinaryExpressionSyntax has Left, OperatorToken and Right additionally.
    • Some can be optional.

Syntax Tokens

  • Terminals: keywords, identifiers, literals, punctuation.
  • Never are parents.

Implementation:

  • Implemented as CLR value types (one struct for all token kinds) for performance. Typed as Object.
  • Carry both raw source text and a decoded value (Value for the typed result, ValueText for the Unicode-normalized string).

Syntax Trivia

  • Insignificant to the understanding of the code: whitespace, comments, preprocessor directives.
  • Still kept for full fidelity and round-trippability.

Relationship with tokens:

  • Not tree children in the normal sense - attached to adjacent tokens via LeadingTrivia / TrailingTrivia. A token owns any trivia after it on the same line up to the next token.
  • Do not have parent.
  • First trivia of the line belong to the first token.

Implementation:

  • Token property: The owning token.
  • Also CLR value types: SyntaxTrivia

Token and trivia

Spans

The position & length information of a node, token or trivia.

  • Text position = A 32-bit integer that is a zero-based Unicode character index.
  • Text span = A beginning position + A count of characters. If the count is zero, it refers to the position after the beginning position but before the next character.

A node has 2 TextSpan:

  • Span: text range of the node excluding trivia.
  • FullSpan: text range including leading/trailing trivia.

Both are zero-based Unicode character indices.

Kinds

  • Every node, token or trivia has a RawKind castable to a language-specific SyntaxKind enum.
    • This disambiguates elements sharing the same class - e.g., BinaryExpressionSyntax can be AddExpression, SubtractExpression, etc.

      So RawKind doesn’t have an inherent meaning?

  • For tokens and trivia, RawKind is the only way to distinguish element types.

So basically, SyntaxNode is the low-level underlying representations for the “higher-level” node types appearing in some specific programming language?

Error Recovery

Malformed code still produces a full, round-trippable tree. Two strategies:

  • Missing token: Expected but absent token is inserted as a zero-width node (IsMissing = true).
  • Skipped tokens: The parser may skip tokens during error recovery. Unrecognized tokens are attached as SkippedTokens trivia.

    So Lexing and Parsing must be happening at the same time, as otherwise, SyntaxToken wouldn’t be immutable.

Declaration/Binding API + Hierarchical symbol table/Semantic model

  • Name resolution.
  • References to compiled libraries.

Compilation

  • Represents everything needed to compile a program: source files, assembly references, compiler options.
  • Contains methods for finding and relating the symbols declared in source code or metadata from compiled libraries.
  • Immutable - changes produce a new Compilation derived from the old one.

Entry point for all semantic queries: symbol lookup by name, access to the global namespace tree, getting a SemanticModel for any source file.

Symbols

Overview:

  • Represents a declared elements in the source code or compiled libraries’ metadata.
  • Namespaces, types, methods, properties, fields, events, parameters are all represented by symbols.
  • A Compilation has methods and properties to retrieve symbols or the symbol table as a tree of symbols.

Class hierarchy:

  • ISymbol: The interface that all symbls implement.

Symbol implementation:

  • Additional information that symbol’s methods and properties can retrieve:
    • Other referenced symbols.

Effects:

  • A consistent representation for namespaces, types, and members across both source code and metadata.
  • Elements from source code and imported metadata are treated exactly the same (e.g., a method from source code & metadata are represented by an IMethodSymbol with identical properties).

Symbols are somewhat similar to the CLR type system (System.reflection API), but they can model more than types and are a representation of language concepts.

For instance, an iterator method:

  • A single language symbol.
  • A CLR type and CLR multiple methods.

Semantic Model

Scoped to a single source file. Answers:

  • What symbol does this identifier resolve to?
  • What is the type of this expression?
  • What diagnostics apply here?
  • How do variables flow in/out of this region?

Obtained from Compilation by passing a SyntaxTree. Queries are lazy and cached.

Workspace API

  • Entrypoint for whole-solution analysis and refactoring.
  • Organization of information about projects in a solution into 1 object model: Direct access to compiler layer object models.
  • Host environments, e.g. IDEs, a workspace ~ an open solution.
  • Loading a solution also utilizes this model.

Workspace

  • Workspace: An active representation of a solution.
    • Solution: A collection of projects.
    • Project: A collection of documents.
  • Workspace -> A host environment, which is constantly changing.
    • Can be created and run independently without needing a host environment or UI.
  • Update mechanism:
    • Fires events when the host environment changes (e.g. user typing).
    • Automatically updates the CurrentSolution property.
    • The fired event indicates that the workspace and which particular document has changed. -> Consumers can listen to these.

Solutions, Projects, Documents

Overview

  • Workspaces change continuously with user input, but they provide access to isolated, immutable solution models.
  • Solutions are thread-safe snapshots of projects and documents that can be shared without locking or duplication.

Component hierarchy

  • Projects are immutable solution components that bundle source documents, references, and options to grant direct access to compilations.
  • Documents are immutable project components representing individual source files that provide access to raw text, syntax trees, and semantic models.

Modifications

  • A solution instance obtained from the workspace will never change.
  • You apply updates by constructing a new solution instance based on your specific changes and explicitly applying it back to the workspace.

Visualization of the relationship from Roslyn: Relationships between solution, projects and documents

Using Roslyn APIs to Build a .NET IL Interpreter

Name Resolution

Name resolution is about figuring out what each identifier in the code actually refers to. Given a variable, function call, or type annotation, which declaration does it point to? This requires understanding scoping rules, imports, visibility modifiers, and the module system of the language in question.

Getting this right is critical for go-to-definition, find-all-references, and rename refactoring. The approaches here range from formal scope graph models to Datalog-based relational encodings to tree-sitter-based graph construction DSLs.

Scoping rules vary wildly across languages, so Loupe needs a resolution model that encodes rules as data, not code (language-agnostic). When a file changes, only affected names should be re-resolved (incremental). And since every go-to-definition and rename triggers resolution, it has to be instant (fast).

Subpages

  • Scope Graph: a formal model where scoping rules are encoded as a graph of declarations, references, and scope edges. Language-agnostic by design.
  • Stack Graphs: GitHub’s adaptation of scope graphs for language-agnostic code navigation, designed to work without a full type checker.
  • tree-sitter-graph DSL: a DSL for constructing graphs (including stack graphs) directly from tree-sitter parse trees.

Scope Graph

[tool] [tool-dive] Stack Graphs

Source: GitHub (tree-sitter team)

Link: https://github.com/github/stack-graphs

tree-sitter-graph DSL

Type Checking

Type checking is about verifying that operations in the code are applied to values of compatible types. Beyond catching errors, type information powers features like auto-completion, signature help, and more precise code navigation.

The challenge for a tool like Loupe is doing this in a language-agnostic or at least language-parametric way, without reimplementing a full type checker for every supported language.

This is where the language-agnostic goal is most in tension with correctness. Loupe can’t reimplement every language’s type system, so the question is how much it can infer in a language-parametric way and when to defer to external checkers (e.g., via LSP). Whatever type reasoning Loupe does must be incremental (no full re-checks on edits) and fast enough to not block the user from doing meaningful works. Refactoring should be fast, especially in the scaffolding phase.

Subpages

Polymorphic Type Inference Engines & Template Checking

Statix

Relational Semantics via Datalog & Soufflé

Lady Deirdre Framework

Spoofax Language Workbench

Code Property Graphs (CPG) & the Joern Framework

Code Indexing & Navigation

Code indexing is about pre-computing a structured index of symbols, references, and relationships in a codebase so that navigation queries (go-to-definition, find references, call hierarchy) can be answered quickly without re-analyzing the code on every request.

The main design tension is between index freshness and cost. A full re-index on every change is expensive, but a stale index gives wrong answers. Formats like SCIP and LSIF try to standardize what gets stored, while systems like Glean and Kythe focus on how to store and query it at scale.

Loupe’s index format must not assume a specific language (language-agnostic), must support cheap partial updates when files change (incremental), and must answer navigation queries instantly even on large codebases (fast).

Subpages

  • SCIP - A Better Code Indexing Format than LSIF: Sourcegraph’s comparison of SCIP vs LSIF, arguing for a simpler, more debuggable indexing format.
  • Code Navigation for AI SWEs: a practical look at how AI coding agents can leverage code navigation infrastructure.
  • SCIP: a deep dive into the SCIP indexing format itself.
  • Glean: Meta’s system for collecting, deriving, and querying facts about code at scale.
  • Google Kythe: Google’s ecosystem for building tools that work with code, centered on a language-agnostic graph schema.
  • OpenGrok: Oracle’s fast source code search and cross-reference engine.
  • livegrep: a regex-based code search tool optimized for speed over large repositories.
  • Mycroft: a code indexing and navigation tool.

[blog] [overview] SCIP - A Better Code Indexing Format than LSIF

Source: sourcegraph’s blog

Link: https://sourcegraph.com/blog/announcing-scip

Keywords:

  • Indexing format

Takeaways:

An indexing format should be aware of:

  • Human-readability for debuggability.
  • Allow for local reasoning.
  • Allow for incrementalization.

Introduction

“Sourcegraph Inc. is a company developing code search and code intelligence tools that semantically index and analyze large codebases so that they can be searched across commercial, open-source, local, and cloud-based repositories.”

An indexing format is a quickly queryable representation of an entity. Previously, we have LSIF (Language Server Index Format), an open-source standard by Microsoft that serves as a standard indexing format for codebases, enabling rich code navigation, highlighting, etc. This article identified some pain points of LSIF, and thus, the motivation for SCIP.

As indexing formats enable a large part of the Sourcegraph ecosystem, this is a crucial component, especially for code navigation.

Background

There are two fundamentally different ways to implement code navigation: text/heuristic-based and semantic.

Text-based approaches (ctags, tree-sitter) are faster and language-agnostic, but they make mistakes. “Go to definition” might take you to a function with the same name in a different module. “Find references” might miss usages behind aliases or dynamic dispatch. Good enough for quick exploration, bad for anything that requires correctness.

It’s worth separating ctags and tree-sitter here, because they’re quite different in what they actually do.

  • ctags is purely text/regex-based. It scans source files and extracts symbol names and their locations using patterns, with no real parsing. No understanding of scope, types, or imports. “Go to definition” just finds the nearest thing with that name. Inherently search-based.

  • Tree-sitter is a proper incremental parser. It builds a full, language-aware AST, so it actually understands syntax. In principle it could do more than ctags. But it stops at the syntax level: it doesn’t resolve names. It can tell you “this token is an identifier,” but not “this identifier refers to that declaration over in that file.” That resolution step (name binding, type inference, import resolution) requires a type-checker or a full compiler front-end, which tree-sitter doesn’t do and isn’t meant to do.

So the bottleneck isn’t parsing, it’s name resolution. To know where a symbol actually comes from, you need the full semantic graph of the program, not just its syntax tree. That’s what language servers (via LSP) and indexers (via LSIF/SCIP) provide, sitting on top of a compiler.

Semantic approaches actually run a language-aware indexer, essentially a compiler front-end, that resolves symbols with full type information. The results are compiler-accurate: no false positives, no missed references, cross-repository aware. The catch is that this is expensive to set up. You need a working build environment, a language-specific indexer, and a pipeline to feed the output into wherever you’re querying from.

Sourcegraph built their precise navigation on LSIF. The flow is: a language-specific indexer runs against your codebase and dumps an LSIF file, you push that file to Sourcegraph, Sourcegraph ingests and indexes it, and from that point on you get compiler-accurate navigation across repos.

flowchart TB
    subgraph row1 [" "]
        direction LR
        A[Codebase] --> B[Language-specific indexer]
        B -->|LSIF file| C[Upload to Sourcegraph]
    end
    subgraph row2 [" "]
        direction LR
        D[Sourcegraph backend] -->|Stores index| E[(Database)]
        E --> F[Compiler-accurate code navigation]
    end
    C --> D

This is actually a new workflow for me…

Detour: LSIF Format

LSIF is newline-delimited JSON (one object per line). Every line is either a vertex or an edge, distinguished by a "type" field. Every object also has an "id" (a monotonically-increasing integer) and a "label" (the semantic kind).

{ "id": 7,  "type": "vertex", "label": "resultSet" }
{ "id": 10, "type": "edge",   "label": "next", "outV": 8, "inV": 7 }

Vertices represent things: documents, ranges (spans of code), result hubs, hover content, definition locations, reference sets. Edges connect them using outV (from) and inV / inVs (to). The IDs carry no meaning on their own; they’re just handles for wiring the graph together.

Key vertex types:

LabelWhat it represents
documentA source file
rangeA span within a file (start/end line + character)
resultSetA hub shared by all ranges of the same symbol
hoverResultThe hover popup content
definitionResultWhere a symbol is defined
referenceResultAll definition and reference locations

Key edge types:

LabelMeaning
containsDocument owns ranges. Project owns documents
nextRange delegates to its resultSet
textDocument/hoverresultSet points to hover content
textDocument/definitionresultSet points to definition locations
textDocument/referencesresultSet points to reference set
itemBinds a result vertex back to concrete ranges

The resultSet is the central design idea. Consider a symbol like foo that appears 50 times across a codebase — once as a definition and 49 as references. Naively, you’d attach hover content, definition location, and reference list to each of those 50 ranges individually. That’s a lot of duplication.

Instead, LSIF introduces a resultSet as a shared hub: all 50 ranges point to the same resultSet via a next edge, and the actual results (hover, definition, references) hang off the resultSet once. When you query “what’s the hover for range(19)?”, you follow next to get the resultSet, then follow textDocument/hover to get the result.

It’s a reasonable design for compactness, but it adds an indirection step to every query. And because everything is wired by opaque IDs, you have to load the entire graph into memory to efficiently resolve anything — you can’t jump to a range and read its data directly, you have to chase IDs across the whole file.

Challenges of LSIF

Sourcegraph has built dozens of LSIF indexers, from prototype to production grade, covering Go, Java, Scala, Kotlin, TypeScript, and JavaScript.

At scale, that means 45k+ repos with precise navigation enabled and 4k+ LSIF uploads processed per day. At that point, the cracks in the protocol become hard to ignore.

Most of the pain traces back to one design decision: LSIF uses a graph encoding with opaque numeric IDs to connect vertices and edges. Here’s what that looks like for this TypeScript snippet:

// main.ts
function foo(): void {} // line 0: foo is defined here

const x = 1;
foo(); // line 3: foo is referenced here

The LSIF graph for this looks like (notation: vertexLabel(id) where the number is the opaque integer ID assigned by the indexer):

flowchart TB
    classDef docNode fill:#4A90D9,color:#fff,stroke:#2c6fad
    classDef rangeNode fill:#5CB85C,color:#fff,stroke:#3d8b3d
    classDef resultSetNode fill:#F0AD4E,color:#000,stroke:#c87f0a
    classDef resultNode fill:#9B59B6,color:#fff,stroke:#6c3483

    doc["document(4)<br/>main.ts"]:::docNode
    doc -->|contains| r8(["range(8)<br/>line 0: foo definition"]):::rangeNode
    doc -->|contains| r19(["range(19)<br/>line 3: foo reference"]):::rangeNode

    r8 -->|next| rs(("resultSet(7)")):::resultSetNode
    r19 -->|next| rs

    rs -->|textDocument/hover| hov["hoverResult(11)"]:::resultNode
    rs -->|textDocument/definition| def["definitionResult(13)"]:::resultNode
    rs -->|textDocument/references| ref["referenceResult(16)"]:::resultNode

    def -->|item| r8
    ref -->|item definitions| r8
    ref -->|item references| r19

Every connection is expressed as {id, type: "edge", outV: N, inV: M} where N and M are opaque integers. That single choice cascades into a bunch of problems:

  1. No machine-readable schema, no static types:

    • LSIF has no formal schema. Every line is a free-form JSON object, so when writing an indexer or reader you’re working with raw maps and doing manual field access.
  2. Large in-memory data structures: Because edges reference IDs that may not have appeared yet in the file, you can’t process LSIF as a stream. You have to buffer everything first, then resolve references. For a large repo with millions of ranges and edges, this means holding the full graph in memory before you can answer a single query.

flowchart LR
    A["Read line 1<br/>range(8)"] --> B["Read line 2<br/>edge next outV=8 inV=7"]
    B --> C["resultSet(7) not seen yet<br/>must buffer"]
    C --> D["Read line 3<br/>resultSet(7)"]
    D --> E["Now resolve<br/>buffered edges"]
  1. Opaque IDs make debugging painful: When something goes wrong, you’re staring at a wall of numbers with no context. There’s no way to tell what ID 7 or ID 13 represents without mentally tracing the whole graph from the start.
    {"id":14,"type":"edge","label":"textDocument/definition","outV":7,"inV":13}
    {"id":15,"type":"edge","label":"item","outV":13,"inVs":[8],"document":4}
    

What is outV:7? What is inV:13? What range is 8? You have to scroll back through the file and match IDs by hand.

  1. Incremental indexing is hard: IDs are globally monotonically increasing. If you index file A and get IDs 1 to 500, then index file B and get IDs 501 to 1000, you can’t independently re-index just file A later since the new IDs would collide or require renumbering. Merging two separately-generated LSIF outputs is also non-trivial because both start their IDs from 1.

    Index run 1 (full):   range(8) -> resultSet(7) -> definitionResult(13)
    Index run 2 (file A): range(1) -> resultSet(2) -> definitionResult(3)  // ID collision
    

The vibe I get is that the problem of LSIF is that it’s too monolithic & isn’t modular. Local analysis of LSIF file is not really possible - a global picture is required.

SCIP addresses all of this by replacing the graph encoding with a Protobuf schema built around human-readable string IDs for symbols, dropping the concepts of “monikers” and “resultSet” entirely.

SCIP

The SCIP schema is a Protobuf schema. The design is heavily inspired by SemanticDB, a code indexing format that originated in the Scala ecosystem.

Sourcegraph reported a wide range of improvements:

  1. Development speed: Static types from the Protobuf schema give rich editor completions and catch typos at compile time. Human-readable string symbols make debugging much more straightforward. Abstractions like import/export monikers that could silently break navigation in LSIF are gone.
  2. Runtime performance: They saw a 10x speedup in CI when replacing lsif-node with scip-typescript (though not entirely attributable to the protocol itself).
  3. Index size: LSIF indexes are on average 4x larger gzip-compressed and around 5x larger uncompressed compared to equivalent SCIP payloads.
  4. Testability: Sourcegraph built a snapshot testing utility on top of SCIP that is reused across indexers. Snapshot testing with LSIF was painful by comparison.

Don Stewart from Meta integrated SCIP with Glean (Meta’s system for collecting and querying facts about code) and found SCIP to be 8x smaller and processable 3x faster than LSIF. The mapping into Glean was around 550 lines of code vs. 1500 for LSIF.

SCIP also unblocks use cases that LSIF struggled with:

  1. Incremental indexing: Because symbols are identified by human-readable strings rather than global IDs, re-indexing only the changed files becomes feasible. Users would wait less for precise navigation to become available after a push.

  2. Cross-language navigation: Navigating between, say, a Protobuf definition and its generated Java or Go bindings becomes possible. This kind of cross-language linking was essentially out of reach with LSIF.

Detour: SCIP Format

Sources:

  • https://github.com/sourcegraph/scip/blob/main/scip.proto
  • https://github.com/sourcegraph/scip/blob/main/docs/DESIGN.md
  • https://github.com/sourcegraph/scip-typescript/tree/main/snapshots/output

For the same foo example, here’s what SCIP looks like (shown as JSON, actually Protobuf binary):

{
  "metadata": {
    "toolInfo": { "name": "scip-typescript" },
    "projectRoot": "file:///project"
  },
  "documents": [
    {
      "language": "typescript",
      "relativePath": "src/main.ts",
      "occurrences": [
        {
          "range": [0, 9, 12],
          "symbol": "scip-typescript npm . . src/`main.ts`/foo().",
          "symbolRoles": 1
        },
        {
          "range": [3, 2, 5],
          "symbol": "scip-typescript npm . . src/`main.ts`/foo().",
          "symbolRoles": 0
        }
      ],
      "symbols": [
        {
          "symbol": "scip-typescript npm . . src/`main.ts`/foo().",
          "documentation": ["```ts\nfunction foo(): void\n```"],
          "kind": "Function"
        }
      ]
    }
  ]
}

Two occurrences, one symbol entry. No graph, no edges, no resultSet. The same LSIF equivalent required a document vertex, two range vertices, a resultSet, a hoverResult, a definitionResult, a referenceResult, and six edges.

Takeaway: So just a locally-processed form of a file.

Cross-file and cross-repo references work the same way: any occurrence in any document carrying the same symbol string is linked, purely by string equality. The consumer builds a hashtable at index load time.

Cross-file symbol resolution

There are no cross-file edges or pointers. It’s just a hashtable.

When the consumer loads all documents, it builds a map:

symbol string -> list of (document, range, roles)

Every occurrence in every document gets inserted into this map by its symbol string. “Go to definition” is a lookup: find all entries for this symbol string where roles & Definition != 0. “Find references” is the same lookup without the role filter.

So for foo defined in a.ts and referenced in b.ts:

# a.ts indexer emits:
{ range: [0, 9, 12], symbol: "...`a.ts`/foo().", symbolRoles: 1 }  // Definition

# b.ts indexer emits:
{ range: [5, 0, 3],  symbol: "...`a.ts`/foo().", symbolRoles: 0 }  // Reference

Both carry the same symbol string. The consumer puts both into the map. When you hover foo in b.ts, it looks up "...a.ts/foo().", finds the definition entry in a.ts, and navigates there.

Cross-repo works identically. The <package> component (npm @example/a 1.0.0) makes the string globally unique across repos. Two indexes from different repos get loaded into the same map, and the lookup is no different.

Personal Takeaways

An indexing format should be aware of:

  • Human-readability for debuggability.
  • Allow for local reasoning.
  • Allow for incrementalization.

[blog] [overview] Code Navigation for AI SWEs: What We’ve Learned So Far

Source: engines.dev

Link: https://www.engines.dev/blog/code-navigation

Keywords:

  • Code navigation
  • LSP

Takeaways:

I think:

  • Stack Graphs.
  • Glean.
  • SCIP.

are the ones that are worth looking into.

I want to solve the problem at the API/indexing format rather than wrapping existing binaries.

AI SWE is probably AI software engineer. It is AI as a software engineer, not an engineer working in AI.

Introduction

The core framing: AI agents need code navigation the same way humans do. You can’t just throw files at a model and hope it figures out where things are.

The question is how to help AI SWE navigate a codebase.

Existing Approaches

The field is split on how to solve this:

  • SWE-Agent: brute-force string search across the entire codebase, then shows the LLM 100 lines at a time through a file viewer.

    Basically, a naive heuristic where you grep through the codebase and show the matches lines in their surrounding context.

  • CodeMonkey: passes all files into a small LLM and ranks them by relevance.

    Basically, delegate to a narrower but faster LLM to act like a “semantic search” so that the bigger LLM can work more efficiently.

  • Moatless: semantic search to identify relevant files.

    Semantic search is a little vague? Does it mean Moatless performs a semantic analysis.

  • OpenHands / Engines: expose navigation as explicit agent tools, such as find all references and go to definition.

    Basically my idea, focus on the tools first, then AI is just an integration.

The last approach maps cleanly onto what a language server already provides. The question is which backend to use.

Ideal system

  • Scalable: index at most once per commit.
  • Incremental: re-index only changed files on new commits.
  • Flexible: navigate any arbitrary commit hash.

No existing open-source system hits all four.

Systems Evaluated

  • lsproxy: auto-detects language then spins up the right LSP server.

  • Stack Graphs (GitHub):

    • An advanced data structure that powers Github’s web-based code navigation.
    • Desirable properties:
      • Incremental.
      • Theoretically language-agnostic.
      • Fast queries via path tracing on a prebuilt graph.
    • Under the hood, uses tree-sitter + custom .tsg grammar files per language to build the graph. Maintaining .tsg files is tiring.

The Stack Graphs architecture is the right idea: incremental, no running server required, commit-aware. The .tsg maintenance burden is the only real blocker.

  • Glean (Meta):
    • Stores arbitrary facts about a codebase (user-defined schema + Angle query language).
    • Desirable properties:
      • Incremental: Only reindex new changes.
      • Scalable: Still work on Meta’s massive C++ codebase.
      • Flexible: Many queries beyond simple code navigation is supported.
      • Navigate arbitrary commits: Facts can be associated with any commits.
      • Can ingest SCIP as a schema, so it covers a wide range of language support.
    • Drawbacks:
      • Glean uses Thrift for its RPC protocol. The Thrift definition Glean uses relies on Meta’s internal Thrift client.

Basically, Glean is the right direction, just some issues with open-source.

  • multilspy: Python wrapper over an LSP server.
    • Easy to set up.
    • Support multi-language.
    • Drawbacks:
      • A bit bloated, as it downloads language server binaries.
      • LSP server’s lifetime is tied to the Python process.

Sourcegraph: SCIP offers a precise code navigation API. However, it’s not open source.

Their Solution

They concluded that they want:

  • The language agnostic approach of Glean/Stack Graph.
  • The usability of lsproxy.

Their first step though, was wrapping multispy.

  • A server component with MCP support.
  • A Docker container for sandboxing.

They are solving the problem at a totally different level: Ready-to-use binary, rather than API/indexing format. I want to research the solution at the Glean/SCIP level first.

Personal Takeaways

I think:

  • Stack Graphs.
  • Glean.
  • SCIP.

are the ones that are worth looking into.

I want to solve the problem at the API/indexing format rather than wrapping existing binaries.

[tool] [tool-dive] SCIP

Source: Sourcegraph

Link: https://github.com/sourcegraph/scip

[tool] [tool-dive] Glean

Source: Meta

Link: https://github.com/facebookincubator/Glean

[tool] [tool-dive] Google Kythe

Source: Wikipedia

Link: https://en.wikipedia.org/wiki/Google_Kythe

[tool] [tool-dive] OpenGrok

Source: GitHub

Link: https://github.com/oracle/opengrok

[tool] [tool-dive] livegrep

Source: GitHub

Link: https://github.com/livegrep/livegrep

Mycroft

Code Querying

Code querying is about searching and pattern-matching over code in a structured, expressive way. This covers everything from basic text search (ripgrep) to AST-level structural search (ast-grep, Semgrep) to full Datalog-based code-as-data query engines (CodeQL, Soufflé).

The key design question here is the tradeoff between expressiveness, speed, and ease of use.

[blog] [overview] Design Space for Code Search Query

Source: ast-grep blog

Link: https://ast-grep.github.io/blog/code-search-design-space.html

[overview] [overview] Comparison With Other Frameworks

Source: ast-grep docs

Link: https://ast-grep.github.io/advanced/tool-comparison.html

[overview] [overview] Best Code Search Tools for Developers in 2024

Source: debugg.ai

Link: https://debugg.ai/resources/best-code-search-tools-for-developers-2024-navigate-understand-refactor

[paper] [seminal] codeQuest

Source: Springer

Link: https://link.springer.com/chapter/10.1007/11785477_2

[paper] [seminal] Declarative Static Analysis with CodeQL

Source: Wiley

Link: https://onlinelibrary.wiley.com/doi/abs/10.1002/spe.3199

[paper] [trend] Incrementalizing Production CodeQL Analyses

Source: ACM (ESEC/FSE 2023)

Link: https://dl.acm.org/doi/10.1145/3611643.3613860

[paper] [trend] CodeFuse-Query

Source: arXiv

Link: https://arxiv.org/html/2401.01571v1

[tool] [tool-dive] ripgrep

Source: GitHub

Link: https://github.com/BurntSushi/ripgrep

[tool] [tool-dive] ast-grep

Source: ast-grep

Link: https://ast-grep.github.io

[tool] [tool-dive] Semgrep

Source: GitHub

Link: https://github.com/semgrep/semgrep

[blog] [tool-dive] Semgrep: A Static Analysis Journey

Source: Semgrep blog

Link: https://semgrep.dev/blog/2021/semgrep-a-static-analysis-journey/

[tool] [tool-dive] GritQL

Source: grit.io

Link: https://docs.grit.io

[tool] [tool-dive] CodeQL

Source: GitHub

Link: https://codeql.github.com

[tool] [tool-dive] Soufflé

Source: Soufflé

Link: https://souffle-lang.github.io

Code Refactoring

Code refactoring covers applying transformations to a codebase: renaming symbols, reorganizing structure, rewriting patterns. The challenge is doing this correctly and safely, without leaving the code in a malformed state.

This is one of the core pain points that motivated Loupe. AI reaches for grep and sed, which is imprecise and error-prone. Proper refactoring requires understanding the code structurally, not just textually.

[paper] [seminal] Refactoring-Aware AST Differencing

Source: Concordia University (TOSEM 2024)

Link: https://users.encs.concordia.ca/~nikolaos/publications/TOSEM_2024.pdf

[paper] [trend] PyCraft

Source: University of Colorado (FSE 2024)

Link: https://danny.cs.colorado.edu/papers/PyCraft_FSE2024.pdf

[tool] [tool-dive] Comby

Source: comby.dev

Link: https://comby.dev

Quality Assurance

Quality assurance covers detecting code smells, bad patterns, and potential bugs. This is where static analysis meets practical tooling. The resources here range from empirical benchmarks of existing tools to ML-based detection methods.

[paper] [overview] Static Analysis Tools for Secure Code Review

Source: arXiv

Link: https://arxiv.org/pdf/2407.12241

[paper] [trend] LLift

Source: OOPSLA 2024 / UC Riverside

Link: https://www.cs.ucr.edu/~zhiyunq/pub/oopsla24_llift.pdf