`[tool]` `[tool-dive]` Tree-sitter Concepts

Source: tree-sitter

Link: https://tree-sitter.github.io

Takeaways

Design for IDEs: Tree-sitter prioritizes fast incremental re-parsing and token position fidelity over theoretical parser properties.
CST design balances detail and complexity by preserving all tokens while using named nodes (grammar rules) vs. anonymous nodes (inline strings).
Incremental parsing only re-parses edited ranges by reusing unchanged subtrees, critical for real-time editor feedback.
GLR explores all parse branches at runtime and scores with dynamic precedence, handling true ambiguities without grammar restrictions.
Queries separate concerns: Tree navigation is independent of grammar design, enabling services to evolve without grammar changes.

Abstract

This section covers tree-sitter’s architecture, design principles, and core concepts.

Concepts (this article): Design philosophy, architecture, IDE optimization.
Grammar (Guideline & Syntax): Author grammars with design principles and DSL reference.
Query System (Queries): Pattern matching for data extraction and tree navigation.
Services (Services): Syntax highlighting, code navigation, language injection.

Introduction

Tree-sitter is:

A parser generator tool, taking a grammar described in JS and yielding an incremental parsing library in C.
Runtime in various languages to load the generated C library.

The Context

Covered in the Strange Loop talk.

Roughly:

Before tree-sitter, most IDE (vscode, sublime) features (such as syntax-highlighting, code folding) were implemented in terms of regexes.
This caused a lot of problems:
- Inconsistent highlighting: The same type of tokens were not in the same color in some edge cases (weird formatting convention) or even in some pretty common case.
- Regexes highlighting can easily break, such as when inspecting the bundled JS.
- Slow (at least slower than tree-sitter).
So, tree-sitter was created to addressed these issues, reporting:
- Faster processing of source code.
- Consistent highlighting.
- Enable many DX features.
- Unified interface for toolings.

Design Philosophy

Some of the design philosophy of tree-sitter can be found in the Strange Loop talk:

Arbitrary analysis based on syntax tree.
In-process incremental parsing.
No dependencies.

On the tree-sitter homepage, tree-sitter aims to, quoted in verbatim, be:

General enough to parse any programming language
Fast enough to parse on every keystroke in a text editor
Robust enough to provide useful results even in the presence of syntax errors
Dependency-free so that the runtime library (which is written in pure C11) can be embedded in any application

Additional source: https://zed.dev/blog/syntax-aware-editing.

IDE-centric design:

Tree-sitter was designed with IDE-centric workflow in mind.
Concrete syntax trees retain all token locations to support many editor features.
Flexible queries match structural patterns independently from grammar design.
Grammar + query set = language support without custom traversal code.

This aligns with Loupe’s goal: syntactic code analysis and querying tool.

graph TB
    A["Grammar.js<br/>User-defined rules"] -->|tree-sitter generate| B["C Parser<br/>Compiled from grammar"]
    B --> C["TSParser<br/>Execution engine"]

    C -->|ts_parser_parse_string| D["TSTree<br/>Concrete Syntax Tree"]
    D --> E["TSNode<br/>Syntax tree element"]

    F["Tree Queries<br/>Pattern matching"] -->|ts_query_cursor_exec| E

    G["Edit API<br/>ts_tree_edit"] -->|Incremental| C

    H["Language Binding<br/>Rust/JS/Python"] -->|Load| B

    style A fill:#e1f5ff
    style B fill:#fff3e0
    style C fill:#f3e5f5
    style D fill:#e8f5e9
    style F fill:#fce4ec
    style G fill:#fff9c4

Practical Use Cases

Source: https://zed.dev/blog/syntax-aware-editing.

These features in Zed are implemented using tree-sitter queries:

Syntax highlighting (example query from their blog):

["do" "for" "while"] @keyword

(function
name: (identifier) @function)

(pair
key: (property_identifier) @function.method
value: [(function) (arrow_function)])

Symbol outlines (example query from their blog) - fuzzy search symbols within another symbol/context:

(impl_item
  "impl" @context
  trait: (_)? @name
  "for"? @context
  type: (_) @name) @item

(function_item
  (visibility_modifier)? @context
  (function_modifiers)? @context
  "fn" @context
  name: (_) @name) @item

Auto-indentation (example query from their blog):

(statement_block "}" @end) @indent

[
  (assignment_expression)
  (member_expression)
  (if_statement)
] @indent

Language injection (example query from their blog) - sometimes, inside a source code in a language comes another language (e.g. HTML & JS or Rust macro):
```
(script_element
  (raw_text) @content
  (#set! "language" "javascript"))
```
Kind of look like Lexer mode in ANTLR.

User-facing API

Tree-sitter’s API follows a layered approach:

Core APIs handle efficient parsing & a unified AST representation.
High-level APIs provide convenience features.

Four main object types:

Languages (TSLanguage in the C API).
- An (opaque) object representing the programming language parsing logic/rules (not the current parsing states).
- The implementations for specific Language objects are generated by tree-sitter.
Parsers (TSParser in the C API).
- The language parser itself, which can be assigned a language object.
- Can accept some source code and produce a TSTree.
Syntax trees (TSTree in the C API).
- The syntax tree of a piece of a source code.
- Composed of TSNodes.
- Can be editted to produce a new TSTree.
Syntax nodes (TSNode in the C API).
- A single node in the syntax tree.
- Tracks start and end positions & relation to other nodes.

Example flow using a generated parser in C:

Create a TSParser instance.
Set the TSParser with a TSLanguage.
Trigger the parser to produce a TSTree.
Query the TSTree and retrieve TSNodes.

Example program:

#include <stdio.h>
#include <string.h>
#include <tree_sitter/api.h>

// **Language**
// Declare the external grammar function. This is generated by Tree-sitter
// and contains all the syntax rules for JSON.
const TSLanguage *tree_sitter_json(void);

int main() {
    // The source code we want to parse
    const char *source_code = "{\"name\": \"Gemini\"}";
    uint32_t source_len = strlen(source_code);

    // **Parser**
    // Create the parser and equip it with the JSON language grammar
    TSParser *parser = ts_parser_new(); // Dynamically allocated
    ts_parser_set_language(parser, tree_sitter_json());

    // **Parse tree**
    // Instruct the parser to read the string and build the entire tree.
    TSTree *tree = ts_parser_parse_string(
        parser,
        NULL,           // Previous tree (used for incremental parsing, NULL for first time)
        source_code,    // The actual code
        source_len      // Length of the code
    );

    // **Tree node**
    // Extract the absolute top node of the tree to begin our traversal.
    TSNode root_node = ts_tree_root_node(tree);


    // Cleanup memory
    ts_tree_delete(tree);
    ts_parser_delete(parser);

    return 0;
}

Parser API

TSParser accepts input via: string buffer (const * string with uint32_t length) or custom TSInput interface.

ts_parser_parse_string: Accept string buffer with incremental parsing support.

TSTree *ts_parser_parse_string (
  TSParser *self,
  const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk
  const char *string,
  uint32_t length
)

ts_parser_parse: Accept a custom data structure consumer:

TSTree *ts_parser_parse(
 TSParser *self,
 const TSTree *old_tree, // possibly for incremental parsing like in the Strange Loop talk
 TSInput input
);

Syntax Nodes (& Trees)

DOM-style interface for inspection:

Position APIs:

ts_node_type: Node type identifier corresponding to grammar rule.
ts_node_start_byte/ts_node_end_byte: Raw byte offsets.
ts_node_start_point/ts_node_end_point: Zero-based row/column coordinates. Only \n is newline separator.

Tree navigation:

ts_tree_root_node: Get root node from tree.
ts_node_child: Access children via array index.
ts_node_next_sibling/ts_node_prev_sibling: Navigate siblings (may return null node).
ts_node_parent: Navigate to parent node.

Named and Anonymous Nodes

Problem: CSTs preserve all tokens, creating deeply nested trees tedious to traverse.

Solution: Distinguish two node types for selective AST inclusion:

Named nodes: Explicit grammar rules (e.g., if_statement, expression). Always appear in parse tree.
Anonymous nodes: Inline strings/regexes (e.g., "if", "(", ";"). Hidden from parse tree by default.

if_statement: $ => seq("if", "(", $._expression, ")", $._statement);
^^^^^^^^^^^^           ^^^   ^^^                 ^^^
named node        anonymous nodes (hidden from AST)

This distinction enables unambiguous, detailed grammars while keeping trees manageable.

graph TD
    A["if_statement<br/>named node"] -->|child| B["_expression<br/>named node<br/>hidden via _prefix"]
    A -->|not in AST| C["'if'<br/>anonymous"]
    A -->|not in AST| D["'('<br/>anonymous"]
    A -->|not in AST| E["')'<br/>anonymous"]
    A -->|child| F["_statement<br/>named node<br/>hidden via _prefix"]

Node Field Names

Access child nodes:

By index: Default approach via children array.
By field name: More convenient when field names are known.

Like ANTLR, supports both indexed and named accesses.

TSNode ts_node_child_by_field_name(
  TSNode self,
  const char *field_name,    // Unique field name from grammar
  uint32_t field_name_length
);

Field name lookup optimization:

String comparisons are repeated and inefficient.
Tree-sitter interns field names into TSFieldId for efficient lookup.

uint32_t ts_language_field_count(const TSLanguage *);
const char *ts_language_field_name_for_id(const TSLanguage *, TSFieldId);
TSFieldId ts_language_field_id_for_name(const TSLanguage *, const char *, uint32_t);

Tree Edit

Problem: Source edits invalidate tree byte positions. Reparsing entire file is wasteful.

Solution: Two-step incremental update:

Edit tree structure: Tell tree-sitter what changed (old/new byte range, position delta).

typedef struct {
  uint32_t start_byte;       // Edit start
  uint32_t old_end_byte;     // Old end position
  uint32_t new_end_byte;     // New end position
  TSPoint start_point;       // Start as row/column
  TSPoint old_end_point;     // Old end as row/column
  TSPoint new_end_point;     // New end as row/column
} TSInputEdit;

void ts_tree_edit(TSTree *, const TSInputEdit *);

Reparse with old tree as context:

TSTree *new_tree = ts_parser_parse_string(
  parser,
  old_tree,           // Enable incremental parsing
  updated_source,
  new_length
);

Rough flow:

Source change.
Invoke ts_tree_edit to update the tree’s range
Invoke ts_parser_parse with the old tree.
New tree with reused structure.

Why do you need to call ts_tree_edit first? Wouldn’t call re-parse with the old tree suffice? My guess is that it would be more efficient incrementally if tree-sitter knows the editted position from the get-go.

Warning!

Old extracted TSNode references become invalid after reparse. These will need to be updated via ts_node_edit() or fresh nodes can be refetched instead.

My theory is that TSNode does not contain source text information? So, the edit API does not convey information about the new source.

Language Injection Support

Many documents contain multiple languages (e.g., HTML with embedded JS, Bash with heredocs).

Tree-sitter stitches disjointed segments as one unified syntax tree.

typedef struct {
  TSPoint start_point;
  TSPoint end_point;
  uint32_t start_byte;
  uint32_t end_byte;
} TSRange;

void ts_parser_set_included_ranges(
  TSParser *self,
  const TSRange *ranges,
  uint32_t range_count
);

Example: ERB document with embedded HTML and Ruby.

<ul>
  <% people.each do |person| %>
  <li><%= person.name %></li>
  <% end %>
</ul>

Steps:

Parse entire document with root language (ERB).
Identify nodes with embedded languages in AST.
Extract boundaries into separate TSRange arrays.
Switch parser language via ts_parser_set_language.
Feed ranges to parser via ts_parser_set_included_ranges.
Re-parse to generate isolated syntax trees per language.

Concurrency Support

TSTree instances are not thread-safe. However, they can be cheaply copied via ts_tree_copy(TSTree *):

TSTrees are immutable.
An atomic counter is incremented to keep track of completely freed TSTrees.

(Efficient) Tree Traversal

The problem with the standard node fetching APIs (index/named access):

Stateless.
Therefore, repeated accesses would need to recalculate positions from scratch.
Suitable for occasional accesses.

For bulk traversal, use tree cursors.

ts_tree_cursor_new(TSNode): Create cursor from node.
ts_tree_cursor_goto_first_child: Move to first child.
ts_tree_cursor_goto_next_sibling: Move to next sibling.
ts_tree_cursor_goto_parent: Move to parent.
ts_tree_cursor_current_node: Get full node (triggers lazy evaluation).
ts_tree_cursor_current_field_name: Get field name.
ts_tree_cursor_current_field_id: Get field ID.

// 1. Initialize the cursor (this node becomes the impenetrable boundary)
TSTreeCursor cursor = ts_tree_cursor_new(start_node);

// 2. Traversal Flow
// Try moving down into the children (O(1) operation)
if (ts_tree_cursor_goto_first_child(&cursor)) {

    // 3. Extract lightweight data directly
    const char *field = ts_tree_cursor_current_field_name(&cursor);

    // 4. Extract the full node struct only if actually needed
    TSNode current = ts_tree_cursor_current_node(&cursor);

    // 5. Move horizontally (O(1) operation)
    if (ts_tree_cursor_goto_next_sibling(&cursor)) {
        // Evaluates the next sibling
    }

    // 6. Move back up the tree
    ts_tree_cursor_goto_parent(&cursor);
}

Resources for the Design & Implementation

General design: tree-sitter doc
Parsing functionality:
- tree_sitter/api.h header file
- Other language bindings similar to tree_sitter/api.h, especially the official ones.

Keyboard shortcuts

Loupe