Tree-sitter Query System
Source: tree-sitter
Link: https://tree-sitter.github.io
Takeaways
- Separate grammar from navigation: Queries enable applications without modifying grammar, keeping grammar focused on parsing and semantics.
- Declarative patterns are more maintainable and portable than imperative traversal code written in each binding language.
- Portability: The same
.scmquery file works across all language bindings and tools for standardized highlighting and navigation. - Architecture: C core extracts queries; language bindings evaluate predicates and directives. This splits parsing (fast, in C) from filtering (flexible, in bindings).
Quick Reference
| Pattern | Purpose | Example |
|---|---|---|
| Capture | Extract a matched node | (identifier) @name |
| Wildcard | Match any node | (_) matches any named node |
| Field name | Match by field | left: (identifier) matches specific child |
| String literal | Match anonymous node | "if" or "(" |
| Negation | Must not have field | !type_parameters |
| Quantifier | Repetition | (arg)* or (arg)+ |
| Alternation | Match one of several | ["if" "while" "for"] |
| Anchor | Position constraint | . before/after/between children |
| Predicate | Conditional filter | (#eq? @name "foo") |
| Directive | Metadata attachment | (#set! language "javascript") |
Syntax
- Query: A series of patterns that is used to extract certain code structure out of the source code.
- Pattern: An S-expression matching a certain set of syntax nodes.
Pattern’s structure:
(<node-type> ...<children-patterns>)
For example:
(binary_expression (number_literal) (number_literal))
would match a binary expression with two children of type number literal.
Children can be omitted, the remaining nodes would described the expected order of children nodes.
(<node-type> <child-1> <child_2>)
; ^^^^^^^^ ^^^^^^^^
; <child_1> needs only precede <child_2>, some other nodes may be in between
For example:
(binary_expression (number_literal))
would match any binary expression with a number literal as its child.
Fields: Child patterns can be named to be correspondent with field names, and recommended so:
(assignment_expression
left: (member_expression
object: (call_expression)))
If in the grammar rule assignment_expression, there are 2 field named left and object, then this query would return all assignment_expressions whose left and object are member_expression and call_expression.
Negated fields: A field name prefixed with ! indicates a lack of such field.
(class_declaration
name: (identifier) @class_name
!type_parameters)
Anonymous nodes: (<pattern>) only applies to named nodes. To reference anonymous nodes, double quotes the textual contents.
(binary_expression
operator: "!="
right: (null))
Wildcard nodes: _ and (_) are used to matched any nodes. The only difference is that (_) can match any named node and _ can match any named/anonymous node.
(call (_) @call.inner)
; ^^^^^^^^^^^
; any nodes inside a call
ERROR nodes: An unrecognized piece of text and can also be queried.
(ERROR) @error-node
MISSING nodes: Missing expected nodes will be inserted as special missing nodes in the syntax tree.
- To recover from syntax errors, the parser inserts zero-width “phantom” tokens (like a forgotten semicolon) to keep the syntax tree valid.
- Instead of allocating a completely new, heavyweight structural node in memory, it simply attaches an internal “missing” flag to that generated token.
- This approach saves memory and processing power because the token occupies zero physical characters in your source code.
Something’s a bit confusing here - Tree-sitter seems to mention nodes and tokens interchangeably.
(MISSING) @missing-node
(MISSING identifier) @missing-identifier
(MISSING ";") @missing-semicolon
Supertype nodes: Supertypes in queries can be used to match any subtypes.
Supertype here means roughly like supertype in OOP: A ggrammar rule that can specializes to multiple sub-grammar rules. The supertype itself won’t be a standalone node in the AST, rather like a union type though.
(expression) @any-expression
; ^^^^^^^^^
; match any subtypes of `expression`
Match specific subtypes (named or anonymous):
(expression/binary_expression) @binary-expression
(expression/"()") @empty-expression
Operators
The @ operator: Specific nodes in the pattern can be extracted out by suffixing the node with a capture name
(assignment_expression
left: (identifier) @the-function-name
; ^^^^^^^^^^^^^^^^^
right: (function))
The quantification operators + & *: Similar to those in regular expressions.
Sequence group:
- The sequence group
()creates an anonymous block of sibling nodes. - Its primary use case is grouping nodes so you can apply quantifiers (
*,+,?) to an entire repeating sequence. - Example: Matching a repeating comma-separated list using
( (",") (identifier) )*. - It does not enforce strict adjacency; undeclared nodes can still appear between the grouped items.
- Use
()to repeat a pattern, and use.(the anchor operator, see below) to glue nodes tightly together.
Alternation operator []: Similar to character classes in regular expressions.
[
"break"
"delete"
"else"
"for"
"function"
"if"
"return"
"try"
"while"
] @keyword
The alternants can be quantified.
Anchor operator .: Used to constrain the ways in which child patterns are matched, with different behaviors based on its position in a query.
- Before the first child within the parent pattern: That child is only matched when it’s the first named node of the parent.
(array . (identifier) @the-element)
; ^^^^^^^^^^
; must be the first
Without the anchor above, @the-element would be bound to every identifier in the array, and accessing it would return an arrays of all identifiers.
-
After the last child within the parent pattern: Similarly to the above case.
-
Between two child patterns: Match immediate siblings.
(dotted_name
(identifier) @prev-id
.
(identifier) @next-id)
Note that anonymous nodes are not taken into account.
Predicates
The predicate syntax #...?: Add arbitrary conditions or metadata to a pattern. Predicate names start with # and end with ?, and can accept captures or strings as arguments.
The eq? predicate: Compares a capture’s text against a string or another capture.
- Includes variants:
#not-eq?,#any-eq?,#any-not-eq?. - By default, quantified captures require all matched nodes to pass the predicate. The
any-prefix overrides this to match if at least one node passes.
((identifier) @variable.builtin
(#eq? @variable.builtin "self"))
(
(pair
key: (property_identifier) @key-name
value: (identifier) @value-name)
(#eq? @key-name @value-name)
)
The match? predicate: Matches a capture’s text against a regular expression.
- Also supports the
not-andany-prefixes.
((identifier) @constant
(#match? @constant "^[A-Z][A-Z_]+"))
The any-of? predicate: Checks if a capture’s text strictly equals any string in a provided list.
((identifier) @variable.builtin
(#any-of? @variable.builtin
"arguments"
"module"
"console"))
The is? predicate: Asserts that a capture has a specific internal property (e.g., using #is-not? local to ensure a matched variable is not a local variable).
Directives
The directive syntax #...!: Associates arbitrary metadata with a pattern or alters its output. They function similarly to predicates but end with ! instead of ?.
The set! directive: Associates arbitrary key-value metadata pairs with a matched pattern (commonly used for language injections).
((comment) @injection.content
(#match? @injection.content "/[*\/][!*\/]<?[^a-zA-Z]")
(#set! injection.language "doxygen"))
The select-adjacent! directive: Takes two capture names and filters the first capture’s text so that only nodes adjacent to the second capture are preserved.
The strip! directive: Takes a capture and a regular expression, removing any matched text from that capture’s final output.
Note
- The core Tree-sitter C library does not execute predicates or directives.
- It simply extracts them from your query and passes them along as raw metadata.
- It is entirely up to the higher-level bindings (like Rust, Python, JS, WASM) to read that metadata and actually run the filtering logic (like evaluating
#eq?).So, if a predicate isn’t working, it might be because the specific language binding you are using hasn’t implemented support for it yet, not necessarily because your query syntax is wrong.
Query API
Creating a query (ts_query_new): You create a query by passing in the language and a string containing your patterns.
- If it fails,
error_offsetindicates the exact byte where it failed. error_typeindicates the cause (Syntax,NodeType,Field, orCapture).
TSQuery *ts_query_new(
const TSLanguage *language,
const char *source,
uint32_t source_len,
uint32_t *error_offset,
TSQueryError *error_type
);
Thread safety and state (TSQuery vs TSQueryCursor):
TSQuery: The parsed query itself. It is immutable and can be safely shared across multiple threads.TSQueryCursor: Holds the state required to actually run the query. It is not thread-safe, but you can reuse a single cursor for many executions to save memory.
TSQueryCursor *ts_query_cursor_new(void);
Executing a query (ts_query_cursor_exec): Prepares the cursor to find matches within a specific syntax node (usually the root node of your parsed code).
void ts_query_cursor_exec(TSQueryCursor *, const TSQuery *, TSNode);
Iterating matches (ts_query_cursor_next_match): Steps through the successful query matches one by one.
- Returns
trueand populates thematchstructure, orfalsewhen there are no more matches. TSQueryMatchcontains metadata (which pattern matched) and an array of captured nodes.TSQueryCapturecontains the actualTSNodefrom the syntax tree and the index of its capture name.
typedef struct {
TSNode node;
uint32_t index;
} TSQueryCapture;
typedef struct {
uint32_t id;
uint16_t pattern_index;
uint16_t capture_count;
const TSQueryCapture *captures;
} TSQueryMatch;
bool ts_query_cursor_next_match(TSQueryCursor *, TSQueryMatch *match);