Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

[talk] [seminal] Tree-sitter - A New Parsing System for Programming Tools

Source: Strange Loop 2018

Link: https://www.thestrangeloop.com/2018/tree-sitter—a-new-parsing-system-for-programming-tools.html

Video: https://www.youtube.com/watch?v=Jes3bD6P0To

Speaker: Max Brunsfeld, GitHub Atom team

Keywords:

  • Parsing
  • Incremental parsing
  • Syntax tree
  • Code navigation

Takeaway:

Tree-sitter takes the approach of a compiler generator:

  • GLR parsing
  • Error recovery with GLR parsing also
  • Incremental parsing by reparse but with the old AST as a reference

Introduction

Developer tools that support multiple languages have historically been built on regex-based code analysis.

Syntax highlighting in Atom, VS Code, and Sublime Text uses TextMate grammars, which are regular-expression pattern sets. Code folding uses indentation heuristics. Symbol search uses pattern matching over raw text. (Confirmed fact!)

Regex-based tooling has a hard ceiling:

  • Regex cannot represent nested structure or scope boundaries. Balanced delimiters, block nesting, type-sensitive disambiguation - none of this is expressible.
  • Reparsing an entire file on every keystroke is too slow for real-time use. Compiler parsers are designed for batch runs, not interactive editing.
  • Incomplete code breaks everything. While typing, code is almost always syntactically invalid. A parser that rejects invalid input cannot be used in an editor.
  • Every language needed its own editor plugin written from scratch, with no shared interface.

Tree-sitter was built to fix all of this.

  • A library for parsing source code written in C and C++.
  • Language-agnostic: Cover a wide range of languages.
  • Unified: Unified API for different languages.
  • Incremental: No need to reparse the whole file.
  • Error recovery: Recover on error instead of aborting.

IDEs’ limitations:

  • Language-specific.
  • No IDEs at the time were incremental.

Language servers’ limitations:

  • Fixed feature set.
  • RPC + parsing from scratch on every changes.
  • Each maintain a different set of dependencies.

Tree-sitter’s strengths:

  • Arbitrary analysis based on syntax tree.
  • In-process incremental parsing.
  • No dependencies.

Features

Syntax Highlighting

  • In Sublime Text 3/Vscode/Old Atom, syntax highlighting seems to be plagued with highlighting inconsistencies:

    • Different colors for different types in different contexts.
    • Same variables had different colors.
    • Fields were not highlighted.
  • A concrete example: Long lines of code (typically seen in bundled javascript) will break regex-based highlightings.

  • Improvements of tree-sitter:

    • None of the problems above.
    • Very fast.

Code Folding

  • Typically based on indentation: Do not always work, especially when some weird code convention is used!

  • With tree-sitter:

    • Folding always works.
    • Configurable code folding. This is unlocked by treesitter exposing a consistent AST.

Extend Selection

  • It’s the feature where clicking on a position cause semantically larger code items to be selected.
  • Works well with multi-cursor.

How Tree-sitter work

Writing a Grammar

  • Kind of look like most compiler generator.
    • A context-free grammar in tree-sitter is a Javascript value in a file such as grammar.js.
    • Compiled into parser.c
    • The parser can then be imported.
  • Bindings exist.

Algorithms

Seminal Paper

Based on Tim A. Wagner (1998)’s paper: Practical algorithms for Incremental Software Development Environments.

  • Laid out methodology for efficient code analysis in IDE.
  • Basic parsing theory to start with.
  • Extend the theory to incremental parsing.
  • Formally proofs of performance properties.
  • How to handle ambiguities, errors, etc.

LR Parsing

Good source: wikipedia.org/wiki/LR_parser.

  • Looks from start to end without ever backtracking.
  • As it reads the characters, they are grouped into tokens, tokens are grouped into subtrees, stored on a stack for consultation. The items on the stack are merged rapidly.

LR Parsing does not work for all languages. Although practically, a well-designed language shouldn’t be a problem, but some languages have weird quirks.

Tree-sitter does some unique things based on LR Parsing.

Extension: GLR Parsing

When ambiguities happen, fork the branch to work on multiple interpretations.

=> Tree-sitter is based on this.

Error Recovery

  • The tree should include error nodes.
  • Invalid code is a source of lots of ambiguities.

Tree-sitter uses the same idea of GLR for resolving ambiguities.

Incremental Parsing

  • Walk the current AST tree and mark all nodes that contain the editted positions.
  • Reparse like before, but it can now use the old tree as a reference.