Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

This document details the motivation, design of Typedown & the research process into YAML 1.2 & graph database + web 3, to unify and incorporate them into note management. In the end, notes are interlinked documents, which resembles a graph or a web of documents. Incorporating semantic into this web will unlock various analysis & language services (LSP) for note management. Even, static site generators can be enabled based on the written notes.

Research

Glossary

Graph Database

TermDefinitionSee
CypherQuery language for property graph databasesThe Graph Space
Global Graph AlgorithmAn algorithm that operates on the entire graph (e.g. PageRank, community detection, connected components)Local vs Global
Aggregate NoSQL StoreNoSQL database organizing data around self-contained chunks (key-value, document, or column-family)Modeling Connectivity
GraphA collection of nodes and relationships connecting themThe Concept of Graph
Graph Compute EngineAn engine for offline batch analytics over graph data (like OLAP)The Graph Space, Graph Compute Engine
Graph DatabaseAn online CRUD database exposing a graph data model, optimized for OLTPThe Graph Space, Graph Database
HypergraphA graph model where a single edge can connect more than two nodesThe Graph Space
Index-free AdjacencyConnected nodes physically point to each other, O(1) traversal per hopGraph Database
Local Graph OperationStarts at a specific node and traverses outward, touching a small portion of the graphLocal vs Global
Native Graph StorageStorage engine purpose-built for graphs (vs serializing into relational/other backends)Graph Database
NodeAn entity in a graph, can hold properties (key-value pairs)The Concept of Graph, The Property Graph Model
PregelGoogle’s paper describing a distributed graph compute engine, basis for most distributed implementationsGraph Compute Engine
PropertyA key-value pair attached to a node or relationshipThe Property Graph Model
Property GraphThe most popular graph model variant, nodes and relationships with propertiesThe Property Graph Model, The Graph Space
Relationship (Edge)A named, directed connection between two nodes, can hold propertiesThe Concept of Graph, The Property Graph Model
Relationship ChainDoubly linked lists of relationships connecting a node to its neighbors on diskNative Graph Storage
Write-Ahead LogDurability mechanism: Changes are logged before being applied to store filesTransactions, Availability, Scale

Semantic Web

TermDefinitionSee
Digital SignatureEncrypted data blocks proving information came from a trusted sourceThe Semantic Web
Equivalence StatementExplicit links between different names for the same concept across databases (e.g. “zipcode” = “postal code”)Semantic Web
Inference RulesLogical deductions encoded in ontologies (e.g. “if X is in city Y, and Y is in state Z, then X is in state Z”)Semantic Web
Knowledge RepresentationRepresenting a knowledge base in a machine-readable fashionThe Semantic Web
OntologyA document formally defining relations among terms: Taxonomies, inference rules, and equivalence relationsSemantic Web, The Graph Space
Proof VerificationVerifying that claims are logically inferred from trusted sources via exchangeable mathematical proofsThe Semantic Web
RDFResource Description Framework: Triple-based model (subject-predicate-object) encoding meaning via URIsSemantic Web, The Graph Space
Semantic PreservationExtracting universal logic from systems while discarding system-specific instructions, losing as little meaning as possibleThe Semantic Web
Semantic WebExtension of the WWW adding machine-readable meaning to data, enabling computer-human cooperationSemantic Web
Software AgentSoftware that harvests data from diverse sources, processes it, and exchanges results with other agentsThe Semantic Web
SPARQLQuery language for RDF dataThe Graph Space
TaxonomyClass hierarchies (classes and subclasses) and their relations within an ontologySemantic Web
URIUnique identifier tying a concept to a discoverable definition, solving ambiguity across systemsSemantic Web
XMLAdds arbitrary structure to documents via custom tags, structure without meaningSemantic Web

Semantic Web

This is the main page of the Semantic web research, containing sub-links to sub-researches.

As a document itself, this goes on about the overview of semantic webs (web 3): their philosophy, practices & recurring themes/ideas.

Key takeaways:

  • The semantic web is an extension of the world wide web, focusing on defining a consistent, unambiguous, machine-comprehensible semantic model of the world wide web components.
  • The semantic web has 4 main components:
    • XML: To define custom entities. Entities are defined uniquely by URIs.
    • RDF: Define relationship between entities.
    • Ontology: A taxonomy for entities and inference rules based on the relationship between entities. An ontology can also define equivalence relations of its concepts to other ontologies’ concepts.

Here is the list of resources used this document:

See related resources:

Refer to Research Topics for the more in-depth details of the semantic web.

Other Relevant Resources

Expected Topics to Research

  • The philosophy of Semantic Web.
  • The model of Semantic Web.
  • RDF.
  • Ontology.
  • Query and processing Semantic Web.

Overview

A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. - “The Semantic Web” article by Tim Berners Lee.

Semantic web aims to add a machine-readable semantic layer to the largely unstructured traditional web, which was mostly aimed towards human. It incorporates a suite of philosophy and technology to make the traditional web easilly analyzable by tractable algorithms. This, in turn, allows the machine/web crawlers/search engines to construct a graph of information found on the web.

Note that the traditional web is already a graph, however, the nodes in this graph is an opaque web page from which useful knowledge can be somewhat extracted. With the semantic web though, the nodes are more fine-grained: it’s no longer working at the grain of web pages anymore - it’s now working at the grain of knowledge the web pages contain.

Check out this example on an imagined scenario and possible use case that the semantic web enabled (written by Tim Berners Lee - the web standard creator): Link (First section).

  • Each person has a “hand-held” web-browser and a personal web agent.
  • Each website also provides an agent.
  • Agents can work collaboratively on the structured information that the semantic web offer to make informed decision.
  • The interaction between the human and the agent seems largely explorative (like our modern-day interaction with AI agents).

Motivation of The Semantic Web

  • The Wold Wide Web today:

    1. Reliably parse Web pages.
    2. Reliably detect web page layouts and routine processing.
    3. Cannot process the semantics of the web page.
  • The Semantic Web:

    1. Bring structure to the unstructured web content (HTML, etc.).
    2. Enable roaming agents to carry out sophisticated tasks.
    3. Lighter-weight than the current sophisticated AI but still well-informed.
  • Tim Berner’s Lee vision:

    • Developers can use “off-the-shell” software to write Semantic Web pages.
    • Normal users have agents roaming the pages to make well-informed decision on behalf of the user.

Definition

The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

Design Philosophy

The Semantic Web is not a separate Web but an extension of the current one, in which information is given well-defined meaning, better enabling computers and people to work in cooperation. - “The Semantic Web” article by Tim Berners Lee.

The Semantic Web is an extension of the existing Web, not a replacement. Essentially, it’s a semantic layer/enhancement/augmentation over the traditional web.

Infrastructure is already partially in place; the near-term goal is to make machines capable of processing the data they currently only display.

The Web’s core property is universality: anything can link to anything, without discrimination between document types, cultures, or media. So far, this universality has mostly served human-readable documents (pages, articles, media). The Semantic Web extends this to machine-processable data.

Like the Internet, the Semantic Web is designed to be decentralized, trading total consistency for unchecked growth. The trade-off is familiar: “404 Not Found” is the cost of allowing anyone to publish anything without central coordination.

Typedown doesn’t really prioritize decentralized (notes are mostly centralized) & favor total consistency more.

According to Knowledge Representation, essentially, the semantic web unify the language to specify documents/knowledge and the language to define rues for reasoning about the data in which anyone can publish document.

Knowledge Representation

Knowledge representation is the act of representing the knowledge base in a machine-readable fashion.

Traditional knowledge-representation systems are centralized and rigid:

  • Everyone must share the same definitions.
  • Rules are narrow and idiosyncratic.
  • Questions are carefully limited to avoid unanswerable queries.

However, similar to Kurt Gödel’s theorem (“any system that is complex enough to be useful also encompasses unanswerable questions”), the trandition knowledge-representation systems are highly restricted.

Semantic Web accepts a different trade-off: paradoxes and unanswerable questions are unavoidable, but this enables versatility and openness. Any system can export its rules onto the Web without central coordination.

We make the language for the rules as expressive as needed to allow the Web to reason as widely as desired. - “The Semantic Web” article by Tim Berners Lee.

Challenge

The challenge of semantic web is “to provide a language that expresses both data and rules for reasoning about the data and that allows rules from any existing knowledge-representation system to be exported onto the Web”.

  • The logic expressible should cover most complex properties of objects.
  • The logic must be limited to not express paradox.

Fortunately, a large majority of the information we want to express is along the lines of “a hex-head bolt is a type of machine bolt,” which is readily written in existing languages with a little extra vocabulary. - “The Semantic Web” article by Tim Berners Lee.

Key Technologies

Related technologies: XML & RDF.

Two key technologies enable this:

  • XML: adds arbitrary structure to documents via custom tags (e.g., <zip_code>). Structure without meaning.
    • Kind of like YAML.
    • The YAML only describe data. Semantics and data types are encoded in both the YAML processor and language runtime.
  • RDF: encodes meaning as triples (subject, verb, object), like elementary sentences. Each part identified by a URI, so concepts are globally unambiguous.

URIs solve the fundamental problem: human language tolerates ambiguity (“address” can mean mailing address, street address, or speech). Machines do not and URI is used to tie a concept to a unique definition that everyone can find on the web. Using distinct URIs for each concept (mailing_address vs street_address) prevents confusion when data crosses system boundaries.

In essence:

  • RDF triples form webs of information.
  • URIs ensure concepts are tied to unique, discoverable definitions on the Web.

Ontologies (The Third Semantic Web Component)

RDF and URIs are not the end of the story. To motivate the concept of “ontology”:

  1. RDF and URIs solve local ambiguity, but a broader problem remains: different databases may independently define similar concepts with different identifiers.
  2. A program needs to know that zip_code (in one system) and postal_code (in another) refer to the same thing. Without this knowledge, systems cannot meaningfully combine information across sources.

In philosophy, an ontology defines what types of things exist & ontology as a discipline studies about these.

To AI and Web researchers, “an ontology is a document or file that formally defines the relations among terms”. Basically, an ontology is a schema, like YAML schema.

The most typical types of web ontologies often involve a taxonomy and a set of inference rules.

How Ontologies Solve It

Ontologies are documents that formally define relations among terms.

They provide:

  • Taxonomies: Class hierarchies (classes and subclasses) and relations.

    For example:

    • An address is a type of location.
    • City codes apply only to locations.

    “If city codes must be of type city and cities generally have Web sites, we can discuss the Web site associated with a city code even if no database links a city code directly to a Web site.”

    Classes can inherit properties from parent classes, reducing redundancy.

  • Inference rules: Encode logical deductions.

    Example: “If a city code is associated with a state, and an address uses that city code, then the address is in that state”. A computer can then deduce that a Cornell University address in Ithaca must be in New York, which is in the U.S.

  • Equivalence relations: map concepts across ontologies. If you publish an ontology defining “zip code” and I publish one defining “postal code”, either or both ontologies can state: my:zip_code is equivalent to your:postal_code.

Agents (The Application of The Samantic Web)

Related technology: DAML (DARPA Agent Markup Language).

Because the Semantic Web aims at making web pages more machine-readable, software agents that harvest data from diverse sources, process it, and exchange results emerge.

Note that the above 3 components of the Semantic Web define the infrastructure. Agents are not really a component of the semantic web structure itself, if I understand correctly.

The power grows exponentially as machine-readable content increases. The core idea is that: agents designed independently can collaborate when data carries explicit semantics - structured meaning lets them understand and act on unfamiliar data.

Verification

The web is decentralized and so are the agents. Therefore, it’s very possible that agents can fabricate the data or trying to spoof as some other party.

Regarding this, there are two key aspects of agents: Proof verification and Digital signature.

  • Digitial signature: Used to verify that the sources (knowledges) come from legit parties.
  • Proof verification: Used to verify that legit claims are inferred based on the sources.

Proof verification:

  • Agents build trust by exchanging proofs in Semantic Web’s unifying language.
  • Example:
    1. Your agent finds Ms. Cook in Johannesburg via an online service.
    2. You ask the service for proof.
    3. It translates its internal reasoning to Semantic Web language.
    4. Your inference engine verifies the chain of logic and shows source pages.

This enables verification without human intervention.

Digital signatures verify source authenticity: Encrypted data blocks prove information came from trusted source-not forged.

  • Example: verify an accounting statement from retailer is genuine, not spoofed by neighbor.
  • Agents must verify sources before trusting assertions on Semantic Web.

Service Discovery

Agents need to find services performing specific functions.

Service discovery requires common language describing what service does and how to use it. Services advertise in directories (like Yellow Pages).

Current schemes (UPnP - a schema for NAT traversal, Jini) work at syntax level only, rely on pre-standardized function descriptions. The problem is that standards can’t anticipate all future needs (not sure what is really implied here?). Semantic Web enables open discovery through meaningful service descriptions any agent can understand.

Semantic Web is more flexible compared to the previous schemes:

  • Agents exchange ontologies for shared understanding, allowing a shared vocabulary.
  • Agents can bootstrap new reasoning when discovering new ontologies.
  • Semantics enables partial matches, exploiting services even if they don’t perfectly fit requests.

Value Chains

Agents form value chains: data passes agent-to-agent, each adds value.

Distributed Web data transformed progressively into high-value output for user.

Example: agents find providers, filter by insurance plan, match appointments to user schedule. Raw data (worthless scattered across Web) -> refined result (actionable plan).

The Semantic Web

Link: https://www.w3.org/2000/Talks/0906-xmlweb-tbl/text.htm.

Authors: Tim Berners-Lee, Dan Connolly, Lynn Andrea Stein, Ralph Swick.

Key takeaways:

  1. Machine-readable web:
  • The Semantic Web structures data using frameworks like RDF (Subject-Verb-Object).
  • Machines can understand and process information, acting as a massive, decentralized database rather than just linking documents for humans.
  1. Semantic Web abandons the old Knowledge Representation (KR) goal of building perfect logic engines, prioritizing massive data publication and trusting that powerful navigation tools will evolve to process it.
  2. Semantic preservation: It breaks down data silos by extracting universal logic from custom systems and using “equivalence statements” to link different terms (e.g, “zipcode” = “postal code”) without forcing a single global standard.
  3. Verifiable trust & proofs: Systems collaborate securely by exchanging mathematical “logical proofs” and digital signatures instead of sharing private internal rules, ensuring every piece of data has a verifiable chain of trust.

Context

To re-iterate, the semantic web is an extension of the world wide web. We differentiate the two concepts:

  • The world wide web is an interlinked networks of documents, called hypertexts.

    • It emphasises on linking related resources together, allowing humans to browse through a network of related resources.
    • Its essence lies in universality:
      1. Hypertext link allows anythin to link to anything.
      2. Anything must be able to be put on the web.
    • Implications: The web technology is inclusive, regarding cultures, media, quality.
    • Shortcomings: Machines do not have a well-defined, unambgiuous way to derive meanings from documents in the world wide web.
  • The semantic web aims to give data a well-defined meaning so machine can readily process them.

“Weblike”

  • Weblike qualities:

    • Decentralized: No single party control the web content.
    • Universal: Any resources can link to any resources.
    • Compromised “total consistency”: Because no single party wholly control the web, no one can ensure assumptions are actually held up.
  • Semantic web must be weblike.

  • Analogy:

    1. The world wide web is like one big book.
    2. The semantic web is like one big database/mathematical formular.
  • The semantic web is not a separate web: It’s another dimension of information added to the world wide web.

Before entering the next sections, it’s useful to view this tower of semantic web (image taken from w3.org):

Semantic Web Tower

The First Level of Semantic Web: RDF

  • XML (eXtensible Markup Language): Allow creating structured documents that can express various information, but the interpretation of such information is application-specific, which means, there’s no universally unambiguous way to describe their meanings.

    • For example, the same <name> can mean different things in different contexts.
  • RDF (Resource Description Framework): Provide a model to convey meanings of information as sets of triples (Subject + Verb + Object). The triples form directed graphs, creating a web of concepts.

  • RDF treats subjects, verbs and objects as resources.

    • Verb is also called an “RDF property”.
  • In RDF, both subjects and objects are identified by URIs.

    • RDF utilizes XML namespaces to assign URIs to every concept. <name> in XML now is no longer ambiguous.
  • An RDF document asserts that particular things have particular properties with particular values.

Generalizing About Current Knowledge Representation (KR) Systems

KR is a widely researched topic. The authors identified KR as a field at the time as follows:

Knowledge Representation as a field is indeed in a state rather similar to the state of Hypertext technology before the web: it is clearly a good idea, and some very nice demonstrations exist, but it has not yet changed the world. In both cases, the idea is good but to be powerful, they must form a single global system. - “The Semantic Web”.

Past Challenges

In the past, KR systems did not scale:

  • Some used a centralized data technology, requiring data to be centralized too.
  • Some were “conceptually centralized”, requiring everyone to maintain the same definition of concepts.
  • Some used fuzzy reasoning systems (maybe it means to imply imprecision or best-effort reasoning), but they did not scale.

KR systems that possess weblike qualities should allow concepts to be independently conceived and retrospectively linked.

Another reason worth noting regarding the challenges of interoperability:

  • Knowledge consists of two parts:
    • Data (facts - easy to share): e.g, “Mary is John’s sister”.
    • Rules (logic - hard to share): e.g, “A parent’s sister is an aunt”.
  • Systems use a specific program (an inference engine) to apply rules to data and answer questions. Because computers cannot answer infinite or purely general questions, these engines are heavily customized for very specific domains. System creators had to strike a delicate balance when designing their rule languages:
    • Too simple: The system cannot understand or process complex, real-world scenarios.
    • Too powerful: The computer might get stuck in endless loops or fail to guarantee a predictable answer (computational intractability).
  • Because every system made different compromises to achieve that balance, their rules became unique to their own engines, making interoperability (sharing rules between different systems) nearly impossible.

Solution of The Semantic Web

So how does the semantic web resolve this?

  • The Semantic Web prioritizes maximum expressive power for now. The Semantic Web completely sidelines the immediate need for computers to process and answer questions about the data.
    • It’s making the exact same gamble the early internet did: let people publish a massive, messy ocean of data first, and trust that the tools to navigate it will be invented later.
    • To draw an analogy, detractors of the World Wide Web after its invention pointed out that without a central database and tree structure, it’s not possible to find everything. Although it is indeed true, search engines are now remarkably good.
    • So, the creators of the Semantic Web expect the same.

Challenges of The Semantic Web

The main challenge is building a unified language that can express both the data and the underlying rules of any existing Knowledge Representation (KR) or database system.

  • The Semantic Web accepts a hard truth: rules exported from System A probably won’t import cleanly into System B.
  • Instead, the Semantic Web cares more about semantic preservation: It doesn’t care about perfect, symmetrical cross-system compatibility. Its core philosophy is simply to get the information out into the open while losing as little meaning as possible.

How does the Semantic Web allow for semantic preservation?

  • Traditional rules contain both a universal truth and a system-specific instruction manual.

    As an example: “If you find a daughter, check whether she has a daughter, and if so you have a granddaughter”. This has two parts:

    • The logical assertion: The daughter of a daughter is a grandaughter.
    • The hint: The inference engine should use this fact whenever it finds a fact about a daughter.
  • The Semantic Web extracts and saves the universal truth (the logic), while completely throwing away the instruction manual (the operational hints).

A Web of Logic

The current primary task: Giguring out how to successfully inject logic into the web. The logic language must hit a very specific sweet spot:

  • Powerful enough to handle complex abstractions (like defining the properties of other properties).
  • Restrictive enough to prevent the system from getting trapped in logical paradoxes.

However, the problem sounds harder than it actually is in practice. The vast majority of existing information is basic classification (e.g., “a hex-head-bolt is a type of machine-bolt”). This simple data doesn’t stress the system’s logic and can be easily handled by existing foundational tools like RDF with just a little extra vocabulary.

Equivalence statements link different names for the same concept across separate databases, breaking down data silos.

  • The problem is that unlinked databases treat identical concepts (e.g, “zipcode” vs “postal code”) as completely unrelated.
  • The solution the semantic web offers is that “equivalence statements” explicitly link these terms, enabling seamless cross-database queries.
  • This allows for:
    • Smooth evolution of knowledge based: They can act as built-in translators between old and new versions of standards or vocabularies.
    • Fast, custom local terms to easily map to slower, broader global standards.

Image taken from w3.org:

Semantic Web concepts

Information isn’t found by querying a central database. Systems discover new data by actively following links (URIs) embedded within documents.

When a computer reads a document, it parses the content into a mathematical, logical formula it can process. These logical formulas point to other documents. A system will evaluate if the current, trusted document confers “trust” to the external link before deciding to actually fetch and read it. Because every piece of information is linked back to its origin and the chain of documents that led to it, data always has a verifiable context. A system never has to take a random fact at face value without checking its history.

The Eating

About this, check the Verification section in the entry note.

The Semantic Web doesn’t need perfect, universal inference engines to be useful today. Instead of sharing complex internal rules, systems can securely interoperate right now by sharing logical proofs (not the inference engines) backed by digital signatures:

  • Systems with completely different logic engines can still collaborate.
  • Scenario:
    1. When System A solves a query, it exports its step-by-step reasoning as a universal mathematical “proof”.
    2. System B doesn’t need to know how to solve the problem. If it receives a conclusion, it simply asks for the proof, verifies the logic, and decides if it trusts the initial assumptions.

Real-world example: A company (like IBM) doesn’t have to expose its private, complex HR rules to an external site (like W3C) to authorize an employee. It simply issues the employee a mathematically verified, digitally signed proof of employment.

Interesting insight: Guaranteeing a computer can solve any complex real-world problem is mathematically intractable. The Semantic Web sidesteps this by not requiring universal problem-solving. Instead, it lets the data flow and relies on a mix of simple, predictable agents and clever heuristic tools (like search engines) to extract practical value.

The Semantic Web Made Easy

Source: Link

This goes into why all the pieces of semantic web seem to be complete, but the Semantic Web has not seen mainstream adoption.

Context

The Semantic Web is defined as an upgraded web where data has clear, machine-readable meaning, conceptually guided by the famous “Semantic Web Tower” (see below - image taken from w3.org).

  • The ultimate goal is to enable better, more seamless cooperation between humans and computers.
  • The architecture of this vision is based on the “Semantic Web Tower”, an iconic whiteboard drawing by Berners-Lee that is universally used to explain the concept to both developers and the general public.

Semantic Web Tower

The Question

However, there’s a harsh truth: the technical building blocks of the Semantic Web are successfully being built, yet the promised revolution hasn’t happened.

The “P” Axis

The Semantic Web stalled because it built a system for machines, but completely forgot about the humans.

  • The completed “Tower” of standards successfully achieves two goals: it extends the current web, and it allows computers to cooperate.
  • The definition explicitly promised to enable computers and people to work together.
  • The reason the Semantic Web hasn’t taken off is that it neglected the human element. It built the infrastructure but failed to actually engage or empower the “people who are awaiting”.

Image from w3.org:

The perception of the tower

The “P (Perception/People) Axis” is a proposed new dimension for the Semantic Web designed specifically to bridge the gap between complex machine logic and human usability.

  • The “P” axis stands for Perception (or People). Its entire purpose is to adapt and translate the deeply technical layers of the Semantic Web tower into something humans can actually interact with.
  • This axis requires its own dedicated technologies and standards focused on human communication.
  • To make the Semantic Web accessible to people, the P axis intentionally sacrifices some of the raw “expressive power” (machine complexity) of the underlying tower in exchange for user-friendliness.

Graph Technologies: Semantic Web vs Property Graph

For a detailed comparison of RDF vs property graphs (structural differences, performance, reasoning), see The Graph Space: Property Graph vs RDF.

Key Differences

Both represent data as nodes connected by named edges and can model the same data. The real differences:

  • Property graphs optimize for developer ergonomics and traversal performance. Properties live directly on nodes and edges. Schema-free and ad-hoc.
  • Semantic web (RDF) optimizes for global interoperability and machine reasoning. Everything is a triple. Formal ontologies (OWL) enable automatic inference. Standardized by W3C.

The choice is about ecosystem and purpose, not structural capability.

Semantic Web Technologies

The semantic web tower, bottom to top:

URI: Global Identity

Resources

Spec

In Semantic Web

Others

XML: Structure Without Meaning

Resources

Specs

  • W3C XML 1.0 Spec - The definitive spec. Verbose but authoritative.
  • W3C XML Namespaces - How XML avoids naming collisions using URIs. Critical for understanding how RDF serializes to XML.

Tutorials

In Semantic Web

  • XML provides structure but no meaning: <name> could mean anything without external agreement. This is exactly the gap that RDF fills by adding URIs to every concept. See RDF: Meaning as Triples.

RDF: Meaning as Triples

Resources

Specs

In Semantic Web

Others

Alternative Serialization Formats

RDF is an abstract model, not a file format. Multiple serializations exist:

  • Turtle - Human-readable, most commonly used for hand-written RDF.
  • JSON-LD - JSON-based, popular in web APIs and schema.org.
  • RDF/XML - Original format, verbose, mostly legacy.
  • N-Triples - One triple per line, simple but verbose. Good for streaming.

OWL: Ontologies and Reasoning

Resources

Specs

  • W3C OWL 2 Overview - Maps the OWL 2 document suite and explains the three profiles (EL, QL, RL) which trade expressiveness for computational tractability.
  • W3C OWL 2 Primer - Gentle intro with examples. Covers classes, properties, restrictions, and reasoning.

Books

  • Semantic Web for the Working Ontologist (Allemang & Hendler) - The practical guide. Covers RDF, RDFS, OWL, and SPARQL with a focus on building real ontologies.
  • A Semantic Web Primer (Antoniou & van Harmelen) - More academic. Good for understanding the theoretical foundations.

Others

In Semantic Web

  • OWL ontologies enable the automatic inference discussed in The Graph Space. Property graphs have no equivalent, reasoning must be custom code.

RDFS vs OWL

RDFS (RDF Schema) and OWL are both for defining vocabularies, but at different levels:

  • RDFS: Lightweight. Classes, subclasses, domain/range of properties. Enough for simple taxonomies.
  • OWL: Full reasoning. Cardinality constraints, transitive/symmetric properties, equivalence, disjointness. Needed when you want machines to infer new facts.

Most real-world ontologies use a mix: RDFS for the basics, OWL where inference is needed.

SPARQL: Query Language for RDF

Resources

Specs

In Semantic Web

  • SPARQL is to RDF what Cypher is to property graphs. For the comparison, see The Graph Space.
  • SPARQL queries pattern-match on triples, while Cypher pattern-matches on nodes and relationships. Structurally similar, but SPARQL has built-in support for reasoning (if the triple store supports inference).

Live Playgrounds

Tutorials

  • Apache Jena SPARQL Tutorial - Fast, practical course. Covers SELECT, FILTER, OPTIONAL, UNION with runnable examples.
  • SPARQL.dev - Cheat sheet / quick reference. Good for looking up syntax while writing queries.
  • Wikidata SPARQL Tutorial - Learn by querying real data (Wikidata). The most motivating way to learn.

Graph Database

Resources

Expected Topics to Research

  • Philosophy of Graph Databases.
  • The data model of Graph Databases.
  • Query language of Graph Databases.
  • Processing phases of Graph Databases.
  • Implementation technique of Graph Databases.

Graph Databases by Ian Robison, Jim Webber & Emil Eifrem

This is a book aims at practicality, focus on managing and querying highly connected data using schema-free graph models.

The book is authored by engineers at Neo4J.

Key takeaways:

  • Graph databases excel at highly connected data where relational databases and aggregate NoSQL stores struggle: Query patterns that require traversing various records via links.
  • Graph databases are more adept at querying graph-like properties, like all transitive friends of a person, where relational joins degrade exponentially with depth. Graph traversal stays constant (O(1) per hop with index-free adjacency).
  • The property graph model (nodes + relationships, both with key-value properties) is simple yet covers the vast majority of use cases.
  • Graph databases are schema-free and additive: New node types and relationships can be added without breaking existing queries.

The book covers the followings:

  • The Cypher query language.
  • The property graph data model.
  • Best practices/commons pitfalls and architectural patterns when working with graph databases.
  • How to design data with the graph data model?

A Little of History

At around 1999, the founders of Neo4j were struggling with an enterprise content management system where 50% of their engineering effort was spent “fighting” relational databases. While SQL handled discrete data well, it was extremely slow and complex for managing connected data, which represents most of their use cases.

Failing to find an existing solution on the web (using “Altavista”, an antique search engine), the team built a native graph database from scratch. Their goal was to combine the reliability of relational databases (ACID transactions) with a graph-centric model for the 21st century (?).

Since then, graph databases are now mission-critical in logistics (routing), finance (entitlements), airlines, and healthcare.

Basically, the philosophy is to embrace “graph thinking” to empower data connectivity, which allows unlocking insights that traditional databases cannot.

The Concept of Graph

A graph is a collection of nodes (entities) and relationships (edges) connecting them. This schema-free structure can model virtually anything: Social networks, supply chains, road systems, etc.

For example, a social network where users follow each other:

graph LR
    A -->|FOLLOWS| B
    B -->|FOLLOWS| A
    C -->|FOLLOWS| A
    C -->|FOLLOWS| B

Relationships carry semantic meaning - we can immediately see who follows whom and whether it’s mutual.

Graphs can also mix entity types. For instance, adding posts to a user node as a linked list:

graph TD
    A -->|LATEST| Post_3
    Post_3 -->|PREV| Post_2
    Post_2 -->|PREV| Post_1

LATEST points to the most recent post & PREV chains form a timeline.

Different concerns (social connections + content) coexist naturally in the same graph.

The Property Graph Model

The most popular graph model variant. A property graph is made up of nodes, relationships, and properties.

  • Nodes contain properties: Arbitrary key-value pairs where keys are strings and values are arbitrary data types. Think of nodes as documents.
  • Relationships connect and structure nodes. A relationship always has a direction, a label, a start node, and an end node: There are no dangling relationships. Direction and label together add semantic clarity.
  • Both nodes and relationships can hold properties. Relationship properties are particularly useful for:
    • Providing metadata for graph algorithms.
    • Adding semantics (quality, weight).
    • Constraining queries at runtime.
graph LR
    A["Person<br>{name: 'A', age: 30}"] -->|"KNOWS<br>{since: 2020}"| B["Person<br>{name: 'B', age: 25}"]

Simple enough to be intuitive, yet expressive enough for the vast majority of graph use cases. These primitives are all that’s needed to create sophisticated, semantically rich models.

The Graph Space

The graph space can be split along two axes: by usage and by data model.

By Usage

graph TD
    GraphTech[Graph Technologies] --> GDB[Graph Databases]
    GraphTech --> GCE[Graph Compute Engines]
    GDB --> OLTP["Transactional / real-time<br>(like OLTP in relational world)"]
    GCE --> OLAP["Batch analytics<br>(like data mining / OLAP)"]
  • Graph databases: The graph equivalent of a regular OLTP database.

    • Always running.
    • Always serving low-latency reads/writes with full ACID guarantees.
    • Used as the primary data store by an application.

    This is the main focus of this book.

  • Graph compute engines - for offline batch analytics over graph data (similar to data mining/OLAP).

By Data Model

Three dominant graph data models exist:

  • Property graph - the most popular, used by most graph databases. Covered throughout this book.
  • RDF triples - Resource Description Framework, common in semantic web.
  • Hypergraphs - a single edge can connect more than two nodes at once.

Property Graph vs RDF

Both represent data as nodes + named edges and can model the same data. The differences are in conventions and ecosystem:

Property GraphRDF
PropertiesKey-value pairs directly on nodes and edgesExtra triples (e.g. Alice --hasAge--> 30)
RelationshipsFirst-class, can hold propertiesTriples - adding metadata to a relationship requires extra nodes
SchemaAd-hoc, no predefined schemaFormal ontologies (OWL/RDFS)
Optimized forFast traversal, app queriesSemantic reasoning, cross-org interoperability
Query languageCypher, GremlinSPARQL

Property graphs trade semantic rigor for compactness and speed. RDF trades compactness for global interoperability and machine reasoning.

RDF’s formal ontologies (OWL/RDFS) enable automatic inference - the system derives new facts from existing data without custom code. For example, given Dog subclassOf Animal and Rex type Dog, an RDF store infers Rex type Animal. You could build similar reasoning on a property graph, but it would be custom logic rather than a standardized format other systems understand.

Graph Database

  • An online database with CRUD operations that exposes a graph data model.
  • Optimized for transactional (OLTP) workloads - transactional integrity and operational availability.

Relationships are first-class citizens, unlike relational databases (foreign keys) or other NoSQL stores (map-reduce).

This allows building models that map closely to the problem domain.

Two Dimensions

Graph databases vary along two axes: The underlying storage (native graph storage & non-native graph storage) & The procesisng engine (native graph processing & non-native graph processing).

quadrantChart
    title The Graph Database Space
    x-axis "Non-native storage" --> "Native storage"
    y-axis "Non-native processing" --> "Native processing"
    Neo4j: [0.9, 0.9]
    OrientDB: [0.8, 0.95]
    Titan: [0.4, 0.85]
    InfiniteGraph: [0.25, 0.8]
    FlockDB: [0.15, 0.2]
    AllegroGraph: [0.8, 0.2]

Legend:

  • Storage = how data is persisted on disk.
  • Processing = how the engine traverses/queries data at runtime.
Native ProcessingNon-native Processing
Native StorageNeo4j, OrientDBAllegroGraph
Non-native StorageTitan, InfiniteGraphFlockDB

These are engineering trade-offs, not inherently good or bad. Native/native is fastest for graph workloads. Non-native/non-native is easiest to operate.

Native Graph Storage

Custom storage format designed for graphs - nodes, relationships, and properties stored in dedicated structures. E.g. Neo4j uses fixed-size record files for nodes and relationships.

  • Pro: Engineered for graph performance and scalability.
  • Con: Less battle-tested than mature backends.

Non-native Graph Storage

Graph data serialized into an existing backend (e.g. MySQL, Cassandra). The graph is an abstraction layer on top.

  • Pro: Ops teams already know the backend.
  • Con: Overhead from translating between graph and non-graph representations.

Native Graph Processing (Index-Free Adjacency)

Each node physically stores a pointer to its neighbors. Traversing an edge = following a pointer.

  • Pro: O(1) per hop, regardless of total graph size.
  • Con: Non-traversal queries can be difficult or memory-intensive.

Non-native Graph Processing

To find a node’s neighbors, the engine does an index lookup (e.g. B-tree scan).

  • Pro: Index-based lookups are well understood and flexible.
  • Con: O(log n) per hop, degrades as the graph grows.

Local vs Global Graph Operations

There are two fundamentally different ways to work with graph data:

Local Operations

Start at a specific node and traverse outward. Only visit a small portion of the graph.

Examples:

  • “Who does Alice follow?”
  • “Find friends-of-friends of Bob”
  • “What’s the shortest path from X to Y?”

These are fast because they touch few nodes. This is what graph databases are built for: Real-time, low-latency responses.

Global Operations

Operate on the entire graph. Must touch most or all nodes, often in multiple passes.

Examples:

  • PageRank: Score every node by importance based on incoming links. Requires iterating over all nodes multiple times until scores converge.
  • Community detection: Find clusters/groups by scanning everyone’s connections.
  • Connected components: Identify all disconnected subgraphs.
  • Betweenness centrality: Find which nodes act as bridges between different parts of the graph.

These are slow by nature: You can’t answer “who is the most important person in the network” without looking at everyone. This is what graph compute engines are built for: Offline batch processing.

Significance

A graph database trying to run a global algorithm would block all other queries while scanning the entire dataset. A graph compute engine can’t serve real-time app requests because it’s designed for batch throughput, not low latency. Different tools for different jobs.

Graph Compute Engine

A batch processing framework specialized for graph data, think Hadoop/Spark, but for graphs. It’s the graph equivalent of data mining/OLAP in the relational world.

It is not a database. It doesn’t store data persistently, handle real-time queries, or support CRUD operations. It’s a compute layer that takes graph data in (fed from a database via ETL), runs global graph algorithms across the entire dataset, and spits results out. The “engine” part means it handles the parallelization for you.

If a graph database is the waiter taking orders in real time, the graph compute engine is the kitchen doing meal prep overnight.

Graph Compute Engine vs Graph Database

They often work together: The graph database handles real-time app queries, the graph compute engine periodically crunches the whole graph offline and writes computed results back into the graph database for the app to use. Think of it like PostgreSQL (graph database) vs Spark (graph compute engine): One is your live database, the other is your batch analytics layer. Same relationship, just graph-flavored.

graph LR
    SOR["Graph Database<br>(MySQL, Oracle, Neo4j)"] -->|"ETL<br>(periodic)"| GCE["Graph Compute Engine<br>- In-memory processing<br>- Working storage"]
    GCE -->|"Write results back"| SOR

Example: You have a social network stored in Neo4j (your graph database, serving your app). Once a day, you ETL that data into Giraph (your graph compute engine), run PageRank to find influential users, and write the results back. Your app then reads those precomputed scores from Neo4j.

Processing Models

  • In-memory/single machine: E.g. Cassovary.
  • Distributed: E.g. Pegasus, Giraph, most are based on Google’s Pregel paper (the engine Google uses to rank pages).

Graph Database Pros

Three key advantages over relational databases and NoSQL stores:

Performance

With relational databases, join-heavy queries get slower as the dataset grows. With graph databases, performance stays relatively constant regardless of total dataset size. This is because queries are localized: Execution time is proportional to the portion of the graph traversed, not the size of the overall graph.

This makes graph databases especially compelling for connected data, where relational joins would otherwise deteriorate by orders of magnitude.

Flexibility

The graph data model is schema-free and naturally additive: You can add new node types, relationships, and subgraphs without breaking existing queries or application logic. This means:

  • Structure and schema emerge alongside your understanding of the domain, rather than being imposed upfront when you know least about the data.
  • Fewer migrations, reducing maintenance overhead and risk.
  • No need to model the domain in exhaustive detail ahead of time.

Remark: Well, I think there’s maybe an unknown trade-off.

Agility

Graph databases align well with incremental, iterative software delivery. The schema-free model lets the data evolve in step with the application.

The tradeoff: No schema means no schema-oriented governance (like relational constraints). Instead, governance is applied programmatically through tests that validate the data model, queries, and business rules. This fits naturally with test-driven development practices.

Remark: Well, I think there’s maybe an unknown trade-off for this point too.

Modeling Connectivity in Various Data Models

This document compares how relational databases, aggregate NoSQL stores, and graph databases handle connected data.

Context: Aggregate NoSQL Stores & Graph Databases

Aggregate NoSQL databases that organize data around aggregates. An aggregate (from Domain-Driven Design) is a cluster of related data treated as a single unit for storage and retrieval.

For example, an “Order” aggregate bundles the order, its line items, shipping address, and payment info together. In a relational database this would be spread across multiple tables joined by foreign keys, but in an aggregate store the entire order is stored and fetched as one unit.

Three main types:

  • Key-value stores (e.g. Redis, DynamoDB): Data is a blob associated with a key. The store doesn’t care what’s inside.
  • Document stores (e.g. MongoDB, CouchDB): Data is a structured document (JSON/BSON). The store understands the internal structure.
  • Column-family stores (e.g. Cassandra, HBase): Data grouped by columns rather than rows, optimized for reading specific attributes across many records.

These stores excel at reading/writing a single aggregate, but struggle when querying across aggregates or following relationships between them. That’s exactly where graph databases shine.

The fourth NoSQL type is graph databases themselves, which are the opposite: They optimize for relationships rather than aggregates.

Relational Databases and Relationships

Ironically, relational databases deal poorly with relationships. “Relations” only exist as a means of joining tables via foreign keys: They can’t carry semantics, weight, or properties of their own.

Problems:

  • Join tables cause accidental complexity:
    • The relational model itself doesn’t allow relationships between tables to have properties. Junction tables must be used in this case. However, they are usually sparse and large for graph databases’ use cases.
    • Junction tables mix both business and metadata.
    • Example: A simple “Alice is friends with Bob” requires a dedicated table with only foreign keys.
  • Queries degrade with depth due to increased joins:
    • “Bob’s friends” is cheap (1 join).
    • “Friends-of-friends” needs 4 joins.
    • At 4-6 degrees of depth, cost grows exponentially.
  • Reciprocal queries are asymmetric:
    • Some types of queries are cheap, some are not.
    • Example: “What did customer X buy?” is cheap, but “which customers bought product Y?” scans the entire join table.
  • Schema is rigid and brittle: Sparse nullable columns and special-case code are required to handle ad-hoc connections. Migrations are costly as the application evolves.

NoSQL Aggregate Stores and Relationships

Most NoSQL aggregate stores do not store relationships. A well-known workaround is to have a field in the aggregates store another’s id in it.

Aggregate stores fake relationships by embedding foreign aggregate IDs inside fields (e.g. order: 1234 inside a user document). This looks graph-like but has fundamental problems:

  • Relationships aren’t first-class:
    • The application must build and maintain connections manually, the database is agnostic of the relationships, so relationship-involving queries are not well-optimized.
    • ACID may not even hold. If it fails to update/delete references in tandem, dangling references accumulate.
  • Forward-only links: References only point one way. “What did Alice order?” is easy (look up her order IDs). “Who bought product X?” requires a brute-force scan of the entire dataset (O(n)).
  • Denormalization is expensive: Adding backward links (e.g. friended_by) trades write latency and disk for read convenience, and each hop still requires an index lookup (O(log n) per hop) since aggregates have no locality.
  • Performance still degrades with depth: Immediate friends are feasible. Friends-of-friends or deeper traversals become impractical because every hop is a fake relationship resolved through index lookups, not direct pointers.

Performance Comparison

OperationAggregate StoreGraph Database
“Bob’s friends” (forward lookup)Index lookups per friendO(1) per hop, follow direct pointers
“Who is friends with Bob?” (reverse)Brute-force scan, O(n)O(1) per hop, follow incoming edges
Friends-of-friends (depth 2+)Impractical, each hop = index lookupO(1) per hop, stays constant

The core difference: Graph databases provide index-free adjacency (nodes physically point to neighbors). Aggregate stores must resolve every relationship through index lookups or full scans. Many systems try to simulate graph-like processing on aggregate stores, but end up doing it in batch rather than real-time.

Graph Databases and Relationships

In relational and aggregate stores, connections are implicit: The application infers and builds them. In a graph database, connected data is stored as connected data. Where there are connections in the domain, there are connections in the data.

Flexibility

A graph naturally accommodates semi-structured, densely connected domains. New nodes and relationship types can be added without breaking the existing network or migrating data. A social network can mix FRIEND_OF, COLLEAGUE_OF, BOSS_OF, MARRIED_TO, LOVES, and DISLIKES in the same graph with no schema changes.

Performance

Benchmark from Neo4j in Action (Partner & Vukotic): Finding friends-of-friends in a social network of 1,000,000 people with ~50 friends each:

DepthRDBMS (s)Neo4j (s)Records returned
20.0160.01~2,500
330.2670.168~110,000
41,543.5051.359~600,000
5Unfinished2.132~800,000

At depth 2, both are fine. At depth 3, the RDBMS takes 30 seconds (unusable for real-time) while Neo4j takes 168ms. At depth 5, the RDBMS can’t finish at all. Both aggregate stores and relational databases fall short when managing connected data: Anything beyond a shallow traversal is slow because of index lookups. Graphs use index-free adjacency to keep traversal fast regardless of depth.

Multi-Domain Graphs

Graphs are naturally multi-dimensional. Different domains can be joined into a single graph, enabling queries that cross domain boundaries, for example:

  • Purchase history: User -[PLACED]-> Order -[CONTAINS]-> Product. A MOST_RECENT linked list supports browsing order history.
  • Geospatial data: Modeled using R-Trees, graph-like indexes that describe bounded boxes around geographies, forming overlapping location hierarchies (e.g. postal code -> district -> city -> country). This allows “people who live nearby” queries.
  • Recommendations: Combine all of the above. “Find ice cream flavors liked by people who live near user X, enjoy espresso, but dislike Brussels sprouts” is a single graph traversal joining purchase patterns, social connections, and geospatial data.
  • Fraud detection: Patterns outside the norm for a given user are easily detected by comparing against graph-wide buying patterns.

These pattern-matching queries are extremely difficult in SQL, laborious against aggregate stores, and perform poorly in both. Graph databases are optimized for exactly this kind of traversal, providing millisecond responses in many cases.

Neo4j Cypher Query Language

Cypher is a query language that is:

  • Expressive yet compact, with syntax close to how we draw graphs on whiteboards.
  • Declarative & pattern-matching (like tree query in tree-sitter).
  • Specific to Neo4j.

Cypher describes graphs using ASCII art: (a)-[:KNOWS]->(b) reads as “a KNOWS b”.

There are some alternatives to Cypher:

  • Many (including Neo4j) support SPARQL (RDF query language).

    Remark: So there’s an adapter or something?

  • Gremlin (imperative, path-based).

Philosophy

The main priority is for human-readability, including technical and non-technical users.

  • Cypher lets users describe what to find, not how to find it.

  • A Cypher query anchors part of the pattern to starting locations in the graph (typically found via an index), then flexes the unanchored parts to find local matches.

    In other words, Cypher first locate the starting node and navigate from there.

Syntax Basics

Cypher uses ASCII art to represent graph patterns. Parentheses for nodes, square brackets for relationships, curly braces for properties:

PatternSyntaxExample
Any node (anonymous)()(a)<-[:VENUE]-() matches any node connected to a via VENUE, without binding it to a variable. Use when you don’t need the node’s details elsewhere in the query.
Node (bound to variable)(identifier)(a)
Node with properties({key: value})(a {name: 'Alice'})
Node with label(:Label)(a:Person)
Relationship (directed)-[:TYPE]->(a)-[:KNOWS]->(b)
Relationship (undirected)-[:TYPE]-(a)-[:KNOWS]-(b)
Relationship with properties-[:TYPE {key: val}]->(a)-[:KNOWS {since: 2020}]->(b)
Relationship type OR-[:TYPE1|TYPE2]->(a)-[:STREET|CITY]->(b) matches either STREET or CITY
Variable-length path-[:TYPE*min..max]->(a)-[:KNOWS*1..3]->(b) matches paths 1 to 3 hops long
Combined OR + variable-length-[:T1|T2*1..N]->(a)-[:STREET|CITY*1..2]->(b) either type, 1 to 2 hops
Bind relationship to variable-[r:TYPE]->(a)-[w:WROTE]->(b) then access w.year
Property accessnode.propertya.name

Syntax & Examples

graph LR
    a -->|KNOWS| b
    a -->|KNOWS| c
    b -->|KNOWS| c

The equivalent ASCII art in Cypher:

(a)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)

Remark: Not quite sure, the book’s graph direction is reversed, I’ve fixed that to match the ASCII graph. But I don’t know if I’m correct.

Since a query is one-dimensional text but graphs are two-dimensional, the pattern is split into comma-separated subpatterns.

A full query anchors the pattern to a starting node via index lookup:

START a=node:user(name='Michael')
MATCH (a)-[:KNOWS]->(b)-[:KNOWS]->(c), (a)-[:KNOWS]->(c)
RETURN b, c

The simplest Cypher queries consist of START (find anchor points), MATCH (draw the pattern), and RETURN (specify output).

Clauses Overview

ClausePurposeCategory
STARTFind starting nodes via index lookupRead
MATCHDraw a pattern to find in the graphRead
RETURNSpecify what to return to the clientRead
WHEREFilter pattern matching resultsRead
WITHChain query parts (pipe results)Read
CREATE / CREATE UNIQUECreate nodes and relationshipsWrite
SETSet property valuesWrite
DELETERemove nodes, relationships, propertiesWrite
FOREACHRun an update for each element in a listWrite
UNIONMerge results from multiple queriesRead

Assume the following graph exists for all examples:

CREATE (alice {name:'Alice', age:30}),
       (bob {name:'Bob', age:25}),
       (charlie {name:'Charlie', age:35}),
       (alice)-[:KNOWS {since:2018}]->(bob),
       (bob)-[:KNOWS {since:2020}]->(charlie),
       (alice)-[:KNOWS {since:2019}]->(charlie)
  • This CREATE does two things in one statement: Creates nodes (with properties), and relates them (with relationship properties).
  • It runs in a single transaction: Either the full graph is created, or none of it is (ACID).
  • Identifiers like alice, bob, charlie are temporary, scoped to this query only.
  • Within the query, they can be reused to attach relationships. Once the query ends, they’re gone.
  • To give nodes long-lived names, add them to an index.

START Clause

Cypher queries always begin from one or more well-known starting points in the graph, called bound nodes.

START provides these by:

  • Looking them up from an index.
  • Direct ID.

These bound nodes become the anchor points from which the rest of the graph is explored.

The index lookup syntax is node:indexName(key='value'):

  • node - Look up a node (vs a relationship).
  • indexName - The name of the index to search (e.g. user, venue, city). Different node types go in different indexes.
  • (key='value') - The property to match.

Note: This is legacy index syntax from Neo4j 1.x/2.x. Modern Neo4j uses schema indexes with MATCH (n:Label {prop: value}) instead.

START a=node:user(name='Alice'),
      b=node:user(name='Bob')
MATCH (a)-[r:KNOWS]->(b)
RETURN r.since

Output: 2018

This query starts from Alice and Bob, matches the relationship KNOWS between them, then extracts the since property from the relationship.

START scopes the query to a local region of the graph. Without it, a pattern could match anywhere. You can bind from different indexes in one START:

START theater=node:venue(name='Theatre Royal'),
      city=node:city(name='Newcastle'),
      author=node:author(lastname='Shakespeare')

MATCH Clause

Draws an ASCII art pattern to find in the graph. () are nodes, -[:TYPE]-> are relationships. An empty () means “any node, I don’t care about binding it to a variable”.

Bound identifiers from START anchor the pattern. Unbound identifiers (friend below) flex to match all possibilities.

START a=node:user(name='Alice')
MATCH (a)-[:KNOWS]->(friend)-[:KNOWS]->(fof)
WHERE fof <> a
RETURN friend.name AS friend, fof.name AS friend_of_friend

Output:

friendfriend_of_friend
“Bob”“Charlie”

a is anchored to Alice. MATCH finds all nodes one hop away (friend), then another hop (fof). The pattern can match multiple times, producing one row per match.

RETURN Clause

Processes matched data before returning it. Supports:

  • Aliasing: RETURN friend.name AS name.
  • DISTINCT: Deduplicates when a node matches via multiple paths.
  • Aggregation: count(), collect(), sum(), avg(), etc.
  • Ordering: ORDER BY ... ASC/DESC.
  • Limiting: LIMIT N, SKIP N.

Simple return with aliasing and ordering:

START a=node:user(name='Alice')
MATCH (a)-[:KNOWS]->(friend)
RETURN friend.name AS name, friend.age AS age
ORDER BY age DESC

Output:

nameage
“Charlie”35
“Bob”25

Aggregation: Bind a relationship to a variable in MATCH, then count it in RETURN:

START a=node:user(name='Alice')
MATCH (a)-[r:KNOWS]->()
RETURN count(r) AS friends_count

Output: 2

WHERE Clause

Filters matched subgraphs. Where MATCH describes structure, WHERE tunes the results by checking:

  • Certain paths must be present (or absent).
  • Specific properties must exist (or not), regardless of value.
  • Properties must have specific values.
  • Arbitrary expression predicates.
START a=node:user(name='Alice')
MATCH (a)-[:KNOWS]->(friend)
WHERE friend.age > 28
RETURN friend.name

Output:

friend.name
“Charlie”

CREATE / CREATE UNIQUE Clause

Creates nodes and relationships.

  • CREATE: Always adds to the graph. Use when duplication is acceptable.
  • CREATE UNIQUE: Ensures a subgraph structure exists. Only creates missing parts. Use when duplication is not permitted by the domain.
START alice=node:user(name='Alice')
CREATE (dave {name:'Dave', age:28})-[:KNOWS]->(alice)

Output: 1 node created, 1 relationship created.

START alice=node:user(name='Alice')
CREATE UNIQUE (alice)-[:KNOWS]->(bob {name:'Bob'})

Output: If Alice already KNOWS a Bob, nothing is created. Otherwise, 1 node and 1 relationship created.

SET Clause

Sets property values on existing nodes/relationships.

START a=node:user(name='Alice')
SET a.age = 31
RETURN a.age

Output: 31

DELETE Clause

Removes nodes, relationships, and properties. Must delete relationships before their nodes.

START a=node:user(name='Bob')
MATCH (a)-[r]-()
DELETE r, a

Output: 1 node deleted, 2 relationships deleted.

WITH

Sometimes a single MATCH isn’t enough. WITH chains query parts, piping results from one to the next (like Unix pipes). It also enforces discipline by separating read-only clauses from write-centric SET operations.

START a=node:user(name='Alice')
MATCH (a)-[k:KNOWS]->(friend)
WITH friend
ORDER BY friend.age DESC
RETURN collect(friend.name) AS friends

The first part matches and orders friends by age. WITH pipes the ordered results into collect(), which produces a list.

Output:

friends
[“Charlie”, “Bob”]

FOREACH

Runs an update for each element in a list.

FOREACH (name IN ['Dave','Eve'] |
  CREATE ({name: name}))

Output: 2 nodes created.

UNION

Merges results from two or more queries. Column names must match.

START a=node:user(name='Alice')
MATCH (a)-[:KNOWS]->(f)
RETURN f.name AS name
UNION
START a=node:user(name='Bob')
MATCH (a)-[:KNOWS]->(f)
RETURN f.name AS name

Output:

name
“Bob”
“Charlie”

(UNION deduplicates. Use UNION ALL to keep duplicates.)

These clauses are intentionally familiar to SQL developers, but different enough to emphasize that we’re dealing with graphs, not relational sets.

Graph Data Modeling

This document covers:

  • How to solve a real-world problem by using graph modeling techniques.
  • Comparing relational and graph modeling techniques.
  • Guidelines for structuring nodes and relationships & common pitfalls.
  • How to evolve graph over time.

Modeling

This notes my own remarks, not according to the book.

Modeling is a common activity in science: The world is messy and we can’t possibly predict every behavior of it. Modeling is a simplification process where we selectively expose certain useful aspects of the world to easily study it.

With a good model, we can predict surprisingly quite precisely the behavior of the world.

Note that a model is just an approximation of the world. So, a model can break down if it tries to extrapolate too far from its original assumptions. That doesn’t make a model useless, however. The sole role of the model is to enable accurate-enough reasoning about the world within some known set of constraints.

Graph Modeling as a Data Modeling Technique

Graph models are a tool for modeling the world. It allows useful queries about the world.

Compared to relational data modeling techniques, the graph data models’ distinguishing point is the close affinity between the logical and physical models.

Relational techniques force us to deviate from our natural representation of the domain through multiple transformations, each introducing semantic dissonance:

The Relational Path for Modeling

Here are the typical steps for relational modeling:

  1. Whiteboard sketch:
    • Understand entities, how they interrelate, rules governing state transitions.
    • Done informally with domain experts.
    • The result is already a graph.
  2. E-R diagram:
    • A more rigorous form of the whiteboard sketch.
    • However, E-R diagrams only allow single, undirected relationships.
    • A poor fit for domains where relationships are numerous and semantically diverse.
  3. Normalize into tables:
    • Map E-R diagram into tables and relations.
    • Even the simplest case introduces accidental complexity.
    • Foreign key constraints (for 1:N) and join tables (for M:N) clutter the model with metadata that serves the database, not the user.
  4. Denormalize for performance:
    • Normalized models are generally not fast enough for production.
    • Duplicate data and abandon domain fidelity to suit the database engine.
    • Requires RDBMS expertise and accepts substantial data redundancy.
  5. Migration:
    • Introducing structural change is slow, risky, and expensive (weeks/months with downtime).
    • Unlike code refactoring (seconds/minutes), database refactoring is a heavyweight operation.
    • The denormalized model resists rapid evolution.

The result:

  • A gulf between the conceptual world and the physical data layout.
  • Business stakeholders can’t collaborate past the relational threshold.
  • Changed business requirements lag behind because translating them into entrenched relational structures is difficult.
  • Failed migrations risk data integrity.

The Graph Path for Modeling

  1. Whiteboard sketch:
    • Same as relational. Understand entities and their interrelations with domain experts.
  2. Enrich the graph:
    • Instead of transforming into tables, enrich the whiteboard sketch.
    • Add properties to nodes and named, directed relationships.
    • The model is a purposeful abstraction attuned to the application’s data needs.
  3. Store directly:
    • No normalization, no denormalization, no join tables.
    • What you sketch is what you store.
    • Domain modeling is isomorphic to graph modeling.

Testing the Model

Once the domain model is refined, test it before building the application. Bad design decisions baked in early are harder to fix later. Two techniques:

1. Read the Graph Aloud

Pick a start node, follow relationships, read each node’s role and relationship name. It should form sensible sentences:

  • “Alice wrote Post X, which has Comment Y, which Bob authored”
  • “Post X is tagged with Topic Z, which Category W contains”

If it reads well, the model is faithful to the domain.

2. Design for Queryability

Write the queries you expect to run and verify the graph supports them. This requires understanding end users’ goals.

Example: In a blog platform, find all posts by authors that a user follows:

START user=node:users(name = 'Alice')
MATCH (user)-[:FOLLOWS]->(author)-[:WROTE]->(post)
RETURN author.name, post.title
ORDER BY post.date DESC

If such a query is readily supported by the graph, the design is fit for purpose. If not, the model needs restructuring before any code is written.

Cross-Domain Models

Interesting remarks:

  • Relationships both partition a graph into separate domains and connect them.
  • Shared nodes eliminate data duplication across domains.
  • Traversal crosses domain boundaries seamlessly.
  • Scaling is additive (more nodes and relationships, no schema changes).

Example: A blog platform with three domains (content, social, geospatial):

graph TD
    Alice -->|WROTE| Post1[Post: 'Graph DBs']
    Alice -->|FOLLOWS| Bob
    Bob -->|WROTE| Post2[Post: 'Neo4j Tips']
    Post1 -->|TAGGED| GraphDB[Topic: 'Graph DB']
    Post2 -->|TAGGED| GraphDB
    Alice -->|LIVES_IN| London[City: 'London']
    Bob -->|LIVES_IN| London
    London -->|IN| UK[Country: 'UK']

London and GraphDB are shared nodes: They participate in multiple domains without duplication. A single traversal can answer “find posts about Graph DB written by people who live near me.”

Node Labels

Relationships establish semantic context, but are a weak indicator of what a node is. In the graph above, following WROTE tells you the end node is a post, but nothing on the node itself says so. Labels fix this:

graph LR
    A["Alice:User"] -->|WROTE| P["Post1:Post"]
    P -->|TAGGED| T["GraphDB:Topic"]
    A -->|LIVES_IN| L["London:City"]

Labels attach one or more type tags to a node as first-class citizens: (alice:User:Customer). Queries can filter by label (MATCH (n:User)), and labels can carry constraints (e.g. uniqueness).

Common Modeling Pitfalls

Expressivity is no guarantee a graph is fit for purpose.

Don’t Encode Entities as Relationships

The most common mistake: Folding a noun into a verb. Everyday language encourages “Alice reviewed the restaurant” which loses the review entity.

Bad model:

graph LR
    Alice -->|REVIEWED| Restaurant
    Alice -->|RATED| Restaurant

This tells us Alice reviewed the restaurant, but we can’t see:

  • The review text or the rating value.
  • When the review was written.
  • Whether the REVIEWED and RATED refer to the same visit.

Even adding properties to REVIEWED doesn’t help: You still can’t correlate it with the RATED relationship.

The root cause: English shortens “Alice wrote a review of the restaurant” into “Alice reviewed the restaurant”. The noun (review) disappears, folded into a relationship.

Good model:

graph TD
    Alice -->|WROTE| Review
    Review -->|OF| Restaurant

Now the review is a first-class node with its own properties (text, rating, date). Multiple reviews by the same user or of the same restaurant are distinct nodes.

Once entities are modeled as nodes, powerful queries become possible. For example, “find all restaurants where Alice gave a higher rating than the average”:

START alice=node:user(name='Alice')
MATCH (alice)-[:WROTE]->(review)-[:OF]->(restaurant)
WHERE review.rating > 3
RETURN restaurant.name, review.rating

This query only works because the review is a node with its own properties (rating, text, date). With the bad model (REVIEWED as a relationship), filtering by rating or correlating review details would be impossible.

Don’t Conflate Entities and Relationships

Use relationships to convey how things are related. Domain entities aren’t always obvious from everyday language: Think carefully about the nouns. If you find yourself wanting to attach properties to a relationship or connect a relationship to more than two nodes, it should probably be a node.

Don’t Optimize Writes at the Cost of Model Fidelity

Trust the graph database to handle performance. Model according to the questions you want to ask. Graph databases maintain fast query times even when storing vast amounts of fine-grained data.

Domain Evolution

Migrations in graph databases are simpler than in relational databases:

  • Adding new relationship types (e.g. REPLY_TO, FORWARD_OF): Completely safe. Existing queries don’t know about them, so nothing breaks.
  • Adding new nodes: Safe. Extends the graph without affecting existing structure.
  • Changing existing relationship types or node properties: Might affect existing queries. Run a representative set of queries to verify.

These are the same operations performed during normal database usage, so migration in a graph world is just business as normal.

The Same Pitfall Applies to Evolution

When adding new features, the “entities as relationships” pitfall recurs. For example, adding reply/forward support:

For example, adding a “reply” feature to a review platform:

Bad:

(bob)-[:REPLIED_TO]->(review)

Lossy: Can’t see what Bob actually said, or distinguish between multiple replies.

Good: A reply is itself a new node:

graph TD
    Alice -->|WROTE| R1[Review]
    R1 -->|OF| Restaurant
    Bob -->|WROTE| Reply1[Reply 1]
    Reply1 -->|REPLY_TO| R1
    Alice -->|WROTE| Reply2[Reply 2]
    Reply2 -->|REPLY_TO| Reply1

This enables queries like “find the full reply chain for a review”:

START review = node:review(id = '1')
MATCH p=(review)<-[:REPLY_TO*1..4]-()<-[:WROTE]-(replier)
RETURN replier.name AS replier, length(p) - 1 AS depth
ORDER BY depth

Neo4j Internals: Native Graph Processing

Part of the Neo4j internals series: Native Graph Processing | Native Graph Storage | Transactions, Availability, Scale

These sections peek under the hood of graph database implementation, using Neo4j as the reference. Neo4j has both native graph processing and native graph storage, and is open source.

A graph database has native graph processing if it uses index-free adjacency: Each node maintains direct references to its adjacent nodes, acting as a micro-index of nearby nodes.

Index-Free Adjacency vs Global Indexes

Native (index-free adjacency)Non-native (global indexes)
“Alice’s friends”O(1) per hop, follow pointersO(log n) per hop, index lookup
“Who is friends with Alice?”O(1) per hop, follow incoming pointersO(m log n), scan index for each potential friend
m-step traversalO(m), independent of graph sizeO(m log n), degrades with graph size

With index-free adjacency, bidirectional joins are effectively precomputed and stored as relationships. No index lookup needed. This is why graph databases can traverse in either direction (tail to head, or head to tail) equally cheaply.

Index lookups are fine for small networks, but too costly for complex queries over larger graphs. Graph databases make relationships concrete and navigable at query time instead of relying on index indirection.

Neo4j Internals: Native Graph Storage

Part of the Neo4j internals series: Native Graph Processing | Native Graph Storage | Transactions, Availability, Scale

  • Neo4j stores graph data in a number of different store files.
  • Each store file contains the data for a specific part of the graph (nodes, relationships, properties).
  • This division of storage responsibilities separates graph structure from property data, similar to columnar storage or ECS (Entity Component System) in game engines: Group data by concern, not by entity, so the hot path (traversal) doesn’t waste I/O loading data it’s just passing through.

Architecture

graph TD
    API["Traversal API / Core API / Cypher"] --> OC[Object Cache]
    OC --> FSC[File System Cache]
    API --> TM[Transaction Management]
    FSC --> RF[Record Files]
    TM --> TL[Transaction Log]
    RF --> D[Disks]
    TL --> D

Store Files

Neo4j maintains multiple stores:

  • Traversal only needs to know which nodes are connected, not what data they hold.
  • By splitting structure (nodes, relationships) from content (properties), traversal touches only the small, fixed-size structure records and skips property data entirely until explicitly requested.

All stores use fixed-size records so any record’s location can be computed by ID * record_size. This makes lookups O(1).

  • Node store: Where each node lives, and entry points to its relationships and properties.
  • Relationship store: How nodes are connected, and the linked list pointers for traversal.
  • Property store: The actual data (key-value pairs) attached to nodes and relationships. Only loaded when needed.

Node Store (neostore.nodestore.db)

Each node record acts as an “address book”: It tells you where to find the node’s relationships and properties. It stores no actual user data, just pointers.

  • Fixed-size records (9 bytes).
  • Contains:
    • In-use flag (1 byte): Whether this record is active or reclaimable.
    • Pointer to first relationship (4 bytes): Entry point into the relationship chain (a linked list of all relationships connected to this node).
    • Pointer to first property (4 bytes): Entry point into the property chain (a linked list of key-value pairs for this node).
  • A node record stores no actual data. It’s just pointers to where the data lives (relationships and properties are in their own store files).

Relationship Store (neostore.relationshipstore.db)

This store holds the wiring between nodes. Each record knows which two nodes it connects and how to find the next relationship for both of them. It stores no user data, just structure.

  • Fixed-size records (33 bytes).
  • Contains:
    • Start node ID (4 bytes): The node at the start of this relationship.
    • End node ID (4 bytes): The node at the end of this relationship.
    • Relationship type pointer (4 bytes): Points to the relationship type store.
    • Next/prev relationship pointers for start node (4+4 bytes): Doubly linked list of all relationships visible from the start node.
    • Next/prev relationship pointers for end node (4+4 bytes): Same for end node.
    • Next property pointer (4 bytes): Entry point into this relationship’s property chain.
  • A relationship record “belongs” to both its start and end nodes. The doubly linked list pointers form the relationship chain: Iterating a node’s relationships means following this chain. Doubly linked allows traversal in either direction and efficient insert/delete.

Property Store (neostore.propertystore.db)

This is where the actual user data lives as key-value pairs. It is only touched when a query explicitly requests property values, not during traversal.

  • Fixed-size records with four property blocks each.
  • Singly linked list (pointer to next property record).
  • Property names are deduplicated via a separate property index file.
  • Small values (numbers, short strings) can be inlined directly in the property block, avoiding extra I/O.
  • Large values (long strings, arrays) go to separate dynamic stores.

How a Traversal Works

  1. Start at a node record. Follow its pointer to the first relationship in the chain.
  2. Compute the relationship’s byte offset: relationship_ID * 33. O(1) lookup.
  3. From the relationship record, read the other node’s ID.
  4. Compute that node’s byte offset: node_ID * 9. O(1) lookup.
  5. Repeat.

No index involved at any step. Just pointer arithmetic on fixed-size records.

Neo4j Internals: Transactions, Availability, Scale

Part of the Neo4j internals series: Native Graph Processing | Native Graph Storage | Transactions, Availability, Scale

Transactions

Neo4j supports full ACID transactions:

  • Every modification runs inside a transaction.
  • Write-ahead log ensures durability: Changes are logged before being applied to store files.
  • Transactions can span multiple nodes, relationships, and properties.

Recoverability

  • Write-ahead log for crash recovery.
  • On restart, Neo4j replays uncommitted transactions from the log.

Availability

  • Master-slave replication with automatic failover.
  • Slaves serve reads immediately. Writes go to master and propagate.
  • On master failure, a slave is elected as new master.

Scale

  • Reads: Scale horizontally by adding slaves.
  • Writes: Scale vertically (single master).
  • Cache sharding: Route requests to instances with relevant portions of the graph in memory.
  • Graph sharding (splitting the graph across machines): Remains an open research problem. Cutting a graph inevitably severs relationships, degrading traversal performance.

YAML 1.2

YAML 1.2

Link: https://yaml.org/spec/1.2.2/

Overview

YAML 1.2 has evolved to:

  • Be a strict superset of JSON.
  • Remove many obscure rules like implicit typings.

The general important goals of YAML:

  1. Be human-readable.
  2. Be portable across languages.
  3. Easy to parse/implement/use.

Context: PyYAML is generally considered the reference implementation. LibYAML is generally used in other YAML frameworks.

Structure

To understand YAML, one should:

  • Understand its information model, that is, the abstract model of how YAML decides to organize data. (See Processes and Models)
  • Translation of that information model to the YAML textual format.

Features Overview

This section lists all features of YAML, to motivate the YAML specification later.

Three basic building blocks: Scalars, Sequences, Mappings. (for more details check Scalars onwards)

Syntax

Block Syntax

  • Indentation is used for scope: Same indentation = Same list/hash.
  • A line is used for specifying an entry within the list/hash.

Entry syntax:

  • Block sequences use a dash and a space (- ) for each entry.
  • Block mappings use a colon and a space (: ) for each key/value pair.

Comments use #.

Example:

house_1:
  - person_1 # First entry of the first key/value pair
  - hrnax # Second entry of the first key/value pair

Example from the spec:

- name: Mark McGwire
  hr: 65
  avg: 0.278
- name: Sammy Sosa
  hr: 63
  avg: 0.288

For a more compact representation, see Compact Nested Mapping.

Flow Syntax

  • Indicators are used to denote scopes.
  • Puntuations are used to denote entries.

Entry syntax:

  • Flow sequences use a comma-separated list within square brackets.
  • Flow mappings use a comma-separated list of key/value pairs within curly braces.
flow mapping: { a: 1, b: 2 }

flow sequence:
  - [1, 2, 3]
  - [4, 5, 6]

Data Sharing

In data structures, it’s common to have an object/list to be shared in more than one places.

YAML allows this by anchors (&) and aliases (*), kind of like C++ address-of and dereferencing operators.

Example from the spec:

hr:
  - Mark McGwire
  # Following node labeled SS
  - &SS Sammy Sosa
rbi:
  - *SS # Subsequent occurrence
  - Ken Griffey

Complex Mapping Keys

Sometimes mapping keys need to be more than simple scalars, e.g using a list as a key.

A question mark and space (? ) indicates a complex mapping key. Within a block collection, key/value pairs can start immediately following the dash, colon or question mark.

Example from the spec:

? - Detroit Tigers
  - Chicago cubs
: - 2001-07-23

[New York Yankees, Atlanta Braves]: [2001-07-02, 2001-08-12, 2001-08-14]

Compact Nested Mapping

It’s common to have a list of objects (e.g. a product catalog).

Within a block sequence, key/value pairs can start immediately after the dash, allowing compact representation of lists of mappings.

Example from the spec:

# Products purchased
- item: Super Hoop
  quantity: 1
- item: Basketball
  quantity: 4
- item: Big Shoes
  quantity: 1

Structures

YAML files can be made up of directives and document content.

  • Directives: instructions for the YAML processor, not part of the data. YAML 1.2 defines 2 directives: %YAML (version) and %TAG (tag shorthands).
  • Document content: the actual data described in the above section, such as scalars, sequences, and mappings.

Syntactically:

  • --- is used to separate directives from content, and signals start of a new document (even without directives, effectively just document separators).

    Example from the spec:

    # Ranking of 1998 home runs
    ---
    - Mark McGwire
    - Sammy Sosa
    - Ken Griffey
    
    # Team ranking
    ---
    - Chicago Cubs
    - St Louis Cardinals
    
  • ... is used to signal the end of a document without starting a new one.

    Example from the spec:

    ---
    time: 20:03:20
    player: Sammy Sosa
    action: strike (miss)
    ...
    ---
    time: 20:03:47
    player: Sammy Sosa
    action: grand slam
    ...
    

Scalars

Scalars are YAML’s atomic values: strings, numbers, booleans, etc. Unlike sequences and mappings, they hold a single value.

Scalar content can be written in two notations: block and flow.

Block Scalars

Block scalars are useful for multi-line strings where whitespace and newline control matters (e.g. embedded config, prose, ASCII art).

In block scalars, the base indentation is determined by the first non-empty content line. That indentation is stripped from all lines.

Block scalars have two styles:

  • Literal style (|): all line breaks are preserved as-is.
  • Folded style (>): newlines are folded to spaces, except lines that are blank or more-indented than the base, those preserve their newlines.

Examples:

  • In literal style (|), newlines are preserved.

    Example:

    newlines_preserved: |
      First line
      Second line
      Third line
    

    Interpreted: "First line\nSecond line\nThird line\n"

  • In folded style (>), newlines become spaces. Newlines are preserved for more-indented and blank lines.

    Example from the spec:

    description: >
      Mark McGwire's
      year was crippled
      by a knee injury.
    

    Interpreted: "Mark McGwire's year was crippled by a knee injury.\n"

    Example from the spec (preserved newlines for more-indented and blank lines):

    summary: >
      Sammy Sosa completed another
      fine season with great stats.
    
        63 Home Runs
        0.288 Batting Average
    
      What a year!
    

    Interpreted: "Sammy Sosa completed another fine season with great stats.\n\n 63 Home Runs\n 0.288 Batting Average\n\nWhat a year!\n"

  • A document’s root node can itself be a scalar, the block scalar indicator follows --- directly (§9.1.3. Bare Documents):

    --- |
      entire document
      is one literal scalar
    
  • Indentation determines scope (§6.1. Indentation Spaces): a block scalar ends when indentation drops back to the parent’s level. Lines below the base indentation but above the parent’s level are invalid (§8.1.1. Block Scalar Headers):

    Example from the spec:

    name: Mark McGwire
    accomplishment: >
      Mark set a major league
      home run record in 1998.
    stats: |
      65 Home Runs
      0.278 Batting Average
    

Flow Scalars

Flow scalars are inline, useful for short strings, values with special characters, or when you want to stay on one line.

YAML’s flow scalars have 3 styles:

  • The plain style
  • Two quoted styles:
    • The double-quoted style provides escape sequences.
    • The single-quoted style is useful when escaping is not needed.

All flow scalars can span multiple lines. Line breaks are always folded.

  • Quoted scalars:

    Example from the spec:

    unicode: "Sosa did fine.\u263A"
    control: "\b1998\t1999\t2000\n"
    hex esc: "\x0d\x0a is \r\n"
    
    single: '"Howdy!" he cried.'
    quoted: " # Not a 'comment'."
    tie-fighter: '|\-*-/|'
    
  • Multi-line flow scalars:

    Example from the spec:

    plain: This unquoted scalar
      spans many lines.
    
    quoted: "So does this
      quoted scalar.\n"
    

Tags

Every YAML node has a tag that denotes its type (e.g !!str, !!int, !!seq).

For simplicity though, most nodes don’t specify one explicitly: they are untagged and the YAML processor resolves a tag automatically based on the active schema, which defines the set of available tags and how untagged nodes are resolved (e.g 123 -> !!int, true -> !!bool).

YAML 1.2 defines three built-in schemas:

  • Fail safe schema: the minimum every processor must support. Only !!str, !!seq, !!map.
  • JSON schema (recommended): adds !!null, !!bool, !!int, !!float.
  • Core schema (recommended): extends JSON schema with human-friendly notations (e.g octal 0o14, hex 0xC, ~ for null).

The examples in this specification generally use the seq, map and str types from the fail safe schema. A few examples also use the int, float and null types from the JSON schema.

  • Integers:

    Example from the spec:

    canonical: 12345
    decimal: +12345
    octal: 0o14
    hexadecimal: 0xC
    
  • Floating point:

    Example from the spec:

    canonical: 1.23015e+3
    exponential: 12.3015e+02
    fixed: 1230.15
    negative infinity: -.inf
    not a number: .nan
    
  • Miscellaneous:

    Example from the spec:

    null:
    booleans: [true, false]
    string: "012345"
    
  • Timestamps:

    Example from the spec:

    canonical: 2001-12-15T02:59:43.1Z
    iso8601: 2001-12-14t21:59:43.10-05:00
    spaced: 2001-12-14 21:59:43.10 -5
    date: 2002-12-14
    

Explicit Tags

Explicit typing is denoted with a tag using the exclamation point (!) symbol.

There are two kinds of tags:

  • Local tags:
    • Motivation: Sometimes tags are only required to have meanings within the consuming application.
      • Start with ! and are application-specific (e.g !circle).
      • Don’t need to be declared, you just use them inline.
  • Global tags:
    • Motivation: When interoperability matters, tags that are universally recognized across different processors are useful.
    • Syntax: full URIs that are universally unique (e.g tag:yaml.org,2002:str), standardized across all processors.

Since global tags are verbose, YAML provides tag handles (also called tag shorthands): short prefixes that expand to a URI prefix, declared via the %TAG directive.

Two handles are built-in:

  • !! defaults to tag:yaml.org,2002:, so !!str expands to tag:yaml.org,2002:str.
  • ! is the primary handle for local tags.

Custom handles can also be defined:

%TAG !e! tag:example.com,2002:
---
- !e!circle   # expands to tag:example.com,2002:circle
  radius: 7
  • Example from the spec:

    %TAG ! tag:clarkevans.com,2002:
    ---
    !shape
    # Use the ! handle for presenting
    # tag:clarkevans.com,2002:circle
    - !circle
      center: &ORIGIN { x: 73, y: 129 }
      radius: 7
    - !line
      start: *ORIGIN
      finish: { x: 89, y: 102 }
    - !label
      start: *ORIGIN
      color: 0xFFEEBB
      text: Pretty vector drawing.
    
  • Unordered sets:

    Example from the spec:

    # Sets are represented as a
    # Mapping where each key is
    # associated with a null value
    --- !!set
    ? Mark McGwire
    ? Sammy Sosa
    ? Ken Griffey
    
  • Ordered mappings:

    Example from the spec:

    # Ordered maps are represented as
    # A sequence of mappings, with
    # each mapping having one key
    --- !!omap
    - Mark McGwire: 65
    - Sammy Sosa: 63
    - Ken Griffey: 58
    

Processes and Models

YAML serves two consumers: machines that process data, and humans that read it. To bridge these perspectives, YAML defines two complementary concepts:

  • YAML representations: an abstract data model (graph of typed nodes) that captures what the data is, independent of any textual format.
  • YAML stream: a concrete character stream for presenting those representations in a human-readable way.

A YAML processor (e.g PyYAML, libyaml) converts between these two views. It works on behalf of an application (e.g a config loader, a deployment tool): the processor handles YAML mechanics, the application decides what the data means.

Three stages

The conversion between representations and streams is broken into three stages:

  1. Representation: native data structures -> a directed graph of typed nodes.
  2. Serialization: the graph -> an ordered event tree (linearized for sequential output).
  3. Presentation: the event tree -> a human-readable character stream.

Note: “serialization” in YAML does not mean producing text. It means linearizing the graph into an ordered form. The actual text output is the presentation stage.

The event-based approach (decoupling structure from output via an event stream) is a common pattern in parser design. rust-analyzer uses a similar technique, where the parser emits events (start-node, token, end-node) to build a lossless syntax tree. The difference: YAML events discard presentation details, while rust-analyzer events preserve everything (whitespace, comments, errors).

Processing Overview (source: YAML 1.2 spec)

Processes

A processor need not expose all three stages. It may translate directly between native data structures and a character stream (dump and load). However, even when skipping stages, it should behave as if it went through all three. Native data structures should only depend on information in the representation (node kinds, tags, content), not on presentation or serialization details like key order, comments, or tag handles.

In practice, this means applications that rely on YAML comments or key ordering are operating outside the spec’s guarantees.

Dump

Dumping converts native data structures into a character stream: native data -> graph -> events -> text.

  1. Representation: Native data -> Abstract graph

    Native data structures are mapped to YAML’s abstract model: a directed graph of typed nodes.

    Three node kinds:

    • Sequence: an ordered series of entries (like arrays/lists).
    • Mapping: an unordered set of (key, value) pairs (like hash tables/dicts). Keys and values are themselves nodes.
    • Scalar: a leaf node (strings, integers, dates, etc).

    The result is a directed graph, not a tree, because nodes can be shared (via anchors/aliases, where multiple parents reference the same node). Each node also carries a tag specifying its data type. This simple model can represent any data structure independent of programming language.

  2. Serialization: Abstract graph -> Event/Serialization tree

    A character stream is sequential: one character after another. A graph with shared nodes and unordered keys cannot be written to a stream directly, so it must be linearized first.

    The serialization process resolves this by:

    • Imposing an ordering on mapping keys.
    • Replacing shared node references with placeholders called aliases.

    The result is an event tree: an ordered sequence of events (e.g start-mapping, scalar, end-sequence). YAML does not specify how key order or anchor names are chosen.

    These are called serialization details.

    The YAML processor should choose a sensible human-friendly key order and anchor names.

    The serialization tree is suitable for one-pass processing of YAML data.

  3. Presentation: Event tree -> Text

    The final stage formats the event tree as a human-readable character stream. This is where all stylistic choices are made: block vs flow, indentation, quoting, tag handles, directives, comments, etc. These are called presentation details.

    These details are all up to the preferences of the user & may require guidance.

Load

Loading is the inverse: text -> events -> graph -> native data. Each stage strips away the details added by its dump counterpart.

  1. Parsing: Text -> Event Tree

    Takes a character stream, produces an event tree.

    Discards presentation details (styles, indentation, comments).

    Can fail on ill-formed input.

  2. Composing: Event Tree -> Graph

    Reconstructs the representation graph from the event tree.

    Resolves aliases back into shared node references, discards serialization details (key order, anchor names).

    Can fail on unresolved aliases.

  3. Constructing: Graph -> Native data structure

    Converts the representation graph into native data structures (dicts, lists, strings, etc).

    Must only rely on information in the representation (node kinds, tags, content), not presentation or serialization details.

    Can fail if required native types are unavailable.

Information Models

The above section specifies the phases/procedures. This section specifies the interfaces/the data structures agreed upon by the phases.

As an analogy, in compiler construction, we have lexing, parsing, etc as the processes, while tokens, ASTs are the information models.

Information Models (source: YAML 1.2 spec)

The diagram shows three models, each inheriting from the previous and adding new properties:

  • Representation Graph:
    • Tag (the data type):
      • Must have a name.
      • Only specific/explicit tags are allowed here.
    • Node: A graph node.
      • Sequence node: The node representing a sequence.
        • Contains ordered sequence of nodes.
      • Mapping node: The node representing a mapping.
        • Contains unordered key/value pairs.
        • Keys and values are both nodes.
      • Scalar node: The node representing a scalar.
        • Canonical form: The interpreted value of the scalar.
  • Serialization Tree (+): Inherit everything from Representation Graph and…
    • Tag:
      • (+) Add non-specific tag: Implicit tags.
    • (+) Alias node: To support serialization, alias nodes are introduced.
    • Node:
      • (+) Anchor: For aliasing.
      • Scalar node:
        • (+) Formatted content.
  • Presentation Stream (++):
    • (++) Directive: Instruction to the YAML processor.
      • (++) Name.
      • (++) Parameter.
    • (++) Comment.
    • Node:
      • (++) Style, spacing, etc.

Each layer’s additions (+, ++) are details that should not leak into other layers. Applications should not treat key order, comments, or indentation as meaningful data. Keeping these layers separate ensures YAML representations stay consistent and portable across programming environments.

Representation Graph

YAML represents native data as a:

  • Rooted: one starting node from which all others are reachable.
  • Connected: every node is reachable from the root.
  • Directed: edges have direction (parent points to child).

graph of tagged nodes.

Other properties:

  • Cycles are allowed.
  • Nodes can have multiple incoming edges (shared via aliases).

Two categories: collections (nodes defined in terms of other nodes) and scalars (independent nodes).

Representation Model (source: YAML 1.2 spec)

Nodes

Each node has content of one of three kinds, plus a tag:

  • Scalar: an opaque datum presentable as zero or more Unicode characters.
  • Sequence: an ordered series of zero or more nodes. May contain the same node more than once, or even itself.
  • Mapping: an unordered set of key/value node pairs. Keys must be unique.
Tags

Tags represent type information & meta information about a node:

  • Global tags are URIs, globally unique. The tag: URI scheme is recommended.
  • Local tags start with !, application-specific.

YAML provides a TAG directive to make tag notation less verbose.

Notes:

  • Tags that share the same URI prefix are not related in any special way.
  • By convention, / is used for namespace hierarchies and # for variants, but each tag may use its own rules (e.g Perl uses ::, Java uses .).

Each tag must specify:

  • The expected node kind (scalar, sequence, or mapping).
  • For scalar tags: a mechanism for converting formatted content to a canonical form (for equality testing).
  • Optionally:
    • Allowed content values for validation.
    • Tag resolution rules.
    • Other metadata applicable to all of the tag’s nodes.
Node Comparison

Mapping keys must be unique, so YAML needs a way to test node equality:

  • Canonical form: every scalar tag must specify how to produce a canonical form from formatted content. For example, 0o13 (octal) and 0xB (hex) both have canonical form 11.
  • Equality: two nodes are equal when they have the same tag and content. Scalars compare canonical forms character-by-character. Collections compare recursively.
  • Uniqueness: a mapping’s keys are unique if no two keys are equal.

Notes:

  • Tag equality can use simple character-by-character comparison (since processors can’t know every URI scheme’s equality rules). Tags must therefore be presented in canonical form. The tag: URI scheme also uses this comparison method.
  • If a node has itself as a descendant (via an alias, i.e circular reference), equality checking could loop forever. The spec leaves this to each processor to decide.
  • A processor may treat equal scalars as identical.

Serialization Tree

The representation graph can’t be consumed sequentially as-is: keys are unordered and nodes can be shared. The serialization tree resolves this, giving each node an ordered set of children.

When constructing native data structures from the serialization tree, key order and anchor names should not be used to preserve application data.

Serialization Model (source: YAML 1.2 spec)

Mapping Key Order
  • Mapping keys have no order in the representation. Serialization imposes one, but this is a serialization detail, not application data.
  • Where order matters, use a sequence instead (e.g an ordered mapping = a sequence of single-pair mappings).
Anchors and Aliases
  • A node can appear in multiple collections (shared node). To serialize this, the first occurrence gets an anchor (&name), and subsequent occurrences become alias nodes (*name) referring back to it.
  • Anchor names are serialization details, discarded after composing back to a graph.
  • An alias refers to the most recent event with that anchor name. So anchors don’t need to be unique within a serialization: if two nodes use &a, then *a refers to whichever &a came last.
  • An anchor need not have any alias referring to it (anchoring without aliasing is valid).

Presentation Stream

The presentation stream is the final human-readable output: a stream of Unicode characters using styles, scalar formats, comments, directives, etc. A single stream can contain multiple documents separated by markers.

Presentation Model (source: YAML 1.2 spec)

Node Styles
  • Each node is presented in some style depending on its kind. Styles are presentation details, not reflected in the serialization tree or representation graph.
  • Two groups:
    • Block styles: use indentation to denote structure.
    • Flow styles: use explicit indicators.
  • Scalar styles: literal (|), folded (>), plain, single-quoted, double-quoted.

Kind/Style Combinations (source: YAML 1.2 spec)

Scalar Formats
  • Scalars can be presented in several formats (e.g integer 11 might also be written as 0xB). Tags must specify how to convert formatted content to canonical form.
  • Like node style, format is a presentation detail.
Comments
  • Comments are a presentation detail, they must not affect the serialization tree or representation graph.
  • Not associated with any particular node. Must not appear inside scalars, but may be interleaved with scalars inside collections.
Directives
  • Each document may have directives (a name and optional parameters). Like all presentation details, not reflected in the serialization tree or representation graph.
  • YAML 1.2 defines two: YAML and TAG.

Loading Failure Points

Loading native data structures from a YAML stream can fail at multiple points.

Loading Failure Points (source: YAML 1.2 spec)

Two levels of representation:

  • Partial representation: tags may be unresolved, canonical forms may be missing. Useful when type information is incomplete.
  • Complete representation: every node has a resolved tag and canonical form. Required for constructing native data structures.

Well-Formed Streams and Identified Aliases

  • Character stream must match the BNF productions defined in the spec. Every alias must refer to a previous node identified by an anchor.
  • Processors should reject malformed streams and unidentified aliases.
# Malformed stream: bad indentation
key: value
  bad: indent   # error: mapping values are not allowed here

# Unidentified alias: *unknown was never anchored
name: *unknown  # error: no anchor named "unknown"

Resolved Tags

  • Most tags are not explicit in the character stream.

  • During parsing, untagged nodes get a non-specific tag:

    • ! for non-plain scalars (quoted or block: '...', "...", |, >). These are always strings.
    • ? for everything else (plain scalars like 8080, true, hello, and all collections). These need further resolution.
  • Composing a complete representation requires resolving each non-specific tag to a specific one (global or local).

  • Tag resolution depends only on:

    • The node’s non-specific tag (? or !).
    • The path from root to that node (the chain of parents, i.e. structural position in the tree).
    • The node’s content and kind (scalar, sequence, or mapping).
  • Otherwise, tag resolution must not consider:

    • Presentation details (comments, indentation, styles).
    • Sibling node content.
    • Associated value/key content.
  • These restrictions ensure tag resolution can happen as soon as a node is first encountered, typically before its content is fully parsed. -> This makes one-pass resolution both possible and practical.

Resolution conventions:

  • ! non-specific tag should resolve to !!seq, !!map, or !!str depending on node kind (scalar, sequence, or mapping). This lets authors “disable” tag resolution by explicitly writing !, forcing a vanilla type.
  • ? non-specific tag is where application-specific rules apply:
    • Most commonly for plain scalars: matching content against regexes for integers, floats, timestamps, etc.
    • Applications may also match mapping keys to resolve structured types (e.g. points, complex numbers).
  • Processors should provide mechanisms for applications to override and expand these default rules.
  • Unresolved tags are not an error. If a document contains unresolved tags, the processor can still compose a partial representation: the graph structure is built (kinds, content, relationships are all known), but some nodes keep their non-specific tag (? or !) instead of a specific one like !!int or !!str. A complete representation requires every node to have a specific tag, so native data structures can’t be constructed from a partial one.
# Resolution by content (non-specific tag ?):
port: 8080       # ? + content "8080" -> !!int
host: localhost   # ? + content "localhost" -> !!str
debug: true      # ? + content "true" -> !!bool

# Resolution by non-specific tag (! always -> vanilla type):
name: 'Alice'    # ! -> !!str (quoted, regardless of content)
count: '8080'    # ! -> !!str (even though content looks like integer)

Path from root can also influence resolution. An application might use the parent chain to override default rules:

--- !server
port: 8080       # app knows path is !server -> port, so resolves to !!int
                 # even without checking content pattern
host: localhost

# But resolution must NOT look at siblings or associated nodes.
# e.g. resolving "port" cannot depend on what "host" contains.

Recognized and Valid Tags

  • For a node to be fully processable, two conditions must hold:
    • The tag is recognized: the processor knows what the tag means (has rules for it).
    • The content is valid: the content matches what the tag expects (e.g. !!int expects integer-like content).
  • Unrecognized or invalid scalars only allow partial representation.
  • Unrecognized or invalid collections still allow complete representation (collection equality doesn’t depend on data type knowledge), but native data structures cannot be constructed.
# Unrecognized tag: processor doesn't know what !matrix means
transform: !matrix [[1,0],[0,1]]

# Invalid content: tag is recognized but content doesn't match
count: !!int "not a number"

Available Tags

  • The processing environment may lack native types for certain tags. When unavailable, native construction fails, but a complete representation (the graph of typed nodes) can still be composed. The application can work with it as a generic structure of (tag, content) pairs without native conversion.
# !!timestamp is a valid YAML tag, but not every language
# has a native timestamp type.
created: !!timestamp 2002-12-14
# Native construction (fails if no timestamp type):
{"created": datetime(2002, 12, 14)}

# Using complete representation directly (always works):
{"created": Node(tag="tag:yaml.org,2002:timestamp", content="2002-12-14")}
# The app knows it's a timestamp (from the tag) and has
# the raw content. It can format, pass along, or convert manually.

Design

Typedown YAML Schema & Processor

Typedown YAML Schema is just YAML.

  • A valid Typedown YAML is a valid YAML.
  • A YAML using the features as defined by the spec YAML 1.2.2 & the built-in “core” schema are processable by the Typedown YAML processor.

Typedown YAML processor extends the YAML built-in core schemas with many more tags for better content management.

Typedown Abstract Model

Graph database.

Resources

Meta-resources

Properties

Labels