Graph Data Modeling

This document covers:

How to solve a real-world problem by using graph modeling techniques.
Comparing relational and graph modeling techniques.
Guidelines for structuring nodes and relationships & common pitfalls.
How to evolve graph over time.

Modeling

This notes my own remarks, not according to the book.

Modeling is a common activity in science: The world is messy and we can’t possibly predict every behavior of it. Modeling is a simplification process where we selectively expose certain useful aspects of the world to easily study it.

With a good model, we can predict surprisingly quite precisely the behavior of the world.

Note that a model is just an approximation of the world. So, a model can break down if it tries to extrapolate too far from its original assumptions. That doesn’t make a model useless, however. The sole role of the model is to enable accurate-enough reasoning about the world within some known set of constraints.

Graph Modeling as a Data Modeling Technique

Graph models are a tool for modeling the world. It allows useful queries about the world.

Compared to relational data modeling techniques, the graph data models’ distinguishing point is the close affinity between the logical and physical models.

Relational techniques force us to deviate from our natural representation of the domain through multiple transformations, each introducing semantic dissonance:

The Relational Path for Modeling

Here are the typical steps for relational modeling:

Whiteboard sketch:
- Understand entities, how they interrelate, rules governing state transitions.
- Done informally with domain experts.
- The result is already a graph.
E-R diagram:
- A more rigorous form of the whiteboard sketch.
- However, E-R diagrams only allow single, undirected relationships.
- A poor fit for domains where relationships are numerous and semantically diverse.
Normalize into tables:
- Map E-R diagram into tables and relations.
- Even the simplest case introduces accidental complexity.
- Foreign key constraints (for 1:N) and join tables (for M:N) clutter the model with metadata that serves the database, not the user.
Denormalize for performance:
- Normalized models are generally not fast enough for production.
- Duplicate data and abandon domain fidelity to suit the database engine.
- Requires RDBMS expertise and accepts substantial data redundancy.
Migration:
- Introducing structural change is slow, risky, and expensive (weeks/months with downtime).
- Unlike code refactoring (seconds/minutes), database refactoring is a heavyweight operation.
- The denormalized model resists rapid evolution.

The result:

A gulf between the conceptual world and the physical data layout.
Business stakeholders can’t collaborate past the relational threshold.
Changed business requirements lag behind because translating them into entrenched relational structures is difficult.
Failed migrations risk data integrity.

The Graph Path for Modeling

Whiteboard sketch:
- Same as relational. Understand entities and their interrelations with domain experts.
Enrich the graph:
- Instead of transforming into tables, enrich the whiteboard sketch.
- Add properties to nodes and named, directed relationships.
- The model is a purposeful abstraction attuned to the application’s data needs.
Store directly:
- No normalization, no denormalization, no join tables.
- What you sketch is what you store.
- Domain modeling is isomorphic to graph modeling.

Testing the Model

Once the domain model is refined, test it before building the application. Bad design decisions baked in early are harder to fix later. Two techniques:

1. Read the Graph Aloud

Pick a start node, follow relationships, read each node’s role and relationship name. It should form sensible sentences:

“Alice wrote Post X, which has Comment Y, which Bob authored”
“Post X is tagged with Topic Z, which Category W contains”

If it reads well, the model is faithful to the domain.

2. Design for Queryability

Write the queries you expect to run and verify the graph supports them. This requires understanding end users’ goals.

Example: In a blog platform, find all posts by authors that a user follows:

START user=node:users(name = 'Alice')
MATCH (user)-[:FOLLOWS]->(author)-[:WROTE]->(post)
RETURN author.name, post.title
ORDER BY post.date DESC

If such a query is readily supported by the graph, the design is fit for purpose. If not, the model needs restructuring before any code is written.

Cross-Domain Models

Interesting remarks:

Relationships both partition a graph into separate domains and connect them.
Shared nodes eliminate data duplication across domains.
Traversal crosses domain boundaries seamlessly.
Scaling is additive (more nodes and relationships, no schema changes).

Example: A blog platform with three domains (content, social, geospatial):

graph TD
    Alice -->|WROTE| Post1[Post: 'Graph DBs']
    Alice -->|FOLLOWS| Bob
    Bob -->|WROTE| Post2[Post: 'Neo4j Tips']
    Post1 -->|TAGGED| GraphDB[Topic: 'Graph DB']
    Post2 -->|TAGGED| GraphDB
    Alice -->|LIVES_IN| London[City: 'London']
    Bob -->|LIVES_IN| London
    London -->|IN| UK[Country: 'UK']

London and GraphDB are shared nodes: They participate in multiple domains without duplication. A single traversal can answer “find posts about Graph DB written by people who live near me.”

Node Labels

Relationships establish semantic context, but are a weak indicator of what a node is. In the graph above, following WROTE tells you the end node is a post, but nothing on the node itself says so. Labels fix this:

graph LR
    A["Alice:User"] -->|WROTE| P["Post1:Post"]
    P -->|TAGGED| T["GraphDB:Topic"]
    A -->|LIVES_IN| L["London:City"]

Labels attach one or more type tags to a node as first-class citizens: (alice:User:Customer). Queries can filter by label (MATCH (n:User)), and labels can carry constraints (e.g. uniqueness).

Common Modeling Pitfalls

Expressivity is no guarantee a graph is fit for purpose.

Don’t Encode Entities as Relationships

The most common mistake: Folding a noun into a verb. Everyday language encourages “Alice reviewed the restaurant” which loses the review entity.

Bad model:

graph LR
    Alice -->|REVIEWED| Restaurant
    Alice -->|RATED| Restaurant

This tells us Alice reviewed the restaurant, but we can’t see:

The review text or the rating value.
When the review was written.
Whether the REVIEWED and RATED refer to the same visit.

Even adding properties to REVIEWED doesn’t help: You still can’t correlate it with the RATED relationship.

The root cause: English shortens “Alice wrote a review of the restaurant” into “Alice reviewed the restaurant”. The noun (review) disappears, folded into a relationship.

Good model:

graph TD
    Alice -->|WROTE| Review
    Review -->|OF| Restaurant

Now the review is a first-class node with its own properties (text, rating, date). Multiple reviews by the same user or of the same restaurant are distinct nodes.

Once entities are modeled as nodes, powerful queries become possible. For example, “find all restaurants where Alice gave a higher rating than the average”:

START alice=node:user(name='Alice')
MATCH (alice)-[:WROTE]->(review)-[:OF]->(restaurant)
WHERE review.rating > 3
RETURN restaurant.name, review.rating

This query only works because the review is a node with its own properties (rating, text, date). With the bad model (REVIEWED as a relationship), filtering by rating or correlating review details would be impossible.

Don’t Conflate Entities and Relationships

Use relationships to convey how things are related. Domain entities aren’t always obvious from everyday language: Think carefully about the nouns. If you find yourself wanting to attach properties to a relationship or connect a relationship to more than two nodes, it should probably be a node.

Don’t Optimize Writes at the Cost of Model Fidelity

Trust the graph database to handle performance. Model according to the questions you want to ask. Graph databases maintain fast query times even when storing vast amounts of fine-grained data.

Domain Evolution

Migrations in graph databases are simpler than in relational databases:

Adding new relationship types (e.g. REPLY_TO, FORWARD_OF): Completely safe. Existing queries don’t know about them, so nothing breaks.
Adding new nodes: Safe. Extends the graph without affecting existing structure.
Changing existing relationship types or node properties: Might affect existing queries. Run a representative set of queries to verify.

These are the same operations performed during normal database usage, so migration in a graph world is just business as normal.

The Same Pitfall Applies to Evolution

When adding new features, the “entities as relationships” pitfall recurs. For example, adding reply/forward support:

For example, adding a “reply” feature to a review platform:

Bad:

(bob)-[:REPLIED_TO]->(review)

Lossy: Can’t see what Bob actually said, or distinguish between multiple replies.

Good: A reply is itself a new node:

graph TD
    Alice -->|WROTE| R1[Review]
    R1 -->|OF| Restaurant
    Bob -->|WROTE| Reply1[Reply 1]
    Reply1 -->|REPLY_TO| R1
    Alice -->|WROTE| Reply2[Reply 2]
    Reply2 -->|REPLY_TO| Reply1

This enables queries like “find the full reply chain for a review”:

START review = node:review(id = '1')
MATCH p=(review)<-[:REPLY_TO*1..4]-()<-[:WROTE]-(replier)
RETURN replier.name AS replier, length(p) - 1 AS depth
ORDER BY depth

Keyboard shortcuts

Typedown