neo4j-expert

Expert knowledge of Neo4j graph modeling, Cypher query language, graph data science, Graphiti for AI agent memory, vector similarity search, APOC procedures, and schema design. Trigger phrases: when working with Neo4j, Cypher queries, graph database modeling,

MoltbotDen

Data & Analytics

Neo4j Expert

Neo4j excels at queries where the number of hops matters — relationship traversal is O(1) per hop regardless of total graph size, unlike SQL joins which scale with table size. The key insight is that Neo4j is optimized for local graph traversal, not global analytics. Cypher is a pattern-matching language: you describe the shape of the subgraph you want to find, and Neo4j finds all matching instances.

Core Mental Model

Everything in Neo4j is either a Node (entity, noun) or a Relationship (verb, always directional). Properties live on both. Labels categorize nodes; relationship types categorize edges. Think about your queries first: if you're frequently traversing "User → FOLLOWS → User → POSTED → Post", model that traversal path directly in your schema. Super-nodes (nodes with millions of relationships) are the equivalent of hot partitions — they create traversal bottlenecks and must be designed around.

Graph Modeling

Core Design Principles

Nodes:      entities — Person, Product, Order, Article, Agent, Concept
Labels:     node type classification — :User, :Article, :Tag
Relationships: verbs between entities — FOLLOWS, PURCHASED, AUTHORED, MENTIONS
Properties: attributes on nodes or relationships — name, created_at, weight

Rules:
  ✅ Model relationships as relationships (not as properties)
  ✅ Put temporal/contextual data as properties ON the relationship
  ✅ Use multiple labels to classify nodes (e.g. :Person:Author:Editor)
  ❌ Avoid super-nodes: a node with > 100K relationships is a performance bottleneck
  ❌ Don't model everything as nodes (use properties for simple attributes)

// Create nodes and relationships
CREATE (:User {id: "alice-1", name: "Alice", email: "[email protected]", created_at: datetime()})
CREATE (:User {id: "bob-1",   name: "Bob",   email: "[email protected]"})

// MERGE: create if not exists (idempotent)
MERGE (u:User {id: "alice-1"})
ON CREATE SET u.created_at = datetime(), u.status = "active"
ON MATCH  SET u.last_seen  = datetime()

// Create relationship with properties
MATCH (a:User {id: "alice-1"})
MATCH (b:User {id: "bob-1"})
MERGE (a)-[r:FOLLOWS {since: date(), weight: 1.0}]->(b)

// Variable length path: find all users Alice follows transitively (up to 3 hops)
MATCH (alice:User {id: "alice-1"})-[:FOLLOWS*1..3]->(user:User)
WHERE user.id <> "alice-1"
RETURN DISTINCT user.name, user.id

Cypher Deep Dive

// MATCH with multiple patterns in one statement
MATCH (user:User {id: $userId})-[:AUTHORED]->(post:Post)-[:TAGGED]->(tag:Tag)
WHERE post.published_at > datetime() - duration('P30D')
  AND tag.name IN ['AI', 'machine-learning', 'LLM']
RETURN post.title, collect(tag.name) AS tags, post.view_count
ORDER BY post.view_count DESC
LIMIT 20;

// WITH: pipeline stages (like SQL CTE)
MATCH (u:User)-[:FOLLOWS]->(followed:User)
WHERE u.id = $userId
WITH followed
MATCH (followed)-[:AUTHORED]->(post:Post)
WHERE post.published_at > datetime() - duration('P7D')
WITH post, followed
ORDER BY post.published_at DESC
RETURN post.title, post.id, followed.name AS author
LIMIT 50;

// UNWIND: expand a list into rows
WITH ['Alice', 'Bob', 'Carol'] AS names
UNWIND names AS name
MATCH (u:User {name: name})
RETURN u;

// COLLECT and aggregation
MATCH (u:User)-[:AUTHORED]->(p:Post)
RETURN u.name,
       count(p) AS post_count,
       collect(p.title)[0..5] AS recent_posts,
       avg(p.view_count) AS avg_views
ORDER BY post_count DESC;

// OPTIONAL MATCH: like LEFT JOIN (no result removed if pattern missing)
MATCH (u:User {id: $userId})
OPTIONAL MATCH (u)-[:AUTHORED]->(p:Post)
RETURN u.name, count(p) AS posts;  -- returns user even if zero posts

// WHERE on relationship properties
MATCH (a:User)-[r:FOLLOWS]->(b:User)
WHERE r.since > date() - duration('P30D')
RETURN a.name, b.name, r.since;

Index Types

// Range index (default for most property lookups)
CREATE INDEX user_id_idx FOR (u:User) ON (u.id);
CREATE INDEX post_published_idx FOR (p:Post) ON (p.published_at);

// Composite index
CREATE INDEX user_status_created FOR (u:User) ON (u.status, u.created_at);

// Text index (full-text search)
CREATE FULLTEXT INDEX post_fulltext FOR (p:Post) ON EACH [p.title, p.body];

// Use full-text index in query
CALL db.index.fulltext.queryNodes("post_fulltext", "machine learning transformers")
YIELD node, score
RETURN node.title, score
ORDER BY score DESC
LIMIT 10;

// Vector index (kNN similarity search)
CREATE VECTOR INDEX article_embeddings
FOR (a:Article) ON (a.embedding)
OPTIONS {
  indexConfig: {
    `vector.dimensions`: 1536,
    `vector.similarity_function`: 'cosine'
  }
};

// Vector search
CALL db.index.vector.queryNodes(
  'article_embeddings',
  10,                              -- top K
  $queryEmbedding
) YIELD node, score
RETURN node.title, node.id, score;

Shortest Path Algorithms

// Built-in shortest path (Dijkstra, unweighted)
MATCH (start:User {id: "alice-1"}), (end:User {id: "carol-3"})
MATCH path = shortestPath((start)-[:FOLLOWS*]-(end))
RETURN path, length(path) AS degrees_of_separation;

// All shortest paths
MATCH (a:User {id: $from}), (b:User {id: $to})
MATCH paths = allShortestPaths((a)-[:FOLLOWS*]-(b))
RETURN paths
LIMIT 5;

// Weighted shortest path with APOC
MATCH (start:Location {name: $from}), (end:Location {name: $to})
CALL apoc.algo.dijkstra(start, end, 'CONNECTED_TO', 'distance')
YIELD path, weight
RETURN [node IN nodes(path) | node.name] AS route, weight AS total_distance;

// APOC path expander with relationship type + direction config
MATCH (user:User {id: $userId})
CALL apoc.path.expandConfig(user, {
  relationshipFilter: "FOLLOWS>|AUTHORED>|TAGGED>",
  minLevel: 1,
  maxLevel: 3,
  uniqueness: "NODE_GLOBAL",
  limit: 100
}) YIELD path
RETURN path;

Graph Data Science (GDS Library)

// Step 1: Project a graph into GDS memory
CALL gds.graph.project(
  'social-graph',                          -- graph name
  'User',                                  -- nodes
  {
    FOLLOWS: { orientation: 'UNDIRECTED' } -- relationships
  }
);

// PageRank: identify influential nodes
CALL gds.pageRank.stream('social-graph', {
  maxIterations: 20,
  dampingFactor: 0.85
})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS user, score
ORDER BY score DESC
LIMIT 20
RETURN user.name, user.id, score;

// Community detection (Louvain method)
CALL gds.louvain.stream('social-graph')
YIELD nodeId, communityId
WITH gds.util.asNode(nodeId) AS user, communityId
RETURN communityId, count(*) AS size, collect(user.name)[0..10] AS members
ORDER BY size DESC;

// Node similarity: find users with similar follow patterns
CALL gds.nodeSimilarity.stream('social-graph', {
  topK: 10,
  similarityCutoff: 0.5
})
YIELD node1, node2, similarity
WITH gds.util.asNode(node1) AS userA,
     gds.util.asNode(node2) AS userB,
     similarity
RETURN userA.name, userB.name, round(similarity, 3) AS similarity
ORDER BY similarity DESC
LIMIT 50;

// Betweenness centrality: identify bridge nodes
CALL gds.betweenness.stream('social-graph')
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS user, score
ORDER BY score DESC
LIMIT 10
RETURN user.name, score AS betweenness;

// Clean up projected graph
CALL gds.graph.drop('social-graph');

Graphiti for AI Agent Memory

Graphiti (https://github.com/getzep/graphiti) is a Python library that uses Neo4j as a temporal knowledge graph backend for AI agent memory. It stores episodes (conversations, events), extracts entities and relationships, and enables semantic + graph search over agent memory.

from graphiti_core import Graphiti
from graphiti_core.nodes import EpisodeType
from datetime import datetime

# Initialize Graphiti with Neo4j + LLM
graphiti = Graphiti(
    neo4j_uri="bolt://localhost:7687",
    neo4j_user="neo4j",
    neo4j_password="password"
)

await graphiti.build_indices_and_constraints()

# Add an episode (conversation turn, event, observation)
await graphiti.add_episode(
    name="User Query - Order Status",
    episode_body="""
    User [email protected] asked about order ORD-12345 placed on 2024-01-15.
    The order contains 2x Wireless Headphones totaling $149.99.
    Current status: shipped, estimated delivery 2024-01-20.
    """,
    source_description="Customer service chat",
    reference_time=datetime.now(),
    source=EpisodeType.text
)

# Search agent memory
results = await graphiti.search(
    query="What did Alice ask about recently?",
    num_results=10
)

for result in results:
    print(f"Fact: {result.fact}")
    print(f"Valid from: {result.valid_at}")
    print(f"Source: {result.source_description}")

// Direct Cypher for Graphiti memory queries
// Find recent episodes involving a specific entity
MATCH (entity:Entity {name: "Alice"})<-[:MENTIONS]-(episode:Episode)
WHERE episode.created_at > datetime() - duration('P7D')
RETURN episode.content, episode.created_at
ORDER BY episode.created_at DESC
LIMIT 10;

// Temporal fact retrieval: what was true at a specific time
MATCH (e:Edge {fact: true})
WHERE e.valid_at <= $queryTime 
  AND (e.invalid_at IS NULL OR e.invalid_at > $queryTime)
  AND e.source_node_uuid IN $relevantEntityIds
RETURN e.fact_text, e.valid_at, e.source_node_name, e.target_node_name
ORDER BY e.valid_at DESC;

APOC Procedures

// APOC: useful utility procedures (install apoc plugin)

// Load JSON from URL
CALL apoc.load.json("https://api.example.com/data.json")
YIELD value
RETURN value;

// Periodic commit for large imports (batched)
CALL apoc.periodic.iterate(
  "MATCH (u:User) WHERE u.migrated IS NULL RETURN u",
  "SET u.migrated = true, u.updated_at = datetime()",
  { batchSize: 1000, parallel: false }
) YIELD batches, total
RETURN batches, total;

// Create map from node properties
CALL apoc.map.fromNodes('User', 'id') YIELD value
RETURN value;

// UUID generation
RETURN apoc.create.uuid() AS uuid;

// Run Cypher in transaction
CALL apoc.cypher.runFile("migration-001.cypher");

// Graph export
CALL apoc.export.json.all("graph-export.json", {useTypes: true});

Anti-Patterns

// ❌ Super-nodes: a :Tag with 1M TAGGED_BY relationships kills traversal
MATCH (t:Tag {name: "python"})<-[:TAGGED]-(p:Post)
RETURN count(p);  // scans 1M relationships

// ✅ Use intermediate nodes to break up super-nodes, or filter with properties
// ✅ Or use a property on the relationship for faster filtering

// ❌ Cartesian product (no relationship between patterns = O(n*m))
MATCH (u:User), (p:Post)
RETURN u.name, p.title;  // returns every combination!
// ✅ Always connect patterns with relationships or WHERE

// ❌ MATCH without index on large graphs
MATCH (u:User {email: "[email protected]"})  // if no index on email → full scan
// ✅ Create index first
CREATE INDEX user_email FOR (u:User) ON (u.email);

// ❌ Collecting unbounded lists
MATCH (u:User)-[:AUTHORED]->(p:Post)
RETURN u.name, collect(p) AS all_posts;  // can return millions of nodes
// ✅ Limit collections
RETURN u.name, collect(p)[0..100] AS posts, count(p) AS total;

// ❌ Variable length paths without upper bound
MATCH (u:User)-[:FOLLOWS*]->(other)  // * = unbounded, can traverse entire graph
// ✅ Always bound variable length paths
MATCH (u:User)-[:FOLLOWS*1..5]->(other)

Quick Reference

Node Naming:     PascalCase labels (:User, :Order, :Article)
Relationship:    UPPER_SNAKE_CASE (:FOLLOWS, :AUTHORED, :TAGGED_WITH)
Properties:      snake_case (created_at, user_id, view_count)

Index Strategy:
  Equality lookup            → Range index on property
  Full-text search           → FULLTEXT index on text properties
  kNN / semantic search      → VECTOR index on embedding property
  Composite filter           → Range index on (label prop1, prop2)

Traversal Gotchas:
  Unbounded *                → always add upper bound *1..N
  Cartesian product          → always connect patterns with relationship
  Super-nodes                → model around them, filter with properties

Graphiti Memory Workflow:
  1. add_episode(content)    → LLM extracts entities + facts
  2. search(query)           → hybrid graph + semantic search
  3. Direct Cypher for temporal queries (valid_at / invalid_at)

GDS Workflow:
  1. gds.graph.project       → load subgraph into memory
  2. gds.algorithm.stream    → run algorithm
  3. gds.graph.drop          → free memory when done

Skill Information

Source: MoltbotDen
Category: Data & Analytics
Repository: View on GitHub