Ontology in Graph Databases: Lessons from Building a Medical Diagnostics Knowledge Graph
Ontology in a graph database is not an academic exercise — it is the difference between a graph that answers questions and a graph that is a tangled hairball. An ontology defines what kinds of things exist in your domain, what properties they carry, and how they relate to one another. Without one, your graph grows without structure and querying becomes guesswork.
The LeadGraph project offers a concrete, production-grade example of ontology design. It is a market intelligence platform for Siemens Healthineers that ingests data from 15 external sources (FDA 510(k) clearances, clinical trials, patents, conference exhibitor lists, grant databases, GitHub repositories, and more), normalises everything into a Neo4j knowledge graph, and scores companies by commercial relevance.
This article walks through the ontology decisions in LeadGraph and extracts the general principles that apply to any graph database project.
What Ontology Means in Neo4j
In Neo4j, ontology manifests as:
- Node labels —
:Company,:Product,:Application,:Signal. These are your entity types. - Property constraints —
:Company.normalizedNameis unique;:Signal.typeis indexed. - Relationship types —
SUPPLIES,DEVELOPS,HAS_SIGNAL,USED_IN. These encode the semantic edges between entities. - Value ranges —
:Signal.confidenceis a float from 0 to 1;:Signal.typeis one of 12 enumerated signal types.
A well-designed ontology answers three questions: what exists, what matters about it, and how it connects.
The LeadGraph Ontology
The ontology is seeded explicitly — not discovered. The seedGraph() function in ontology.ts creates constraints, nodes, and relationships in a single transaction-friendly batch:
CREATE CONSTRAINT IF NOT EXISTS FOR (c:Company) REQUIRE c.normalizedName IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (a:Application) REQUIRE a.name IS UNIQUE;
CREATE INDEX IF NOT EXISTS FOR (s:Signal) ON (s.type);
These three lines capture the ontology's backbone: every company must have a unique normalised name, every application area must be a named entity, and signals must be queryable by type.
Node Labels and Their Semantics
The ontology defines seven node labels:
| Label | Purpose | Key Properties | Uniqueness |
|---|---|---|---|
Company | An organisation in the diagnostics market | name, normalizedName, domain, segment, region | normalizedName |
Product | A specific Siemens product or reagent | catalogId, name, category | catalogId |
Application | A clinical application area | name, category | name |
Signal | An external indicator of activity | type, date, confidence, description, url | none (event) |
Contact | A person at a company | name, email, role | none |
PipelineStage | A sales pipeline milestone | stage, enteredAt | none |
Activity | A logged interaction | type, note, date | none |
The distinction between Company and Product is straightforward. The interesting design choice is Application as a separate node rather than a property on Company. This enables multi-hop traversal: you can find companies that develop assays in the same application area as a given Siemens product — a query that would be expensive with property-based filtering on a relationship.
Relationship Semantics
The six relationship types encode the domain's business logic:
(:Company)-[:SUPPLIES]->(:Product) // Siemens manufactures this product
(:Company)-[:DEVELOPS]->(:Application) // Company works in this application area
(:Company)-[:HAS_SIGNAL]->(:Signal) // Company triggered this external signal
(:Product)-[:USED_IN]->(:Application) // Product is relevant to this clinical area
(:Contact)-[:CONTACT_AT]->(:Company) // Person works at this organisation
(:Contact)-[:HAS_ACTIVITY]->(:Activity) // Person had an interaction
(:Contact)-[:IN_STAGE]->(:PipelineStage) // Current pipeline status
The DEVELOPS relationship is where the ontology does its heaviest lifting. Every external data point — an FDA clearance, a conference appearance, a new hire — is mapped through the application area classification. When the scoring engine runs, it computes productFitScore as the overlap ratio between a company's application areas and Siemens' product portfolio:
// scorer.ts (simplified)
const overlap = company.applications.filter(
app => siemensApps.has(app)
).length;
const productFitScore = (overlap / siemensApps.size) * 30; // [0-30]
This query works because the ontology maintains (:Company)-[:DEVELOPS]->(:Application) and (:Product)-[:USED_IN]->(:Application) as first-class paths.
The Signal Normalisation Pattern
The most instructive design in LeadGraph is the signal normalisation pattern. The project ingests from 15 wildly different data sources:
| Source | Raw Data | Normalised to |
|---|---|---|
| FDA 510(k) API | Medical device clearance records | Signal { type: "FDA_CLEARANCE", confidence, date } |
| ClinicalTrials.gov | Trial registry entries | Signal { type: "CLINICAL_TRIAL", ... } |
| EPO OPS (patents) | European patent filings | Signal { type: "PATENT", ... } |
| MEDICA API | Trade fair exhibitor list | Signal { type: "CONFERENCE", ... } |
| GitHub Search API | Repositories matching diagnostic keywords | Signal { type: "RESEARCH_PUBLICATION", ... } |
| FÖKAT (grants) | German government research grants | Signal { type: "FUNDING", ... } |
| DRKS (clinical trials DE) | German clinical trial registry | Signal { type: "CLINICAL_TRIAL", ... } |
Each source has an adapter implementing a common interface:
interface SourceAdapter {
readonly id: string;
fetch(): Promise<RawLead[]>;
normalize(raw: RawLead): LeadCandidate;
healthCheck(): Promise<boolean>;
}
The normalize() method is where each adapter maps its domain-specific data into the shared ontology. An FDA 510(k) clearance becomes:
{
companyName: "Euroimmun AG",
domain: "euroimmun.com",
applicationAreas: ["Autoimmune Diagnostics"],
signals: [{
type: "FDA_CLEARANCE",
date: "2025-12-10",
confidence: 0.9,
description: "FDA 510(k) for Euroimmun Anti-dsDNA ELISA"
}]
}
A MEDICA trade fair listing becomes:
{
companyName: "DIARECT AG",
applicationAreas: ["Autoimmune Diagnostics"],
signals: [{
type: "CONFERENCE",
date: "2025-10-15",
confidence: 0.6,
description: "Presenting new autoimmune panel at ADLM 2025"
}]
}
Different raw data, same ontology. This is the central value of a well-designed graph ontology: it makes disparate data sources queryable through a single model.
Why Confidence Is Part of the Ontology
Notice the confidence field on every signal. The ontology encodes uncertainty because not all data sources are equally reliable. An FDA clearance (confidence 0.9) is a stronger signal than a news article mentioning a company (confidence 0.5). By embedding confidence as a property, the scoring engine can weight signals by source reliability:
const signalScore = signals.reduce((sum, s) =>
sum + (weights[s.type] ?? 1) * s.confidence * recencyMultiplier(s.date), 0
); // [0-40]
This is a deliberate ontology decision: uncertainty is a first-class property of your data model, not an afterthought.
What Makes a Good Graph Ontology
Drawing from the LeadGraph example and general graph database practice, here are the principles:
1. Entities Are Nodes, Values Are Properties
A common mistake is storing important domain concepts as properties on other nodes. In LeadGraph, Application is a separate node, not a string array on Company. This seems trivial but has major implications:
- You can query all companies in an application area without full scans.
- You can attach metadata to the application area itself (market size, growth rate).
- You can join through application areas across different entity types (companies and products).
Rule of thumb: if you query by it, filter on it, or join through it, it should be a node.
2. Relationships Are Named, Not Tagged
Another common anti-pattern is using a generic RELATED_TO relationship with a type property to distinguish semantics. This forces every query to filter by property values, destroying performance and readability.
LeadGraph uses distinct relationship types (DEVELOPS, SUPPLIES, HAS_SIGNAL) that are self-documenting and indexable.
3. The Ontology Must Be Explicitly Seeded
LeadGraph seeds its ontology in code — the application areas and Siemens product portfolio are defined as TypeScript arrays:
const APPLICATION_AREAS = [
"Hemostasis & Thrombosis",
"Plasma Proteins",
"Infectious Disease & Serology",
"Cardiac Markers",
"Oncology & Tumor Markers",
"Autoimmune Diagnostics",
"Specialty Proteins & Reagents",
];
This is not an accident. An ontology that emerges organically from data is rarely coherent. Explicit seeding means every node label, relationship type, and property constraint is a conscious design decision.
4. External Data Must Be Normalised, Not Imported Raw
The adapter pattern is critical. Each data source maps to the shared ontology via normalize(). If you import raw FDA data in FDA's schema and raw patent data in the patent office's schema, you have not built a knowledge graph — you have a data lake.
5. Constraints Are Ontology Enforcement
Neo4j constraints (REQUIRE ... IS UNIQUE) are not optional performance hints. They enforce the ontology at the database level. If two source adapters produce a company with the same name but different spellings, the normalizedName constraint catches the collision. The normalizeCompanyName() function handles deduplication:
export function normalizeCompanyName(name: string): string {
return name
.toLowerCase()
.replace(/[^a-z0-9\s]/g, "")
.replace(/gmbh|ag|limited|ltd|inc|corp|llc/g, "")
.trim()
.replace(/\s+/g, "-");
}
Common Ontology Mistakes
From building and iterating on the LeadGraph ontology, the recurring pitfalls are:
Over-normalisation. Splitting everything into nodes creates traversal hell. A company's region (EUROPE, NORTH_AMERICA) is a property, not a node — querying "all companies in Europe" requires only a property index, not a three-hop traversal.
Under-normalisation. Storing application areas as a comma-separated string on Company loses the ability to query across the application dimension. The rule of thumb above applies.
Ignoring time. Signals have dates, contacts have activity timestamps, pipeline stages have entry dates. If your ontology does not model time, your graph cannot answer "what changed."
Treating all relationships as equal. A HAS_SIGNAL relationship with a type property on the relationship is different from a named HAS_SIGNAL relationship with a Signal node that has a type property. The latter allows the signal to have its own properties (confidence, date, description) and to exist independently of the relationship.
Testing the Ontology
LeadGraph tests the ontology through integration tests that seed the graph and verify query results:
// From neo4j.test.ts
test("seeded ontology has correct structure", async () => {
const result = await seedGraph();
expect(result.constraintsCreated).toBe(3);
expect(result.companiesSeeded).toBeGreaterThan(1);
expect(result.applicationAreas).toBe(7);
expect(result.productsSeeded).toBeGreaterThan(0);
});
More importantly, the scoring tests validate that the ontology supports the expected queries:
test("product fit score uses application overlap", async () => {
const company = { applications: ["Hemostasis & Thrombosis"] };
const siemensApps = new Set(["Hemostasis & Thrombosis", "Cardiac Markers"]);
const score = computeProductFit(company, siemensApps);
expect(score).toBeGreaterThan(0);
});
If the ontology changes, these tests fail — providing a safety net for schema evolution.
Summary
The LeadGraph project demonstrates that ontology design is the single most impactful decision in a graph database project. It determines what queries are possible, how performant they are, and whether the graph remains coherent as new data sources are added.
The key takeaways for building your own graph ontology:
- Explicitly seed your ontology — do not let it emerge from data.
- Entities are nodes, values are properties — if you query by it, make it a node.
- Name your relationships — generic
RELATED_TOis a code smell. - Normalise external data into your ontology — each source adapter is a translation layer.
- Model uncertainty — confidence scores are first-class ontology properties.
- Enforce with constraints —
UNIQUEandINDEXare ontology guarantees, not performance hacks. - Test the ontology — if a schema change breaks queries, your tests should catch it.
A graph without ontology is just a collection of nodes. A graph with ontology is a knowledge graph.
Further Reading
- Robinson, I., Webber, J. & Eifrem, E. Graph Databases, 2nd Edition. O'Reilly Media, 2015. ISBN 978-1-449-35625-5. — The definitive reference on graph database modeling, covering node labels, relationship types, and the property graph model.
- Gainey, C. Ontologies in Neo4j: Semantics and Knowledge Graphs. Neo4j Blog, 2020. https://neo4j.com/blog/knowledge-graph/ontologies-in-neo4j-semantics-and-knowledge-graphs/ — Covers the three characteristics of formal ontologies (formal representation, explicit description, consensuated knowledge) and how to use neosemantics for RDF/OWL import and inference in Neo4j.
- Howard, R. RDF Triple Stores vs. Property Graphs: What's the Difference? Neo4j Blog, 2024. https://neo4j.com/blog/knowledge-graph/rdf-vs-property-graphs-knowledge-graphs/ — Compares RDF/OWL ontology approaches with property graph modeling, arguing for property graphs by default with ontological principles layered in where needed.
- Neo4j. Graph Data Modeling Core Principles. GraphAcademy. https://neo4j.com/graphacademy/training-gdm-40/03-graph-data-modeling-core-principles/ — Official training covering node design, relationship specificity, data accessibility hierarchy, and the gather-and-inspect anti-pattern.
- Neo4j. Data Modeling Best Practices. https://neo4j.com/developer/industry-use-cases/_attachments/neo4j_data_model_best_practices.txt — Consolidated best practices document covering naming conventions, node/relationship design, property strategy, and a validation checklist.
- Neo4j. Modeling Designs. https://neo4j.com/docs/getting-started/data-modeling/modeling-designs/ — Reference for common graph modeling patterns: intermediate nodes, linked lists, fanout, and timeline trees.
- Neo4j. Importing Ontologies — Neosemantics. https://neo4j.com/labs/neosemantics/4.2/importing-ontologies/ — Reference for loading RDFS/OWL/SKOS ontologies into Neo4j using the neosemantics plugin.
- Mungall, C. Biological Knowledge Graph Modeling Design Patterns. 2019. https://douroucouli.wordpress.com/2019/03/14/biological-knowledge-graph-modeling-design-patterns/ — Practical design patterns from large-scale biological KG projects: ontology classification as labels, relationship specificity, edge properties, and the trade-off between knowledge graph modeling and OWL-level logical modeling.
- Ferilli, S. LPG-based Ontologies as Schemas for Graph DBs. CEUR-WS Vol. 3194, 2022. https://ceur-ws.org/Vol-3194/paper31.pdf — Academic paper proposing the GraphBRAIN formalism for expressing ontologies as schemas on Labeled Property Graphs, bridging the gap between Neo4j's schema-less model and formal ontology languages.