Ontology in Graph Databases: Lessons from Building a Medical Diagnostics Knowledge Graph

Ontology in a graph database is not an academic exercise — it is the difference between a graph that answers questions and a graph that is a tangled hairball. An ontology defines what kinds of things exist in your domain, what properties they carry, and how they relate to one another. Without one, your graph grows without structure and querying becomes guesswork.

The LeadGraph project offers a concrete, production-grade example of ontology design. It is a market intelligence platform for Siemens Healthineers that ingests data from 15 external sources (FDA 510(k) clearances, clinical trials, patents, conference exhibitor lists, grant databases, GitHub repositories, and more), normalises everything into a Neo4j knowledge graph, and scores companies by commercial relevance.

This article walks through the ontology decisions in LeadGraph and extracts the general principles that apply to any graph database project.

What Ontology Means in Neo4j

In Neo4j, ontology manifests as:

Node labels — :Company, :Product, :Application, :Signal. These are your entity types.
Property constraints — :Company.normalizedName is unique; :Signal.type is indexed.
Relationship types — SUPPLIES, DEVELOPS, HAS_SIGNAL, USED_IN. These encode the semantic edges between entities.
Value ranges — :Signal.confidence is a float from 0 to 1; :Signal.type is one of 12 enumerated signal types.

A well-designed ontology answers three questions: what exists, what matters about it, and how it connects.

The LeadGraph Ontology

The ontology is seeded explicitly — not discovered. The seedGraph() function in ontology.ts creates constraints, nodes, and relationships in a single transaction-friendly batch:

CREATE CONSTRAINT IF NOT EXISTS FOR (c:Company) REQUIRE c.normalizedName IS UNIQUE;
CREATE CONSTRAINT IF NOT EXISTS FOR (a:Application) REQUIRE a.name IS UNIQUE;
CREATE INDEX IF NOT EXISTS FOR (s:Signal) ON (s.type);

These three lines capture the ontology's backbone: every company must have a unique normalised name, every application area must be a named entity, and signals must be queryable by type.

Node Labels and Their Semantics

The ontology defines seven node labels:

Label	Purpose	Key Properties	Uniqueness
`Company`	An organisation in the diagnostics market	`name`, `normalizedName`, `domain`, `segment`, `region`	`normalizedName`
`Product`	A specific Siemens product or reagent	`catalogId`, `name`, `category`	`catalogId`
`Application`	A clinical application area	`name`, `category`	`name`
`Signal`	An external indicator of activity	`type`, `date`, `confidence`, `description`, `url`	none (event)
`Contact`	A person at a company	`name`, `email`, `role`	none
`PipelineStage`	A sales pipeline milestone	`stage`, `enteredAt`	none
`Activity`	A logged interaction	`type`, `note`, `date`	none

The distinction between Company and Product is straightforward. The interesting design choice is Application as a separate node rather than a property on Company. This enables multi-hop traversal: you can find companies that develop assays in the same application area as a given Siemens product — a query that would be expensive with property-based filtering on a relationship.

Relationship Semantics

The six relationship types encode the domain's business logic:

(:Company)-[:SUPPLIES]->(:Product)       // Siemens manufactures this product
(:Company)-[:DEVELOPS]->(:Application)    // Company works in this application area
(:Company)-[:HAS_SIGNAL]->(:Signal)       // Company triggered this external signal
(:Product)-[:USED_IN]->(:Application)     // Product is relevant to this clinical area
(:Contact)-[:CONTACT_AT]->(:Company)      // Person works at this organisation
(:Contact)-[:HAS_ACTIVITY]->(:Activity)   // Person had an interaction
(:Contact)-[:IN_STAGE]->(:PipelineStage)  // Current pipeline status

The DEVELOPS relationship is where the ontology does its heaviest lifting. Every external data point — an FDA clearance, a conference appearance, a new hire — is mapped through the application area classification. When the scoring engine runs, it computes productFitScore as the overlap ratio between a company's application areas and Siemens' product portfolio:

// scorer.ts (simplified)
const overlap = company.applications.filter(
  app => siemensApps.has(app)
).length;
const productFitScore = (overlap / siemensApps.size) * 30; // [0-30]

This query works because the ontology maintains (:Company)-[:DEVELOPS]->(:Application) and (:Product)-[:USED_IN]->(:Application) as first-class paths.

The Signal Normalisation Pattern

The most instructive design in LeadGraph is the signal normalisation pattern. The project ingests from 15 wildly different data sources:

Source	Raw Data	Normalised to
FDA 510(k) API	Medical device clearance records	`Signal { type: "FDA_CLEARANCE", confidence, date }`
ClinicalTrials.gov	Trial registry entries	`Signal { type: "CLINICAL_TRIAL", ... }`
EPO OPS (patents)	European patent filings	`Signal { type: "PATENT", ... }`
MEDICA API	Trade fair exhibitor list	`Signal { type: "CONFERENCE", ... }`
GitHub Search API	Repositories matching diagnostic keywords	`Signal { type: "RESEARCH_PUBLICATION", ... }`
FÖKAT (grants)	German government research grants	`Signal { type: "FUNDING", ... }`
DRKS (clinical trials DE)	German clinical trial registry	`Signal { type: "CLINICAL_TRIAL", ... }`

Each source has an adapter implementing a common interface:

interface SourceAdapter {
  readonly id: string;
  fetch(): Promise<RawLead[]>;
  normalize(raw: RawLead): LeadCandidate;
  healthCheck(): Promise<boolean>;
}

The normalize() method is where each adapter maps its domain-specific data into the shared ontology. An FDA 510(k) clearance becomes:

{
  companyName: "Euroimmun AG",
  domain: "euroimmun.com",
  applicationAreas: ["Autoimmune Diagnostics"],
  signals: [{
    type: "FDA_CLEARANCE",
    date: "2025-12-10",
    confidence: 0.9,
    description: "FDA 510(k) for Euroimmun Anti-dsDNA ELISA"
  }]
}

A MEDICA trade fair listing becomes:

{
  companyName: "DIARECT AG",
  applicationAreas: ["Autoimmune Diagnostics"],
  signals: [{
    type: "CONFERENCE",
    date: "2025-10-15",
    confidence: 0.6,
    description: "Presenting new autoimmune panel at ADLM 2025"
  }]
}

Different raw data, same ontology. This is the central value of a well-designed graph ontology: it makes disparate data sources queryable through a single model.

Why Confidence Is Part of the Ontology

Notice the confidence field on every signal. The ontology encodes uncertainty because not all data sources are equally reliable. An FDA clearance (confidence 0.9) is a stronger signal than a news article mentioning a company (confidence 0.5). By embedding confidence as a property, the scoring engine can weight signals by source reliability:

const signalScore = signals.reduce((sum, s) =>
  sum + (weights[s.type] ?? 1) * s.confidence * recencyMultiplier(s.date), 0
); // [0-40]

This is a deliberate ontology decision: uncertainty is a first-class property of your data model, not an afterthought.

What Makes a Good Graph Ontology

Drawing from the LeadGraph example and general graph database practice, here are the principles:

1. Entities Are Nodes, Values Are Properties

A common mistake is storing important domain concepts as properties on other nodes. In LeadGraph, Application is a separate node, not a string array on Company. This seems trivial but has major implications:

You can query all companies in an application area without full scans.
You can attach metadata to the application area itself (market size, growth rate).
You can join through application areas across different entity types (companies and products).

Rule of thumb: if you query by it, filter on it, or join through it, it should be a node.

2. Relationships Are Named, Not Tagged

Another common anti-pattern is using a generic RELATED_TO relationship with a type property to distinguish semantics. This forces every query to filter by property values, destroying performance and readability.

LeadGraph uses distinct relationship types (DEVELOPS, SUPPLIES, HAS_SIGNAL) that are self-documenting and indexable.

3. The Ontology Must Be Explicitly Seeded

LeadGraph seeds its ontology in code — the application areas and Siemens product portfolio are defined as TypeScript arrays:

const APPLICATION_AREAS = [
  "Hemostasis & Thrombosis",
  "Plasma Proteins",
  "Infectious Disease & Serology",
  "Cardiac Markers",
  "Oncology & Tumor Markers",
  "Autoimmune Diagnostics",
  "Specialty Proteins & Reagents",
];

This is not an accident. An ontology that emerges organically from data is rarely coherent. Explicit seeding means every node label, relationship type, and property constraint is a conscious design decision.

4. External Data Must Be Normalised, Not Imported Raw

The adapter pattern is critical. Each data source maps to the shared ontology via normalize(). If you import raw FDA data in FDA's schema and raw patent data in the patent office's schema, you have not built a knowledge graph — you have a data lake.

5. Constraints Are Ontology Enforcement

Neo4j constraints (REQUIRE ... IS UNIQUE) are not optional performance hints. They enforce the ontology at the database level. If two source adapters produce a company with the same name but different spellings, the normalizedName constraint catches the collision. The normalizeCompanyName() function handles deduplication:

export function normalizeCompanyName(name: string): string {
  return name
    .toLowerCase()
    .replace(/[^a-z0-9\s]/g, "")
    .replace(/gmbh|ag|limited|ltd|inc|corp|llc/g, "")
    .trim()
    .replace(/\s+/g, "-");
}

Common Ontology Mistakes

From building and iterating on the LeadGraph ontology, the recurring pitfalls are:

Over-normalisation. Splitting everything into nodes creates traversal hell. A company's region (EUROPE, NORTH_AMERICA) is a property, not a node — querying "all companies in Europe" requires only a property index, not a three-hop traversal.

Under-normalisation. Storing application areas as a comma-separated string on Company loses the ability to query across the application dimension. The rule of thumb above applies.

Ignoring time. Signals have dates, contacts have activity timestamps, pipeline stages have entry dates. If your ontology does not model time, your graph cannot answer "what changed."

Treating all relationships as equal. A HAS_SIGNAL relationship with a type property on the relationship is different from a named HAS_SIGNAL relationship with a Signal node that has a type property. The latter allows the signal to have its own properties (confidence, date, description) and to exist independently of the relationship.

Testing the Ontology

LeadGraph tests the ontology through integration tests that seed the graph and verify query results:

// From neo4j.test.ts
test("seeded ontology has correct structure", async () => {
  const result = await seedGraph();
  expect(result.constraintsCreated).toBe(3);
  expect(result.companiesSeeded).toBeGreaterThan(1);
  expect(result.applicationAreas).toBe(7);
  expect(result.productsSeeded).toBeGreaterThan(0);
});

More importantly, the scoring tests validate that the ontology supports the expected queries:

test("product fit score uses application overlap", async () => {
  const company = { applications: ["Hemostasis & Thrombosis"] };
  const siemensApps = new Set(["Hemostasis & Thrombosis", "Cardiac Markers"]);
  const score = computeProductFit(company, siemensApps);
  expect(score).toBeGreaterThan(0);
});

If the ontology changes, these tests fail — providing a safety net for schema evolution.

Summary

The LeadGraph project demonstrates that ontology design is the single most impactful decision in a graph database project. It determines what queries are possible, how performant they are, and whether the graph remains coherent as new data sources are added.

The key takeaways for building your own graph ontology:

Explicitly seed your ontology — do not let it emerge from data.
Entities are nodes, values are properties — if you query by it, make it a node.
Name your relationships — generic RELATED_TO is a code smell.
Normalise external data into your ontology — each source adapter is a translation layer.
Model uncertainty — confidence scores are first-class ontology properties.
Enforce with constraints — UNIQUE and INDEX are ontology guarantees, not performance hacks.
Test the ontology — if a schema change breaks queries, your tests should catch it.

A graph without ontology is just a collection of nodes. A graph with ontology is a knowledge graph.