Aman Panjwani
12 min read

Building the Knowledge Graph: From Structured Data

Knowledge GraphsKnowledge GraphsStructured DataGraph ModelingNeo4j

Continuing the Knowledge Graph Journey

This article builds directly on the ideas introduced earlier in this series.

So far, we’ve taken a layered approach to understanding Knowledge Graphs, starting with first principles and gradually moving closer to implementation.

We began with Knowledge Graphs 101: Building on Four Pillars of Intelligence, where we established what makes a system “intelligent” beyond storage. Structure, meaning, context, and reasoning are not features. They are foundations.

Next, in Knowledge Acquisition in AI: How Knowledge Graphs and LLMs Learn Differently, we compared two very different learning mechanisms. Knowledge Graphs learn by explicitly connecting facts. Large Language Models learn by absorbing patterns in language. One provides clarity and traceability. The other provides flexibility and intuition. Understanding this difference is critical before trying to combine them.

Then, in Designing the Mental Model: The Thinking Framework Behind Knowledge Graphs, we slowed down deliberately. Before writing code or importing data, we focused on how to think. Using a CRISP-DM–inspired framework, we translated business goals into modeling decisions. What to include. What to ignore. What questions the graph must be able to answer.

This article picks up from that point.

Here, we move from thinking to structure.
From intention to representation.

Using a concrete domain and simple structured data, we will walk through how to turn CSV files into a clean, inspectable Knowledge Graph. The goal is not analytics or AI. The goal is correctness. A graph that accurately reflects the reality it is meant to represent.

Everything that comes later depends on getting this part right.

1. Why Structured Data Is the Right Place to Start

When people think about Knowledge Graphs, they often jump straight to unstructured data. Documents, PDFs, text, embeddings, LLMs.

That jump is understandable, but it’s usually the wrong starting point if you are exploring.

Most organizations already have something far more valuable: structured data that represents how their business actually works. Orders, products, suppliers, systems, transactions. These datasets may look boring, but they encode real relationships. They just happen to be flattened into tables.

Instead of asking, “Which tables do I need to join?”, we start asking, “How do these things connect?” That shift matters more than any algorithm we add later.

Starting with structured data gives us a few clear advantages.

We already know the entities. The relationships already exist, even if they’re buried in tables. We’re not guessing meaning. We’re making it explicit.

It also gives us control. We can enforce identity, inspect the graph as it grows, and trace problems back to specific data or decisions.

And it forces discipline. By holding off on AI, we make sure the graph can stand on its own. If the structure is weak, no layer of intelligence will fix it.

The goal here isn’t complexity. It’s trust.

Once the structure is right, intelligence becomes an addition. Not a crutch.

Next, we need to be clear about the question this graph is meant to answer. Without that, even a clean graph turns into noise.

2. The Question Comes First

A Knowledge Graph without a question is just a collection of connected facts. It might look impressive, but it won’t be useful.

For this example, we’ll work with a simple but realistic question:

How does supplier risk impact products, warehouses, and customer orders?

This question forces us to think beyond isolated tables. Risk doesn’t live in one place. It flows. From suppliers to products. From products to warehouses. From warehouses to orders.

That flow is exactly what graphs are good at representing.

By starting with the question, we constrain the model. We know which entities matter. We know which relationships must exist. And just as importantly, we know what we can ignore for now.

This step comes straight from the business understanding phase of the mental model we discussed earlier. It keeps the graph grounded in purpose instead of growing in random directions.

With the question clear, we can now define the domain and scope of what we’re going to model.

3. Domain and Scope

To keep the focus on structure, we’ll use a supply chain domain.

Not because it’s flashy, but because it’s familiar. Most people understand suppliers, products, warehouses, and orders. That lets us spend our time on modeling decisions instead of domain explanations.

We’ll be intentional about scope.

We’ll model suppliers, the products they provide, where those products are stored, and how orders are fulfilled. That’s enough to answer our core question about risk and impact.

We’ll also leave things out on purpose.

There are no customers as entities. No real-time streams. No optimization logic. Those details matter later, but they would distract us here.

This isn’t about modeling everything. It’s about modeling the right things.

With the domain and boundaries set, we can look at the data we’ll use to build the graph.

4. The Data We’ll Use

We’ll build this graph entirely from structured data, using a small set of CSV files.

Each file represents something the business already understands. Suppliers. Products. Warehouses. Orders. Movements between them.

What matters is how these files relate to each other.

Some files describe core entities. Others exist only to describe relationships or events. That distinction is important, because it determines what becomes a node and what becomes a relationship in the graph.

suppliers.csv : Core supplier information, including geography and risk tier.

products.csv : The products that move through the supply chain.

supplier_products.csv : Which suppliers provide which products, including lead time and cost.

warehouses.csv : Physical locations where products are stored.

inventory.csv : Current product availability at each warehouse.

orders.csv : Customer orders and how they are fulfilled.

shipments.csv : Movement of goods from suppliers to warehouses.

Individually, each of these files is easy to understand. The value appears when we connect them into a single structure.

Before we import anything, we need to decide how these tables translate into a graph model. That’s where rows stop being records and start becoming knowledge.

5. From Tables to a Graph Model

This is the point where most confusion starts.

In tables, everything looks the same. Rows and columns. Joins when you need them. In a graph, we have to be more deliberate.

Not every table becomes a node.
Not every row deserves its own identity.

Some datasets describe things that exist on their own. Those become nodes. Others exist only to describe how things relate or interact. Those become relationships.

In our case, suppliers, products, warehouses, orders, and shipments all have their own identity. They represent real-world entities or events. They become nodes.

Files like supplier to product mappings or inventory snapshots don’t stand on their own. Their job is to connect things or add context. Those become relationships, often with properties attached.

This is where the mental model from the previous article shows up in practice.

We’re not modeling data for storage. We’re modeling reality for reasoning.

Once we’re clear on what becomes a node and what becomes a relationship, we can lock down identity. Because without stable identity, even the best-looking graph quietly breaks.

6. Connecting to Neo4j and Getting Ready to Import

At this point, we’re done with setup and decisions. It’s time to build.

Open Neo4j and connect to a running database. If you’re using Neo4j Desktop, this means starting the database and opening the Neo4j Browser. If you’re using another setup, the goal is the same. Get access to a Cypher prompt connected to your graph.

Nothing about the model or data is tied to Neo4j specifically. Any property graph database with support for nodes, relationships, and properties would work. Neo4j simply gives us a clean environment to demonstrate the process.

One important detail before we start importing. We’ll load the CSV files directly from GitHub using raw URLs. That means there’s no local setup required. As long as your database can access the internet, you can follow along exactly.

From here on, each import step follows the same pattern. We stream rows from a CSV file, match or create the relevant nodes, and then connect them.

We’ll start with the core entities first, then layer relationships and events on top. This keeps the graph stable as it grows and makes it easier to inspect at each step.

Let’s begin with the first import.

8. Create the Database and Import Everything

Now we build. Copy these queries into Neo4j Browser and run them in order.

Two rules:

  • Run the database + constraints first.
  • For LOAD CSV, always keep the whole pipeline in one query. Don’t break it with semicolons until the end.
SQL
CREATE DATABASE supplychain IF NOT EXISTS; :use supplychain

This creates an isolated database so you don’t mix experiments with other graphs.

SQL
CREATE CONSTRAINT supplier_id IF NOT EXISTS FOR (s:Supplier) REQUIRE s.supplier_id IS UNIQUE; CREATE CONSTRAINT product_id IF NOT EXISTS FOR (p:Product) REQUIRE p.product_id IS UNIQUE; CREATE CONSTRAINT warehouse_id IF NOT EXISTS FOR (w:Warehouse) REQUIRE w.warehouse_id IS UNIQUE; CREATE CONSTRAINT order_id IF NOT EXISTS FOR (o:Order) REQUIRE o.order_id IS UNIQUE; CREATE CONSTRAINT shipment_id IF NOT EXISTS FOR (sh:Shipment) REQUIRE sh.shipment_id IS UNIQUE;

These constraints stop silent duplicates. If you rerun imports, you won’t accidentally create a second “SUP001”.

Import core entities first

SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/suppliers.csv' AS row MERGE (s:Supplier {supplier_id: row.supplier_id}) SET s.supplier_name = row.supplier_name, s.country = row.country, s.risk_tier = row.risk_tier;
SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/products.csv' AS row MERGE (p:Product {product_id: row.product_id}) SET p.product_name = row.product_name, p.category = row.category;
SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/warehouses.csv' AS row MERGE (w:Warehouse {warehouse_id: row.warehouse_id}) SET w.location = row.location, w.capacity_units = toInteger(row.capacity_units);

This establishes the “things” in the graph before we start connecting them.

Add relationships and state

SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/supplier_products.csv' AS row MATCH (s:Supplier {supplier_id: row.supplier_id}) MATCH (p:Product {product_id: row.product_id}) MERGE (s)-[r:SUPPLIES]->(p) SET r.lead_time_days = toInteger(row.lead_time_days), r.cost_per_unit = toFloat(row.cost_per_unit);

This converts a join table into an actual relationship with properties (lead time, cost).

SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/inventory.csv' AS row MATCH (w:Warehouse {warehouse_id: row.warehouse_id}) MATCH (p:Product {product_id: row.product_id}) MERGE (w)-[r:STORES]->(p) SET r.on_hand_units = toInteger(row.on_hand_units), r.last_updated_date = date(row.last_updated_date);

This turns inventory into “Warehouse stores Product”, with on-hand units attached where they belong.

Import orders (node + links)

SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/orders.csv' AS row MERGE (o:Order {order_id: row.order_id}) SET o.order_date = date(row.order_date), o.customer_region = row.customer_region, o.quantity = toInteger(row.quantity) WITH row, o MATCH (p:Product {product_id: row.product_id}) MERGE (o)-[:FOR_PRODUCT]->(p) WITH row, o MATCH (w:Warehouse {warehouse_id: row.warehouse_id}) MERGE (o)-[:FULFILLED_FROM]->(w);

Orders are events, so we model them as nodes and connect them to product and warehouse. The WITH keeps row in scope for the whole pipeline.

SQL
LOAD CSV WITH HEADERS FROM 'https://raw.githubusercontent.com/aman-panjwani/knowledge-graph/main/shipments.csv' AS row MERGE (sh:Shipment {shipment_id: row.shipment_id}) SET sh.shipment_date = date(row.shipment_date), sh.expected_delivery_date = date(row.expected_delivery_date), sh.status = row.status WITH row, sh MATCH (s:Supplier {supplier_id: row.supplier_id}) MERGE (s)-[:SENT]->(sh) WITH row, sh MATCH (w:Warehouse {warehouse_id: row.warehouse_id}) MERGE (sh)-[:DESTINED_FOR]->(w);

Shipments are also events, so they become nodes linked from supplier to warehouse via the shipment.

9. Inspect the Graph in 60 Seconds

Run these right after import. They tell you fast if the graph is sane.

1) Node counts by type

SQL
MATCH (n) RETURN labels(n) AS label, count(*) AS count ORDER BY count DESC;

You should see non-zero counts for Supplier, Product, Warehouse, Order, Shipment.

2) Relationship counts by type

SQL
MATCH ()-[r]->() RETURN type(r) AS rel, count(*) AS count ORDER BY count DESC;

This confirms your edges actually got created (SUPPLIES, STORES, FOR_PRODUCT, FULFILLED_FROM, SENT, DESTINED_FOR).

Quick spot-check: high-risk suppliers and what they impact

SQL
MATCH (s:Supplier {risk_tier:'high'})-[:SUPPLIES]->(p:Product) OPTIONAL MATCH (o:Order)-[:FOR_PRODUCT]->(p) RETURN s.supplier_name, count(DISTINCT p) AS products, count(DISTINCT o) AS orders ORDER BY orders DESC;

This validates the core question. Risk should connect to products, and ideally to orders through those products.

10. Seeing Risk From Two Angles and One Graph

Up to now, we’ve been preparing the ground. This is where the graph starts paying us back.

We started with a simple question:

How does supplier risk flow through products, warehouses, and orders?

Instead of answering it with one big table, we’ll look at it from two practical angles. Then we’ll look at the graph itself.

View 1. Warehouse impact

This view answers a very operational question.

If a high-risk supplier causes disruption, which warehouses feel it first?

SQL
MATCH (s:Supplier {risk_tier:'high'})-[:SUPPLIES]->(p:Product) MATCH (o:Order)-[:FOR_PRODUCT]->(p) MATCH (o)-[:FULFILLED_FROM]->(w:Warehouse) RETURN w.warehouse_id AS warehouse, w.location AS location, count(DISTINCT o) AS orders_impacted, sum(o.quantity) AS units_impacted, count(DISTINCT p) AS products_exposed, count(DISTINCT s) AS high_risk_suppliers ORDER BY orders_impacted DESC, units_impacted DESC LIMIT 25;

This tells you where risk concentrates. Warehouses with high order and unit impact are the ones you watch closely.

View 2. Product impact

Now we flip the perspective.
Instead of asking where risk lands, we ask what spreads it.

SQL
MATCH (s:Supplier {risk_tier:'high'})-[:SUPPLIES]->(p:Product) MATCH (o:Order)-[:FOR_PRODUCT]->(p) OPTIONAL MATCH (o)-[:FULFILLED_FROM]->(w:Warehouse) RETURN p.product_id AS product_id, p.product_name AS product, p.category AS category, count(DISTINCT o) AS orders_impacted, sum(o.quantity) AS units_impacted, count(DISTINCT w) AS warehouses_fulfilling ORDER BY orders_impacted DESC, units_impacted DESC LIMIT 25;

The graph itself

Numbers are useful. Seeing the connections is better.
This query returns the full path so Neo4j can render it visually.

SQL
MATCH path = (s:Supplier {risk_tier:'high'})-[:SUPPLIES]->(p:Product) <-[:FOR_PRODUCT]-(o:Order)-[:FULFILLED_FROM]->(w:Warehouse) RETURN path LIMIT 50;

When you run this in Neo4j Browser, you’ll see suppliers, products, orders, and warehouses linked together. This is the same information as the tables above, but expressed as structure instead of aggregation.

At this point, the value of the Knowledge Graph should feel obvious. We’re not querying rows anymore. We’re following how risk actually moves through the system.

Wrapping Up

What we did here was deliberately simple.

We didn’t add AI.
We didn’t add embeddings.
We didn’t optimize anything.

We focused on one thing only. Representing reality correctly.

Starting from structured CSVs, we turned isolated tables into a connected model that can answer questions tables struggle with. We made relationships explicit. We made flows visible. And we did it in a way that you can inspect, validate, and trust.

That’s the real work in building a Knowledge Graph.

Once the structure is right, everything else becomes easier. Queries become shorter. Explanations become clearer. And mistakes become easier to spot.

Most teams rush past this step. They jump straight to intelligence and end up reasoning on top of shaky foundations. This article was about not doing that.

If you can build a clean graph from boring structured data, you can build one from anything.

Up next:

In the upcoming articles, we’ll start adding intelligence on top of this graph.

Not by changing the structure, but by building on it. We’ll use this same model to introduce graph analytics, reasoning, and AI-driven workflows. Community detection, impact propagation, and LLM-based interaction all make more sense once the underlying structure is solid.

The important point is this. Intelligence doesn’t replace structure. It amplifies it.

By starting with a graph that accurately represents reality, we give every layer above it something reliable to work with. That’s the foundation we’ll keep building on next.

Discussion Log (0)

Building the Knowledge Graph: From Structured Data | AI ML Insider