Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation

Learn how knowledge graphs supercharge recommendation systems by adding semantic understanding. Covers RippleNet, KGCN, KGAT, CKE, and path-based reasoning -- with intuitive explanations, real-world analogies, and working Python code.

When you search for The Dark Knight on a streaming platform, the system does not merely log that you watched it. It knows Christian Bale played Batman, Christopher Nolan directed it, it belongs to the Batman trilogy, and it shares cinematic DNA with other cerebral action films. This rich semantic web is a knowledge graph (KG) – a structured network of entities (movies, actors, directors, genres) connected by typed relations (acted_in, directed_by, part_of).

Why does this matter for recommendations? Because pure collaborative filtering has a blind spot: it can only recommend items that already have interaction history. A brand-new film with zero views is invisible. But if that film shares a director with movies you love, a knowledge graph sees the connection on day one. KGs transform recommendation from raw pattern matching into semantic reasoning.


What You Will Learn

  • What knowledge graphs are and how they encode real-world facts
  • How KG embeddings (TransE, TransR, DistMult) represent entities and relations as vectors
  • Propagation-based methods: RippleNet spreads user preferences like ripples in water
  • Graph convolutional methods: KGCN and KGAT learn item representations from KG neighbors
  • Embedding fusion: CKE combines collaborative, structural, and textual signals
  • Path-based reasoning for explainable recommendations
  • Working PyTorch code for every major method

Prerequisites

  • Basic Python and PyTorch (tensors, nn.Module, training loops)
  • Familiarity with graph neural networks (Part 7 of this series)
  • Comfort with embedding concepts (Part 5 )

Foundations of Knowledge Graphs

What Is a Knowledge Graph?

Movie knowledge graph: movies, people, genres and collections linked by typed relations

A knowledge graph stores facts as (head, relation, tail) triples. Each triple states one atomic fact:

  • (The Dark Knight, directed_by, Christopher Nolan)
  • (The Dark Knight, starred, Christian Bale)
  • (The Dark Knight, has_genre, Action)
  • (Christian Bale, acted_in, The Prestige)

Formally, a knowledge graph is a set $\mathcal{G} = \{(h, r, t)\}$ where $h \in \mathcal{E}$ is the head entity, $r \in \mathcal{R}$ is the relation type, and $t \in \mathcal{E}$ is the tail entity. $\mathcal{E}$ is the set of all entities and $\mathcal{R}$ is the set of all relation types.

Analogy. Think of a knowledge graph as a Wikipedia-scale fact database, but stored as a graph instead of prose. Each article is a node, and every link between articles is a labeled edge – the label tells you why they are connected.

Real-World Knowledge Graphs

Knowledge GraphEntitiesFactsUsed by
Freebase39M1.9BGoogle (deprecated)
Wikidata100M+1.4B+Wikipedia, Google
DBpedia6M580MAcademic research
Amazon Product GraphBillionsTrillionsAmazon recommendations

Knowledge Graph Structure for Recommendation

Recommendation KGs are typically heterogeneous – they contain multiple entity and relation types:

Entity types

  • Users: $U = \{u_1, u_2, \ldots, u_m\}$
  • Items: $I = \{i_1, i_2, \ldots, i_n\}$
  • Attributes: genres, actors, directors, brands, …

Relation types

  • User-item: (user, interacted, item)
  • Item-attribute: (movie, has_genre, Action), (movie, directed_by, Nolan)
  • Attribute-attribute: (Nolan, collaborated_with, Bale)

Knowledge Graph Embeddings

Before feeding a knowledge graph into a recommendation model, we must convert entities and relations into dense vectors. The key question: how do we train these embeddings so they respect the graph’s structure?

TransE: head plus relation vector lands near the tail; training pulls valid tails close and pushes negatives outside the margin

TransE – the simplest and most intuitive method – says: for any valid triple $(h, r, t)$, the vector equation $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ should hold.

$$\mathcal{L} = \sum_{(h,r,t) \in \mathcal{G}}\; \sum_{(h',r,t') \notin \mathcal{G}} \bigl[\gamma + \|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_2 - \|\mathbf{h}' + \mathbf{r} - \mathbf{t}'\|_2\bigr]_+$$

Plain English. “Push valid triples close together ($\mathbf{h} + \mathbf{r} \approx \mathbf{t}$) and invalid triples far apart. The margin $\gamma$ controls how much separation you require.” The right panel of the figure above shows this: the green dot (valid tail) is pulled inside the margin circle, while gray negatives are pushed outside.

MethodScoring functionBest for
TransE$\|\mathbf{h} + \mathbf{r} - \mathbf{t}\|$Simple 1-to-1 relations
TransR$\|\mathbf{h} M_r + \mathbf{r} - \mathbf{t} M_r\|$Relations needing different vector spaces
DistMult$\mathbf{h}^T \text{diag}(\mathbf{r})\, \mathbf{t}$Symmetric relations
ComplEx$\text{Re}(\mathbf{h}^T \text{diag}(\mathbf{r})\, \bar{\mathbf{t}})$Asymmetric relations

TransE analogy. Think of cities on a map. “Paris” + “fly_east_2000km” should land near “Moscow.” “Paris” + “fly_south_1500km” should land near “Algiers.” The relation vector behaves like a displacement on the map.

Implementation: Knowledge Graph Construction

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import torch
import torch.nn as nn
from collections import defaultdict
from typing import List, Tuple


class KnowledgeGraph:
    """Simple knowledge graph for recommendation systems."""

    def __init__(self):
        self.entities = {}       # entity_name -> entity_id
        self.relations = {}      # relation_name -> relation_id
        self.triples = []        # List of (h_id, r_id, t_id)
        self.entity_adj = defaultdict(list)  # entity_id -> [(r_id, t_id), ...]

    def add_entity(self, name: str) -> int:
        if name not in self.entities:
            self.entities[name] = len(self.entities)
        return self.entities[name]

    def add_relation(self, name: str) -> int:
        if name not in self.relations:
            self.relations[name] = len(self.relations)
        return self.relations[name]

    def add_triple(self, head: str, relation: str, tail: str):
        h_id = self.add_entity(head)
        r_id = self.add_relation(relation)
        t_id = self.add_entity(tail)
        self.triples.append((h_id, r_id, t_id))
        self.entity_adj[h_id].append((r_id, t_id))

    def get_neighbors(self, entity_id: int) -> List[Tuple[int, int]]:
        return self.entity_adj[entity_id]


# Build a movie knowledge graph
kg = KnowledgeGraph()
kg.add_triple("TheDarkKnight", "directed_by", "ChristopherNolan")
kg.add_triple("TheDarkKnight", "starred", "ChristianBale")
kg.add_triple("TheDarkKnight", "has_genre", "Action")
kg.add_triple("TheDarkKnight", "has_genre", "Crime")
kg.add_triple("Inception", "directed_by", "ChristopherNolan")
kg.add_triple("Inception", "starred", "LeonardoDiCaprio")
kg.add_triple("Inception", "has_genre", "SciFi")
kg.add_triple("ThePrestige", "directed_by", "ChristopherNolan")
kg.add_triple("ThePrestige", "starred", "ChristianBale")
kg.add_triple("ThePrestige", "has_genre", "Drama")

# Explore the graph
dk_id = kg.entities["TheDarkKnight"]
print(f"TheDarkKnight neighbors: {kg.get_neighbors(dk_id)}")
# (directed_by, Nolan), (starred, Bale), (has_genre, Action), (has_genre, Crime)

Why Knowledge Graphs Help Recommendation

Knowledge graphs solve four hard problems that plague pure collaborative filtering.

Four recurring failure modes of collaborative filtering and how a knowledge graph addresses each

1. Cold Start

Problem. A new movie with zero interactions is invisible to collaborative filtering – there is nothing for the model to compare against.

KG solution. Even on day one, the new movie has attributes (director, actors, genre) that connect it to the rest of the graph. If you loved Nolan’s other films, the KG can recommend his new movie immediately, with no interaction data required.

2. Data Sparsity

Problem. Most users interact with a tiny fraction of items. The interaction matrix is 99%+ zeros, leaving the model little signal to work with.

KG solution. The knowledge graph fills the gaps with dense semantic connections. Even if two items share no users, they might share a director, genre, or production studio – and that link still carries information.

3. Explainability

Problem. “Users who liked X also liked Y” is not a satisfying explanation.

KG solution. “We recommend Inception because it was directed by Christopher Nolan, who also directed The Dark Knight, which you rated 5 stars.” The KG yields concrete, interpretable reasoning paths.

4. Diversity

Problem. Collaborative filtering tends to create filter bubbles – recommending more of the same.

KG solution. Different relation paths lead to different kinds of recommendations. Following same_director yields different results than same_genre or same_actor, naturally diversifying the recommendation list.

Concrete Example: After Watching The Dark Knight

Reasoning over a small knowledge graph: from one watch to four candidate recommendations

The figure traces explicit paths from a user’s single watch to four candidate movies:

  • Inception – via Dark Knight -> directed_by -> Nolan -> directed -> Inception
  • The Prestige – two converging paths through Nolan and Bale (a strong signal)
  • Batman Begins – via shared cast and shared genre
  • The system can produce a different ranking and an explanation for each candidate, something pure CF cannot do.

Types of KG-Enhanced Methods

CategoryApproachKey models
Propagation-basedSpread user preferences through the KGRippleNet
Graph convolutionalLearn item embeddings by aggregating KG neighborsKGCN, KGAT
Embedding-basedLearn joint embeddings of users, items, and KG entitiesCKE
Path-basedReason over multi-hop paths for explainabilityKPRN, PathRec

RippleNet: Preference Propagation

RippleNet: user preferences spread outward from history items through hop-1 and hop-2 KG neighbors

The Ripple Analogy

Drop a stone in a pond and ripples spread outward. RippleNet (Wang et al., CIKM 2018) does the same thing with user preferences: when a user interacts with an item, that preference ripples outward through the knowledge graph, activating related entities at increasing distances.

How It Works

Given user $u$ with historical items $V_u = \{v_1, v_2, \ldots\}$:

  1. Hop 0: Start with the user’s items as the initial preference set $\mathcal{S}_u^0 = V_u$.
  2. Hop 1: Find every entity connected to $\mathcal{S}_u^0$ via any relation. These form $\mathcal{S}_u^1$.
  3. Hop 2: Find every entity connected to $\mathcal{S}_u^1$. These form $\mathcal{S}_u^2$.
  4. Aggregate: Compute the user’s enhanced embedding by attention-weighting entities at each hop.

At hop $h$, the relevance of a tail entity $t$ to candidate item $v$ is:

$$p_i^h = \text{softmax}(\mathbf{v}^T \mathbf{R}_i\, \mathbf{t})$$

Plain English. “How relevant is this KG entity to the item we are scoring? Use the relation matrix $\mathbf{R}$ to measure compatibility.”

The user’s preference vector at hop $h$:

$$\mathbf{o}_u^h = \sum_{(h,r,t) \in \mathcal{S}_u^h} p_i^h\, \mathbf{t}$$

The final user embedding combines all hops:

$$\mathbf{u} = \sum_{h=0}^{H} \alpha_h\, \mathbf{o}_u^h$$

Walkthrough. A user watched The Dark Knight:

  • Hop 0: {Dark Knight}
  • Hop 1: {Nolan, Bale, Action, Crime}
  • Hop 2: {Inception, Prestige, Batman Begins, DiCaprio, Drama, …}

The ripples discover that this user might like Inception (connected through Nolan) and The Prestige (connected through both Nolan and Bale – a doubly-supported signal).

Implementation: RippleNet

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
import torch
import torch.nn as nn
import torch.nn.functional as F


class RippleNet(nn.Module):
    """RippleNet: propagating user preferences on the knowledge graph."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_hop=2, n_memory=32):
        super().__init__()
        self.n_hop = n_hop
        self.n_memory = n_memory
        self.embedding_dim = embedding_dim

        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.item_emb = nn.Embedding(num_items, embedding_dim)
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.hop_weights = nn.Parameter(torch.ones(n_hop + 1) / (n_hop + 1))

        for emb in [self.user_emb, self.item_emb,
                    self.entity_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, ripple_sets):
        """
        Args:
            user_ids: [batch_size]
            item_ids: [batch_size]
            ripple_sets: ripple_sets[i][h] holds (relation_id, tail_id)
                         tuples for user i at hop h.
        """
        user_base = self.user_emb(user_ids)        # [B, d]
        item_vec = self.item_emb(item_ids)          # [B, d]

        user_enhanced = self._propagate(user_base, item_vec, ripple_sets)
        scores = (user_enhanced * item_vec).sum(dim=1)
        return scores

    def _propagate(self, user_base, item_vec, ripple_sets):
        batch_size = user_base.size(0)
        hop_memories = [user_base]

        for hop in range(self.n_hop):
            hop_embs = []
            for i in range(batch_size):
                rs = ripple_sets[i][hop] if hop < len(ripple_sets[i]) else []
                if not rs:
                    hop_embs.append(torch.zeros(self.embedding_dim))
                    continue

                rs = rs[:self.n_memory]               # cap memory
                rels = torch.LongTensor([r for r, _ in rs])
                tails = torch.LongTensor([t for _, t in rs])

                rel_e = self.relation_emb(rels)       # [K, d]
                tail_e = self.entity_emb(tails)       # [K, d]

                # Score each KG neighbor against the candidate item
                scores = (item_vec[i].unsqueeze(0) * rel_e * tail_e).sum(1)
                probs = F.softmax(scores, dim=0)      # [K]
                hop_embs.append((probs.unsqueeze(1) * tail_e).sum(0))

            hop_memories.append(torch.stack(hop_embs))

        # Weighted combination across hops
        stacked = torch.stack(hop_memories, dim=1)    # [B, H+1, d]
        weights = F.softmax(self.hop_weights, dim=0)
        return (weights.unsqueeze(0).unsqueeze(2) * stacked).sum(dim=1)

    @staticmethod
    def build_ripple_sets(user_items, kg_adj, max_hops=2):
        """Build ripple sets for one user via BFS."""
        ripple_sets = []
        current = set(user_items)
        for _ in range(max_hops):
            next_hop = []
            for entity in current:
                for r_id, t_id in kg_adj.get(entity, []):
                    next_hop.append((r_id, t_id))
            ripple_sets.append(next_hop)
            current = {t for _, t in next_hop}
        return ripple_sets

KGCN: Knowledge Graph Convolutional Networks

A Different Perspective

While RippleNet propagates user preferences through the KG, KGCN (Wang et al., WWW 2019) takes the opposite stance: it builds better item representations by aggregating information from each item’s KG neighborhood.

Analogy. RippleNet asks, “Starting from what you liked, what is nearby in the knowledge graph?” KGCN asks, “For each candidate item, what does its KG neighborhood tell us about it?”

How KGCN Works

For an item $i$ with KG neighbors $\mathcal{N}_i$, KGCN:

  1. Samples $K$ neighbors from the KG.
  2. Weights each neighbor by a learned, relation-aware attention score.
  3. Aggregates the weighted neighbor embeddings.
  4. Combines the aggregate with the item’s own embedding.

At layer $l$:

$$\mathbf{e}_{\mathcal{N}_i}^{(l)} = \sum_{(r,e) \in \mathcal{N}_i} \pi_r(i, e)\; \mathbf{e}_e^{(l-1)}$$

where $\pi_r(i, e)$ is a relation-aware attention weight. The item embedding then updates as:

$$\mathbf{e}_i^{(l)} = \sigma\!\bigl(\mathbf{W}^{(l)}[\mathbf{e}_i^{(l-1)} \,\|\, \mathbf{e}_{\mathcal{N}_i}^{(l)}] + \mathbf{b}^{(l)}\bigr)$$

Plain English. “Look at what is connected to this item in the KG (actors, directors, genres). Weight each connection by how important that relation type is. Blend the neighborhood signal with the item’s own embedding.”

Implementation: KGCN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
import torch
import torch.nn as nn
import torch.nn.functional as F


class KGCNLayer(nn.Module):
    """Single KGCN layer with relation-aware aggregation."""

    def __init__(self, embedding_dim, num_relations):
        super().__init__()
        self.embedding_dim = embedding_dim

        self.rel_transforms = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim, bias=False)
            for _ in range(num_relations)
        ])
        self.attn = nn.ModuleList([
            nn.Linear(embedding_dim, 1)
            for _ in range(num_relations)
        ])
        self.combine = nn.Linear(embedding_dim * 2, embedding_dim)

    def forward(self, item_embs, neighbors, relation_embs):
        num_items = item_embs.size(0)
        aggregated = []

        for i in range(num_items):
            nbrs = neighbors.get(i, [])
            if not nbrs:
                aggregated.append(torch.zeros(self.embedding_dim))
                continue

            nbr_embs, scores = [], []
            for r_id, n_id in nbrs:
                transformed = self.rel_transforms[r_id](item_embs[n_id])
                nbr_embs.append(transformed)
                compatibility = item_embs[i] + relation_embs[r_id]
                scores.append(self.attn[r_id](compatibility))

            nbr_embs = torch.stack(nbr_embs)
            scores = torch.cat(scores)
            weights = F.softmax(scores, dim=0)
            aggregated.append((weights.unsqueeze(1) * nbr_embs).sum(0))

        aggregated = torch.stack(aggregated)
        combined = torch.cat([item_embs, aggregated], dim=1)
        return self.combine(combined)


class KGCN(nn.Module):
    """Knowledge Graph Convolutional Network for recommendation."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_layers=2):
        super().__init__()
        self.num_users = num_users
        self.num_items = num_items

        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.layers = nn.ModuleList([
            KGCNLayer(embedding_dim, num_relations)
            for _ in range(n_layers)
        ])

        for emb in [self.user_emb, self.entity_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, item_neighbors):
        user_vec = self.user_emb(user_ids)
        item_vec = self.entity_emb.weight[:self.num_items]
        rel_vec = self.relation_emb.weight

        for idx, layer in enumerate(self.layers):
            if idx < len(item_neighbors):
                item_vec = layer(item_vec, item_neighbors[idx], rel_vec)
                item_vec = F.relu(item_vec)

        item_final = item_vec[item_ids]
        return (user_vec * item_final).sum(dim=1)

KGAT: Knowledge Graph Attention Network

KGAT: attention weights highlight which KG neighbors of an item matter most for the user

What KGAT Adds

KGAT (Wang et al., KDD 2019) goes further than KGCN by building a Collaborative Knowledge Graph (CKG) that merges user-item interactions with the knowledge graph into one unified graph. It then applies attention-based aggregation over this combined structure.

$$\mathcal{G}_\text{CKG} = \underbrace{\{(u, \text{interact}, i)\}}_{\text{user-item edges}} \;\cup\; \underbrace{\{(h, r, t)\}}_{\text{KG edges}}$$

Why this matters. In KGCN, user and item embeddings live in separate spaces. In KGAT, they are all nodes in the same graph, so collaborative signals and semantic signals propagate together.

Attention Mechanism

For an entity $e$ (which may be a user, item, or attribute), KGAT computes attention over neighbors:

$$\pi(e, r, e') = \frac{\exp\!\bigl(\text{LeakyReLU}(\mathbf{a}^T [\mathbf{e}_e \,\|\, \mathbf{e}_r \,\|\, \mathbf{e}_{e'}])\bigr)}{\sum_{(r'', e'') \in \mathcal{N}(e)} \exp\!\bigl(\text{LeakyReLU}(\mathbf{a}^T [\mathbf{e}_e \,\|\, \mathbf{e}_{r''} \,\|\, \mathbf{e}_{e''}])\bigr)}$$

Plain English. “For each neighbor, concatenate the entity, relation, and neighbor embeddings, run them through a learned scoring function, and normalize with softmax. This tells the model which connections matter most.” The figure above visualizes this: edges to Nolan and DiCaprio are thick (high attention) while the edge to 2010 (release year) is thin.

The entity update:

$$\mathbf{e}_e^{(l)} = \sigma\!\bigl(\mathbf{W}^{(l)}[\mathbf{e}_e^{(l-1)} \,\|\, \mathbf{e}_{\mathcal{N}(e)}^{(l-1)}] + \mathbf{b}^{(l)}\bigr)$$

Multi-Head Attention

Like GAT (Part 7), KGAT uses multiple attention heads to capture different aspects of the relationships:

$$\mathbf{e}_{\mathcal{N}(e)} = \big\|_{k=1}^{K}\; \sum_{(r,e') \in \mathcal{N}(e)} \pi^{(k)}(e, r, e')\, \mathbf{e}_{e'}^{(k)}$$

Implementation: KGAT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn as nn
import torch.nn.functional as F


class KGATLayer(nn.Module):
    """Single KGAT layer with relation-aware attention."""

    def __init__(self, embedding_dim, num_heads=1, dropout=0.1):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.num_heads = num_heads
        self.dropout = dropout

        # Score [entity || relation || neighbor]
        self.attention = nn.Linear(embedding_dim * 3, num_heads)
        self.W_v = nn.Linear(embedding_dim, embedding_dim)
        self.output = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, entity_embs, relation_embs, neighbors):
        outputs = []
        for i in range(entity_embs.size(0)):
            nbrs = neighbors.get(i, [])
            if not nbrs:
                outputs.append(entity_embs[i])
                continue

            scores, values = [], []
            for r_id, n_id in nbrs:
                combined = torch.cat([
                    entity_embs[i], relation_embs[r_id], entity_embs[n_id]
                ])
                scores.append(self.attention(combined))
                values.append(self.W_v(entity_embs[n_id]))

            scores = F.softmax(torch.stack(scores), dim=0)  # [K, heads]
            scores = F.dropout(scores, p=self.dropout, training=self.training)
            values = torch.stack(values)                     # [K, d]

            aggregated = (scores.mean(dim=1, keepdim=True) * values).sum(0)
            out = entity_embs[i] + aggregated
            outputs.append(F.relu(self.output(out)))

        return torch.stack(outputs)


class KGAT(nn.Module):
    """Knowledge Graph Attention Network for recommendation."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_layers=2, num_heads=2, dropout=0.1):
        super().__init__()
        self.num_users = num_users

        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.layers = nn.ModuleList([
            KGATLayer(embedding_dim, num_heads, dropout)
            for _ in range(n_layers)
        ])

        for emb in [self.entity_emb, self.user_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, ckg_neighbors):
        # Combined embedding table: [users | entities]
        all_embs = self.entity_emb.weight.clone()
        all_embs[:self.num_users] = self.user_emb.weight
        rel_embs = self.relation_emb.weight

        for idx, layer in enumerate(self.layers):
            if idx < len(ckg_neighbors):
                all_embs = layer(all_embs, rel_embs, ckg_neighbors[idx])

        user_final = all_embs[user_ids]
        item_final = all_embs[self.num_users + item_ids]
        return (user_final * item_final).sum(dim=1)

CKE: Collaborative Knowledge Base Embedding

The Multi-Signal Approach

CKE (Zhang et al., KDD 2016) takes a different philosophy: instead of propagating through the graph at inference time, it pre-learns embeddings from three complementary sources and combines them additively.

$$\mathbf{i} = \mathbf{i}_\text{CF} + \mathbf{i}_\text{KG} + \mathbf{i}_\text{text}$$
ComponentSourceLearning method
$\mathbf{i}_\text{CF}$User-item interactionsMatrix factorization
$\mathbf{i}_\text{KG}$Knowledge graph structureTransR embedding
$\mathbf{i}_\text{text}$Item descriptionsCNN text encoder

Why three signals? Each captures something the others miss:

  • CF knows who likes what, but not why.
  • KG knows semantic relationships, but not user behavior.
  • Text captures nuanced descriptions that structured triples cannot express.

Joint Training

CKE trains all three components together with a combined loss:

$$\mathcal{L} = \mathcal{L}_\text{CF} + \lambda_1 \mathcal{L}_\text{KG} + \lambda_2 \mathcal{L}_\text{text} + \lambda_3 \mathcal{L}_\text{reg}$$

The CF loss is standard matrix factorization with $\hat{r}_{ui} = \mathbf{u}^T \mathbf{i}$. The KG loss uses TransR: push valid triples together and invalid triples apart.

Implementation: CKE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import torch
import torch.nn as nn
import torch.nn.functional as F


class TransR(nn.Module):
    """TransR: relation-specific entity projection."""

    def __init__(self, num_entities, num_relations, entity_dim, relation_dim):
        super().__init__()
        self.entity_emb = nn.Embedding(num_entities, entity_dim)
        self.relation_emb = nn.Embedding(num_relations, relation_dim)
        self.proj_matrices = nn.Embedding(num_relations,
                                          entity_dim * relation_dim)
        self.entity_dim = entity_dim
        self.relation_dim = relation_dim

        for emb in [self.entity_emb, self.relation_emb, self.proj_matrices]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, heads, relations, tails,
                neg_heads=None, neg_tails=None):
        h = self.entity_emb(heads)
        r = self.relation_emb(relations)
        t = self.entity_emb(tails)

        proj = self.proj_matrices(relations).view(
            -1, self.entity_dim, self.relation_dim
        )
        h_proj = torch.bmm(h.unsqueeze(1), proj).squeeze(1)
        t_proj = torch.bmm(t.unsqueeze(1), proj).squeeze(1)

        pos_score = (h_proj + r - t_proj).norm(p=2, dim=1) ** 2

        if neg_heads is not None and neg_tails is not None:
            nh = self.entity_emb(neg_heads)
            nt = self.entity_emb(neg_tails)
            nh_proj = torch.bmm(nh.unsqueeze(1), proj).squeeze(1)
            nt_proj = torch.bmm(nt.unsqueeze(1), proj).squeeze(1)
            neg_score = (nh_proj + r - nt_proj).norm(p=2, dim=1) ** 2
            return F.relu(1.0 + pos_score - neg_score).mean()

        return pos_score


class CKE(nn.Module):
    """Collaborative Knowledge base Embedding."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, relation_dim=32,
                 kg_lambda=0.1, reg_lambda=0.01):
        super().__init__()
        self.num_entities = num_entities
        self.kg_lambda = kg_lambda
        self.reg_lambda = reg_lambda

        # CF component
        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.item_cf_emb = nn.Embedding(num_items, embedding_dim)

        # KG component (TransR)
        self.transr = TransR(num_entities, num_relations,
                             embedding_dim, relation_dim)

        # Project KG embeddings into the CF space
        self.kg_proj = nn.Linear(embedding_dim, embedding_dim)

        nn.init.xavier_uniform_(self.user_emb.weight)
        nn.init.xavier_uniform_(self.item_cf_emb.weight)

    def forward(self, user_ids, item_ids):
        user_vec = self.user_emb(user_ids)
        item_cf = self.item_cf_emb(item_ids)
        item_kg = self.kg_proj(self.transr.entity_emb(item_ids))

        item_combined = item_cf + item_kg
        return (user_vec * item_combined).sum(dim=1)

    def compute_loss(self, user_ids, item_ids, ratings,
                     kg_heads=None, kg_rels=None, kg_tails=None):
        # CF loss
        preds = self.forward(user_ids, item_ids)
        cf_loss = F.mse_loss(preds, ratings.float())

        # KG loss
        kg_loss = torch.tensor(0.0)
        if kg_heads is not None:
            neg_heads = torch.randint(0, self.num_entities, kg_heads.shape)
            neg_tails = torch.randint(0, self.num_entities, kg_tails.shape)
            kg_loss = self.transr(kg_heads, kg_rels, kg_tails,
                                  neg_heads, neg_tails)

        # Regularization
        reg = (self.user_emb.weight.norm(2) ** 2 +
               self.item_cf_emb.weight.norm(2) ** 2)

        return cf_loss + self.kg_lambda * kg_loss + self.reg_lambda * reg

Path-Based Reasoning for Explainable Recommendations

Why Paths Matter

The methods above learn embeddings – dense vectors that are powerful but opaque. Path-based reasoning takes a different route: it finds concrete paths through the knowledge graph from a user’s history to a candidate item, and uses those paths as both features and explanations.

Example path. You rated The Dark Knight highly -> The Dark Knight directed_by Christopher Nolan -> Christopher Nolan directed Inception -> Recommend Inception.

This path is not just a feature for the model – it doubles as a human-readable explanation.

Implementation: Multi-Hop Path Reasoning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import deque


class PathReasoning(nn.Module):
    """Find and score KG paths for explainable recommendations."""

    def __init__(self, num_entities, num_relations, embedding_dim=64):
        super().__init__()
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.path_scorer = nn.Sequential(
            nn.Linear(embedding_dim * 3, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, 1),
        )

        nn.init.xavier_uniform_(self.entity_emb.weight)
        nn.init.xavier_uniform_(self.relation_emb.weight)

    def find_paths(self, kg_adj, start, end, max_hops=3, max_paths=10):
        """BFS to find paths from start to end entity."""
        paths = []
        queue = deque([(start, [], [])])
        visited = set()

        while queue and len(paths) < max_paths:
            current, rel_path, ent_path = queue.popleft()

            if current == end and rel_path:
                paths.append((rel_path, ent_path + [end]))
                continue
            if len(rel_path) >= max_hops:
                continue

            for r_id, t_id in kg_adj.get(current, []):
                if (current, r_id, t_id) not in visited:
                    visited.add((current, r_id, t_id))
                    queue.append((t_id,
                                  rel_path + [r_id],
                                  ent_path + [current]))

        return paths[:max_paths]

    def score_path(self, path):
        """Score a single path with learned embeddings."""
        rel_ids, ent_ids = path
        if not rel_ids:
            return torch.tensor(0.0)

        start = self.entity_emb(torch.tensor([ent_ids[0]]))
        rels = self.relation_emb(torch.tensor(rel_ids)).mean(0, keepdim=True)
        end = self.entity_emb(torch.tensor([ent_ids[-1]]))

        combined = torch.cat([start, rels, end], dim=1)
        return self.path_scorer(combined).squeeze()

    def recommend_with_explanations(self, user_items, kg_adj,
                                    candidates, max_hops=3):
        """Return [(item_id, score, best_path)] sorted by score."""
        results = []
        for item_id in candidates:
            best_score = float('-inf')
            best_path = None
            for src_item in user_items:
                for path in self.find_paths(kg_adj, src_item, item_id,
                                            max_hops=max_hops):
                    score = self.score_path(path).item()
                    if score > best_score:
                        best_score = score
                        best_path = path
            if best_path is not None:
                results.append((item_id, best_score, best_path))

        results.sort(key=lambda x: x[1], reverse=True)
        return results

With KG vs Without KG: The Bottom Line

Quantitative lift on MovieLens-1M and qualitative comparison of recommendation lists

The left panel shows KGAT delivering double-digit relative gains over a vanilla BPR baseline on every standard metric – and a much larger gain on cold-start Recall@20, which is exactly the regime where collaborative filtering struggles most.

The right panel makes the qualitative difference visceral. Without a KG, the model falls back on raw popularity and surfaces unrelated blockbusters; the brand-new Nolan film is invisible because no user has seen it yet. With a KG, the model surfaces films that are thematically connected – and the new Nolan release is recommended on day one because the graph already knows Nolan directed it.


Practical Tips

Building a Knowledge Graph

SourceTypeExample
Wikidata / DBpediaPublic structured KGsMovie metadata, company info
Product catalogsInternal databasesBrand, category, specifications
NER + relation extractionFrom unstructured textReviews, descriptions

Entity Alignment

Items in your recommendation system need to be linked to entities in the knowledge graph. Common approaches:

  1. Exact string matching – fast but brittle.
  2. Fuzzy matching – handles typos and abbreviations.
  3. Embedding similarity – learn embeddings on both sides and match nearest neighbors.
  4. Manual curation – highest quality but expensive.

Training Strategies

StrategyWhen to use
Joint trainingEnough data to train KG and rec objectives together
Pre-train + fine-tuneKG data is much larger than interaction data
Multi-task learningYou want shared representations across tasks

Choosing the Number of Hops

  • 1 hop: direct attributes only. Fast but shallow.
  • 2 hops: the sweet spot for most tasks. Captures “same director” or “same actor” patterns.
  • 3 hops: richer paths but introduces noise. Use attention to filter.
  • 4+ hops: rarely helpful. The signal-to-noise ratio drops sharply.

Complete Training Pipeline

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np


class KGRecDataset(Dataset):
    """Dataset for KG-enhanced recommendation."""

    def __init__(self, user_ids, item_ids, ratings):
        self.user_ids = torch.LongTensor(user_ids)
        self.item_ids = torch.LongTensor(item_ids)
        self.ratings = torch.FloatTensor(ratings)

    def __len__(self):
        return len(self.user_ids)

    def __getitem__(self, idx):
        return {
            'user_id': self.user_ids[idx],
            'item_id': self.item_ids[idx],
            'rating': self.ratings[idx],
        }


def train_cke(model, train_loader, val_loader, kg_triples=None,
              num_epochs=10, lr=0.001):
    """Train a CKE model with optional KG loss."""
    optimizer = optim.Adam(model.parameters(), lr=lr)
    best_val = float('inf')

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0

        for batch in train_loader:
            optimizer.zero_grad()
            loss = model.compute_loss(
                batch['user_id'], batch['item_id'], batch['rating'],
                *kg_triples if kg_triples else (None, None, None)
            )
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                preds = model(batch['user_id'], batch['item_id'])
                val_loss += F.mse_loss(preds, batch['rating'].float()).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"Train: {train_loss:.4f} | Val: {val_loss:.4f}")

        if val_loss < best_val:
            best_val = val_loss
            torch.save(model.state_dict(), 'best_kg_model.pt')


if __name__ == "__main__":
    num_users, num_items = 1000, 500
    num_entities, num_relations = 2000, 10

    train_data = KGRecDataset(
        np.random.randint(0, num_users, 8000),
        np.random.randint(0, num_items, 8000),
        np.random.uniform(1, 5, 8000),
    )
    val_data = KGRecDataset(
        np.random.randint(0, num_users, 2000),
        np.random.randint(0, num_items, 2000),
        np.random.uniform(1, 5, 2000),
    )

    model = CKE(num_users, num_items, num_entities, num_relations)
    train_cke(
        model,
        DataLoader(train_data, batch_size=64, shuffle=True),
        DataLoader(val_data, batch_size=64),
        num_epochs=10,
    )

Frequently Asked Questions

How do knowledge graphs help with the cold-start problem?

Even when a new item has zero interactions, it still has attributes in the knowledge graph (genre, director, cast). These attributes connect it to other items that do have interaction history. A user who loved Nolan’s films can receive a recommendation for his latest movie on release day, purely through KG connections.

RippleNet vs. KGCN – what is the difference?

RippleNet is user-centric: it starts from the user’s history and ripples outward through the KG. KGCN is item-centric: it builds enriched item representations by aggregating from each item’s KG neighborhood. In practice, KGCN tends to scale better because item neighborhoods are more stable than user preference ripples.

When should I use KGAT over KGCN?

Use KGAT when you want to model the interaction between collaborative signals and semantic signals in a unified graph. KGAT’s Collaborative Knowledge Graph merges user-item edges with KG edges, so attention can learn across both types. KGCN keeps them separate.

How does CKE compare to graph-based methods?

CKE is simpler and faster – it pre-learns KG embeddings and combines them additively. Graph-based methods (KGCN, KGAT) are more powerful because they propagate information through the graph at inference time, but they are also more expensive. Start with CKE for a quick baseline, then upgrade to KGAT if you need better accuracy.

Can knowledge graphs improve recommendation diversity?

Yes. Different relation types lead to different kinds of connections. Following same_director gives different results than same_genre or same_actor. By exploring multiple relation paths, the system naturally surfaces diverse recommendations instead of creating a filter bubble.

How do I handle noisy or incomplete knowledge graphs?

  1. Attention mechanisms (KGAT) naturally down-weight noisy connections.
  2. TransR embeddings learn to ignore inconsistent triples during training.
  3. Data augmentation through relation inference: predict missing edges.
  4. Multi-task learning shares information across tasks, making the model more robust to missing data.

What are the computational bottlenecks?

The main bottleneck is neighbor aggregation on large KGs (millions of entities). Solutions:

  • Neighbor sampling – limit to $K$ neighbors per node.
  • Hierarchical aggregation – aggregate attributes first, then items.
  • Mini-batch training with subgraph sampling.
  • Pre-computation of KG embeddings (CKE approach).

Can knowledge graphs provide explainable recommendations?

This is one of their biggest strengths. Path-based methods generate explanations like: “We recommend Movie X because it shares director Y with Movie Z, which you rated highly.” These explanations are concrete, human-readable, and grounded in factual relationships – much better than “users similar to you also liked this.”

What is the latest in KG-enhanced recommendation?

Recent trends include:

  1. Temporal KGs that model how preferences and facts evolve over time.
  2. Multi-modal KGs combining structured knowledge with images and text.
  3. Transformer-based KG methods that replace GCN aggregation with self-attention.
  4. Pre-trained KG embeddings from large-scale foundation models.
  5. LLM + KG hybrids that use language models to extract from and reason over knowledge graphs.

Key Takeaways

  • Knowledge graphs add semantic understanding to recommendation systems, going beyond pure collaborative signals.
  • Cold start is the killer app: KGs enable recommendations for items with zero interaction history.
  • RippleNet propagates user preferences outward through the KG like ripples in water.
  • KGCN and KGAT build enriched item representations by aggregating from KG neighborhoods.
  • CKE combines three signal types (collaborative, structural, textual) into a unified item embedding.
  • Path-based reasoning provides explainability – concrete, human-readable reasons for each recommendation.
  • 2 hops is the sweet spot for most KG-enhanced methods; more hops introduce noise.

Series Navigation

This article is Part 8 of the 16-part Recommendation Systems series.

PartTopic
1Fundamentals and Core Concepts
2Collaborative Filtering
3Deep Learning Basics
4CTR Prediction
5Embedding Techniques
6Sequential Recommendation
7Graph Neural Networks and Social Recommendation
8Knowledge Graph-Enhanced Recommendation (you are here)
9Multi-Task and Multi-Objective Learning
10Deep Interest Networks
11Contrastive Learning
12LLM-Enhanced Recommendation
13Fairness and Explainability
14Cross-Domain and Cold Start
15Real-Time and Online Learning
16Industrial Practice

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub