Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation

Learn how knowledge graphs supercharge recommendation systems by adding semantic understanding. Covers RippleNet, KGCN, KGAT, CKE, and path-based reasoning -- with intuitive explanations, real-world analogies, and working Python code.

When you search for The Dark Knight on a streaming platform, the system doesn’t just log that you watched it. It knows that Christian Bale played Batman, Christopher Nolan directed it, it’s part of the Batman trilogy, and it shares cinematic DNA with other cerebral action films. This rich semantic web is a knowledge graph (KG) — a structured network of entities (movies, actors, directors, genres) connected by typed relations (acted_in, directed_by, part_of).

Why does this matter for recommendations? Pure collaborative filtering has a blind spot: it can only recommend items with existing interaction history. A brand-new film with zero views is invisible. But if that film shares a director with movies you love, a knowledge graph sees the connection right away. KGs transform recommendations from raw pattern matching into semantic reasoning.


Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation — Chapter overview

What You Will Learn#

  • What knowledge graphs are and how they encode real-world facts
  • How KG embeddings (TransE, TransR, DistMult) represent entities and relations as vectors
  • Propagation-based methods: RippleNet spreads user preferences like ripples in water
  • Graph convolutional methods: KGCN and KGAT learn item representations from KG neighbors
  • Embedding fusion: CKE combines collaborative, structural, and textual signals
  • Path-based reasoning for explainable recommendations
  • Working PyTorch code for every major method

Prerequisites#

  • Basic Python and PyTorch (tensors, nn.Module, training loops)
  • Familiarity with graph neural networks (Part 7 of this series)
  • Comfort with embedding concepts (Part 5 )

Foundations of Knowledge Graphs#

What Is a Knowledge Graph?#

Movie knowledge graph: movies, people, genres and collections linked by typed relations

A knowledge graph stores facts as (head, relation, tail) triples. Each triple represents one atomic fact:

  • (The Dark Knight, directed_by, Christopher Nolan)
  • (The Dark Knight, starred, Christian Bale)
  • (The Dark Knight, has_genre, Action)
  • (Christian Bale, acted_in, The Prestige)

Formally, a knowledge graph is a set $\mathcal{G} = \{(h, r, t)\}$ where $h \in \mathcal{E}$ is the head entity, $r \in \mathcal{R}$ is the relation type, and $t \in \mathcal{E}$ is the tail entity. $\mathcal{E}$ is the set of all entities and $\mathcal{R}$ is the set of all relation types.

Analogy. Think of a knowledge graph as a Wikipedia-scale fact database, but stored as a graph instead of prose. Each article is a node, and every link between articles is a labeled edge — the label explains why they are connected.

Real-World Knowledge Graphs#

Knowledge GraphEntitiesFactsUsed by
Freebase39M1.9BGoogle (deprecated)
Wikidata100M+1.4B+Wikipedia, Google
DBpedia6M580MAcademic research
Amazon Product GraphBillionsTrillionsAmazon recommendations

Knowledge Graph Structure for Recommendation#

Recommendation KGs are typically heterogeneous — they contain multiple entity and relation types:

Entity types

  • Users: $U = \{u_1, u_2, \ldots, u_m\}$
  • Items: $I = \{i_1, i_2, \ldots, i_n\}$
  • Attributes: genres, actors, directors, brands, …

Relation types

  • User-item: (user, interacted, item)
  • Item-attribute: (movie, has_genre, Action), (movie, directed_by, Nolan)
  • Attribute-attribute: (Nolan, collaborated_with, Bale)

Knowledge Graph Embeddings#

Before feeding a knowledge graph into a recommendation model, we must convert entities and relations into dense vectors. The key question is: how do we train these embeddings to respect the graph’s structure?

TransE: head plus relation vector lands near the tail; training pulls valid tails close and pushes negatives outside the margin

$$\mathcal{L} = \sum_{(h,r,t) \in \mathcal{G}}\; \sum_{(h',r,t') \notin \mathcal{G}} \bigl[\gamma + \|\mathbf{h} + \mathbf{r} - \mathbf{t}\|_2 - \|\mathbf{h}' + \mathbf{r} - \mathbf{t}'\|_2\bigr]_+$$

Plain English. “Push valid triples close together ($\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ ) and invalid triples far apart. The margin $\gamma$ controls how much separation you require.” The right panel of the figure above shows this: the green dot (valid tail) is pulled inside the margin circle, while gray negatives are pushed outside.

MethodScoring functionBest for
TransE$\mid\mathbf{h} + \mathbf{r} - \mathbf{t}\mid$Simple 1-to-1 relations
TransR$\mid\mathbf{h} M_r + \mathbf{r} - \mathbf{t} M_r\mid$Relations needing different vector spaces
DistMult$\mathbf{h}^T \text{diag}(\mathbf{r})\, \mathbf{t}$Symmetric relations
ComplEx$\text{Re}(\mathbf{h}^T \text{diag}(\mathbf{r})\, \bar{\mathbf{t}})$Asymmetric relations

TransE analogy. Think of cities on a map. “Paris” + “fly_east_2000km” should land near “Moscow.” “Paris” + “fly_south_1500km” should land near “Algiers.” The relation vector acts like a displacement on the map.

Implementation: Knowledge Graph Construction#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
from collections import defaultdict
from typing import List, Tuple

class KnowledgeGraph:
    """Simple knowledge graph for recommendation systems."""

    def __init__(self):
        self.entities = {}       # entity_name -> entity_id
        self.relations = {}      # relation_name -> relation_id
        self.triples = []        # List of (h_id, r_id, t_id)
        self.entity_adj = defaultdict(list)  # entity_id -> [(r_id, t_id), ...]

    def add_entity(self, name: str) -> int:
        if name not in self.entities:
            self.entities[name] = len(self.entities)
        return self.entities[name]

    def add_relation(self, name: str) -> int:
        if name not in self.relations:
            self.relations[name] = len(self.relations)
        return self.relations[name]

    def add_triple(self, head: str, relation: str, tail: str):
        h_id = self.add_entity(head)
        r_id = self.add_relation(relation)
        t_id = self.add_entity(tail)
        self.triples.append((h_id, r_id, t_id))
        self.entity_adj[h_id].append((r_id, t_id))

    def get_neighbors(self, entity_id: int) -> List[Tuple[int, int]]:
        return self.entity_adj[entity_id]

# Build a movie knowledge graph
kg = KnowledgeGraph()
kg.add_triple("TheDarkKnight", "directed_by", "ChristopherNolan")
kg.add_triple("TheDarkKnight", "starred", "ChristianBale")
kg.add_triple("TheDarkKnight", "has_genre", "Action")
kg.add_triple("TheDarkKnight", "has_genre", "Crime")
kg.add_triple("Inception", "directed_by", "ChristopherNolan")
kg.add_triple("Inception", "starred", "LeonardoDiCaprio")
kg.add_triple("Inception", "has_genre", "SciFi")
kg.add_triple("ThePrestige", "directed_by", "ChristopherNolan")
kg.add_triple("ThePrestige", "starred", "ChristianBale")
kg.add_triple("ThePrestige", "has_genre", "Drama")

# Explore the graph
dk_id = kg.entities["TheDarkKnight"]
print(f"TheDarkKnight neighbors: {kg.get_neighbors(dk_id)}")
# (directed_by, Nolan), (starred, Bale), (has_genre, Action), (has_genre, Crime)

Why Knowledge Graphs Help Recommendation#

Knowledge graphs solve four hard problems that plague pure collaborative filtering.

Four recurring failure modes of collaborative filtering and how a knowledge graph addresses each

Cold Start#

Problem. A new movie with zero interactions is invisible to collaborative filtering — there’s nothing for the model to compare against.

KG solution. Even on day one, the new movie has attributes (director, actors, genre) that connect it to the rest of the graph. If you loved Nolan’s other films, the KG can recommend his new movie immediately, without any interaction data.

Data Sparsity#

Problem. Most users interact with a tiny fraction of items. The interaction matrix is 99%+ zeros, leaving the model with little signal to work with.

KG solution. The knowledge graph fills the gaps with dense semantic connections. Even if two items share no users, they might share a director, genre, or production studio — and that link still carries information.

Explainability#

Problem. “Users who liked X also liked Y” is not a satisfying explanation.

KG solution. “We recommend Inception because it was directed by Christopher Nolan, who also directed The Dark Knight, which you rated 5 stars.” The KG provides concrete, interpretable reasoning paths.

Diversity#

Problem. Collaborative filtering tends to create filter bubbles — recommending more of the same.

KG solution. Different relation paths lead to different kinds of recommendations. Following same_director yields different results than same_genre or same_actor, naturally diversifying the recommendation list.

Concrete Example: After Watching The Dark Knight#

Reasoning over a small knowledge graph: from one watch to four candidate recommendations

The figure traces explicit paths from a user’s single watch to four candidate movies:

  • Inception — via Dark Knight -> directed_by -> Nolan -> directed -> Inception
  • The Prestige — two converging paths through Nolan and Bale (a strong signal)
  • Batman Begins — via shared cast and shared genre
  • The system can produce a different ranking and an explanation for each candidate, something pure CF cannot do.

Types of KG-Enhanced Methods#

CategoryApproachKey models
Propagation-basedSpread user preferences through the KGRippleNet
Graph convolutionalLearn item embeddings by aggregating KG neighborsKGCN, KGAT
Embedding-basedLearn joint embeddings of users, items, and KG entitiesCKE
Path-basedReason over multi-hop paths for explainabilityKPRN, PathRec

RippleNet: Preference Propagation#

Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation — Chapter summary

RippleNet: user preferences spread outward from history items through hop-1 and hop-2 KG neighbors

The Ripple Analogy#

Drop a stone in a pond and ripples spread outward. RippleNet (Wang et al., CIKM 2018) does the same thing with user preferences: when a user interacts with an item, that preference ripples outward through the knowledge graph, activating related entities at increasing distances.

How It Works#

Given user $u$ with historical items $V_u = \{v_1, v_2, \ldots\}$ :

  1. Hop 0: Start with the user’s items as the initial preference set $\mathcal{S}_u^0 = V_u$ .
  2. Hop 1: Find every entity connected to $\mathcal{S}_u^0$ via any relation. These form $\mathcal{S}_u^1$ .
  3. Hop 2: Find every entity connected to $\mathcal{S}_u^1$ . These form $\mathcal{S}_u^2$ .
  4. Aggregate: Compute the user’s enhanced embedding by attention-weighting entities at each hop.
$$p_i^h = \text{softmax}(\mathbf{v}^T \mathbf{R}_i\, \mathbf{t})$$

Plain English. “How relevant is this KG entity to the item we are scoring? Use the relation matrix $\mathbf{R}$ to measure compatibility.”

$$\mathbf{o}_u^h = \sum_{(h,r,t) \in \mathcal{S}_u^h} p_i^h\, \mathbf{t}$$ $$\mathbf{u} = \sum_{h=0}^{H} \alpha_h\, \mathbf{o}_u^h$$

Walkthrough. A user watched The Dark Knight:

  • Hop 0: {Dark Knight}
  • Hop 1: {Nolan, Bale, Action, Crime}
  • Hop 2: {Inception, Prestige, Batman Begins, DiCaprio, Drama, …}

The ripples discover that this user might like Inception (connected through Nolan) and The Prestige (connected through both Nolan and Bale — a doubly-supported signal).

Implementation: RippleNet#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
import torch
import torch.nn as nn
import torch.nn.functional as F

class RippleNet(nn.Module):
    """RippleNet: propagating user preferences on the knowledge graph."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_hop=2, n_memory=32):
        super().__init__()
        self.n_hop = n_hop
        self.n_memory = n_memory
        self.embedding_dim = embedding_dim

        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.item_emb = nn.Embedding(num_items, embedding_dim)
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.hop_weights = nn.Parameter(torch.ones(n_hop + 1) / (n_hop + 1))

        for emb in [self.user_emb, self.item_emb,
                    self.entity_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, ripple_sets):
        """
        Args:
            user_ids: [batch_size]
            item_ids: [batch_size]
            ripple_sets: ripple_sets[i][h] holds (relation_id, tail_id)
                         tuples for user i at hop h.
        """
        user_base = self.user_emb(user_ids)        # [B, d]
        item_vec = self.item_emb(item_ids)          # [B, d]

        user_enhanced = self._propagate(user_base, item_vec, ripple_sets)
        scores = (user_enhanced * item_vec).sum(dim=1)
        return scores

    def _propagate(self, user_base, item_vec, ripple_sets):
        batch_size = user_base.size(0)
        hop_memories = [user_base]

        for hop in range(self.n_hop):
            hop_embs = []
            for i in range(batch_size):
                rs = ripple_sets[i][hop] if hop < len(ripple_sets[i]) else []
                if not rs:
                    hop_embs.append(torch.zeros(self.embedding_dim))
                    continue

                rs = rs[:self.n_memory]               # cap memory
                rels = torch.LongTensor([r for r, _ in rs])
                tails = torch.LongTensor([t for _, t in rs])

                rel_e = self.relation_emb(rels)       # [K, d]
                tail_e = self.entity_emb(tails)       # [K, d]

                # Score each KG neighbor against the candidate item
                scores = (item_vec[i].unsqueeze(0) * rel_e * tail_e).sum(1)
                probs = F.softmax(scores, dim=0)      # [K]
                hop_embs.append((probs.unsqueeze(1) * tail_e).sum(0))

            hop_memories.append(torch.stack(hop_embs))

        # Weighted combination across hops
        stacked = torch.stack(hop_memories, dim=1)    # [B, H+1, d]
        weights = F.softmax(self.hop_weights, dim=0)
        return (weights.unsqueeze(0).unsqueeze(2) * stacked).sum(dim=1)

    @staticmethod
    def build_ripple_sets(user_items, kg_adj, max_hops=2):
        """Build ripple sets for one user via BFS."""
        ripple_sets = []
        current = set(user_items)
        for _ in range(max_hops):
            next_hop = []
            for entity in current:
                for r_id, t_id in kg_adj.get(entity, []):
                    next_hop.append((r_id, t_id))
            ripple_sets.append(next_hop)
            current = {t for _, t in next_hop}
        return ripple_sets

KGCN: Knowledge Graph Convolutional Networks#

A Different Perspective#

While RippleNet propagates user preferences through the KG, KGCN (Wang et al., WWW 2019) takes the opposite stance: it builds better item representations by aggregating information from each item’s KG neighborhood.

Analogy. RippleNet asks, “Starting from what you liked, what is nearby in the knowledge graph?” KGCN asks, “For each candidate item, what does its KG neighborhood tell us about it?”

How KGCN Works#

For an item $i$ with KG neighbors $\mathcal{N}_i$ , KGCN:

  1. Samples $K$ neighbors from the KG.
  2. Weights each neighbor by a learned, relation-aware attention score.
  3. Aggregates the weighted neighbor embeddings.
  4. Combines the aggregate with the item’s own embedding.
$$\mathbf{e}_{\mathcal{N}_i}^{(l)} = \sum_{(r,e) \in \mathcal{N}_i} \pi_r(i, e)\; \mathbf{e}_e^{(l-1)}$$ $$\mathbf{e}_i^{(l)} = \sigma\!\bigl(\mathbf{W}^{(l)}[\mathbf{e}_i^{(l-1)} \,\|\, \mathbf{e}_{\mathcal{N}_i}^{(l)}] + \mathbf{b}^{(l)}\bigr)$$

Plain English. “Look at what is connected to this item in the KG (actors, directors, genres). Weight each connection by how important that relation type is. Blend the neighborhood signal with the item’s own embedding.”

Implementation: KGCN#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import torch
import torch.nn as nn
import torch.nn.functional as F

class KGCNLayer(nn.Module):
    """Single KGCN layer with relation-aware aggregation."""

    def __init__(self, embedding_dim, num_relations):
        super().__init__()
        self.embedding_dim = embedding_dim

        self.rel_transforms = nn.ModuleList([
            nn.Linear(embedding_dim, embedding_dim, bias=False)
            for _ in range(num_relations)
        ])
        self.attn = nn.ModuleList([
            nn.Linear(embedding_dim, 1)
            for _ in range(num_relations)
        ])
        self.combine = nn.Linear(embedding_dim * 2, embedding_dim)

    def forward(self, item_embs, neighbors, relation_embs):
        num_items = item_embs.size(0)
        aggregated = []

        for i in range(num_items):
            nbrs = neighbors.get(i, [])
            if not nbrs:
                aggregated.append(torch.zeros(self.embedding_dim))
                continue

            nbr_embs, scores = [], []
            for r_id, n_id in nbrs:
                transformed = self.rel_transforms[r_id](item_embs[n_id])
                nbr_embs.append(transformed)
                compatibility = item_embs[i] + relation_embs[r_id]
                scores.append(self.attn[r_id](compatibility))

            nbr_embs = torch.stack(nbr_embs)
            scores = torch.cat(scores)
            weights = F.softmax(scores, dim=0)
            aggregated.append((weights.unsqueeze(1) * nbr_embs).sum(0))

        aggregated = torch.stack(aggregated)
        combined = torch.cat([item_embs, aggregated], dim=1)
        return self.combine(combined)

class KGCN(nn.Module):
    """Knowledge Graph Convolutional Network for recommendation."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_layers=2):
        super().__init__()
        self.num_users = num_users
        self.num_items = num_items

        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.layers = nn.ModuleList([
            KGCNLayer(embedding_dim, num_relations)
            for _ in range(n_layers)
        ])

        for emb in [self.user_emb, self.entity_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, item_neighbors):
        user_vec = self.user_emb(user_ids)
        item_vec = self.entity_emb.weight[:self.num_items]
        rel_vec = self.relation_emb.weight

        for idx, layer in enumerate(self.layers):
            if idx < len(item_neighbors):
                item_vec = layer(item_vec, item_neighbors[idx], rel_vec)
                item_vec = F.relu(item_vec)

        item_final = item_vec[item_ids]
        return (user_vec * item_final).sum(dim=1)

KGAT: Knowledge Graph Attention Network#

KGAT: attention weights highlight which KG neighbors of an item matter most for the user

What KGAT Adds#

$$\mathcal{G}_\text{CKG} = \underbrace{\{(u, \text{interact}, i)\}}_{\text{user-item edges}} \;\cup\; \underbrace{\{(h, r, t)\}}_{\text{KG edges}}$$

Why this matters. In KGCN, user and item embeddings live in separate spaces. In KGAT, they are all nodes in the same graph, so collaborative signals and semantic signals propagate together.

Attention Mechanism#

$$\pi(e, r, e') = \frac{\exp\!\bigl(\text{LeakyReLU}(\mathbf{a}^T [\mathbf{e}_e \,\|\, \mathbf{e}_r \,\|\, \mathbf{e}_{e'}])\bigr)}{\sum_{(r'', e'') \in \mathcal{N}(e)} \exp\!\bigl(\text{LeakyReLU}(\mathbf{a}^T [\mathbf{e}_e \,\|\, \mathbf{e}_{r''} \,\|\, \mathbf{e}_{e''}])\bigr)}$$

Plain English. “For each neighbor, concatenate the entity, relation, and neighbor embeddings, run them through a learned scoring function, and normalize with softmax. This tells the model which connections matter most.” The figure above visualizes this: edges to Nolan and DiCaprio are thick (high attention) while the edge to 2010 (release year) is thin.

$$\mathbf{e}_e^{(l)} = \sigma\!\bigl(\mathbf{W}^{(l)}[\mathbf{e}_e^{(l-1)} \,\|\, \mathbf{e}_{\mathcal{N}(e)}^{(l-1)}] + \mathbf{b}^{(l)}\bigr)$$

Multi-Head Attention#

$$\mathbf{e}_{\mathcal{N}(e)} = \big\|_{k=1}^{K}\; \sum_{(r,e') \in \mathcal{N}(e)} \pi^{(k)}(e, r, e')\, \mathbf{e}_{e'}^{(k)}$$

Implementation: KGAT#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
import torch
import torch.nn as nn
import torch.nn.functional as F

class KGATLayer(nn.Module):
    """Single KGAT layer with relation-aware attention."""

    def __init__(self, embedding_dim, num_heads=1, dropout=0.1):
        super().__init__()
        self.embedding_dim = embedding_dim
        self.num_heads = num_heads
        self.dropout = dropout

        # Score [entity || relation || neighbor]
        self.attention = nn.Linear(embedding_dim * 3, num_heads)
        self.W_v = nn.Linear(embedding_dim, embedding_dim)
        self.output = nn.Linear(embedding_dim, embedding_dim)

    def forward(self, entity_embs, relation_embs, neighbors):
        outputs = []
        for i in range(entity_embs.size(0)):
            nbrs = neighbors.get(i, [])
            if not nbrs:
                outputs.append(entity_embs[i])
                continue

            scores, values = [], []
            for r_id, n_id in nbrs:
                combined = torch.cat([
                    entity_embs[i], relation_embs[r_id], entity_embs[n_id]
                ])
                scores.append(self.attention(combined))
                values.append(self.W_v(entity_embs[n_id]))

            scores = F.softmax(torch.stack(scores), dim=0)  # [K, heads]
            scores = F.dropout(scores, p=self.dropout, training=self.training)
            values = torch.stack(values)                     # [K, d]

            aggregated = (scores.mean(dim=1, keepdim=True) * values).sum(0)
            out = entity_embs[i] + aggregated
            outputs.append(F.relu(self.output(out)))

        return torch.stack(outputs)

class KGAT(nn.Module):
    """Knowledge Graph Attention Network for recommendation."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, n_layers=2, num_heads=2, dropout=0.1):
        super().__init__()
        self.num_users = num_users

        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.layers = nn.ModuleList([
            KGATLayer(embedding_dim, num_heads, dropout)
            for _ in range(n_layers)
        ])

        for emb in [self.entity_emb, self.user_emb, self.relation_emb]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, user_ids, item_ids, ckg_neighbors):
        # Combined embedding table: [users | entities]
        all_embs = self.entity_emb.weight.clone()
        all_embs[:self.num_users] = self.user_emb.weight
        rel_embs = self.relation_emb.weight

        for idx, layer in enumerate(self.layers):
            if idx < len(ckg_neighbors):
                all_embs = layer(all_embs, rel_embs, ckg_neighbors[idx])

        user_final = all_embs[user_ids]
        item_final = all_embs[self.num_users + item_ids]
        return (user_final * item_final).sum(dim=1)

CKE: Collaborative Knowledge Base Embedding#

The Multi-Signal Approach#

$$\mathbf{i} = \mathbf{i}_\text{CF} + \mathbf{i}_\text{KG} + \mathbf{i}_\text{text}$$
ComponentSourceLearning method
$\mathbf{i}_\text{CF}$User-item interactionsMatrix factorization
$\mathbf{i}_\text{KG}$Knowledge graph structureTransR embedding
$\mathbf{i}_\text{text}$Item descriptionsCNN text encoder

Why three signals? Each captures something the others miss:

  • CF knows who likes what, but not why.
  • KG knows semantic relationships, but not user behavior.
  • Text captures nuanced descriptions that structured triples cannot express.

Joint Training#

$$\mathcal{L} = \mathcal{L}_\text{CF} + \lambda_1 \mathcal{L}_\text{KG} + \lambda_2 \mathcal{L}_\text{text} + \lambda_3 \mathcal{L}_\text{reg}$$

The CF loss is standard matrix factorization with $\hat{r}_{ui} = \mathbf{u}^T \mathbf{i}$ . The KG loss uses TransR: push valid triples together and invalid triples apart.

Implementation: CKE#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import torch
import torch.nn as nn
import torch.nn.functional as F

class TransR(nn.Module):
    """TransR: relation-specific entity projection."""

    def __init__(self, num_entities, num_relations, entity_dim, relation_dim):
        super().__init__()
        self.entity_emb = nn.Embedding(num_entities, entity_dim)
        self.relation_emb = nn.Embedding(num_relations, relation_dim)
        self.proj_matrices = nn.Embedding(num_relations,
                                          entity_dim * relation_dim)
        self.entity_dim = entity_dim
        self.relation_dim = relation_dim

        for emb in [self.entity_emb, self.relation_emb, self.proj_matrices]:
            nn.init.xavier_uniform_(emb.weight)

    def forward(self, heads, relations, tails,
                neg_heads=None, neg_tails=None):
        h = self.entity_emb(heads)
        r = self.relation_emb(relations)
        t = self.entity_emb(tails)

        proj = self.proj_matrices(relations).view(
            -1, self.entity_dim, self.relation_dim
        )
        h_proj = torch.bmm(h.unsqueeze(1), proj).squeeze(1)
        t_proj = torch.bmm(t.unsqueeze(1), proj).squeeze(1)

        pos_score = (h_proj + r - t_proj).norm(p=2, dim=1) ** 2

        if neg_heads is not None and neg_tails is not None:
            nh = self.entity_emb(neg_heads)
            nt = self.entity_emb(neg_tails)
            nh_proj = torch.bmm(nh.unsqueeze(1), proj).squeeze(1)
            nt_proj = torch.bmm(nt.unsqueeze(1), proj).squeeze(1)
            neg_score = (nh_proj + r - nt_proj).norm(p=2, dim=1) ** 2
            return F.relu(1.0 + pos_score - neg_score).mean()

        return pos_score

class CKE(nn.Module):
    """Collaborative Knowledge base Embedding."""

    def __init__(self, num_users, num_items, num_entities, num_relations,
                 embedding_dim=64, relation_dim=32,
                 kg_lambda=0.1, reg_lambda=0.01):
        super().__init__()
        self.num_entities = num_entities
        self.kg_lambda = kg_lambda
        self.reg_lambda = reg_lambda

        # CF component
        self.user_emb = nn.Embedding(num_users, embedding_dim)
        self.item_cf_emb = nn.Embedding(num_items, embedding_dim)

        # KG component (TransR)
        self.transr = TransR(num_entities, num_relations,
                             embedding_dim, relation_dim)

        # Project KG embeddings into the CF space
        self.kg_proj = nn.Linear(embedding_dim, embedding_dim)

        nn.init.xavier_uniform_(self.user_emb.weight)
        nn.init.xavier_uniform_(self.item_cf_emb.weight)

    def forward(self, user_ids, item_ids):
        user_vec = self.user_emb(user_ids)
        item_cf = self.item_cf_emb(item_ids)
        item_kg = self.kg_proj(self.transr.entity_emb(item_ids))

        item_combined = item_cf + item_kg
        return (user_vec * item_combined).sum(dim=1)

    def compute_loss(self, user_ids, item_ids, ratings,
                     kg_heads=None, kg_rels=None, kg_tails=None):
        # CF loss
        preds = self.forward(user_ids, item_ids)
        cf_loss = F.mse_loss(preds, ratings.float())

        # KG loss
        kg_loss = torch.tensor(0.0)
        if kg_heads is not None:
            neg_heads = torch.randint(0, self.num_entities, kg_heads.shape)
            neg_tails = torch.randint(0, self.num_entities, kg_tails.shape)
            kg_loss = self.transr(kg_heads, kg_rels, kg_tails,
                                  neg_heads, neg_tails)

        # Regularization
        reg = (self.user_emb.weight.norm(2) ** 2 +
               self.item_cf_emb.weight.norm(2) ** 2)

        return cf_loss + self.kg_lambda * kg_loss + self.reg_lambda * reg

Path-Based Reasoning for Explainable Recommendations#

Why Paths Matter#

The methods above learn embeddings — dense vectors that are powerful but opaque. Path-based reasoning takes a different route: it finds concrete paths through the knowledge graph from a user’s history to a candidate item, and uses those paths as both features and explanations.

Example path. You rated The Dark Knight highly -> The Dark Knight directed_by Christopher Nolan -> Christopher Nolan directed Inception -> Recommend Inception.

This path is not just a feature for the model — it doubles as a human-readable explanation.

Implementation: Multi-Hop Path Reasoning#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import deque

class PathReasoning(nn.Module):
    """Find and score KG paths for explainable recommendations."""

    def __init__(self, num_entities, num_relations, embedding_dim=64):
        super().__init__()
        self.entity_emb = nn.Embedding(num_entities, embedding_dim)
        self.relation_emb = nn.Embedding(num_relations, embedding_dim)

        self.path_scorer = nn.Sequential(
            nn.Linear(embedding_dim * 3, embedding_dim),
            nn.ReLU(),
            nn.Linear(embedding_dim, 1),
        )

        nn.init.xavier_uniform_(self.entity_emb.weight)
        nn.init.xavier_uniform_(self.relation_emb.weight)

    def find_paths(self, kg_adj, start, end, max_hops=3, max_paths=10):
        """BFS to find paths from start to end entity."""
        paths = []
        queue = deque([(start, [], [])])
        visited = set()

        while queue and len(paths) < max_paths:
            current, rel_path, ent_path = queue.popleft()

            if current == end and rel_path:
                paths.append((rel_path, ent_path + [end]))
                continue
            if len(rel_path) >= max_hops:
                continue

            for r_id, t_id in kg_adj.get(current, []):
                if (current, r_id, t_id) not in visited:
                    visited.add((current, r_id, t_id))
                    queue.append((t_id,
                                  rel_path + [r_id],
                                  ent_path + [current]))

        return paths[:max_paths]

    def score_path(self, path):
        """Score a single path with learned embeddings."""
        rel_ids, ent_ids = path
        if not rel_ids:
            return torch.tensor(0.0)

        start = self.entity_emb(torch.tensor([ent_ids[0]]))
        rels = self.relation_emb(torch.tensor(rel_ids)).mean(0, keepdim=True)
        end = self.entity_emb(torch.tensor([ent_ids[-1]]))

        combined = torch.cat([start, rels, end], dim=1)
        return self.path_scorer(combined).squeeze()

    def recommend_with_explanations(self, user_items, kg_adj,
                                    candidates, max_hops=3):
        """Return [(item_id, score, best_path)] sorted by score."""
        results = []
        for item_id in candidates:
            best_score = float('-inf')
            best_path = None
            for src_item in user_items:
                for path in self.find_paths(kg_adj, src_item, item_id,
                                            max_hops=max_hops):
                    score = self.score_path(path).item()
                    if score > best_score:
                        best_score = score
                        best_path = path
            if best_path is not None:
                results.append((item_id, best_score, best_path))

        results.sort(key=lambda x: x[1], reverse=True)
        return results

With KG vs Without KG: The Bottom Line#

Quantitative lift on MovieLens-1M and qualitative comparison of recommendation lists

The left panel shows KGAT delivering double-digit relative gains over a vanilla BPR baseline on every standard metric — and a much larger gain on cold-start Recall@20, which is exactly the regime where collaborative filtering struggles most.

The right panel makes the qualitative difference visceral. Without a KG, the model falls back on raw popularity and surfaces unrelated blockbusters; the brand-new Nolan film is invisible because no user has seen it yet. With a KG, the model surfaces films that are thematically connected — and the new Nolan release is recommended on day one because the graph already knows Nolan directed it.


Practical Tips#

Building a Knowledge Graph#

SourceTypeExample
Wikidata / DBpediaPublic structured KGsMovie metadata, company info
Product catalogsInternal databasesBrand, category, specifications
NER + relation extractionFrom unstructured textReviews, descriptions

Entity Alignment#

Items in your recommendation system need to be linked to entities in the knowledge graph. Common approaches:

  1. Exact string matching — fast but brittle.
  2. Fuzzy matching — handles typos and abbreviations.
  3. Embedding similarity — learn embeddings on both sides and match nearest neighbors.
  4. Manual curation — highest quality but expensive.

Training Strategies#

StrategyWhen to use
Joint trainingEnough data to train KG and rec objectives together
Pre-train + fine-tuneKG data is much larger than interaction data
Multi-task learningYou want shared representations across tasks

Choosing the Number of Hops#

  • 1 hop: direct attributes only. Fast but shallow.
  • 2 hops: the sweet spot for most tasks. Captures “same director” or “same actor” patterns.
  • 3 hops: richer paths but introduces noise. Use attention to filter.
  • 4+ hops: rarely helpful. The signal-to-noise ratio drops sharply.

Complete Training Pipeline#

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np

class KGRecDataset(Dataset):
    """Dataset for KG-enhanced recommendation."""

    def __init__(self, user_ids, item_ids, ratings):
        self.user_ids = torch.LongTensor(user_ids)
        self.item_ids = torch.LongTensor(item_ids)
        self.ratings = torch.FloatTensor(ratings)

    def __len__(self):
        return len(self.user_ids)

    def __getitem__(self, idx):
        return {
            'user_id': self.user_ids[idx],
            'item_id': self.item_ids[idx],
            'rating': self.ratings[idx],
        }

def train_cke(model, train_loader, val_loader, kg_triples=None,
              num_epochs=10, lr=0.001):
    """Train a CKE model with optional KG loss."""
    optimizer = optim.Adam(model.parameters(), lr=lr)
    best_val = float('inf')

    for epoch in range(num_epochs):
        model.train()
        train_loss = 0

        for batch in train_loader:
            optimizer.zero_grad()
            loss = model.compute_loss(
                batch['user_id'], batch['item_id'], batch['rating'],
                *kg_triples if kg_triples else (None, None, None)
            )
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        model.eval()
        val_loss = 0
        with torch.no_grad():
            for batch in val_loader:
                preds = model(batch['user_id'], batch['item_id'])
                val_loss += F.mse_loss(preds, batch['rating'].float()).item()

        train_loss /= len(train_loader)
        val_loss /= len(val_loader)
        print(f"Epoch {epoch+1}/{num_epochs} | "
              f"Train: {train_loss:.4f} | Val: {val_loss:.4f}")

        if val_loss < best_val:
            best_val = val_loss
            torch.save(model.state_dict(), 'best_kg_model.pt')

if __name__ == "__main__":
    num_users, num_items = 1000, 500
    num_entities, num_relations = 2000, 10

    train_data = KGRecDataset(
        np.random.randint(0, num_users, 8000),
        np.random.randint(0, num_items, 8000),
        np.random.uniform(1, 5, 8000),
    )
    val_data = KGRecDataset(
        np.random.randint(0, num_users, 2000),
        np.random.randint(0, num_items, 2000),
        np.random.uniform(1, 5, 2000),
    )

    model = CKE(num_users, num_items, num_entities, num_relations)
    train_cke(
        model,
        DataLoader(train_data, batch_size=64, shuffle=True),
        DataLoader(val_data, batch_size=64),
        num_epochs=10,
    )

FAQ#

How do knowledge graphs help with the cold-start problem?#

Even when a new item has zero interactions, it still has attributes in the knowledge graph (genre, director, cast). These attributes connect it to other items that do have interaction history. A user who loved Nolan’s films can receive a recommendation for his latest movie on release day, purely through KG connections.

RippleNet vs. KGCN — what is the difference?#

RippleNet is user-centric: it starts from the user’s history and ripples outward through the KG. KGCN is item-centric: it builds enriched item representations by aggregating from each item’s KG neighborhood. In practice, KGCN tends to scale better because item neighborhoods are more stable than user preference ripples.

When should I use KGAT over KGCN?#

Use KGAT when you want to model the interaction between collaborative signals and semantic signals in a unified graph. KGAT’s Collaborative Knowledge Graph merges user-item edges with KG edges, so attention can learn across both types. KGCN keeps them separate.

How does CKE compare to graph-based methods?#

CKE is simpler and faster — it pre-learns KG embeddings and combines them additively. Graph-based methods (KGCN, KGAT) are more powerful because they propagate information through the graph at inference time, but they are also more expensive. Start with CKE for a quick baseline, then upgrade to KGAT if you need better accuracy.

Can knowledge graphs improve recommendation diversity?#

Yes. Different relation types lead to different kinds of connections. Following same_director gives different results than same_genre or same_actor. By exploring multiple relation paths, the system naturally surfaces diverse recommendations instead of creating a filter bubble.

How do I handle noisy or incomplete knowledge graphs?#

  1. Attention mechanisms (KGAT) naturally down-weight noisy connections.
  2. TransR embeddings learn to ignore inconsistent triples during training.
  3. Data augmentation through relation inference: predict missing edges.
  4. Multi-task learning shares information across tasks, making the model more robust to missing data.

What are the computational bottlenecks?#

The main bottleneck is neighbor aggregation on large KGs (millions of entities). Solutions:

  • Neighbor sampling — limit to $K$ neighbors per node.
  • Hierarchical aggregation — aggregate attributes first, then items.
  • Mini-batch training with subgraph sampling.
  • Pre-computation of KG embeddings (CKE approach).

Can knowledge graphs provide explainable recommendations?#

This is one of their biggest strengths. Path-based methods generate explanations like: “We recommend Movie X because it shares director Y with Movie Z, which you rated highly.” These explanations are concrete, human-readable, and grounded in factual relationships — much better than “users similar to you also liked this.”

What is the latest in KG-enhanced recommendation?#

Recent trends include:

  1. Temporal KGs that model how preferences and facts evolve over time.
  2. Multi-modal KGs combining structured knowledge with images and text.
  3. Transformer-based KG methods that replace GCN aggregation with self-attention.
  4. Pre-trained KG embeddings from large-scale foundation models.
  5. LLM + KG hybrids that use language models to extract from and reason over knowledge graphs.

Summary#

  • Knowledge graphs add semantic understanding to recommendation systems, going beyond pure collaborative signals.
  • Cold start is the killer app: KGs enable recommendations for items with zero interaction history.
  • RippleNet propagates user preferences outward through the KG like ripples in water.
  • KGCN and KGAT build enriched item representations by aggregating from KG neighborhoods.
  • CKE combines three signal types (collaborative, structural, textual) into a unified item embedding.
  • Path-based reasoning provides explainability — concrete, human-readable reasons for each recommendation.
  • 2 hops is the sweet spot for most KG-enhanced methods; more hops introduce noise.
In this series

Recommendation Systems 16 parts

  1. 01 Recommendation Systems (1): Fundamentals and Core Concepts
  2. 02 Recommendation Systems (2): Collaborative Filtering and Matrix Factorization
  3. 03 Recommendation Systems (3): Deep Learning Foundations
  4. 04 Recommendation Systems (4): CTR Prediction and Click-Through Rate Modeling
  5. 05 Recommendation Systems (5): Embedding and Representation Learning
  6. 06 Recommendation Systems (6): Sequential Recommendation and Session-based Modeling
  7. 07 Recommendation Systems (7): Graph Neural Networks and Social Recommendation
  8. 08 Recommendation Systems (8): Knowledge Graph-Enhanced Recommendation you are here
  9. 09 Recommendation Systems (9): Multi-Task Learning and Multi-Objective Optimization
  10. 10 Recommendation Systems (10): Deep Interest Networks and Attention Mechanisms
  11. 11 Recommendation Systems (11): Contrastive Learning and Self-Supervised Learning
  12. 12 Recommendation Systems (12): Large Language Models and Recommendation
  13. 13 Recommendation Systems (13): Fairness, Debiasing, and Explainability
  14. 14 Recommendation Systems (14): Cross-Domain Recommendation and Cold-Start Solutions
  15. 15 Recommendation Systems (15): Real-Time Recommendation and Online Learning
  16. 16 Recommendation Systems (16): Industrial Architecture and Best Practices

Liked this piece?

Follow on GitHub for the next one — usually one a week.

GitHub