Graph
Definition

A graph is a pair of a set of vertices (nodes) $v_{i} \in V$ and a set of pairs of vertices (edges) $E = {(v_{i}, v_{j}) ∣ i, j \leq ∣ V ∣}$ . $G = (V, E)$
Link to original

Node Embeddings

Node Embedding
Definition

The goal of node embedding is finding an encoder mapping nodes into embedding space preserving the similarity of the nodes. $z_{u} = ENC (u) subject to similarity (u, v) \approx z_{v}^{⊺} z_{u}$

The encoder $ENC : V \to R^{d}$ maps node to embedding and the decoder $DEC : R^{d} \times R^{d} \to R^{+}$ extracts useful-information such as local neighbor and classification label of the corresponding node from embedding, where $d$ is the dimension of the embedded vectors. .
Link to original

DeepWalk
Definition

DeepWalk is a random walk based Node Embedding method capturing the structural information of a Graph by performing random walks on it. This method can handle large graphs efficiently, and can be applied to various types of graphs (directed, undirected, weighted). However, it does not incorporate node attributes or features.

Algorithm

Run short fixed-length uniform random walks starting from each node $u$ in the graph.

These walks generate sequences of nodes $N (u)$ for each node, similar to sentences in natural language processing.

The sequences are then fed into a Word2Vec-like model to learn vector representations for each node.

Link to original

Node2Vec
Definition

Node2Vec is an extension of DeepWalk that introduces more flexible random walks. The model can capture both local and global network structure

Algorithm

Run short fixed-length biased random walks starting from each node $u$ in the graph using some random walk strategy $R (p, q)$ .

Consider a random walk that just traversed edge $(t, v)$ and now resides at node $v$ . The unnormalized transition probability $α_{pq} (t, x)$ on edges $(v, x)$ is given by $α_{pq} (t, x) = ⎩ ⎨ ⎧ \frac{1}{p} 1 \frac{1}{q} d (t, x) = 0 d (t, x) = 1 d (t, x) = 2$ where:

$p$ is the return parameter, controlling the likelihood of immediately revisiting a node in the walk

$q$ is the in-out parameter controlling the search strategy (depth-first vs. breadth-first)

$d (u, x) \in {0, 1, 2}$ is the shortest path distance between node $t$ and $x$

These walks generate sequences of nodes $N_{R} (u)$ for each node, similar to sentences in natural language processing.

The sequences are then fed into a Word2Vec-like model to learn vector representations for each node.

Link to original

Graph Neural Networks

Graph Neural Network
Definition

Graph neural network (GNN) is a Neural Network model for processing data that can be represented as graphs. The model can learn and extract features from both node attributes and graph structure. The node features are iteratively updated by aggregating information from neighboring nodes.
Link to original

Graph Convolutional Network
Definition

Graph convolutional network (GCN) is a CNN operating on Graph-structured data which takes two inputs: node features $X$ and its Adjacency Matrix $A$ . GCN learns hidden layer vectors that encode both node features and local graph structures on a fixed graph.

GCN Layer

The GCN layer Is a function of node feature and adjacency matrix. $H^{(l + 1)} = σ (A H^{(l)} W^{(l)})$ where:

$H^{(l)}$ is the node feature matrix at layer $l$ , $H^{(0)}$ is the node features.

$A \in R$ is the Adjacency Matrix of the graph

$W^{(l)}$ is the learnable weight matrix for layer $l$ It works similar to the filter in CNN with $F^{'}$ channel size. It is shared across all nodes and independent on the graph size.

$σ$ is a non-linear Activation Function

Normalized GCN Layer

To increase numerical stability by enforcing self-loop, the Adjacency Matrix $A$ is transformed by adding an identity matrix and normalizing with the Degree Matrix $D$ . $H^{(l + 1)} = σ (D^{- \frac{1}{2}} \tilde{A} D^{- \frac{1}{2}} H^{(l)} W^{(l)})$ where $\tilde{A} = A + I$

GCN Layer With Self-Transform Term

In some scenario, the self transformation term is added to the GCN layer to help preserve node-specific information $H^{(l + 1)} = σ (A H^{(l)} W^{(l)} + H^{(l)} B^{(l)})$ where $B^{(l)}$ is the learnable self-transformation matrix for layer $l$

GCN Layer With Skip Connection

The skip connection can ease over-smoothing problem. $H^{(l + 1)} = σ (A H^{(l)} W^{(l)} + H^{(l)})$

Neighborhood-Based approach

$h_{i}^{(l + 1)} = σ (W^{(l)} \cdot AGG ({h_{j}^{(l)} ∣ v_{j} \in N (v_{i})}))$ where $N (v_{i})$ is the neighbors of the node $v_{i}$ .

The aggregation function $AGG$ could be a mean, sum, or max operation:

$AGG^{mean} ({h_{j}^{(l)} ∣ v_{j} \in N (v_{i})}) = v_{j} \in N (v_{i}) \sum \frac{h _{j}^{(l)}}{∣ N ( v _{i} ) ∣}$

$AGG^{sum} ({h_{j}^{(l)} ∣ v_{j} \in N (v_{i})}) = v_{j} \in N (v_{i}) \sum h_{j}^{(l)}$

$AGG^{max} ({h_{j}^{(l)} ∣ v_{j} \in N (v_{i})}) = v_{j} \in N (v_{i}) max h_{j}^{(l)}$

Facts

The output of a GCN is permutation invariant: invariant to the ordering of nodes in the input graph. The output of each layer in a GCN is permutation equivalent: equivalent to the permutation of nodes in the input graph.
Link to original

GraphSAGE
Definition

Graph SAmple and aggreGatE (GraphSAGE) is a Node Embedding method sampling and aggregating features from each node’s local neighborhood. It can be generalized to unseen data.

Architecture

GraphSAGE does not require the adjacency matrix of entire graph used in GCN, it utilizes fixed number of neighbors for each node.

Aggregators

Mean aggregator: $h_{N (v_{i})}^{(k)} = mean ({h_{j}^{(k - 1)} ∣ v_{j} \in N_{k} (v_{i}) \cup {v_{i}}})$

LSTM aggregator:
$(v_{i})\})$$ where the order of the feature sequence is random Pooling aggregator: $$\operatorname{AGG}_{k}^{\text{pool}} = \max(\{\operatorname{MLP}(h^{(k-1)}_{j})|v_{j}\in \mathcal{N}_{k} (v_{i})\})$$ # Algorithm 1. For $k=1,\dots, K$: 1. For each node $v_{i} \in V$: 1. Sample a fixed-size set of neighbors $\mathcal{N}_{k}(v_{i})$. 2. Aggregate the features of the sampled neighbors using an aggregator function (mean, [[Long Short-Term Memory|LSTM]], pooling). $$h^{(k)}_{\mathcal{N}(v_{i})} = \operatorname{AGG}_{k}(\{h^{(k-1)}_{j}|v_{j}\in \mathcal{N}_{k}(v_{i})\})$$ 3. Combine the aggregated neighborhood information with the node's own features and apply a non-linear [[Activation Function]]. $$h_{i}^{(k)} = \sigma(W^{(k)} \operatorname{CONCAT}(h_{i}^{(k-1)}, h_{\mathcal{N}(v_{i})}^{(k)}))$$ where $\sigma$ is a non-linear [[Activation Function]], and $h_{i}^{(k)}$ is a feature of node $v_{i}$ at $k$-th stage 2. Normalize the feature embedding $$h_{i}^{(k)} = \frac{h_{i}^{(k)}}{||h_{i}^{(k)}||_{2}},\forall v_{i}\in V$$ 2. After $K$ iterations, the final vectors are the output node embeddings. $$z_{i} = h_{i}^{(k)},\forall v_{i} \in V$$$ Link to original

Graph Attention Network
Definition

Graph attention network (GAT) applied Attention mechanism to the aggregation stage of GraphSAGE model. Instead of assigning same weights to all neighbors, GAT uses attention coefficient between nodes.

Architecture

Attention

The attention values are calculated as $a_{ij} = v_{j} \in N (v_{i}) softmax (a^{⊺} \cdot concat (W h_{i}, W h_{j}))$ where $a$ and $W$ are learnable parameters shared across all nodes.

Aggregation

The feature of node $v_{i}$ at $k$ -th stage is calculated with attention $h_{i}^{(k)} = σ (v_{j} \in N (v_{i}) \sum a_{ij} W^{(k)} h_{i}^{(k - 1)})$ where $σ$ is a non-linear Activation Function

Multi-Head Attention

To stabilize the learning process, GAT employs multi-head attention. The independent attentions are executed with the same input, and the output of the output features are concatenated or averaged. $h_{i}^{(k)} = AGG ({h_{i}^{(k)} [h] ∣ h = 1, \dots, H})$ where $h_{i}^{(k)} [h] = σ (v_{j} \in N (v_{i}) \sum a_{ij}^{h} W^{(k)} h_{i}^{(k - 1)})$ , and $H$ is the number of attention heads.
Link to original

Differentiable Pooling
Definition

Differentiable pooling (DiffPool) is a hierarchical graph pooling method that reduces the size of graph representations. The model is designed for graph-level tasks, such as graph classification, regression, and matching.

Architecture

DiffPool learns a soft assignment matrix $S^{(l)}$ that maps nodes in the input graph to a set of clusters and uses the assignment to generate a coarsened graph with fewer nodes.

The DiffPool layer is constructed by the two GNNs: embedding and pooling GNNs.

The embedding GNN is used to generate an embedding matrix $Z^{(l)} = GNN_{l, embed} (X^{(l)}, A^{(l)})$ The pooling GNN is used to generate an assignment matrix representing a soft clustering of nodes. $S^{(l)} = softmax (GNN_{l, pool} (X^{(l)}, A^{(l)}))$ where $X^{(l)}$ and $A^{(l)}$ are the input cluster features and adjacency matrices at layer $l$ respectively, and the number of columns of $S^{(l)}$ is a hyperparameter and smaller than the number of input nodes.

DiffPool layer generates a new coarsened adjacency matrix and a new matrix embedding using the embedding and pooling GNNs. $(X^{(l + 1)}, A^{(l + 1)}) := (S^{(l)}^{⊺} Z^{(l)}, S^{(l)}^{⊺} A^{(l)} S^{(l)}) = DiffPool (Z^{(l)}, A^{(l)})$
Link to original

Graph Isomorphism Network
Definition

The Graph isomorphism network (GIN) is a GNN architecture having better expressive power and ability to distinguish different graph structures with injective aggregation.

Architecture

GNNs generates a Node Embedding using the computational graph corresponding to a subtree rooted around each node. The most expressive GNN maps different computational graphs (rooted subtrees) into different node embeddings. If each step of GNN’s neighbor aggregation is injective, then the model retains the full neighboring information. So, the generated node embeddings can distinguish different rooted subtrees.

Any injective multi-set function can be expressed as $ϕ (x \in S \sum f (x))$ where the functions $Φ$ and $f$ are both some non-linear functions.

Thanks to the Universal Approximation Theorem it can be approximated with multi-layer perceptrons. $MLP_{ϕ} (x \in S \sum MLP_{f} (x))$

GIN uses the injective aggregation with the features of a node and its neighbors.
$h_{i}^{(k)} &= \operatorname{MLP}_{\phi}^{(k)}\left((1+\epsilon^{(k)})\operatorname{MLP}^{(k)}_{f}(h_{i}^{(k-1)})+ \sum\limits_{j \in \mathcal{N}(v_{i})}\operatorname{MLP}^{(k)}_{f}(h_{j}^{(k-1)}) \right)\\ &\approx \operatorname{MLP}^{(k)}\left((1+\epsilon^{(k)})h_{i}^{(k-1)} + \sum\limits_{j \in \mathcal{N}(v_{i})} h_{j}^{(k-1)}\right) \end{aligned}$$ where $\epsilon^{k}$ is a learnable scalar ensuring the central node's features.$ Link to original

Heterogeneous Graphs

Relational GCN
Definition

The Relational graph convolutional network (RGCN) is an extension of the GCN designed to handle the graph with multiple types of edges or relationships between nodes.

Architecture

The model uses relation-specific wrights matrices. The information of neighbors of a node is aggregated considering the different relation types.

$h_{i}^{(l + 1)} = σ (r \in R \sum v_{j} \in N_{r} (v_{i}) \sum [\frac{1}{∣∣ N _{r} ( v _{i} ) ∣ ∣ _{2}} W_{r}^{(l)} h_{j}^{(l)}] + W_{0}^{(l)} h_{i}^{(l)})$ where:

$h_{i}^{(l)}$ is the feature vector of node $v_{i}$ at layer $l$

$R$ is the set of relations (edge types) in the graph

$N_{r} (v_{i})$ is the neighbors of the node $v_{i}$ under relation $r$ .

$W_{r}^{(l)}$ is the learnable weight matrix for relation $r$ at layer $l$ ( $W_{0}^{(l)}$ is for self-connection).

$σ$ is a non-linear Activation Function

Regularization Methods

The number of parameters of the model is increased rapidly with respect to the number of relations. I cause overfitting and scability issues. The techniques such as basis decomposition or block-diagonal decomposition of the weight matrices can be employed to reduce the number of parameters and improve model efficiency.

Basis Decomposition

In basis decomposition, instead of learning a separate weight matrix for each relation, we express each relation-specific weight matrix as a linear combination of a smaller set of basis matrices.

$W_{r} = \sum_{b = 1}^{B} a_{r b} V_{b}$ where:

$B$ is the number of basis matrices (typically $B ≪ ∣ R ∣$ )

$V_{b}$ are the learnable basis matrices

$a_{r b}$ are learnable importance weight of matrix $V_{b}$ .

This method reduces the number of parameters from $O (∣ R ∣ d^{(l)}^{2})$ to $O (B d^{(l)}^{2} + ∣ R ∣ B)$ , where $d^{(l)}$ is the hidden dimension in layer $l$ .

Block-Diagonal Decomposition

In block-diagonal decomposition, each relation-specific weight matrix is constrained to be block-diagonal

$W_{r} = diag (Q_{r 1}, \dots, Q_{r B}) = Q_{r 1} O ⋱ O Q_{r B}$ where each $Q_{r b}$ is a low-dimensional matrix of size $(d / B) \times (d / B)$ .

This method reduces the number of parameters from $O (∣ R ∣ d^{(l)}^{2})$ to $O (∣ R ∣ \frac{d ^{(l)} ^{2}}{B})$ , where $d^{(l)}$ is the hidden dimension in layer $l$ .

The channels (dimensions) of the node feature vector are indeed grouped The transformation of the channel values occurs within each group

It can be viewed as grouping the channels of the node feature vector and transforming the channel values within each group.
Link to original

Heterogeneous Graph Transformer
Definition

Heterogeneous graph transformer (HGT) is an extension of the GCN designed to handle the graph with multiple types of nodes and edges.

Architecture

Meta-relation triplets

$t$ is the target node, $s_{1}$ and $s_{2}$ are the source nodes, $e_{1} = (s_{1}, t)$ and $e_{2} = (s_{2}, t)$ are the edges, and the corresponding meta-relations are $⟨ τ (s_{1}), ϕ (e_{1}), τ (t)⟩$ and $⟨ τ (s_{2}), ϕ (e_{2}), τ (t)⟩$ .

HGT uses meta-relation triplets $⟨ τ (s), ϕ (e), τ (t)⟩$ (source node type, edge type, target node type) to describe the relationships between different entity types in a graph.

Heterogeneous Mutual Attention

HGT uses an attention mechanism that considers node types and edge types. The attention weights are computed based on the meta-relation triplets, allowing for type-specific information propagation.

For a meta-relation $⟨ τ (s), ϕ (e), τ (t)⟩$ , the attention is calculated by $Attention_{HGT} (s, e, t) = s \in N (t) softmax (concat ({ATT-head^{i} (s, e, t) ∣ i = 1, \dots, h}))$ where $N (t)$ is the neighbors of the node $t$ , and $h$ is the number of attention heads.

The attention head is calculated by $ATT-head^{i} (s, e, t) = (K^{(i)} (s) W_{ϕ (e)}^{ATT} Q^{(i)} (t)^{⊺})$ where:

$Q (t) = W_{τ (t)}^{Q} H_{t}^{(l - 1)}$ is a query vector

$K (s) = W_{τ (s)}^{K} H_{s}^{(l - 1)}$ is a key vector

$H_{s}^{(l - 1)}$ and $H_{t}^{(l - 1)}$ are the feature vectors of the source and target nodes at layer $l - 1$ .

$W_{τ (t)}^{Q}$ , $W_{ϕ (e)}^{ATT}$ , and $W_{τ (s)}^{K}$ are learnable parameter matrices specific to the meta-relation.

Heterogeneous Message Passing

The information of source nodes are passed to the target node. Similar to the attention process, the meta-relations of edges are incorporated into the message passing process.

For a meta-relation $⟨ τ (s), ϕ (e), τ (t)⟩$ , the multi-head message is calculated by $Message_{HGT} (s, e, t) = concat ({MSG-head^{i} (s, e, t) ∣ i = 1, \dots, h})$ where $h$ is the number of message heads.

The message head is calculated by $MSG-head^{i} (s, e, t) = (V^{(i)} (s) W_{ϕ (e)}^{MSG})$ where:

$V (s) = W_{τ (s)}^{V} H_{s}^{(l - 1)}$ is a value vector

$W_{τ (s)}^{V}$ , $W_{ϕ (e)}^{MSG}$ are learnable parameter matrices specific to the meta-relation.

Target-Specific Aggregation

The heterogeneous multi-head attention and message are aggregated from the source nodes to the target node. The attention values are used as weights averaging corresponding messages from the source nodes.

The updated vector $\tilde{H}_{t}^{(l)}$ is calculated by $\tilde{H}_{t}^{(l)} = s \in N (t) \oplus (Attention_{HGT} (s, e, t) \cdot Message_{HGT} (s, e, t))$

The final output, contextualized representation of the node, is calculated using the updated vector. $H_{t}^{(l)} = W_{τ (t)}^{H} σ (\tilde{H}_{t}^{(l)}) + H_{t}^{(l - 1)}$ where

$H_{t}^{(l - 1)}$ is the feature vectors of the target node at layer $l - 1$ , and works as a residual connection.

$W_{τ (t)}^{H}$ is learnable parameter matrix specific to the meta-relation.

$σ$ is a non-linear Activation Function

Link to original

Knowledge Graphs

Knowledge Graph
Definition

Knowledge graph (KG) is a structured representation of information that organizes data in the form of entities and their relationships. Commonly, knowledge graphs are massive and incomplete. So, knowledge graph completion (KG completion) is an important task.

Knowledge Graph Completion Task

Edges in KG are represented as triplets $⟨ h, r, t ⟩$ , where $h$ is a head, $r$ is a relation, and $t$ is a tail. Given a triplet $⟨ h, r, t ⟩$ , the goal of KG completion is that the embedding of $(h, r)$ should be close to the embedding of $t$ . A score function is high if input triplet $⟨ h, r, t ⟩$ is probable, otherwise low.

Relation Patterns in Knowledge Graph

Types of relations embedding models can represent

Model Score Embedding Symmetric Antisymmetric Inverse Transitive 1-to-N
TransE $- ∥ h + r - t ∥$ $h, r, t \in R^{d}$ $\times$ $\circ$ $\circ$ $\circ$ $\times$
TransR $- ∥ M_{r} h + r - M_{r} t ∥$ $h, t \in R^{d}$
$r \in R^{k}$
$M_{r} \in R^{k \times d}$ $\circ$ $\circ$ $\circ$ $\circ$ $\circ$
DistMult $h^{⊺} diag (r) t$ $h, r, t \in R^{d}$ $\circ$ $\times$ $\times$ $\times$ $\times$
ComplEx $Re (h^{⊺} diag (r) \overline{t})$ $h, r, t \in C^{d}$ $\circ$ $\circ$ $\circ$ $\times$ $\circ$

Predictive Queries on Knowledge Graph

One-Hop Query

Path Query

Conjunctive Query

Examples

Node types: drug, disease, adverse event, protein, pathways Relation types: has_func, causes, assoc, treats, is_a

Link to original

Model	Score	Embedding	Symmetric	Antisymmetric	Inverse	Transitive	1-to-N
TransE	$- ∥ h + r - t ∥$	$h, r, t \in R^{d}$	$\times$	$\circ$	$\circ$	$\circ$	$\times$
TransR	$- ∥ M_{r} h + r - M_{r} t ∥$	$h, t \in R^{d}$ $r \in R^{k}$ $M_{r} \in R^{k \times d}$	$\circ$	$\circ$	$\circ$	$\circ$	$\circ$
DistMult	$h^{⊺} diag (r) t$	$h, r, t \in R^{d}$	$\circ$	$\times$	$\times$	$\times$	$\times$
ComplEx	$Re (h^{⊺} diag (r) \overline{t})$	$h, r, t \in C^{d}$	$\circ$	$\circ$	$\circ$	$\times$	$\circ$

TransE
Definition

Translating embeddings (TransE) is a Knowledge Graph embedding method. TransE represents entities and relations as vectors in the same low-dimensional space. The model is trained through Contrastive Learning.

For a triplet $⟨ h, r, t ⟩$ , let $h, r, t \in R^{d}$ be embedding vectors. The objective of TransE is finding the vectors satisfying the equality $h + r = t$

The entity score function is defined as $f_{r} (h, t) = - ∣∣ h + r - t ∣∣$

TransE cannot model symmetric and 1-to-N relations
Link to original

TransR
Definition

Translating embeddings in relational space (TransR) is a Knowledge Graph embedding method. it extends the ieda of TransE model by introducing separate embedding space for entities and relations.

For a triplet $⟨ h, r, t ⟩$ , let $h, t \in R^{d}$ be vectors in the entity space, and $r \in R^{k}$ be a vector in relation space.

The objective of TransR is finding the vectors satisfying the equality $h_{r} + r = t_{r}$ where $h_{r} := M_{r} h$ and $t_{r} := M_{r} t$ are the projections of $h$ and $t$ in the relation space, and $M_{r} \in R^{k \times d}$ is the projection matrix of $r$ .

The entity score function is defined as $f_{r} (h, t) = - ∣∣ h_{r} + r - t_{r} ∣∣$
Link to original

DistMult
Definition

DistMult is a Knowledge Graph embedding method representing entities and relations as vectors in same low-dimensional space. It uses a bilinear scoring function with a diagonal matrix for each relation.

For a triplet $⟨ h, r, t ⟩$ , let $h, r, t \in R^{d}$ be embedding vectors.

The score function of DistMult is defined as $f_{r} (h, t) = h^{⊺} diag (r) t = i = 1 \sum d h_{i} r_{i} t_{i}$

DistMult cannot model antisymmetric, inverse, and transitive relations.
Link to original

ComplEx
Definition

Complex embedding (ComplEx) is a Knowledge Graph embedding method. It extends the idea of DistMult in a vector space to the complex domain.

For a triplet $⟨ h, r, t ⟩$ , let $h, r, t \in C^{d}$ be complex embedding vectors.

The score function of ComplEx is defined as $f_{r} (h, t) = Re (h^{⊺} diag (r) \overline{t}) = Re (i = 1 \sum d h_{i} r_{i} \overline{t}_{i})$

ComplEx cannot model transitive relations
Link to original

Reasoning over Knowledge Graphs

One-Hop Query
Definition

One-hop query is the same task as knowledge graph completion task. KG completion: Is link $⟨ h, r, t ⟩$ in the KG? $\Leftrightarrow$ One-hop query: Is $t$ an answer to query $(h, r)$ ?
Link to original

Path Query
Definition

Path query is the generalization of One-Hop Query with more relations on the path.

An $n$ -hop path query $q$ can be represented by $q = (v_{a}, (r_{1}, r_{2}, \dots, r_{n}))$ where $v_{a}$ is an anchor entity, and $r_{i}$ is an $i$ -th relation.

Traversing Knowledge Graph

KGs are incomplete, so we cannot identify all the answer entities. For this reason, the path-based queries over an incomplete Knowledge Graph are calculated by the KG embedding method that can handle transitive relations such as TransE.

Examples

Query: (Fulvestrant, (Causes, Assoc))

Link to original

Conjunctive Query
Definition

Conjunctive query has multiple anchor entities having their own path queries.

A conjunctive query $q$ can be represented by $q = [(v_{1}, (r_{1, 1}, r_{1, 2}, \dots, r_{1, m_{1}})), \dots, (v_{n}, (r_{n, 1}, r_{n, 2}, \dots, r_{n, m_{n}}))]$ where $v_{i}$ is an anchor entity of $i$ -th Path Query, and $r_{i, m_{j}}$ is a $j$ -th relation of $i$ -th path.

Traversing Knowledge Graph

Each intermediate node of a conjunctive query represents a set of entities, so we can not use the method used for Path Query. The Query2Box model can handle the problem.
Link to original

Query2Box
Definition

Query2Box is a model for calculating path-based queries over an incomplete Knowledge Graph. It represents queries as boxes in a vector space. To answer a query, the model computes the final query box and ranks entities based on their distance to the box.

Architecture

Box Embedding

In the Query2Box model, queries are represented as boxes in a vector space, and entities are represented as points (zero-size box) in the space. The box embedding allows for modeling uncertainty and capturing a set of possible answers.

Query box in $R^{d}$ is defined as $q = (Cen_{q}, Off_{q}) \in R^{2 d}$ where $Center_{q} \in R^{d}$ is the center of the box, and $Offset_{q} \in R_{\geq 0}^{d}$ is the offset of the box.

The embedding box is defined using the query box. $Box_{q} = {v \in R^{d} ∣ Cen_{q} - Off_{q} ⪯ v ⪯ Cen_{q} + Off_{q}}$ where $⪯$ is element-wise inequality.

Query Operations

Query2Box consists of the two operations: projection, and intersection.

Projection Operation

The projection operation expands the box to include related entities. $Cen_{q^{'}} = Cen_{q} + Cen_{r} Off_{q^{'}} = off_{q} + Off_{r}$

Intersection Operation

The intersection operation combines information from multiple sub-queries by intersecting their boxes. Instead of directly applying intersection over a set of box embeddings, the intersection in calculated using the Attention-like operation.

The center is calculated as $Cen_{q_{inter}} = i \sum w_{i} ⊙ Cen_{q_{i}}$ where $w_{i} = \frac{exp ( MLP ( q _{i} ))}{\sum _{j} exp ( MLP ( q _{j} ))}$

The offset is calculated as $Off_{q_{inter}} = min ({Off_{q_{i}} ∣ i = 1, \dots, n}) ⊙ σ (f_{off} ({q_{i} ∣ i = 1, \dots, n}))$ where $σ$ is a Sigmoid Function, and $f_{off} ({q_{i} ∣ i = 1, \dots, n}) = MLP (\frac{1}{N} i = 1 \sum N MLP (q_{i}))$

Entity-to-Box Distance

Given a query box $q$ and an entity vector $v$ , the distance between them is defined as $dist_{box} (v; q) = dist_{outside} (v; q) + α dist_{inside} (v; q)$ where $0 < α < 1$ is a fixed scalar that downweights the distance inside the box.
Link to original

Sugraph and Network Motifs

Subgraph
Definition

A subgraph of a Graph $G = (V, E)$ is another graph formed from a subset of the vertices and edges of $G$ .

Node-Induced Subgraph

Take subset of the nodes and all edges induced by the nodes.

$G^{'} = (V^{'}, E^{'})$ where $V^{'} \subset V$ and $E^{'} = {(u, v) \in E ∣ u, v \in V^{'}}$ .

Edge-Induced Subgraph

Take subset of the edges and all corresponding nodes. $G^{'} = (V^{'}, E^{'})$ where $E^{'} \subset E$ and $V^{'} = {u, v \in V ∣ (u, v) \in E^{'}}$ .
Link to original

Graph Isomorphism
Definition

An isomorphism of Graph $G_{1}$ and $G_{2}$ is a Bijection $f : V_{1} \to V_{2}$ between the vertex sets of the graphs such that any two vertices $u$ and $v$ of $G_{1}$ are adjacent in $G_{1}$ if and only if $f (u)$ and $f (v)$ are adjacent in $H$ . $(u, v) \in E_{1} \Leftrightarrow (f (u), f (v)) \in E_{2}$

Is there exist an isomorphism $f$ between two graphs, the graphs are isomorphic. $G_{1} ≅ G_{2}$
Link to original

Subgraph Isomorphism
Definition

Given two graphs $G_{1}$ and $G_{2}$ , if $G_{2}$ contains a Subgraph that is isomorphic to $G_{1}$ . $\exists G_{2}^{'} \subset G_{2} s.t. G_{1} ≅ G_{2}^{'}$ where $≅$ means Graph Isomorphic.

Then $G_{2}$ is subgraph isomorphic to $G_{1}$
Link to original

Graph-Level Subgraph Frequency
Definition

Graph-level subgraph frequency refers to the number of times a particular Subgraph pattern appears in the entire network.

Given a Graph $G$ and a Subgraph of interest $S$ , the graph-level frequency of $S$ in $G$ is calculated as $F_{G} (S) = ∣ {G^{'} \subset G ∣ G^{'} ≅ S} ∣$ where $≅$ means Graph Isomorphic.
Link to original

Node-Level Subgraph Frequency
Definition

Given a Subgraph of interest $S$ , its node $v \in S$ (the anchor), and a Graph $G$ , the node-level frequency is defined as the number of nodes $u \in G$ for which some Subgraph $G^{'} \subset G$ is isomorphic to a specific Subgraph pattern $G^{'} ≅ S$ and the isomorphism maps node $u \in G$ to $v \in S$ .

$F_{G} (S, v) = ∣ {u \in V (G) ∣\exists G^{'} \subset G s.t. G^{'} ≅ S \land f (u) = v} ∣$ where $≅$ means isomorphic, $f$ is an isomorphism, and $V (G)$ is the set of vertices in Graph $G$ .
Link to original

Network Motif
Definition

Network motifs are recurrent and significant interconnected subgraphs of a larger Graph. To be considered a motif, a sub graph must occur more frequently than expected by chance in randomized graphs with same statistics (e.g. number of nodes, edges, degree sequence).

Z-Score

The significance of a network motif can be measured by the Z-score. $Z = \frac{N _{real} - N ˉ _{rand}}{σ _{rand}}$ where $N_{real}$ is occurrence of the subgraph in $G_{real}$ , $\overset{ˉ}{N}_{real}$ is the average of occurrence of the subgraph in $G_{rand}$ , and $σ_{rand}$ is the standard deviation of occurrence of the subgraph in $G_{rand}$ .

The occurrences are measured by Graph-Level Subgraph Frequency or Node-Level Subgraph Frequency.

A high Z-score indicates that the Subgraph appears significantly more often in the real network than expected by chance, suggesting it may be a functionally important motif.

Network Significance Profile

Motifs performs specific functions within a graph, contributing to the overall behavior of it. Different types of graphs (e.g. gene regulation networks, neural networks, social networks, and word connectivity) often have characteristic motifs.
Link to original

Subgraph Matching

Neural Subgraph Matching
Definition

Neural subgraph matching is a model desinged to solve the subgraph matching problem using neural networks.

Architecture

Consider a subgraph $G_{Q}$ and a target graph $G_{T}$

Embedding

For each node in the query $G_{Q}$ and target graph $G_{T}$ , obtain a $k$ -hop neighborhood around the anchor. The neighborhoods are embedded using GNN by computing the embeddings for the anchor nodes in their respective neighborhoods. The GNN is trained to have partial order relation in its embedding space.

Order Embedding

Order embedding enforces a partial order relationship between node embeddings of the sub graphs. It is used to capture the hierarchical relationship between graphs and subgraphs. A partial order is defined in the embedding space where larger graphs are located above of their subgraphs.

Subgraph isomorphism relationship can be encoded in order embedding space $\forall_{i = 1}^{d}, z_{q} [i] \leq z_{t} [i] \Leftrightarrow G_{q} \subset G_{t}$ where $d$ is the dimension of the embedding space, and $z_{q} [i]$ is the $i$ -th value of the embedding vector of subgraph anchor $q$ .

The GNN for the embedding is trained by minimizing a max-margin loss with Contrastive Learning examples. The margin between graph embedding of $G_{q}$ and $G_{t}$ are defined as $E (z_{q}, z_{t}) = i = 1 \sum d (max (0, z_{q} [i] - z_{t} [i]))^{2}$ The max-margin loss is defined as $L = (z_{q}, z_{t}) \in P \sum E (z_{q}, z_{t}) + (z_{q}, z_{t}) \in N \sum max (0, α - E (z_{q}, z_{t}))$ where $P$ is a set of positive pairs, and $N$ is a set of negative pairs.

Max-margin loss prevents the model from embedding all vectors to far from the origin.

Subgraph Prediction

Consider a query graph $G_{q}$ anchored at node $q$ , and a target graph $G_{t}$ anchored at node $t$ . By using the margin used for embedding, we can check whether $G_{q}$ is a node-anchored subgraph of the target graph $G_{t}$ . $f (z_{q}, z_{t}) = {10 E (z_{q}, z_{t}) < ϵ otherwise$ where $ϵ$ is a hyperparameter works as a threshold.

To check if $G_{Q}$ is isomorphic to a subgraph of $G_{T}$ , repeat this process for all $q \in G_{Q}$ and $t \in G_{T}$ and aggregate the result to make the binary prediction for the decision problem of subgraph matching. Here $G_{q}$ is the neighborhood around the anchor node $q \in G_{Q}$ .
Link to original

SPMiner
Definition

Subgraph pattern miner (SPMiner) is a model for identifying frequent subgraphs (network motifs) in a graph.

Architecture

The frequent subsampling counting problem consists of two stages: searching over all possible motifs, and counting the frequency of each motif in graph. SPMiner consists of two steps: Embedding candidate subgraphs, and motif search procedure.

Embedding candidate subgraphs

SPMiner decomposes the input Graph into many overlapping node-anchored neighborhoods around each node. Then encodes each subgraph into an order embedding space.

Motif Search Procedure

SPMiner then directly reasons in the embedding space to identify frequent motifs. It searches for a $k$ -step walk in the embedding space that stays to the lower left of as many neighborhoods as possible. The walk is performed by iteratively adding nodes and edges to the current motif candidate, and tracking its embedding.

Given an order embedding encoder $ϕ$ , let a Graph generation procedure be ${G_{0}, G_{1}, \dots, G_{k - 1}}$ , where at any step $i$ , $G_{i}$ is generated by adding $1$ node to $G_{i - 1}$ . Then the sequence of embeddings ${ϕ (G_{0}), ϕ (G_{1}), \dots, ϕ (G_{k - 1})}$ is a monotonic walk in the order embedding space. Finding the frequent motif is the same as finding a walk that $G_{k - 1} argmax ∣ {G_{v} ∣\forall G_{v} \subset G_{T}, ϕ (G_{k - 1}) ⪯ ϕ (G_{v})} ∣$

SPMiner can quickly count the number of occurrences of a given motif by simply checking the number of neighborhoods that are embedded to the top-right of it in the embedding space
Link to original

GNNs for Recommendation System

Recall at K
Definition

Recall at K (Recall@K) is a metric that help evaluate the performance of recommendation system. It is the proportion of correctly identified relevant items in the top K recommendations out of the total number of relevant items in the dataset.

$Recall@K = \frac{Number of relevant items in K}{Total number of relevant items}$

For each user $u$ , let $P_{u}$ be a set of positive items the user will interact, and $R_{u}$ be a set of items recommended by the model. $Recall@K_{u} = \frac{∣ P _{u} \cup R _{u} ∣}{∣ P _{u} ∣}$ In top-K recommendation $∣ R_{u} ∣ = K$ .

Recall@K is not differentiable.
Link to original

Binary Loss
Definition

Binary loss is a metric that help evaluate the performance of recommendation system. It treats the recommendation problem as a binary classification task.

Let $U$ be a set of all users, $V$ be a set of all items, $E = {(u, v) ∣ u \in U, v \in V, u \sim v}$ be a set of observed user-item interactions, and $E_{neg} = {(u, v) ∣ u \in U, v \in V, (u, v) \in / E}$ be a set of negative edges. The binary loss function is defined as $L = - \frac{1}{E} (u, v) \in E \sum ln (σ (f_{θ} (u, v)) - \frac{1}{E _{neg}} (u, v) \in E_{neg} \sum ln (1 - σ (f_{θ} (u, v))$ where:

$σ (x)$ is a Sigmoid Function

$u$ and $v$ are the embedding vectors of $u$ and $v$ .

$f_{θ} (u, v)$ is the score between $u$ and $v$ .

Since the binary loss is non-personalized, the all positive edges are pushed higher than those of all negative edges.
Link to original

Bayesian Personalized Ranking
Definition

Bayesian personalized ranking (BPR) is a metric that help evaluate the performance of recommendation system. It focuses on the relative order of items rather than absolute scores.

Let $U$ be a set of all users, $V$ be a set of all items, $E = {(u, v) ∣ u \in U, v \in V, u \sim v}$ be a set of observed user-item interactions, and $E_{neg} = {(u, v) ∣ u \in U, v \in V, (u, v) \in / E}$ be a set of negative edges.

For each user $u^{*} \in U$ , the rooted positive and negative edges are defined as $E (u^{*}) E_{neg} (u^{*}) = {(u^{*}, v) ∣ u \in U, v \in V, (u^{*}, v) \in E} = {(u^{*}, v) ∣ u \in U, v \in V, (u^{*}, v) \in E_{neg}}$ The BPR loss for user $u^{*}$ is defined as $L (u^{*}) = \frac{1}{∣ E ( u ^{*} ) ∣ \cdot ∣ E _{neg} ( u ^{*} ) ∣} (u^{*}, v_{pos}) \in E (u^{*}) \sum (u^{*}, v_{neg}) \in E_{neg} (u^{*}) \sum - ln (σ (f_{θ} (u^{*}, v_{pos}) - f_{θ} (u^{*}, v_{neg})))$

The final BPR loss is the average of them. $\frac{1}{∣ U ∣} u^{*} \in U \sum L (u^{*})$ where:

$σ (x)$ is a Sigmoid Function

$u$ and $v$ are the embedding vectors of $u$ and $v$ .

$f_{θ} (u, v)$ is the score between $u$ and $v$ .

Link to original

Neural Graph Collaborative Filtering
Definition

Neural graph collaborative filtering (NGCF) is a recommendation algorithm that uses GNN to enhance collaborative filtering.

Architecture

NGCF constructs a bipartite graph where users and items are nodes, and interactions form edges.

Embedding Propagation Layers

The embeddings of NGCF are updated through propagation.
$\mathbf{e}_{u}^{(l+1)} &= \sigma\left( \sum_{i \in \mathcal{N}_{u} \cup \{u\}} \frac{1}{\sqrt{|\mathcal{N}_{u}||\mathcal{N}_{i}|}} (W_{1}\mathbf{e}_{i}^{(l)} + W_{2}(\mathbf{e}_{i}^{(l)} \odot \mathbf{e}_{u}^{(l)})) \right)\\ \mathbf{e}_{i}^{(l+1)} &= \sigma\left( \sum_{i \in \mathcal{N}_{i} \cup \{i\}} \frac{1}{\sqrt{|\mathcal{N}_{u}||\mathcal{N}_{i}|}} (W_{1}\mathbf{e}_{u}^{(l)} + W_{2}(\mathbf{e}_{i}^{(l)} \odot \mathbf{e}_{u}^{(l)})) \right) \end{aligned}$$ where - $\mathbf{e}_{u}^{(l)}$ is the embedding of user $u$ at the $l$-th layer - $\mathbf{e}_{i}^{(l)}$ is the embedding of item $i$ at the $l$-th layer - $\mathcal{N}_{u}$ is the set of items interacted with by user $u$. - $W_{1}$ and $W_{2}$ are learnable weight matrices - $\odot$ denotes element-wise multiplication - $\sigma$ is a non-linear [[Activation Function]] It can be written in a matrix form. $$E^{(l+1)} = \sigma\left( (\mathcal{L}+I) E^{(l)} W_{1}^{(l+1)} + \mathcal{L} (E^{(l)} \odot E^{(l)}) W_{2}^{(l+1)} \right)$$ where: - $\mathcal{L} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ is the [[Laplacian Matrix]] for the user-item graph - $A = \begin{bmatrix} O&R\\R^{\intercal}&O \end{bmatrix}$ is the [[Adjacency Matrix]], where $R$ is $N \times M$ user-item interaction matrix - $D$ is the [[Degree Matrix]] - $W_{1}^{(l+1)}$ and $W_{2}^{(l+1)}$ are learnable weight matrices at layer $k+1$ ## Model Prediction The final user and item embedding matrices are constructed by concatenating all the layer's embeddings $$\begin{aligned}\mathbf{e}_{u}^{*} &= \operatorname{CONCAT}(\{\mathbf{e}_{u}^{(l)}|l=1,\dots,L\})\\ \mathbf{e}_{i}^{*} &= \operatorname{CONCAT}(\{\mathbf{e}_{i}^{(l)}|l=1,\dots,L\}) \end{aligned}$$ where $L$ is the total number of layers. The score between $u$ and $i$ is calculated using the inner product $$\operatorname{score}_{\text{NGCF}}(u, i) = {\mathbf{e}_{u}^{*}}^{\intercal}\mathbf{e}_{i}^{*}$$$ Link to original

LightGCN
Definition

LightGCN (Light Graph Convolutional Network) is a simplified version of NGCF.

Architecture

Light Graph Convolution

In the LGC, only the normalized sum of neighbor embeddings is performed towards next layer; other operations like self-connection, feature transformation, and nonlinear activation are all removed.

The graph convolution operation in LightGCN is defined as
$\mathbf{e}_{u}^{(l+1)} &= \sum_{i \in \mathcal{N}_{u}} \frac{1}{\sqrt{|\mathcal{N}_{u}||\mathcal{N}_{i}|}} \mathbf{e}_{i}^{(l)}\\ \mathbf{e}_{i}^{(l+1)} &= \sum_{u \in \mathcal{N}_{i}} \frac{1}{\sqrt{|\mathcal{N}_{u}||\mathcal{N}_{i}|}} \mathbf{e}_{u}^{(l)} \end{aligned}$$ where - $\mathbf{e}_{u}^{(l)}$ is the embedding of user $u$ at the $l$-th layer - $\mathbf{e}_{i}^{(l)}$ is the embedding of item $i$ at the $l$-th layer - $\mathcal{N}_{u}$ is the set of items interacted with by user $u$. It can be written in a matrix form. $$E^{(l+1)} = \mathcal{L} E^{(l)}$$ where: - $\mathcal{L} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$ is the normalized adjacency matrix for the user-item graph - $A = \begin{bmatrix} O&R\\R^{\intercal}&O \end{bmatrix}$ is the [[Adjacency Matrix]], where $R$ is $N \times M$ user-item interaction matrix - $D$ is the [[Degree Matrix]] The only learnable parameters are the embeddings at the $0$-th layer. ## Model Prediction The final user and item embedding matrices are constructed by the weighted sum of the embeddings in each layer. $$E_{\text{final}} = \alpha_{0}E^{(0)} + \alpha_{1}E^{(1)} + \dots + \alpha_{K}E^{(K)}$$ where $\alpha_{k}$ is a hyperparameter (The parameters are set uniformly $\alpha_{k} = \cfrac{1}{K+1}$ in the paper). The score between $u$ and $i$ is calculated using the inner product $$\operatorname{score}_{\text{LGC}}(u, i) = {\mathbf{e}_{u}^{*}}^{\intercal}\mathbf{e}_{i}^{*}$$$ Link to original

PinSAGE
Definition

PinSAGE (Pinterest SAGE) is a Graph Convolutional Network model for large-scale recommendation systems.

Architecture

PinSAGE is based on the GraphSAGE architecture.

Importance-Based Neighbors

We simulate random walks starting from node $u$ and count the visits for each node in the random walk. The neighborhood of $u$ , $N (u)$ , is then defined as the top- $k$ nodes with the highest visit counts with respect to node $u$ .

Importance Pooling

PinSAGE assigns different weights (normalized visit counts) to different neighbors when aggregating information from a node’s neighborhood. This allows the model to focus on more relevant or influential neighbors.

Curriculum Training

PinSAGE implements a curriculum training strategy, starting with easier examples and gradually moving to more difficult ones. The difficulty of negative samples is based on the visit counts
Link to original

Deep Generative Models for Graphs

GraphRNN
Definition

Graph recurrent neural network (GraphRNN) is a graph generation model. It generates graph in a autoregressive manner. GraphRNN treats graph generation as a sequential process. It generates nodes one at a time and for each new node, it decides its connections to previously generated nodes. The model consists of two main components: node-level RNN and edge-level RNN.

Architecture

Graph $G$ with node ordering $π$ can be uniquely mapped into a sequence of node and edge additions $S^{π}$ . The sequence $S^{π}$ has two levels: node-level and edge-level. Node-level sequence adds nodes one at the time, and each node-level step is an edge-level sequence. Edge-level sequence adds edges between existing nodes. $S^{π} = (S_{1}^{π}, S_{2}^{π}, \dots, S_{n}^{π})$ where $S^{π}$ is the entire graph sequence, and $S_{i}^{π}$ is $i$ -th node-level step or edge-level sequence. $S_{i}^{π} = (S_{i, 1}^{π}, S_{i, 2}^{π}, \dots, S_{i, m}^{π})$ where $S_{i, j}^{π}$ is $j$ -th edge-level step in $i$ -th node-level step.

The sequence is created from the Adjacency Matrix of a Graph.

Node-Level RNN

At each step, node-level RNN outputs a hidden state that summarizes the graph generated so far. The hidden state is used to initialize the edge-level RNN.

Edge-Level RNN

Edge-level RNN sequentially predict if the new node will connect to each of the previously generated nodes.
Link to original

More Expressive GNNs

Position-Aware GNN
Definition

Position-aware GNN (P-GNN) is designed to compute position-aware Node Embedding.

Architecture

Effective Node Embedding should be able to learn to distinguish nodes $v_{1}$ and $v_{2}$ . However, standard GNN is not able to classify nodes $v_{1}$ and $v_{2}$ into different classes based on the network structure alone because the two nodes are symmetric/isomorphic in the graph, and their GNN rooted subtrees used for message aggregation are the same.

Anchor-Set

By using the distance of a given target node to each anchor-set as an augmented node feature, P-GNN can deal with the position-aware tasks. Where anchor-sets are randomly constructed.

P-GNN first samples sets of anchor nodes, computes the distance of a given target node to each anchor-set, and then learns a permutation-invariant aggregation over the anchor-sets. The model can capture positions/locations of nodes with respect to the anchor nodes.
Link to original

Identity-Aware GNN
Definition

Identity-aware GNN (ID-GNN) is designed to compute structure-aware Node Embedding.

Architecture

Across all example tasks, traditional GNN will always assign the same embedding to both nodes, edges and graphs, because for all tasks the computational graphs are identical. In contrast, the colored computational graphs provided by ID-GNN allow for clear differentiation between the nodes of label A and label B, as the colored computational graphs are no longer identical across the tasks.

Heterogeneous Message Passing

ID-GNN implies inductive-node coloring by applying different message/aggregation to nodes with different colorings.

To embed a node $v \in G$ , extract $v$ -centered local network (ego network) and assign a unique coloring to the central node of the network. The message passing and aggregation differ by the node’s coloring. $m_{s}^{(l)} = MSG_{1 [s = v]} (h_{s}^{(k - 1)})$

ID-GNN-Fast

The ID-GNN model can be simplified by just adding the cycle count at each level as an augmented node feature without using the two different networks.
Link to original

Graph Transformers

Transformer-Based GNN
Definition

Transformer-based GNN address some of the limitations of traditional GNNs (such as cycle counting) while leveraging the Transformer architecture for processing Graph-structured data.

Architecture

Node Embedding

Each node is embedded into a vector using their features.

Positional Encoding

Unlike in standard Transformer where positional encodings represent sequential order, in graph scenarios, these encodings are used to capture the structural information of the graph. It is constructed from the Adjacency Matrix of the graph.

There exists various approaches constructing positional encoding vectors.

Relative Distance

This method directly using the idea of Position-Aware GNN. The distance of a given target node to each anchor-set as a positional encoding.

Laplacian Eigenvectors

Calculate the eigenvector matrix of the Laplacian Matrix, and use each row of the eigenvector matrix as a positional encoding of each node.

The signs of the eigenvectors may change model’s prediction. SigNet uses a neural network to get sign-invariant positional encoding to prevent this problem. $f (z_{1}, z_{2}, \dots, z_{k}) = ρ (ϕ (z_{1}) + ϕ (- z_{1}), \dots, ϕ (z_{k}) + ϕ (- z_{k}))$ where $z_{i}$ is the $i$ -th Eigenvector of the Laplacian Matrix, and $ρ$ and $ϕ$ are neural network (MLP, GNN, etc.)

Self Attention

The edge features are used for adjusting the attention weights $[k_{ij}] = (\frac{K ^{⊺} Q}{d})$ of the nodes.

If there is edge between nodes $i$ and $j$ with features $e_{ij}$ , it is linearly transformed $c_{ij} = W_{1}^{⊺} e_{ij}$ If there is no edge, find the shortest edge path $(e^{1}, e^{2}, \dots, e^{n})$ between $i$ and $j$ , and define $c_{ij} = i = 1 \sum n W_{i}^{⊺} e^{i}$ Then, it is added to the corresponding attention weight. $a_{ij} = k_{ij} + c_{ij}$ where $W_{1}, \dots, W_{n}$ are learnable parameters.
Link to original

GNNs for Large Graphs

Neighbor Sampling
Definition

Neighbor sampling is used to efficiently compute node embeddings in large graphs where considering the entire neighborhood of a node becomes computationally expensive. Instead of using the entire $K$ -hop neighborhood of a node to compute its embedding, we sample a subset of neighbors at each hop. It is used for GraphSAGE model.

Constructing the exact computational graph for each node is computationally heavy. The computation graph becomes exponentially large with respect to the layer size $K$ , and explodes when it hits a hub (high-degree) node. For this reason, the neighbor sampling randomly samples at most $H$ neighbors at each hop.

Algorithm

Randomly sample $M ≪ N$ root nodes, where $N$ to the total number of nodes in the graph.

For each sampled root node $v$ , construct the computational graph by randomly sampling at most $H$ neighbors at each hop.

Update the embeddings of the root nodes.

Limitations

Smaller $H$ leads to more efficient neighborhood aggregation, but results are less stable due to the large variance in neighbor aggregation.

Even with neighbor sampling, the size of the computational graph is still exponential with respect to number of GNN layers $K$

Random sampling is fast, but it may sample unimportant nodes.

Link to original

Cluster-GCN
Definition

Cluster-GCN is an algorithm for training GCN on large-scale graphs. The main idea of Cluster-GCN is to address the computational redundancy occurring in Neighbor Sampling, when nodes in a mini-batch share many neighbors.

Instead of sampling individual nodes or neighbors, Cluster-GCN samples small subgraphs from the large graph. It then performs efficient layer-wise node embedding updates over these subgraphs, where the subgraphs should retain edge connectivity structure of the original graph as much as possible.

If only single group is used per minibatch, the induced subgraph removes between-group links and the graph community detection algorithm puts similar nodes together in the same group. So the sampled nodes are not diverse enough to represent the entire graph structure. For this reason, the cluster-GCN aggregate multiple node groups per mini-batch.

Algorithm

The given graph is partitioned into groups of nodes (subgraphs) using community detection algorithms such as Louvain or METIS algorithm.

For each mini-batch, randomly sample multiple node groups and construct node-induced subgraph of the aggregated node group.

Update the embeddings of the nodes of the subgraph.

Link to original

Simple Graph Convolution
Definition

Simple graph convolution (SGC) is a simplified variant of GCN. It simplifies GCN without losing much performance by removing the non-linear Activation Function. Due to the simplicity, the SGC can be more computationally efficient, especially for larger graphs. The simplification strategy is very similar to the one used by LightGCN for recommender systems.

The $K$ -hops simple graph convolution network for a classification task can be written in $\hat{Y}_{SGC} = softmax (S^{K} X Θ)$ where:

$S = D^{- \frac{1}{2}} (A + I) D^{- \frac{1}{2}}$ is the normalized adjacency matrix for the graph

$A$ is the Adjacency Matrix of the graph.

$D$ is the Degree Matrix of the matrix $(A + I)$

$X$ is the fixed input feature matrix of nodes.

$Θ$ is the only weight matrix.

SGC doesn’t require building a computational graph or sample a subgraph, so can be applied for large graphs.
Link to original

In-Context Learning Over Graphs

In-Context Learning
Definition

In-context learning (ICL) is a method utilizing pre-trained model to solve new tasks without fine-tuning by showing a few examples of the desire task to the model, allowing the model to infer the pattern and apply it to new instances. Unlike traditional machine learning, ICL doesn’t involve updating the model’s parameters. The model uses its existing knowledge to interpret and apply the new information.
Link to original

PRODIGY
Definition

Pretraining Over Diverse In-Context Graph Systems (PRODIGY) model is a pretraining framework that enables In-Context Learning over graphs. The key idea of PRODIGY is to formulate in-context learning over graphs with a novel prompt graph representation. It is similar to the few-shot prompting widely used for LLM.

Architecture

Prompt Graph

Data graphs of the input nodes/edges/subgraphs are contextualized to by an embedding model such as GCN or GAT. For node classification problem, the embedding of root node of the data graph is used as data node embedding. $G_{i} = E_{v_{i}}$ For link classification problem, the data node embedding is calculated using the edge’s nodes. $G_{i} = W^{⊺} (CONCAT (E_{v_{1}}, E_{v_{2}}, max (E_{v_{i}}))) + b$ where $W$ and $b$ are learnable parameters.

Link Prediction

Node Classification

Graph Classification

Task graph is constructed using the contextualized data graphs and the labels. The edges between the data and label node groups are fully connected.

The task graph is fed into the another GAT to obtain updated representation of data nodes and label nodes. $H = GNN_{T} (G^{T})$

The classification is performed by the cosine similarity of the embedded nodes of the target graph. $cosinesimilarity (H_{label}, H_{data})$

Pretraining

The PRODIGY model is trained by the neighbor matching task and the multi(edge/node/subgraph)task.

Neighbor Matching

We sample multiple subgraphs from the pretraining graph as the local neighborhoods, and we say a node belongs to a local neighborhood if it is in the sampled subgraph. The sampled subgraphs are used as the prompt/query data graphs

Multitask

In the pretraining stage, each data graph of prompts and queries is constructed by sampling $k$ -hop neighborhoods of the randomly sampled nodes.
Link to original

My Knowledge Base

Explorer

Graph Neural Network Note

Graph

Definition

Node Embeddings

Node Embedding

Definition

DeepWalk

Definition

Algorithm

Node2Vec

Definition

Algorithm

Graph Neural Networks

Graph Neural Network

Definition

Graph Convolutional Network

Definition

GCN Layer

Normalized GCN Layer

GCN Layer With Self-Transform Term

GCN Layer With Skip Connection

Neighborhood-Based approach

Facts

GraphSAGE

Definition

Architecture

Aggregators

Graph Attention Network

Definition

Architecture

Attention

Aggregation

Multi-Head Attention

Differentiable Pooling

Definition

Architecture

Graph Isomorphism Network

Definition

Architecture

Heterogeneous Graphs

Relational GCN

Definition

Architecture

Regularization Methods

Basis Decomposition

Block-Diagonal Decomposition

Heterogeneous Graph Transformer

Definition

Architecture

Meta-relation triplets

Heterogeneous Mutual Attention

Heterogeneous Message Passing

Target-Specific Aggregation

Knowledge Graphs

Knowledge Graph

Definition

Knowledge Graph Completion Task

Relation Patterns in Knowledge Graph

Predictive Queries on Knowledge Graph

Examples

TransE

Definition

TransR

Definition

DistMult

Definition

ComplEx

Definition

Reasoning over Knowledge Graphs

One-Hop Query

Definition

Path Query

Definition

Traversing Knowledge Graph

Examples

Conjunctive Query

Definition

Traversing Knowledge Graph