quelle ia pour les données structurées ? graph neural...
TRANSCRIPT
Quelle IA pour les données structurées ?
Graph Neural NetworksRomain Raveaux
Maître de conférences
Université de Tours
Laboratoire d’informatique (LIFAT)
Equipe RFAI
1
Outline
• Graph Neural Network• Encoder
• Decoder
• The model• Matrix formalization
• Message passing formalization
• The losses
2
Previous episode
• Introduction and Graph-based problems• http://romain.raveaux.free.fr/document/cours%20IA%20DI5%20graphs%20in
troV2.pdf
3
Graph Neural Network (GNN)
Taxonomy:Graph/node embedding
Explicit embedding
Through feature extraction
End-to-end learning : Here are the GNN
Implicit embedding
Graph space
4
Graph Neural Network
• Input: A graph
• Output: Node embeddings
• Assumptions: stationarity and compositionality
• The goal:• Graph Neural Networks (GNN) perform an end-to-end learning including
feature extraction and classification.
5
Graph Neural Networks as a encoder
Output : a node embedding
6
Graph Neural Networks as a encoder
7
Two Key Components
8
• Encoder maps each node to a low-dimensional vector.
• Similarity function specifies how relationships in vector space map to relationships in the original network.
node in the input graph
d-dimensional
embedding
Similarity of u and v in
the original networkdot product between node
embeddings
What is the goal of the encoder ?
• Optimize the encoder such that :
• Should two nodes have similar embeddings if they….1. are connected?
2. share neighbors?
3. Pr(v|u) ? Estimate the probability of visiting v
9
Decoder:
10
Basics of “conventional“ neural networks
11
A Linear model
• 𝑋 ∈ ℝ𝑚×1 : input
• W∈ ℝ
• B∈ ℝ
• 𝑌 ∈ ℝ𝑚×1 : output (prediction)
• 𝑌 = 𝑋𝑊 + 𝐵
12
𝑋 =𝑥1x2𝑥3
[𝑤1] = 𝑊
𝑥1𝑤1x2𝑤2
𝑥3𝑤3
= 𝑋𝑊
A Linear model : for multi-dimensional input
• 𝑋 ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×1
• B∈ ℝ
• 𝑌 ∈ ℝ𝑚×1 : output (prediction)
• 𝑌 = 𝑋𝑊 + 𝐵
13
𝑋 =𝑥1𝑎 𝑥1𝑏𝑥2𝑎 𝑥2𝑏𝑥3𝑎 𝑥3𝑏
𝑤𝑎𝑤𝑏
= 𝑊
𝑥1𝑎𝑤𝑎 +𝑥1𝑏 𝑤𝑏
𝑥2𝑎𝑤𝑎 +𝑥2𝑏 𝑤𝑏
𝑥3𝑎𝑤𝑎 +𝑥3𝑏 𝑤𝑏
= 𝑋𝑊
A Linear model : for multi-dimensional input and output
• 𝑋 ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×𝑛
• B∈ ℝ𝑛
• 𝑌 ∈ ℝ𝑚×𝑛 : output (prediction)
• 𝑌 = 𝑋𝑊 + 𝐵
14
𝑋 =𝑥1𝑎 𝑥1𝑏𝑥2𝑎 𝑥2𝑏𝑥3𝑎 𝑥3𝑏
𝑤1𝑎
𝑤2𝑎
𝑤1𝑏
𝑤2𝑏= 𝑊
𝑥1𝑎𝑤1𝑎 +𝑥1𝑏 𝑤2𝑎
𝑥2𝑎𝑤1𝑎 +𝑥2𝑏 𝑤2𝑎
𝑥3𝑎𝑤1𝑎 +𝑥3𝑏 𝑤2𝑎
𝑥1𝑎𝑤1𝑏 +𝑥1𝑏 𝑤2𝑏
𝑥2𝑎𝑤1𝑏 +𝑥2𝑏 𝑤2𝑏
𝑥3𝑎𝑤1𝑏 +𝑥3𝑏 𝑤2𝑏
= 𝑋𝑊
A non-linear model : for multi-dimensionalinput/output
• 𝑋 ∈ ℝ𝑚×𝑑 : input
• W∈ ℝ𝑑×𝑛
• B∈ ℝ𝑛
• 𝑅 𝑎 , 𝜎 𝑎 : a non-linear function
• 𝑌 ∈ ℝ𝑚×𝑛 : output (prediction)
• 𝑌 = 𝜎 𝑋𝑊 + 𝐵
Image credit : Julien Mille15
Can we stack the model ? Endomorphismproperty
• 𝑋 ∈ ℝ𝑚×𝑑 : input• 𝑊(1) ∈ ℝ𝑑×𝑛
• 𝐵(1) ∈ ℝ𝑛
• 𝑌(1) ∈ ℝ𝑚×𝑛 : intermediate results• 𝑊(1) ∈ ℝ𝑛×𝑜
• 𝐵(1) ∈ ℝ𝑜
• 𝑌(2) ∈ ℝ𝑚×𝑜 : output (prediction)
• 𝑌(1) = 𝜎 𝑋𝑊 1 + 𝐵 1
• 𝑌(2) = 𝜎( 𝑌(1)𝑊 2 + 𝐵 2 )
• 𝑌(2) = 𝜎(𝜎 𝑋𝑊 1 + 𝐵 1 𝑊 2 + 𝐵 2 )
• 𝑌(𝑙) = 𝑓( 𝑌 𝑙−1 ,𝑊 𝑙 , 𝐵 𝑙 )=𝑓( 𝑌 𝑙−1 , 𝜃 𝑙 )
• 𝑌(𝑙) = 𝑓 𝑓 𝑋, 𝜃 𝑙−1 , 𝜃 𝑙 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s• 𝐹 𝑋, Θ = 𝑓𝜃 𝑙 ∘ 𝑓𝜃 𝑙−1 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s
Multi Layer Perceptron : MLPA sequence of linearoperations and non linear operations
Why these non-linear operations ? 16
The basics of artificial neural networks
17
The basics of graph neural networks
18
Adjacency matrix
19
Degree matrix
20
Exponentiation of the adjacency matrix
21
Key idea and Intuition [Kipf and Welling, 2016]• The key idea is to generate node embeddings based on local
neighborhoods.
• The intuition is to aggregate node information from their neighbors using neural networks.
• Nodes have embeddings at each layer and the neural network can be arbitrary depth. “layer-0” embedding of node u is its input feature, i.e. Fu.
22
GNN: a pictorial model
23
Topology and Features
• A∈ ℝ|𝑉|×|𝑉| : Adjacency matrix (topology/structure)
• H∈ ℝ|𝑉|×𝑑 : d dimensional features vector for each node
• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation
24
A=𝑎1𝑎 𝑎1𝑏𝑎2𝑎 𝑎2𝑏
ℎ1𝑎ℎ2𝑎
ℎ1𝑏ℎ2𝑏
= 𝐻
𝑎1𝑎ℎ1𝑎 +𝑎1𝑏 ℎ2𝑎𝑎2𝑎ℎ1𝑎 +𝑎2𝑏 ℎ2𝑎
𝑎1𝑎ℎ1𝑏 +𝑎1𝑏 ℎ2𝑏𝑎2𝑎ℎ1𝑏 +𝑎2𝑏 ℎ2𝑏 = 𝐴𝐻
Simple example of f
25
Topology and Features
• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation
• W∈ ℝ𝑑×𝑙 : Parameters matrix that does not depend on |V|
26
AH =𝑎ℎ1𝑎 𝑎ℎ1𝑏𝑎ℎ2𝑎 𝑎ℎ2𝑏
𝑊1𝑎
𝑊2𝑎
𝑊1𝑏 𝑊1𝑐
𝑊2𝑏 𝑊2𝑐= 𝑊
𝑎ℎ1𝑎𝑊1𝑎 +𝑎ℎ1𝑏 𝑊2𝑎
𝑎ℎ2𝑎𝑊1𝑎 +𝑎ℎ2𝑏 𝑊2𝑎
𝑎ℎ1𝑎𝑊1𝑏 +𝑎ℎ1𝑏 𝑊2𝑏 𝑎ℎ1𝑎𝑊1𝑐 +𝑎ℎ1𝑏 𝑊2𝑐
𝑎ℎ2𝑎𝑊1𝑏 +𝑎ℎ2𝑏 𝑊2𝑏 𝑎ℎ2𝑎𝑊1𝑐 +𝑎ℎ2𝑏 𝑊2𝑐 = 𝐴𝐻𝑊
average of neighbor’s
previous layer embeddings
The Math
27
• Basic approach: Average neighbor messages and apply a neural network.
Initial “layer 0” embeddings
are equal to node features
kth layer
embedding
of vnon-linearity
(e.g., ReLU or
tanh)
previous layer
embedding of v
Training the Model
28
• After K-layers of neighborhood aggregation, we get output embeddings for each node.
• We can feed these embeddings into any loss functionand run stochastic gradient descent to train the aggregation parameters.
trainable matrices
(i.e., what we learn)
Inductive Capability
29
Inductive node embedding generalize to entirely unseen graphs
e.g., train on protein interaction graph from model organism A and
generate embeddings on newly collected data about organism B
train on one graph generalize to new graph
Inductive Capability
30
train with snapshot new node arrivesgenerate embedding
for new node
Many application settings constantly encounter previously unseen nodes.
e.g., Reddit, YouTube, GoogleScholar, ….
Need to generate new embeddings “on the fly”Representation Learning on Networks,
snap.stanford.edu/proj/embeddings-www, WWW 2018
2 issues of this simple example
• Issue 1:• for every node, f sums up all the feature vectors of all neighboring nodes but
not the node itself.
• Fix: simply add the identity matrix to A
• Issue 2:• A is typically not normalized and therefore the multiplication with A will
completely change the scale of the feature vectors.
• Fix: Normalizing A such that all rows sum to one:
31
Altogether: [Kipf and Welling, 2016]
32
• The two patched mentioned before +
• A better (symmetric) normalization of the adjacency matrix
O(|E|) time complexity overall.
Graph Convolutional Networks
33
• Kipf et al.’s Graph Convolutional Networks (GCNs) are a slight variation on the neighborhood aggregation idea:
Graph Convolutional Networks
34
same matrix for self and
neighbor embeddingsper-neighbor
normalization
Basic Neighborhood Aggregation
GCN Neighborhood Aggregation
VS.
instead of simple
average, normalization
varies across neighbors
use the same transformation
matrix for self and neighbor
embeddings
Graph Convolutional Networks
35
• Empirically, they found this configuration to give the best results.
• More parameter sharing.
• Down-weights high degree neighbors.
From matrix computations to message passing algorithms
36
37
38
39
40
41
42
43
44
45
More complex models [Nowak et al., 2017]
46
GraphSAGE Idea
47
• So far we have aggregated the neighbor messages by taking their (weighted) average, can we do better?
GraphSAGE Idea
48
Any differentiable function
that maps set of vectors to a
single vector.
• Simple neighborhood aggregation:
• GraphSAGE:
GraphSAGE Differences
49generalized aggregation
concatenate self embedding and
neighbor embedding
GraphSAGE Variants
50
• Mean:
• Pool• Transform neighbor vectors and apply symmetric vector
function.
• LSTM:• Apply LSTM to random permutation of neighbors.
element-wise mean/max
Graph Attention Network (GAT)
• contributions of neighboring nodes to the central node are neither identical like GraphSage, GCN, …
• learn the relative weights between two connected nodes
51
P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in Proc. of ICLR, 2017
GNN as an encoder
52
Applications and losses
• Let us recall that Z is the output the GNN
53
Unsupervised learning
• The graph factorization problem is the problem of predicting if two nodes are linked or not.
• Where is a similarity measure between two nodes embeddings
• Variation of graph factorization• DeepWalk [Perozzi et al., 2014]
• node2vec [Grover and Leskovec, 2016]
54
Unsupervised embedding: Graph factorization
• Use stochastic gradient descent (SGD) as a general optimization method.
The graph factorization problem is the problem of predicting if two nodes are linked or not.
55
Supervised learning
• Node classification : social network
•
56
Supervised learning
• Node classification :
57
Node-wise classification
58
Supervised learning :Node classification
• Last layer:
• For the last layer the activation function is a softmax activation function.
• The loss :
59
Gibbs distribution
Supervised learning:
• Graph classification• A global average pooling layer must be added to gather all the node
embeddings of a given graph.
• Z’ can be fed to a MLP for classification
• The loss is the cross entropy
60
Semi-Supervised learning:
• Node classification:• the problem of classifying nodes in a graph, where labels are only available
for a small subset of nodes.
• where label information is smoothed over the graph via some form of explicit graph-based regularization
• Assumption is that connected nodes in the graph are likely to share the same label (class).
• This is true for instance for a neighborhood graph.
61
𝑠𝑜, l𝑟𝑒𝑔 = ||𝑍. 𝑍𝑇−A||Frobenius2
Some applications
• Node classification on citation networks. • The input is a citation network where nodes are papers, edges are citation
links and optionally bag-of-words features on nodes. The target for each node is a paper category (e.g. stat.ML, cs.LG, ...).
62
Image/molecule classification
• Graph classification: An image is represented as a graph: based on raw pixels (a regular grid and all images have the same graph) or based on superpixels (irregular graph)
64
Graph classification
Taken from M. Bronstein. CVPR Tutorial 2017 65
1. cancerous or not cancerous molecules
2. determination of the boiling point
Molecular graph
Summary on GNN
66
Recommended reading
• Tutorials: • Geometric Deep Learning, Tutorial, CVPR, 2017. http://geometricdeeplearning.com/ • Deep Learning on Graphs with Graph Convolutional Networks. http://deeploria.gforge.inria.fr/thomasTalk.pdf • Graph-based Methods in Pattern Recognition & Document Image Analysis. http://gmprdia.univ-lr.fr/
• List of papers: • Gilmer et al., Neural Message Passing for Quantum Chemistry, 2017. https://arxiv.org/abs/1704.01212 • Kipf et al., Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017. https://arxiv.org/abs/1609.02907 • Defferrard et al., Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS 2016.
https://arxiv.org/abs/1606.09375 • Bruna et al., Spectral Networks and Locally Connected Networks on Graphs, ICLR 2014. https://arxiv.org/abs/1312.6203 • Duvenaud et al., Convolutional Networks on Graphs for Learning Molecular Fingerprints, NIPS 2015.
https://arxiv.org/abs/1509.09292 • Li et al., Gated Graph Sequence Neural Networks, ICLR 2016. https://arxiv.org/abs/1511.05493 • Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics, NIPS 2016.
https://arxiv.org/abs/1612.00222 • Kearnes et al., Molecular Graph Convolutions: Moving Beyond Fingerprints, 2016. https://arxiv.org/abs/1603.00856 91 R
67