quelle ia pour les données structurées ? graph neural...

Post on 04-Oct-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Quelle IA pour les données structurées ?

Graph Neural NetworksRomain Raveaux

romain.raveaux@univ-tours.fr

Maître de conférences

Université de Tours

Laboratoire d’informatique (LIFAT)

Equipe RFAI

1

Outline

• Graph Neural Network• Encoder

• Decoder

• The model• Matrix formalization

• Message passing formalization

• The losses

2

Previous episode

• Introduction and Graph-based problems• http://romain.raveaux.free.fr/document/cours%20IA%20DI5%20graphs%20in

troV2.pdf

3

Graph Neural Network (GNN)

Taxonomy:Graph/node embedding

Explicit embedding

Through feature extraction

End-to-end learning : Here are the GNN

Implicit embedding

Graph space

4

Graph Neural Network

• Input: A graph

• Output: Node embeddings

• Assumptions: stationarity and compositionality

• The goal:• Graph Neural Networks (GNN) perform an end-to-end learning including

feature extraction and classification.

5

Graph Neural Networks as a encoder

Output : a node embedding

6

Graph Neural Networks as a encoder

7

Two Key Components

8

• Encoder maps each node to a low-dimensional vector.

• Similarity function specifies how relationships in vector space map to relationships in the original network.

node in the input graph

d-dimensional

embedding

Similarity of u and v in

the original networkdot product between node

embeddings

What is the goal of the encoder ?

• Optimize the encoder such that :

• Should two nodes have similar embeddings if they….1. are connected?

2. share neighbors?

3. Pr(v|u) ? Estimate the probability of visiting v

9

Decoder:

10

Basics of “conventional“ neural networks

11

A Linear model

• 𝑋 ∈ ℝ𝑚×1 : input

• W∈ ℝ

• B∈ ℝ

• 𝑌 ∈ ℝ𝑚×1 : output (prediction)

• 𝑌 = 𝑋𝑊 + 𝐵

12

𝑋 =𝑥1x2𝑥3

[𝑤1] = 𝑊

𝑥1𝑤1x2𝑤2

𝑥3𝑤3

= 𝑋𝑊

A Linear model : for multi-dimensional input

• 𝑋 ∈ ℝ𝑚×𝑑 : input

• W∈ ℝ𝑑×1

• B∈ ℝ

• 𝑌 ∈ ℝ𝑚×1 : output (prediction)

• 𝑌 = 𝑋𝑊 + 𝐵

13

𝑋 =𝑥1𝑎 𝑥1𝑏𝑥2𝑎 𝑥2𝑏𝑥3𝑎 𝑥3𝑏

𝑤𝑎𝑤𝑏

= 𝑊

𝑥1𝑎𝑤𝑎 +𝑥1𝑏 𝑤𝑏

𝑥2𝑎𝑤𝑎 +𝑥2𝑏 𝑤𝑏

𝑥3𝑎𝑤𝑎 +𝑥3𝑏 𝑤𝑏

= 𝑋𝑊

A Linear model : for multi-dimensional input and output

• 𝑋 ∈ ℝ𝑚×𝑑 : input

• W∈ ℝ𝑑×𝑛

• B∈ ℝ𝑛

• 𝑌 ∈ ℝ𝑚×𝑛 : output (prediction)

• 𝑌 = 𝑋𝑊 + 𝐵

14

𝑋 =𝑥1𝑎 𝑥1𝑏𝑥2𝑎 𝑥2𝑏𝑥3𝑎 𝑥3𝑏

𝑤1𝑎

𝑤2𝑎

𝑤1𝑏

𝑤2𝑏= 𝑊

𝑥1𝑎𝑤1𝑎 +𝑥1𝑏 𝑤2𝑎

𝑥2𝑎𝑤1𝑎 +𝑥2𝑏 𝑤2𝑎

𝑥3𝑎𝑤1𝑎 +𝑥3𝑏 𝑤2𝑎

𝑥1𝑎𝑤1𝑏 +𝑥1𝑏 𝑤2𝑏

𝑥2𝑎𝑤1𝑏 +𝑥2𝑏 𝑤2𝑏

𝑥3𝑎𝑤1𝑏 +𝑥3𝑏 𝑤2𝑏

= 𝑋𝑊

A non-linear model : for multi-dimensionalinput/output

• 𝑋 ∈ ℝ𝑚×𝑑 : input

• W∈ ℝ𝑑×𝑛

• B∈ ℝ𝑛

• 𝑅 𝑎 , 𝜎 𝑎 : a non-linear function

• 𝑌 ∈ ℝ𝑚×𝑛 : output (prediction)

• 𝑌 = 𝜎 𝑋𝑊 + 𝐵

Image credit : Julien Mille15

Can we stack the model ? Endomorphismproperty

• 𝑋 ∈ ℝ𝑚×𝑑 : input• 𝑊(1) ∈ ℝ𝑑×𝑛

• 𝐵(1) ∈ ℝ𝑛

• 𝑌(1) ∈ ℝ𝑚×𝑛 : intermediate results• 𝑊(1) ∈ ℝ𝑛×𝑜

• 𝐵(1) ∈ ℝ𝑜

• 𝑌(2) ∈ ℝ𝑚×𝑜 : output (prediction)

• 𝑌(1) = 𝜎 𝑋𝑊 1 + 𝐵 1

• 𝑌(2) = 𝜎( 𝑌(1)𝑊 2 + 𝐵 2 )

• 𝑌(2) = 𝜎(𝜎 𝑋𝑊 1 + 𝐵 1 𝑊 2 + 𝐵 2 )

• 𝑌(𝑙) = 𝑓( 𝑌 𝑙−1 ,𝑊 𝑙 , 𝐵 𝑙 )=𝑓( 𝑌 𝑙−1 , 𝜃 𝑙 )

• 𝑌(𝑙) = 𝑓 𝑓 𝑋, 𝜃 𝑙−1 , 𝜃 𝑙 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s• 𝐹 𝑋, Θ = 𝑓𝜃 𝑙 ∘ 𝑓𝜃 𝑙−1 ∶ 𝐶𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑜𝑓 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛s

Multi Layer Perceptron : MLPA sequence of linearoperations and non linear operations

Why these non-linear operations ? 16

The basics of artificial neural networks

17

The basics of graph neural networks

18

Adjacency matrix

19

Degree matrix

20

Exponentiation of the adjacency matrix

21

Key idea and Intuition [Kipf and Welling, 2016]• The key idea is to generate node embeddings based on local

neighborhoods.

• The intuition is to aggregate node information from their neighbors using neural networks.

• Nodes have embeddings at each layer and the neural network can be arbitrary depth. “layer-0” embedding of node u is its input feature, i.e. Fu.

22

GNN: a pictorial model

23

Topology and Features

• A∈ ℝ|𝑉|×|𝑉| : Adjacency matrix (topology/structure)

• H∈ ℝ|𝑉|×𝑑 : d dimensional features vector for each node

• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation

24

A=𝑎1𝑎 𝑎1𝑏𝑎2𝑎 𝑎2𝑏

ℎ1𝑎ℎ2𝑎

ℎ1𝑏ℎ2𝑏

= 𝐻

𝑎1𝑎ℎ1𝑎 +𝑎1𝑏 ℎ2𝑎𝑎2𝑎ℎ1𝑎 +𝑎2𝑏 ℎ2𝑎

𝑎1𝑎ℎ1𝑏 +𝑎1𝑏 ℎ2𝑏𝑎2𝑎ℎ1𝑏 +𝑎2𝑏 ℎ2𝑏 = 𝐴𝐻

Simple example of f

25

Topology and Features

• 𝐴𝐻 ∈ ℝ|𝑉|×𝑑= Feature aggregation

• W∈ ℝ𝑑×𝑙 : Parameters matrix that does not depend on |V|

26

AH =𝑎ℎ1𝑎 𝑎ℎ1𝑏𝑎ℎ2𝑎 𝑎ℎ2𝑏

𝑊1𝑎

𝑊2𝑎

𝑊1𝑏 𝑊1𝑐

𝑊2𝑏 𝑊2𝑐= 𝑊

𝑎ℎ1𝑎𝑊1𝑎 +𝑎ℎ1𝑏 𝑊2𝑎

𝑎ℎ2𝑎𝑊1𝑎 +𝑎ℎ2𝑏 𝑊2𝑎

𝑎ℎ1𝑎𝑊1𝑏 +𝑎ℎ1𝑏 𝑊2𝑏 𝑎ℎ1𝑎𝑊1𝑐 +𝑎ℎ1𝑏 𝑊2𝑐

𝑎ℎ2𝑎𝑊1𝑏 +𝑎ℎ2𝑏 𝑊2𝑏 𝑎ℎ2𝑎𝑊1𝑐 +𝑎ℎ2𝑏 𝑊2𝑐 = 𝐴𝐻𝑊

average of neighbor’s

previous layer embeddings

The Math

27

• Basic approach: Average neighbor messages and apply a neural network.

Initial “layer 0” embeddings

are equal to node features

kth layer

embedding

of vnon-linearity

(e.g., ReLU or

tanh)

previous layer

embedding of v

Training the Model

28

• After K-layers of neighborhood aggregation, we get output embeddings for each node.

• We can feed these embeddings into any loss functionand run stochastic gradient descent to train the aggregation parameters.

trainable matrices

(i.e., what we learn)

Inductive Capability

29

Inductive node embedding generalize to entirely unseen graphs

e.g., train on protein interaction graph from model organism A and

generate embeddings on newly collected data about organism B

train on one graph generalize to new graph

Inductive Capability

30

train with snapshot new node arrivesgenerate embedding

for new node

Many application settings constantly encounter previously unseen nodes.

e.g., Reddit, YouTube, GoogleScholar, ….

Need to generate new embeddings “on the fly”Representation Learning on Networks,

snap.stanford.edu/proj/embeddings-www, WWW 2018

2 issues of this simple example

• Issue 1:• for every node, f sums up all the feature vectors of all neighboring nodes but

not the node itself.

• Fix: simply add the identity matrix to A

• Issue 2:• A is typically not normalized and therefore the multiplication with A will

completely change the scale of the feature vectors.

• Fix: Normalizing A such that all rows sum to one:

31

Altogether: [Kipf and Welling, 2016]

32

• The two patched mentioned before +

• A better (symmetric) normalization of the adjacency matrix

O(|E|) time complexity overall.

Graph Convolutional Networks

33

• Kipf et al.’s Graph Convolutional Networks (GCNs) are a slight variation on the neighborhood aggregation idea:

Graph Convolutional Networks

34

same matrix for self and

neighbor embeddingsper-neighbor

normalization

Basic Neighborhood Aggregation

GCN Neighborhood Aggregation

VS.

instead of simple

average, normalization

varies across neighbors

use the same transformation

matrix for self and neighbor

embeddings

Graph Convolutional Networks

35

• Empirically, they found this configuration to give the best results.

• More parameter sharing.

• Down-weights high degree neighbors.

From matrix computations to message passing algorithms

36

37

38

39

40

41

42

43

44

45

More complex models [Nowak et al., 2017]

46

GraphSAGE Idea

47

• So far we have aggregated the neighbor messages by taking their (weighted) average, can we do better?

GraphSAGE Idea

48

Any differentiable function

that maps set of vectors to a

single vector.

• Simple neighborhood aggregation:

• GraphSAGE:

GraphSAGE Differences

49generalized aggregation

concatenate self embedding and

neighbor embedding

GraphSAGE Variants

50

• Mean:

• Pool• Transform neighbor vectors and apply symmetric vector

function.

• LSTM:• Apply LSTM to random permutation of neighbors.

element-wise mean/max

Graph Attention Network (GAT)

• contributions of neighboring nodes to the central node are neither identical like GraphSage, GCN, …

• learn the relative weights between two connected nodes

51

P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in Proc. of ICLR, 2017

GNN as an encoder

52

Applications and losses

• Let us recall that Z is the output the GNN

53

Unsupervised learning

• The graph factorization problem is the problem of predicting if two nodes are linked or not.

• Where is a similarity measure between two nodes embeddings

• Variation of graph factorization• DeepWalk [Perozzi et al., 2014]

• node2vec [Grover and Leskovec, 2016]

54

Unsupervised embedding: Graph factorization

• Use stochastic gradient descent (SGD) as a general optimization method.

The graph factorization problem is the problem of predicting if two nodes are linked or not.

55

Supervised learning

• Node classification : social network

56

Supervised learning

• Node classification :

57

Node-wise classification

58

Supervised learning :Node classification

• Last layer:

• For the last layer the activation function is a softmax activation function.

• The loss :

59

Gibbs distribution

Supervised learning:

• Graph classification• A global average pooling layer must be added to gather all the node

embeddings of a given graph.

• Z’ can be fed to a MLP for classification

• The loss is the cross entropy

60

Semi-Supervised learning:

• Node classification:• the problem of classifying nodes in a graph, where labels are only available

for a small subset of nodes.

• where label information is smoothed over the graph via some form of explicit graph-based regularization

• Assumption is that connected nodes in the graph are likely to share the same label (class).

• This is true for instance for a neighborhood graph.

61

𝑠𝑜, l𝑟𝑒𝑔 = ||𝑍. 𝑍𝑇−A||Frobenius2

Some applications

• Node classification on citation networks. • The input is a citation network where nodes are papers, edges are citation

links and optionally bag-of-words features on nodes. The target for each node is a paper category (e.g. stat.ML, cs.LG, ...).

62

Image/molecule classification

• Graph classification: An image is represented as a graph: based on raw pixels (a regular grid and all images have the same graph) or based on superpixels (irregular graph)

64

Graph classification

Taken from M. Bronstein. CVPR Tutorial 2017 65

1. cancerous or not cancerous molecules

2. determination of the boiling point

Molecular graph

Summary on GNN

66

Recommended reading

• Tutorials: • Geometric Deep Learning, Tutorial, CVPR, 2017. http://geometricdeeplearning.com/ • Deep Learning on Graphs with Graph Convolutional Networks. http://deeploria.gforge.inria.fr/thomasTalk.pdf • Graph-based Methods in Pattern Recognition & Document Image Analysis. http://gmprdia.univ-lr.fr/

• List of papers: • Gilmer et al., Neural Message Passing for Quantum Chemistry, 2017. https://arxiv.org/abs/1704.01212 • Kipf et al., Semi-Supervised Classification with Graph Convolutional Networks, ICLR 2017. https://arxiv.org/abs/1609.02907 • Defferrard et al., Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS 2016.

https://arxiv.org/abs/1606.09375 • Bruna et al., Spectral Networks and Locally Connected Networks on Graphs, ICLR 2014. https://arxiv.org/abs/1312.6203 • Duvenaud et al., Convolutional Networks on Graphs for Learning Molecular Fingerprints, NIPS 2015.

https://arxiv.org/abs/1509.09292 • Li et al., Gated Graph Sequence Neural Networks, ICLR 2016. https://arxiv.org/abs/1511.05493 • Battaglia et al., Interaction Networks for Learning about Objects, Relations and Physics, NIPS 2016.

https://arxiv.org/abs/1612.00222 • Kearnes et al., Molecular Graph Convolutions: Moving Beyond Fingerprints, 2016. https://arxiv.org/abs/1603.00856 91 R

67

top related