graphmix: improved training of gnns for semi-supervised

GraphMix: Improved Training of GNNs for Semi-Supervised Learning

Vikas Verma1,2, Meng Qu1, Kenji Kawaguchi3, Alex Lamb1, Yoshua Bengio1, Juho Kannala2, JianTang1

1 Mila - Québec Artificial Intelligence Institute, Montréal, Canada, 2 Aalto University, Finland, 3 Massachusetts Institute ofTechnology (MIT), USA

Abstract

We present GraphMix, a regularization method for GraphNeural Network based semi-supervised object classification,whereby we propose to train a fully-connected network jointlywith the graph neural network via parameter sharing andinterpolation-based regularization. Further, we provide a theo-retical analysis of how GraphMix improves the generalizationbounds of the underlying graph neural network, without mak-ing any assumptions about the “aggregation” layer or the depthof the graph neural networks. We experimentally validate thisanalysis by applying GraphMix to various architectures suchas Graph Convolutional Networks, Graph Attention Networksand Graph-U-Net. Despite its simplicity, we demonstrate thatGraphMix can consistently improve or closely match state-of-the-art performance using even simpler architectures suchas Graph Convolutional Networks, across three establishedgraph benchmarks: Cora, Citeseer and Pubmed citation net-work datasets, as well as three newly proposed datasets: Cora-Full, Co-author-CS and Co-author-Physics.

1 IntroductionDue to the presence of graph-structured data across a widevariety of domains, such as biological networks, citationnetworks and social networks, there have been several at-tempts to design neural networks, known as graph neu-ral networks (GNN), that can process arbitrarily structuredgraphs. Early work includes (Gori, Monfardini, and Scarselli2005; Scarselli et al. 2009) which propose a neural net-work that can directly process most types of graphs e.g.,acyclic, cyclic, directed, and undirected graphs. More recentapproaches include (Bruna et al. 2013; Henaff, Bruna, andLeCun 2015; Defferrard, Bresson, and Vandergheynst 2016;Kipf and Welling 2016; Gilmer et al. 2017; Hamilton, Ying,and Leskovec 2017; Velickovic et al. 2018, 2019; Qu, Bengio,and Tang 2019; Gao and Ji 2019; Ma et al. 2019), among oth-ers. Many of these approaches are designed for addressing theproblem of semi-supervised learning over graph-structureddata (Zhou et al. 2018). Much of these research efforts havebeen dedicated to developing novel architectures.

Here we instead propose an architecture-agnostic methodfor regularized training of GNNs for semi-supervised

Copyright c© 2021, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

node classification. Recently, regularization based on data-augmentation has been shown to be very effective in othertypes of neural networks but how to apply these techniques inGNNs is still under-explored. Our proposed method Graph-Mix 1 is a unified framework that draws inspiration from inter-polation based data augmentation (Zhang et al. 2018; Vermaet al. 2019a) and self-training based data-augmentation(Laine and Aila 2016; Tarvainen and Valpola 2017; Vermaet al. 2019b; Berthelot et al. 2019). We show that with our pro-posed method, we can achieve state-of-the-art performanceeven when using simpler GNN architectures such as GraphConvolutional Networks (Kipf and Welling 2017), with noadditional memory cost and with minimal additional com-putation cost. Further, we conduct a theoritical analysis todemonstrate the effectiveness of the proposed method overthe underlying GNNs.

2 Problem Definition and PreliminariesProblem Setup: We are interested in the problem ofsemi-supervised node and edge classification using graph-structured data. We can formally define such graph-structureddata as an undirected graph G = (V,A,X ), where V =Vl ∪Vu is the union of labeled (Vl) and unlabeled (Vu) nodesin the graph with cardinalities nl and nu, and A is the adja-cency matrix representing the edges between the nodes ofV , X ∈ R(nl+nu)×d is the input node features. Each nodev belongs to one out of C classes and can be labeled witha C-dimensional one-hot vector yv ∈ RC . Given the labelsYl ∈ Rnl×C of the labeled nodes Vl, the task is to predict thelabels Yu ∈ Rnu×C of the unlabeled nodes Vu.

Graph Neural Networks: Graph Neural Networks (GNN)learn the lth layer representations of a sample i by leveragingthe representations of the samples NB(i) in the neighbour-hood of i. This is done by using an aggregation functionthat takes as an input the representations of all the samplesalong with the graph structure and outputs the aggregatedrepresentation. The aggregation function can be defined us-ing the Graph Convolution layer (Kipf and Welling 2017),Graph Attention Layer (Velickovic et al. 2018), or any gen-eral message passing layer (Gilmer et al. 2017). Formally,

1code available at https://github.com/vikasverma1077/GraphMixcorrespondence: [email protected]

PRELIMINARY VERSION: DO NOT CITE The AAAI Digital Library will contain the published

version some time after the conference

Figure 1: The procedure for training with GraphMix . The labeled and unlabeled nodes are shown with different colors in thegraph. GraphMix augments the training of a baseline Graph Neural Network (GNN) with a Fully-Connected Network (FCN). TheFCN is trained by interpolating the hidden states and the corresponding labels. This leads to better features which are transferredto the GNN via sharing the linear transformation parameters W (in Equation1) of the GNN and FCN layers. Furthermore, thepredictions made by the GNN for unlabeled data are used to augment the input data for the FCN. The FCN and the GNN lossesare minimized jointly by alternate minimization.

let h(l) ∈ Rn×k be a matrix containing the k-dimensionalrepresentation of n nodes in the lth layer, then:

h(l+1) = σ(AGGREGATE(h(l)W,A)) (1)

where W ∈ Rk×k′ is a linear transformation matrix, k′is the dimension of (l + 1)th layer, AGGREGATE is theaggregation function that utilizes the graph adjacency matrixA to aggregate the hidden representations of neighbouringnodes and σ is a non-linear activation function, e.g. ReLU.

Interpolation Based Regularization Techniques: Re-cently, interpolation-based techniques have been proposedfor regularizing neural networks. We briefly describe some ofthese techniques here. Mixup (Zhang et al. 2018) trains a neu-ral network on the convex combination of input and targets,whereas Manifold Mixup (Verma et al. 2019a) trains a neuralnetwork on the convex combination of the hidden states of arandomly chosen hidden layer and the targets. While Mixupregularizes a neural network by enforcing that the modeloutput should change linearly in between the examples in theinput space, Manifold Mixup regularizes the neural networkby learning better (more discriminative) hidden states.

Formally, suppose Tθ(x) = (f ◦g)θ(x) is a neural networkparametrized with θ such that g : x −→ h is a functionthat maps input sample to hidden states, f : h −→ y isa function that maps hidden states to predicted output, λis a random variable drawn from Beta(α, α) distribution,Mixλ(a,b) = λ∗a+(1−λ)∗b is an interpolation function,D is the data distribution, (x,y) and (x′,y′) is a pair oflabeled examples sampled from distributionD and ` be a lossfunction such as cross-entropy loss, then the Manifold MixupLoss is defined as:

LMM(D, Tθ, α) = E(x,y)∼D

E(x′,y′)∼D

Eλ∼Beta(α,α)

(2)

`(f(Mixλ(g(x), g(x′))),Mixλ(y,y′)).

We use above Manifold Mixup loss for training an auxil-iary Fully-connected-network as described in Section 3.

3 GraphMix3.1 MotivationData Augmentation is one of the simplest and most efficienttechnique for regularizing a neural network. In the domainsof computer vision, speech and natural language, there existefficient data augmentation techniques, for example, randomcropping, translation or Cutout (Devries and Taylor 2017)for computer vision, (Ko et al. 2015) and (Park et al. 2019)for speech and (Xie et al. 2017) for natural language. How-ever, data augmentation for the graph-structured data remainsunder-explored. There exists some recent work along theselines but the prohibitive computation cost (see Section 5)introduced by these methods make them impractical for real-world large graph datasets. Based on these limitations, ourmain objective is to propose an efficient data augmentationtechnique for graph datasets.

Recent work based on interpolation-based data augmenta-tion (Zhang et al. 2018; Verma et al. 2019a) has seen sizableimprovements in regularization performance across a numberof tasks. However, these techniques are not directly appli-cable to graphs for an important reason: Although we cancreate additional nodes by interpolating the features and cor-responding labels, it remains unclear how these new nodesmust be connected to the original nodes via synthetic edgessuch that the structure of the whole graph is preserved. To

alleviate this issue, we propose to train an auxiliary Fully-Connected Network (FCN) using Manifold Mixup as dis-cussed in Section 3.2. Note that the FCN only uses the nodefeatures (not the graph structure), thus the Manifold mixuploss in Eq. 2 can be directly used for training the FCN.

Interpolation based data-augmentation techniques have anadded advantage for training GNNs: A vanilla GNN learnsthe representation of each node by iteratively aggregatinginformation from the neighbors of that node (Equation 1).However, this induces the problem of oversmoothing (Li,Han, and Wu 2018; Xu et al. 2018) while training GNNswith many layers. Due to this limitation, GNNs are trainedonly with a few layers, and thus they can only leverage thelocal neighbourhood of each node for learning its represen-tations, without leveraging the representations of the nodeswhich are multiple hops away in the graph. This limitationcan be addressed using the interpolation-based method suchas Manifold Mixup: in Manifold Mixup, the representationsof a randomly chosen pair of nodes is used to facilitate bet-ter representation learning; it is possible that the randomlychosen pair of nodes will not be in the local neighbourhoodof each other. Based on these challenges and motivations wepresent our proposed approach GraphMix for training GNNsin the following Section.

3.2 MethodWe first describe GraphMix at a high-level and then givea more formal description. GraphMix augments the vanillaGNN with a Fully-Connected Network (FCN). The FCN lossis computed using Manifold Mixup as discussed below andthe GNN loss is computed normally. Manifold Mixup trainingof FCN facilitates learning more discriminative node repre-sentations (Verma et al. 2019a). An important question is howthese more discriminative node representations can be trans-ferred to the GNN? One potential approach could involvemaximizing the mutual information between the hidden statesof the FCN and the GNN using formulations similar to thoseproposed by (Hjelm et al. 2019; Sun et al. 2020). However,this requires optimizing additional network parameters. In-stead, we propose parameter sharing between FCN and GNNto facilitate the transfer of discriminative node representa-tions from the FCN to the GNN. It is a viable option becauseas mentioned in Eq 1, a GNN layer typically performs anadditional operation (AGGREGATE) on the linear trans-formations of node representations (which are essentiallypre-activation representations of the FCN layer). Using themore discriminative representations of the nodes from FCN,as well as the graph structure, the GNN loss is computed inthe usual way to further refine the node representations. Inthis way we can exploit the improved representations fromManifold Mixup for training GNNs.

In Section 3.3, without making any assumption about theaggregation function and the depth of the graph neural net-work, we show that GraphMix improves the generalization ofthe underlying graph neural network. This makes GraphMixapplicable to various kind of architectures having differentaggregation functions, such as weighed averaging in GCN(Kipf and Welling 2016), attention based aggregation in GAT(Velickovic et al. 2018) and graph-pooling/unpooling opera-

tions in Graph U-Nets (Gao and Ji 2019). In the aforemen-tioned sense, GraphMix procedure is highly flexible: it canbe applied to any underlying GNN as long as the underlyingGNN applies parametric transformations to the node features.

Additionally, we propose to use the predicted targets fromthe GNN to augment the training set of the FCN. In this way,both the FCN and the GNN facilitate each other’s learningprocess. Both the FCN loss and the GNN loss are optimizedin an alternating fashion during training. At inference time,predictions are made using only the GNN.

A diagram illustrating GraphMix is presented in Figure 1and the full algorithm is presented in Appendix A.3. Further,we draw similarities and difference of GraphMix w.r.t. Co-training framework in the Appendix A.2.

So far we have presented the general design of GraphMix,now we present GraphMix more formally. Given a graph G,let (Xl,Yl) be the input features and the labels of the labelednodes Vl and let (Xu) be the input features of the unlabelednodes Vu. Let Fθ and Gθ be a FCN and a GNN respectively,which share the parameters θ. The FCN loss from the labeleddata is computed using Eq. 2 as follows:

Lsupervised = LMM((Xl,Yl), Fθ, α) (3)

For unlabeled nodes Vu, we compute the prediction Yu usingthe GNN:

Yu = Gθ(Xu) (4)

We note that recent state-of-the-art semi-supervised learningmethods use a teacher model to accurately predict targets forthe unlabeled data. The teacher model can be realized as atemporal ensemble of the student model (the model beingtrained) (Laine and Aila 2016) or by using an ExponentialMoving Average (EMA) of the parameters of the studentmodel (Tarvainen and Valpola 2017; Verma et al. 2019b).Different from these approaches, we use the GNN as a teachermodel for predicting the targets for the FCN. This is dueto the fact that GNNs leverage graph structure, which inpractice, allows them to make more accurate predictions thanthe temporal ensemble or EMA ensemble of FCN (althoughthere is no theoretical guarantee for this).

Moreover, to improve the accuracy of the predicted targetsin Eq 4, we applied the average of the model predictionon K random perturbations of an input sample along withsharpening as discussed in Appendix A.1.

Using the predicted targets for unlabeled nodes, we createa new training set (Xu, Yu). The loss from the unlabeleddata for the FCN is computed as:

Lunsupervised = LMM((Xu, Yu), Fθ, α) (5)

Total loss for training the FCN is given as the weighted sumof above two loss terms.

LFCN = Lsupervised + w(t) ∗ Lunsupervised (6)

where w(t) is a sigmoid ramp-up function (Tarvainen andValpola 2017) which increases from zero to its max value γduring the course of training.

Now let us assume that the loss for an underlying GNN isLGNN = `(Gθ(Xl),Yl); the overall GraphMix loss for the

joint training of the FCN and GNN can be defined as theweighted sum of the GNN loss and the FCN loss:

LGraphMix = LGNN + λ ∗ LFCN (7)

However, throughout our experiments, optimizing FCNloss and GNN loss alternatively at each training epochachieved better test accuracy (discussed in Appendix A.12).This has an added benefit that it removes the need to tuneweighing hyper-parameter λ.

For Manifold Mixup training of FCN, we apply mixuponly in the hidden layer. Note that in (Verma et al. 2019a),the authors recommended applying mixing in a randomlychosen layer (which also includes the input layer) at eachtraining update. However, we observed under-fitting whenapplying mixup randomly at the input layer or the hiddenlayer. Applying mixup only in the input layer also resulted inunderfitting and did not improve test accuracy.

Memory and Computational Requirements: One of themajor limitations of current GNNs, which prohibits their ap-plication to real-world large datasets, is their memory com-plexity. For example, the fastest implementation of GCN,which stores the entire adjacency matrix A in the memory,has the memory complexity O(|V|2). Implementations withlower memory requirements are possible but they incur higherlatency cost due to repeatedly loading parts of adjacency ma-trix in the memory. Due to these reasons, methods whichhave additional memory requirement in comparison to thebaseline GNNs, are less appealing in practice. In GraphMix,since the parameters of the FCN and GNN are shared, thereis no additional memory requirement. Furthermore, Graph-Mix does not add any significant computation cost over theunderlying GNN, because the underlying GNN is trained inthe standard way and the FCN training requires trivial addi-tional computation cost for computing the predicted-targets(Appendix A.1) and the interpolation function ( Mixλ(a,b)in Section 2).

3.3 AnalysisIn this subsection, we study how GraphMix impacts the gen-eralization bound of a underlying GNN. Our analysis, whichis based on Rademacher complexity (Bartlett and Mendelson2002), provides a new generalization bound for GraphMix,which shows how adding regularization to training the FCNwith pseudolabels improves overall generalization. We relyon the effect of changing one sample in Manifold Mixupand the fact that the weights are shared by a GNN and thecorresponding FCN in GraphMix.

Let G be a fixed graph with n total nodes. Define zi =(xi, yi) to be the pair of the feature and the true label of thei-th node. Without loss of generality, let S = (z1, . . . , zm)be the training set with the labeled nodes where m = nl anddata points z1, . . . , zm are sampled according to an unknowndistribution D to form the labeled node set S. In this subsec-tion, we follow the previous paper on graph neural networksby assuming that all samples are i.i.d. (including replacementsample) (Verma and Zhang 2019).

Let Γ be a finite set of the hyperparameters. For everyhyperparameter γ ∈ Γ, define Fγ to be a distribution-dependent hypothesis space corresponding the hyperparam-

eter γ. That is, Fγ = {fγ : (∃S ∈ S)[fγ = Aγ(S)]}where Aγ is an algorithm that outputs the hypothesis fγgiven a dataset S, and S is the set of training datasets suchthat the probability of S ∈ S according to D is one. Foreach fγ ∈ Fγ , fγ(· ;G) represents GNN with the graph Gand fγ(· ;G0) represents FCN where G0 is the null graphversion of G (i.e., G without edges). Let R`m(Fγ) be theRademacher complexity (Bartlett and Mendelson 2002) ofthe set {(x, y) 7→ `(fγ(x;G), y) : fγ ∈ Fγ}.

Let LGNN(S, fγ) be the LGNN with the GNNfγ(· ;G) and labeled data points S as LGNN(S, fγ) =1m

∑mi=1 `(fγ(xi;G), yi). Let LFCN(S, fγ) be LFCN with the

FCN fγ(· ;G0) and labeled data points S as LFCN(S, fγ) =

LMM(S, fγ(· ;G0), α) + num LMM((XS

u , YSu ), fγ(· ;G0), α)

where (XSu , Y

Su ) is the unlabeled nodes (Xu, Yu)

that corresponds to the labeled node set S. LetLGraphMix(S, fγ) = LGNN(S, fγ) + λLFCN(S, fγ). Let c bethe upper bound on per-sample loss as c ≥ `(fγ(xi;G), yi)and c ≥ `(fγ(xi;G0), yi). For example, c = 1 for trainingand test error (or 0-1 loss).

Theorem 1 provides a generalization bound for GraphMix,which shows that GraphMix can improve the generalizationbound of the underlying GNN under the condition of V > 0as discussed below.Theorem 1. For any δ > 0, with probability atleast 1 − δ, the following holds: for all γ ∈ Γ andall fγ ∈ Fγ , we have that Ex,y∼D[`(fγ(x;G), y)] −LGraphMix(S, fγ) ≤ R`m(Fγ) + c

√ln(|Γ|/δ)

2m − λV , where

V = ES′∼Dm [inffγ∈Fγ LFCN(S′, fγ)]− 4c√

ln(|Γ|/δ)2m .

The proof of Theorem 1 is given in the following.The generalization bound in Theorem 1 becomes a gen-eralization bound for GNN without GraphMix whenλ = 0 as desired. By comparing the generalizationbound with λ = 0 (no GraphMix) and λ > 0 (withGraphMix), we can see that GraphMix improves thegeneralization bound of underlying GNN when V =

ES′∼Dm [inffγ∈Fγ LFCN(S′, fγ)] − 4c√

ln(|Γ|/δ)/(2m) >0. Here, the first term ES′∼Dm [inffγ∈Fγ LFCN(S′, fγ)] in-creases as the hypothesis space Fγ gets “smaller” or hasless complexity. Thus, GraphMix improves the generaliza-tion bound of an underlying GNN when the hyperparametersearch over γ ∈ Γ results in Fγ of moderate complexitysuch that the first term is greater than 4c

√ln(|Γ|/δ)/(2m).

The first term contains the manifold mixup loss LFCN(S′, fγ)over random training datasets S′ ( 6= S), which is expectedto be larger when compared with that of the standard losswithout manifold mixup.

For each fixed Fγ , the generalization bound in Theo-rem 1 goes to zero as m → ∞ since R`m(Fγ) → 0 andV → V0 ≥ 0 as m → ∞. The training loss is also min-imized when the trained model fγ ∈ Fγ fits well to theparticular given training data set S. Therefore, given a par-ticular training dataset S, the expected loss is minimized ifwe conduct a hyperparameter search over γ ∈ Γ such thatfγ ∈ Fγ minimize the training loss for the given S but Fγhas moderate complexity to not being able to minimize the

manifold mixup losses for other datasets S′ ( 6= S) drawnfrom D. Unlike the standard data points, the data points gen-erated during manifold mixup in GraphMix are typically notmemorizable or interpolated by FCN.

3.4 Proof of Theorem 1In the proof, we analyse the effect of changing one samplein Manifold Mixup and GNN-generated targets by utilizingthe fact that the weights are shared by a GNN and the corre-sponding FCN. The the fact of sharing weights also resultsin the generalization bound that relates the generalization inFCN via Manifold Mixup to the generalization of the GNN.

Proof. Let γ ∈ Γ be fixed. Define ϕ(S) =supfγ∈Fγ Ex,y∼D[`(fγ(x;G), y)] − LGraphMix(S, fγ). Wefirst provide an upper bound on ϕ(S) by using McDiarmid’sinequality. To apply McDiarmid’s inequality to ϕ(D), wecompute an upper bound on |ϕ(S)− ϕ(S′)| where S and S′be two training datasets differing by exactly one point of anarbitrary index i0; i.e., Si = S′i for all i 6= i0 and Si0 6= S′i0 .

ϕ(S′)− ϕ(S) ≤ supfγ∈Fγ

LGraphMix(S, fγ)− LGraphMix(S′, fγ)

= supfγ∈Fγ

(LGNN(S, fγ)− LGNN(S′, fγ))

+ λ(LFCN(S, fγ)− LFCN(S′, fγ))

where the first line follows the property of the supremum,sup(a)− sup(b) ≤ sup(a− b), and the second line followsthe definition of LGraphMix.

For the first term,

LGNN(S, fγ)− LGNN(S′, fγ) =

1

m

(`(fγ(xi0 ;G), yi0)− (`(fγ(x′i0 ;G), y′i0)

)≤ c

m,

where we used the fact that given a fixed G and a fixed fγ ,`(fγ(xi;G), yi) = `(fγ(x′i;G), y′i) for i 6= i0. This holdssince fγ(· ;G) does not depend on S given a G and a fγ ,even though fγ(xi;G) contains the aggregation functionsover the graph G.

For the second term,

LMM(S, fγ(· ;G0), α)− LMM(S′, fγ(· ;G0), α)

≤ c(2m− 1)

m2≤ 2c

m,

where we use the fact that LMM(S, fγ(· ;G0), α)has m2 terms and 2m − 1 terms differ for Sand S′, each of which is bounded by the con-stant c. Similarly, LMM((XS

u , YSu ), fγ(· ;G0), α) −

LMM((XS′

u , YS′

u ), fγ(· ;G0), α) ≤ 2cnu

, since the la-bels YS

u and YS′

u are determined by fγ(xi;G), andfγ(xi;G) = fγ(x′i;G) for i 6= i0, given a fixed G and afixed fγ . Therefore,

LFCN(S, fγ)− LFCN(S′, fγ) ≤ 4c

m.

Using these upper bounds,

ϕ(S′)− ϕ(S) ≤ c(1 + 4λ)

m.

Similarly, ϕ(S)− ϕ(S′) ≤ c(1+4λ)m . Thus, by McDiarmid’s

inequality, for any δ > 0, with probability at least 1− δ/|Γ|,

ϕ(S) ≤ ES [ϕ(S)] + c(1 + 4λ)

√ln(|Γ|/δ)

2m.

Moreover,

ES [ϕ(S)] + λES[

inffγ∈Fγ

LFCN(S, fγ)

]≤ ES

[supfγ∈Fγ

Ex′,y′∼D[`(fγ(x′;G), y′)]− LGNN(S, fγ)

]

≤ ES,S′

[supfγ∈Fγ

1

m

m∑i=1

(`(fγ(x′i;G), y′i)− `(fγ(xi;G), yi))

]

≤ Eξ,D,D′

[supfγ∈Fγ

1

m

m∑i=1

ξi(`(fγ(x′i;G), y′i)− `(fγ(xi;G), yi))

]≤ 2Rn(Θ)

where the second line and the last line use the subadditiv-ity of supremum, the third line uses the Jensen’s inequal-ity and the convexity of the supremum, the fourth line fol-lows that for each ξi ∈ {−1,+1}, the distribution of eachterm ξi(`(fγ(x′i;G), y′i) − `(fγ(xi;G), yi)) is the distribu-tion of (`(fγ(x′i;G), y′i)− `(fγ(xi;G), yi)) since S and S′are drawn iid with the same distribution.

Therefore, for any δ > 0, with probability at least 1−δ/|Γ|,all fγ ∈ Fγ ,

Ex,y∼D[`(fγ(x;G), y)]− LGraphMix(S, fγ)

≤ R`m(Fγ) + c

√ln(|Γ|/δ)

2m− λV.

by taking union bounds over all γ ∈ Γ, we obtain the state-ment of this theorem.

4 ExperimentsWe present results for GraphMix algorithm using standardbenchmark datasets and the standard architecture in Section4.1 and 4.3. We also conduct an ablation study on Graph-Mix in Section A.5 to understand the contribution of variouscomponents to its performance. Refer to Appendix A.4 fordataset details and A.8 for implementation and hyperparame-ter details.

4.1 Semi-supervised Node ClassificationFor baselines, we choose GCN (Kipf and Welling 2017),and the recent state-of-the-art graph neural networks GAT(Velickovic et al. 2018), GMNN (Qu, Bengio, and Tang2019) and Graph U-Net (Gao and Ji 2019), as well as therecently proposed regularizers for graph neural networks.GraphMix(GCN) , GraphMix(GAT) and GraphMix(GraphU-Net) refer to the methods where underlying GNNs areGCN, GAT and Graph U-Net respectively. Refer to AppendixA.8 for implementation and hyperparameter details. (Shchuret al. 2018) demonstrated that the performance of the cur-rent state-of-the-art Graph Neural Networks on the standard

Table 1: Results of node classification (% test accuracy) onthe standard split of datasets. [*] means the results are takenfrom the corresponding papers. We conduct 100 trials andreport mean and standard deviation over the trials (refer toTable 8 in the Appendix for comparison with other methodson standard Train/Validation/Test split).

Algorithm Cora Citeseer PubmedGCN 81.30±0.66 70.61±0.22 79.86±0.34GAT 82.70±0.21 70.40±0.35 79.05±0.64

Graph U-Net 81.74±0.54 67.69±1.10 77.73 ±0.98

BVAT* 83.6±0.5 74.0±0.6 79.9±0.4DropEdge* 82.8 72.3 79.6

GraphSGAN* 83.0±1.3 73.1±1.8 -GraphAT* 82.5 73.4 -

GraphVAT* 82.6 73.7 -

GraphMix (GCN) 83.94±0.57 74.72±0.59 80.98±0.55GraphMix (GAT) 83.32±0.18 73.08±0.23 81.10±0.78

GraphMix (Graph U-Net) 82.47±0.76 69.31±1.52 78.91±1.25

Table 2: Results of node classification (% test accuracy) using10 random Train/Validation/Test split of datasets. We conduct100 trials and report mean and standard deviation over thetrials.

Algorithm Cora Citeseer PubmedGCN 77.84±1.45 72.56±2.46 78.74±0.99GAT 77.74±1.86 70.41±1.81 78.48±0.96

Graph U-Net 77.59±1.60 67.55±0.69 76.79±2.45

GraphMix (GCN) 82.07±1.17 76.45±1.57 80.72±1.08GraphMix (GAT) 80.63±1.31 74.08±1.26 80.14±1.51

GraphMix (Graph-U-Net) 80.18±1.62 72.85±1.71 78.47±0.64

train/validation/test split of the popular benchmark datasets(such as Cora, Citeseer, Pubmed, etc) is significantly differentfrom their performance on the random splits. For fair evalu-ation, they recommend using multiple random partitions ofthe datasets. Along these lines, we created 10 random splitsof the Cora, Citeseer and Pubmed with the same train/ valida-tion/test number of samples as in the standard split. We alsoprovide the results for the standard train/validation/test split.The results are presented in Table 2. We observe that Graph-Mix always improves the accuracy of the underlying GNNssuch as GCN, GAT and Graph-U-Net across all the dataset,with GraphMix(GCN) achieving the best results. We furtherpresent results with fewer labeled samples in Appendix A.7.We observer that the relative increase in test accuracy us-ing GraphMix over the baseline GNN is more pronouncedwhen the labeled samples are fewer. This makes GraphMixparticularly appealing for very few labeled data problems.

4.2 Results on Larger DatasetsWe also provide results on three recently proposed datasetswhich are relatively larger than standard benchmark datasets(Cora/Citeseer/Pubmed). We use Cora-Full dataset proposedin (Bojchevski and Günnemann 2018) and Coauthor-CS andCoauthor-Physics datasets proposed in (Shchur et al. 2018).The results are presented in Table 32. We observe that Graph-

2We do not provide results for GAT based experiments in Table3 and Table 5 because we ran out of GPU memory required to run

Table 3: Comparison of GraphMix with other methods (% testaccuracy ), for Cora-Full, Coauthor-CS, Coauthor-Physics. ∗refers to the results reported in (Shchur et al. 2018).

Algorithm Cora-Full Coauthor-CS Coauthor-Physics

GCN* 62.2±0.6 91.1±0.5 92.8±1.0GAT* 51.9±1.5 90.5±0.6 92.5±0.9MoNet* 59.8±0.8 90.8±0.6 92.5±0.9GS-Mean* 58.6±1.6 91.3±2.8 93.0±0.8

GCN 60.13±0.57 91.27±0.56 92.90±0.92Graph-U-Net 59.82±0.39 90.89±0.43 92.57±0.81

GraphMix (GCN) 61.80±0.54 91.83±0.51 94.49±0.84GraphMix (Graph-U-Net) 60.92 ± 0.51 91.44 ± 0.46 93.78 ± 0.79

Mix(GCN) improves the results over GCN for all the threedatasets with a significant margin. We note that we did min-imal hyperparameter search for GraphMix(GCN) as men-tioned in Appendix A.8. The details of the datasets is givenin Appendix A.4.

Table 4: Ablation study results using 10 labeled samplesper class (% test accuracy). We report mean and standarddeviation over ten trials. See Section A.5 for the meaning ofmethods in leftmost column.

Ablation Cora Citeseer PubmedWithout Manifold Mixup andwithout predicted-targets 68.78±3.54 61.01±1.24 72.56±1.08

With Predicted targets 69.08±5.03 62.66±1.80 73.07±0.94With Manifold Mixup 73.55±3.29 65.72±2.10 75.74±1.69With Manifold Mixup andPredicted targets 79.30±1.36 70.78±1.41 77.13±3.60

4.3 Semi-supervised Link ClassificationIn the semi-supervised link classification problem, the task isto predict the labels of the remaining links, given a graph andlabels of a few links. Following (Taskar et al. 2004), we canformulate the link classification problem as a node classifica-tion problem, i.e., given an original graph G, we construct adual Graph G′, the node set V ′ of the dual graph correspondsto the link set E′ of the original graph. The nodes in thedual graph G′ are connected if their corresponding links inthe graph G share a node. The attributes of a node in thedual graph are defined as the index of the nodes of the corre-sponding link in the original graph. Using this formulation,we present results on link classification on Bit OTC and BitAlpha benchmark datasets in the Table 5. As the numbers ofthe positive and negative edges are strongly imbalanced, we

these experiments with larger (higher number of nodes) datasets.

Table 5: Results on Link Classification (%F1 score). ∗ meansthe results are taken from the corresponding papers.

Algorithm Bit OTC Bit AlphaDeepWalk (Perozzi, Al-Rfou, and Skiena 2014) 63.20 62.71

GMNN* (Qu, Bengio, and Tang 2019) 66.93 65.86GCN 65.72±0.38 64.00±0.19

GraphMix (GCN) 66.35±0.41 65.34±0.19

(a) GCN (b) GCN(self-training) (c) GraphMix(GCN) (d) Class-specific Soft-Rank

Figure 2: Two-dimensional representation of the hidden states of Citeseer dataset using (a) GCN, (b) GCN(self-training), (c)GraphMix(GCN), and Soft-Rank (d). GraphMix(GCN) learns better separated representations.

report the F1 score. Our results show that GraphMix(GCN)improves the performance over the baseline GCN methodand is comparable with the recently proposed state-of-the-artmethod GMNN (Qu, Bengio, and Tang 2019) for both thedatasets.

4.4 Visualization of the Learned FeaturesIn Figure 2, we present an analysis of the features learned byGraphMix for Cora dataset using the t-SNE (van der Maatenand Hinton 2008) based 2D visualization of the hidden states.We observe that GraphMix learns hidden states which arebetter separated and condensed than GCN and GCN(self-training). Here, GCN(self-training) refers to training a GCNin a normal way but with additional self-prediction basedtargets for the unlabeled samples. We further evaluate theSoft-rank (refer to Appendix A.10) of the class-specific hid-den states to demonstrate that GraphMix(GCN) makes theclass-specific hidden states more concentrated than GCNand GCN(self-training), as shown in Figure 2d. Refer to Ap-pendix A.11 for 2D representation of other datasets.

5 Related Work• Semi-supervised Learning over Graph Data: There ex-

ists a long line of work for semi-supervised learning overgraph data (Zhu and Ghahramani 2002; Zhu, Ghahramani,and Lafferty 2003). Contrary to previous methods, therecent GNN based approaches define the convolutional op-erators using the neighbourhood information of the nodes(Kipf and Welling 2017; Velickovic et al. 2018).These con-volution operator based method exhibit state-of-the-resultsfor semi-supervised learning over graph data, hence muchof the recent attention is dedicated to proposing architec-tural changes to these methods (Qu, Bengio, and Tang2019; Gao and Ji 2019; Ma et al. 2019). Unlike these meth-ods, we propose a regularization technique that can beapplied to any of these GNNs which uses a parameterizedtransformation on the node features.

• Data Augmentation: It is well known that the generaliza-tion of a learning algorithm can be improved by enlarg-ing the training data size. Since labeling more samplesis labour-intensive and costly, data-augmentation has be-come a popular technique for enlarging the training datasize. Mixing based algorithms are a particular class of

data-augmentation methods in which additional trainingsamples are generated by interpolating the samples (eitherin the input or hidden space) and/or their corresponding tar-gets. Mixup (Zhang et al. 2018), Manifold Mixup (Vermaet al. 2019a), AMR (Beckham et al. 2019), ISD (Jeonget al. 2020) are notable examples of this class of algorithms.Unlike these methods which have been proposed for thefixed topology datasets such as images, we propose in-terpolation based data-augmentation for graph-structureddata.

• Regularizing Graph Neural Networks: RegularizingGraph Neural Networks has drawn some attention recently.GraphSGAN (Ding, Tang, and Zhang 2018) first uses anembedding method such as DeepWalk (Perozzi, Al-Rfou,and Skiena 2014) and then trains generator-classifier net-works in the adversarial learning setting to generate fakesamples in the low-density region between sub-graphs.BVAT (Deng, Dong, and Zhu 2019) and (Feng et al. 2019)generate adversarial perturbations to the features of thenodes while taking graph structure into account. Thesemethods require significant additional computation: Graph-Scan requires computing node embedding as a preprocess-ing step, BVAT and (Feng et al. 2019) require additionalgradient computation for computing adversarial perturba-tions. Unlike these methods, GraphMix does not introduceany significant additional computation since it is based oninterpolation-based techniques and self-generated targets.

6 DiscussionGraphMix is a simple and efficient regularizer for semi-supervised node classification using graph neural networks.Our extensive experiments demonstrate state-of-the-art per-formance using GraphMix on benchmark datasets. Our theo-retical analysis compares generalization bounds of GraphMixvs the underlying GNNs. The strong empirical results ofGraphMix suggest that in parallel to designing new archi-tectures, exploring better regularization for graph-structureddata is a promising avenue for research. A future researchdirection is to jointly model the node features and edgesof the graph such that they can be further used for generat-ing the synthetic interpolated nodes and their correspondingconnectivity to the other nodes in the graph.

ReferencesBartlett, P. L.; and Mendelson, S. 2002. Rademacher andGaussian complexities: Risk bounds and structural results.Journal of Machine Learning Research 3(Nov): 463–482.

Beckham, C.; Honari, S.; Verma, V.; Lamb, A.; Ghadiri, F.;Devon Hjelm, R.; Bengio, Y.; and Pal, C. 2019. On Adver-sarial Mixup Resynthesis. arXiv e-prints arXiv:1903.02709.

Belkin, M.; Niyogi, P.; and Sindhwani, V. 2006. ManifoldRegularization: A Geometric Framework for Learning fromLabeled and Unlabeled Examples. J. Mach. Learn. Res. 7:2399–2434. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1248547.1248632.

Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.;Oliver, A.; and Raffel, C. 2019. MixMatch: A HolisticApproach to Semi-Supervised Learning. arXiv e-printsarXiv:1905.02249.

Blum, A.; and Mitchell, T. 1998. Combining Labeled andUnlabeled Data with Co-training. In Proceedings of theEleventh Annual Conference on Computational LearningTheory, COLT’ 98, 92–100. New York, NY, USA: ACM.ISBN 1-58113-057-0. doi:10.1145/279943.279962. URLhttp://doi.acm.org/10.1145/279943.279962.

Bojchevski, A.; and Günnemann, S. 2018. Deep GaussianEmbedding of Graphs: Unsupervised Inductive Learning viaRanking. In International Conference on Learning Represen-tations. URL https://openreview.net/forum?id=r1ZdKJ-0W.

Bruna, J.; Zaremba, W.; Szlam, A.; and LeCun, Y. 2013. Spec-tral Networks and Locally Connected Networks on Graphs.CoRR abs/1312.6203.

Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi-Supervised Learning. The MIT Press, 1st edition. ISBN0262514125, 9780262514125.

Defferrard, M.; Bresson, X.; and Vandergheynst, P. 2016.Convolutional Neural Networks on Graphs with Fast Lo-calized Spectral Filtering. In Lee, D. D.; Sugiyama, M.;Luxburg, U. V.; Guyon, I.; and Garnett, R., eds., Advances inNeural Information Processing Systems 29, 3844–3852.

Deng, Z.; Dong, Y.; and Zhu, J. 2019. Batch Virtual Adver-sarial Training for Graph Convolutional Networks. CoRRabs/1902.09192. URL http://arxiv.org/abs/1902.09192.

Devries, T.; and Taylor, G. W. 2017. Improved Regulariza-tion of Convolutional Neural Networks with Cutout. CoRRabs/1708.04552. URL http://arxiv.org/abs/1708.04552.

Ding, M.; Tang, J.; and Zhang, J. 2018. Semi-supervisedLearning on Graphs with Generative Adversarial Nets. InProceedings of the 27th ACM International Conference onInformation and Knowledge Management, CIKM ’18, 913–922. New York, NY, USA: ACM. ISBN 978-1-4503-6014-2.doi:10.1145/3269206.3271768. URL http://doi.acm.org/10.1145/3269206.3271768.

Feng, F.; He, X.; Tang, J.; and Chua, T. 2019. Graph Adver-sarial Training: Dynamically Regularizing Based on GraphStructure. CoRR abs/1902.08226. URL http://arxiv.org/abs/1902.08226.

Gao, H.; and Ji, S. 2019. Graph U-Nets. In Chaud-huri, K.; and Salakhutdinov, R., eds., Proceedings of the36th International Conference on Machine Learning, vol-ume 97 of Proceedings of Machine Learning Research,2083–2092. Long Beach, California, USA: PMLR. URLhttp://proceedings.mlr.press/v97/gao19a.html.

Gilmer, J.; Schoenholz, S. S.; Riley, P. F.; Vinyals, O.; andDahl, G. E. 2017. Neural Message Passing for QuantumChemistry. In ICML.

Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A newmodel for learning in graph domains. In Proceedings. 2005IEEE International Joint Conference on Neural Networks,2005., volume 2, 729–734. IEEE.

Grandvalet, Y.; and Bengio, Y. 2005. Semi-supervised Learn-ing by Entropy Minimization. In Saul, L. K.; Weiss, Y.; andBottou, L., eds., Advances in Neural Information ProcessingSystems 17, 529–536.

Hamilton, W.; Ying, Z.; and Leskovec, J. 2017. Inductiverepresentation learning on large graphs. In NIPS.

Henaff, M.; Bruna, J.; and LeCun, Y. 2015. Deep Con-volutional Networks on Graph-Structured Data. ArXivabs/1506.05163.

Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal,K.; Bachman, P.; Trischler, A.; and Bengio, Y. 2019. Learn-ing deep representations by mutual information estimationand maximization. In International Conference on Learn-ing Representations. URL https://openreview.net/forum?id=Bklr3j0cKX.

Jeong, J.; Verma, V.; Hyun, M.; Kannala, J.; and Kwak, N.2020. Interpolation-based semi-supervised learning for objectdetection.

Kipf, T. N.; and Welling, M. 2016. Variational graph auto-encoders. arXiv preprint arXiv:1611.07308 .

Kipf, T. N.; and Welling, M. 2017. Semi-supervised classifi-cation with graph convolutional networks. In ICLR.

Ko, T.; Peddinti, V.; Povey, D.; and Khudanpur, S. 2015. Au-dio augmentation for speech recognition. In INTERSPEECH.

Kumar, S.; Hooi, B.; Makhija, D.; Kumar, M.; Faloutsos, C.;and Subrahmanian, V. 2018. Rev2: Fraudulent user predictionin rating platforms. In WSDM.

Kumar, S.; Spezzano, F.; Subrahmanian, V.; and Faloutsos, C.2016. Edge weight prediction in weighted signed networks.In ICDM.

Laine, S.; and Aila, T. 2016. Temporal Ensembling for Semi-Supervised Learning. CoRR abs/1610.02242. URL http://arxiv.org/abs/1610.02242.

Lee, D.-H. 2013. Pseudo-Label : The Simple and EfficientSemi-Supervised Learning Method for Deep Neural Net-works.

Li, Q.; Han, Z.; and Wu, X.-M. 2018. Deeper Insights intoGraph Convolutional Networks for Semi-Supervised Learn-ing. In AAAI.

Lu, Q.; and Getoor, L. 2003. Link-based Classification. InProceedings of the Twentieth International Conference onInternational Conference on Machine Learning, ICML’03,496–503. AAAI Press. ISBN 1-57735-189-4. URL http://dl.acm.org/citation.cfm?id=3041838.3041901.

Ma, J.; Cui, P.; Kuang, K.; Wang, X.; and Zhu, W. 2019.Disentangled Graph Convolutional Networks. In ICML.

Miyato, T.; ichi Maeda, S.; Koyama, M.; and Ishii, S. 2018.Virtual Adversarial Training: a Regularization Method for Su-pervised and Semi-supervised Learning. IEEE transactionson pattern analysis and machine intelligence .

Monti, F.; Boscaini, D.; Masci, J.; Rodola, E.; Svoboda, J.;and Bronstein, M. M. 2016. Geometric deep learning ongraphs and manifolds using mixture model CNNs. CoRRabs/1611.08402. URL http://arxiv.org/abs/1611.08402.

Park, D. S.; Chan, W.; Zhang, Y.; Chiu, C.-C.; Zoph, B.;Cubuk, E. D.; and Le, Q. V. 2019. SpecAugment: A SimpleData Augmentation Method for Automatic Speech Recogni-tion. arXiv e-prints arXiv:1904.08779.

Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk:Online learning of social representations. In KDD.

Qu, M.; Bengio, Y.; and Tang, J. 2019. GMNN: GraphMarkov Neural Networks. In Chaudhuri, K.; and Salakhut-dinov, R., eds., Proceedings of the 36th International Con-ference on Machine Learning, volume 97 of Proceedingsof Machine Learning Research, 5241–5250. Long Beach,California, USA: PMLR.

Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; andMonfardini, G. 2009. The Graph Neural Network Model.Trans. Neur. Netw. 20(1): 61–80. ISSN 1045-9227. doi:10.1109/TNN.2008.2005605. URL http://dx.doi.org/10.1109/TNN.2008.2005605.

Shchur, O.; Mumme, M.; Bojchevski, A.; and Günnemann, S.2018. Pitfalls of Graph Neural Network Evaluation. CoRRabs/1811.05868. URL http://arxiv.org/abs/1811.05868.

Sun, F.-Y.; Hoffman, J.; Verma, V.; and Tang, J. 2020. Info-Graph: Unsupervised and Semi-supervised Graph-Level Rep-resentation Learning via Mutual Information Maximization.In International Conference on Learning Representations.URL https://openreview.net/forum?id=r1lfF2NYvH.

Tarvainen, A.; and Valpola, H. 2017. Mean teachers are betterrole models: Weight-averaged consistency targets improvesemi-supervised deep learning results. In Advances in NeuralInformation Processing Systems 30, 1195–1204.

Taskar, B.; Wong, M.-F.; Abbeel, P.; and Koller, D. 2004.Link prediction in relational data. In NIPS.

van der Maaten, L.; and Hinton, G. 2008. Visualizing Datausing t-SNE. Journal of Machine Learning Research 9: 2579–2605. URL http://www.jmlr.org/papers/v9/vandermaaten08a.html.

Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Liò,P.; and Bengio, Y. 2018. Graph Attention Networks. In ICLR.

Velickovic, P.; Fedus, W.; Hamilton, W. L.; Liò, P.; Bengio,Y.; and Hjelm, R. D. 2019. Deep graph infomax. In ICLR.

Verma, S.; and Zhang, Z.-L. 2019. Stability and generaliza-tion of graph convolutional neural networks. In Proceed-ings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining, 1539–1548.

Verma, V.; Lamb, A.; Beckham, C.; Najafi, A.; Mitliagkas,I.; Lopez-Paz, D.; and Bengio, Y. 2019a. Manifold Mixup:Better Representations by Interpolating Hidden States. InChaudhuri, K.; and Salakhutdinov, R., eds., Proceedings ofthe 36th International Conference on Machine Learning,volume 97 of Proceedings of Machine Learning Research,6438–6447. Long Beach, California, USA: PMLR. URLhttp://proceedings.mlr.press/v97/verma19a.html.

Verma, V.; Lamb, A.; Juho, K.; Bengio, Y.; and Lopez-Paz,D. 2019b. Interpolation Consistency Training for Semi-supervised Learning. In Kraus, S., ed., Proceedings ofthe Twenty-Eighth International Joint Conference on Arti-ficial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019. ijcai.org. doi:10.24963/ijcai.2019. URL https://doi.org/10.24963/ijcai.2019.

Weston, J.; Ratle, F.; Mobahi, H.; and Collobert, R. 2012.Deep Learning via Semi-Supervised Embedding. In Mon-tavon, G.; Orr, G.; and Müller, K. R., eds., In Neural Net-works: Tricks of the Trade. Springer, second edition.

Xie, Z.; Wang, S. I.; Li, J.; Lévy, D.; Nie, A.; Jurafsky, D.;and Ng, A. Y. 2017. Data Noising as Smoothing in NeuralNetwork Language Models. ArXiv abs/1703.02573.

Xu, K.; Li, C.; Tian, Y.; Sonobe, T.; Kawarabayashi, K.-i.;and Jegelka, S. 2018. Representation Learning on Graphswith Jumping Knowledge Networks. In Dy, J.; and Krause,A., eds., Proceedings of the 35th International Conferenceon Machine Learning, volume 80 of Proceedings of MachineLearning Research, 5453–5462. Stockholmsmässan, Stock-holm Sweden: PMLR. URL http://proceedings.mlr.press/v80/xu18c.html.

Yang, Z.; Cohen, W.; and Salakhudinov, R. 2016. Revisit-ing Semi-Supervised Learning with Graph Embeddings. InICML.

Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D.2018. mixup: Beyond Empirical Risk Minimization. In-ternational Conference on Learning Representations URLhttps://openreview.net/forum?id=r1Ddp1-Rb.

Zhou, J.; Cui, G.; Zhang, Z.; Yang, C.; Liu, Z.; and Sun, M.2018. Graph Neural Networks: A Review of Methods andApplications. CoRR abs/1812.08434. URL http://arxiv.org/abs/1812.08434.

Zhu, X.; and Ghahramani, Z. 2002. Learning from Labeledand Unlabeled Data with Label Propagation. Technical re-port.

Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semi-supervised learning using gaussian fields and harmonic func-tions. In ICML.

A AppendixA.1 Accurate Target Prediction for Unlabeled

dataA recently proposed method for accurate target predictionsfor unlabeled data uses the average of predicted targets acrossK random augmentations of the input sample (Berthelotet al. 2019). Along these lines, in GraphMix we compute thepredicted-targets as the average of predictions made by GNNon K drop-out versions of the input sample.

Many recent semi-supervised learning algorithms (Laineand Aila 2016; Miyato et al. 2018; Tarvainen and Valpola2017; Verma et al. 2019b) are based on the cluster assumption(Chapelle, Schlkopf, and Zien 2010), which posits that theclass boundary should pass through the low-density regionsof the marginal data distribution. One way to enforce this as-sumption is to explicitly minimize the entropy of the model’spredictions p(y|x, θ) on unlabeled data by adding an extraloss term to the original loss term (Grandvalet and Bengio2005). The entropy minimization can be also achieved im-plicitly by modifying the model’s prediction on the unlabeleddata such that the prediction has low entropy and using theselow-entropy predictions as targets for the further training ofthe model. Examples include "Pseudolabels" (Lee 2013) and"Sharpening" (Berthelot et al. 2019). Pseudolabeling con-structs hard (one-hot) labels for the unlabeled samples whichhave “high-confidence predictions”. Since many of the unla-beled samples may have “low-confidence predictions”, theycan not be used in the Pseudolabeling technique. On the otherhand, Sharpening does not require “high-confidence predic-tions” , and thus it can be used for all the unlabelled samples.Hence in this work, we use Sharpening for entropy mini-mization. The Sharpening function over the model predictionp(y|x, θ) can be formally defined as follows (Berthelot et al.2019), where T is the temperature hyperparameter and C isthe number of classes:

Sharpen(pi, T ) := p1Ti

/ C∑j=1

p1Tj (8)

A.2 Connection to Co-trainingThe GraphMix approach can be seen as a special instance ofthe Co-training framework (Blum and Mitchell 1998). Co-training assumes that the description of an example can bepartitioned into two distinct views and either of these viewswould be sufficient for learning given sufficient labeled data.In this framework, two learning algorithms are trained sepa-rately on each view and then the prediction of each learningalgorithm on the unlabeled data is used to enlarge the trainingset of the other. Our method has some important differencesand similarities to the Co-training framework. Similar to Co-training, we train two neural networks and the predictionsfrom the GNN are used to enlarge the training set of the FCN.An important difference is that instead of using the predic-tions from the FCN to enlarge the training set for the GNN,we employ parameter sharing for passing the learned informa-tion from the FCN to the GNN. In our experiments, directlyusing the predictions of the FCN for GNN training resultedin reduced accuracy. This is due to the fact that the number of

Algorithm 1 GraphMix : A procedure for improved trainingof Graph Neural Networks (GNN)

1: Input: A GCN: Gθ(X,A), a FCN: Fθ(X,α) which sharesparameters with the GCN. Beta distribution parameter α forManifold Mixup . Number of random perturbationsK, Sharpen-ing temperature T . maximum value of weight γ in the weightedaveraging of supervised FCN loss and unsupervised FCN loss.Number of parameter updates N . w(t): rampup function forincreasing the importance unsupervised loss in FCN training.(XL, YL) represents labeled samples and XU represents unla-beled samples.

2: for t = 1 to N do3: i = random(0,1) // generate randomly 0 or 14: if i=0 then5: Lsupervised = LMM((Xl,Yl), Fθ, α) // supervised loss

from FCN using the Manifold Mixup6: for k = 1 to K do7: XU,k = RandomPerturbations(XU ) // Apply

kth round of random perturbation to XU8: end for9: YU = 1

K

∑k g(Y | XU,k; θ,A) // Compute average

predictions across K perturbations of XU using the GCN10: YU = Sharpen(YU , T ) // Apply temperature sharp-

ening to the average prediction11: Lunsupervised = LMM((Xu, Yu), Fθ, α)

// unsupervised loss from FCN using the Manifold Mixup

12: L = Lsupervised + w(t) ∗ Lunsupervised // Total loss is theweighted sum of supervised and unsupervised FCN loss

13: else14: L = L(Gθ(Xl),Yl) // Loss using the vanilla GCN15: end if16: Minimize L using gradient descent based optimizer such as

SGD.17: end for

labeled samples for training the FCN is sufficiently low andhence the FCN does not make accurate enough predictions.Another important difference is that unlike the co-trainingframework, the FCN and GNN do not use completely distinctviews of the data: the FCN uses feature vectors X and theGNN uses the feature vector and adjacency matrix (X ,A).

A.3 AlgorithmThe procedure for GraphMix training is given in Algorithm1.

A.4 DatasetsWe use three standard benchmark citation network datasetsfor semi-supervised node classification, namely Cora, Cite-seer and Pubmed. In all these datasets, nodes correspondto documents and edges correspond to citations. Node fea-tures correspond to the bag-of-words representation of thedocument. Each node belongs to one of C classes. Duringtraining, the algorithm has access to the feature vectors andedge connectivity of all the nodes but has access to the classlabels of only a few of the nodes.

For semi-supervised link classification, we use twodatasets Bitcoin Alpha and Bitcoin OTC from (Kumar et al.2016, 2018). The nodes in these datasets correspond to the

bitcoin users and the edge weights between them correspondto the degree of trust between the users. Following (Qu, Ben-gio, and Tang 2019), we treat edges with weights greaterthan 3 as positive instances, and edges with weights less than-3 are treated as negative ones. Given a few labeled edges,the task is to predict the labels of the remaining edges. Thestatistics of these datasets as well as the number of train-ing/validation/test nodes is presented in Appendix A.4.

The statistics of standard benchmark datasets as well asthe number of training/validation/test nodes is presented inTable 6. The statistics of larger datasets in given in Table 7.

Table 6: Dataset statistics.

Dataset # Nodes # Edges # Features # Classes # Training # Validation # TestCora 2,708 5,429 1,433 7 140 500 1,000

Citeseer 3,327 4,732 3,703 6 120 500 1,000Pubmed 19,717 44,338 500 3 60 500 1,000

Bitcoin Alpha 3,783 24,186 3,783 2 100 500 3,221Bitcoin OTC 5,881 35,592 5,881 2 100 500 5,947

For larger datasets of Section 4.2, we took processed ver-sions of these dataset available here 3. We did 10 randomsplits of the the data into train/validation/test split. For theclasses which had more than 100 samples. We choose 20samples per class for training, 30 samples per class for vali-dation and the remaining samples as test data. For the classeswhich had less than 100 samples, we chose 20% samples, perclass for training, 30% samples for validation and the remain-ing for testing. For each split we run experiments using 100random seeds. The statistics of these datasets is presented inAppendix Table 7 and

Table 7: Dataset statistics for Larger datasets

Datasets Classes Features Nodes EdgesCora-Full 67 8710 18703 62421Coauthor-CS 15 6805 18333 81894Coauthor-Physics 5 8415 34493 247962NELL 210 5414 65755 266144

A.5 Ablation StudySince GraphMix consists of various components, some ofwhich are common with the existing literature of semi-supervised learning, we set out to study the effect of var-ious components by systematically removing or adding acomponent from GraphMix . We measure the effect of thefollowing:

• Removing both the Manifold Mixup and predicted targetsfrom the FCN training.

• Having only the Manifold Mixup training for FCN ( Nopredicted targets for FCN training)

• Using the predicted targets for the FCN training (No Man-ifold Mixup in FCN training)

3https://github.com/shchur/gnn-benchmark

• Using both Manifold Mixup and predicted targets for FCNtraining

The ablation results for semi-supervised node classificationare presented in Table 4. We did not do any hyperparametertuning for the ablation study and used the best performinghyperparameters found for the results presented in Table 1.We observe that all the components of GraphMix contributeto its performance, with Manifold Mixup training of FCNcontributing possibly the most.

A.6 Comparison with State-of-the-art MethodsWe present the comparion of GraphMix with the recent state-of-the-art methods as well as earlier methods using the stan-dard Train/Validation/Test split in Table 8. We additionallyuse self-training based baselines, where FCN, GCN, GAT andGraph-U-Net are trained with self-generated targets. Theseare named as FCN (self-training), GCN (self-training), GAT(self-training) and Graph-U-Net(self-training) respectively inTable 8. For generating the predicted-targets in above twobaselines, we followed the procedure of Appendix A.1.

A.7 Results with fewer labeled samplesWe further evaluate the effectiveness of GraphMix in thelearning regimes where fewer labeled samples exist. For eachclass, we randomly sampled K ∈ {5, 10} samples for train-ing and the same number of samples for the validation. Weused all the remaining labeled samples as the test set. We re-peated this process for 10 times. The results in Table 9 showthat GraphMix achieves even better improvements when thelabeled samples are fewer.

A.8 Implementation and Hyperparameter DetailsWe use the standard benchmark architecture as used in GCN(Kipf and Welling 2017), GAT (Velickovic et al. 2018) andGMNN (Qu, Bengio, and Tang 2019), among others. Thisarchitecture has one hidden layer and the graph convolutionis applied twice : on the input layer and on the output of thehidden layer. The FCN in GraphMix shares the parameterswith the GCN.

GraphMix introduces four additional hyperparameters,namely the α parameter of Beta distribution used in Man-ifold Mixup training of the FCN, the maximum weighingcoefficient γ which controls the trade-off between the super-vised loss and the unsupervised loss (loss computed using thepredicted-targets) of FCN, the temparature T in sharpeningand the number of random perturbations K applied to theinput data for the averaging of the predictions.

We conducted minimal hyperparameter seach over onlyα and γ and fixed the hyperparameters T and K to 0.1 and10 respectively. The other hyperparameters were set to thebest values for underlying GNN (GCN or GAT), includingthe learning rate, the L2 decay rate, number of units in thehidden layer etc. We observed that GraphMix is not verysensitive to the values of α and γ and similar values of thesehyperparameters work well across all the benchmark datasets.Refer to Appendix A.8 and A.9 for the details about thehyperparameter values and the procedure used for the besthyperparameters selection.

Table 8: Comparison of GraphMix with other methods (% test accuracy ), for Cora, Citeseer and Pubmed.

Method Cora Citeseer Pubmed

Results reported from the literature

MLP 55.1% 46.5% 71.4%ManiReg (Belkin, Niyogi, and Sindhwani 2006) 59.5% 60.1% 70.7%SemiEmb (Weston et al. 2012) 59.0% 59.6% 71.7%LP (Zhu, Ghahramani, and Lafferty 2003) 68.0% 45.3% 63.0%DeepWalk (Perozzi, Al-Rfou, and Skiena 2014) 67.2% 43.2% 65.3%ICA (Lu and Getoor 2003) 75.1% 69.1% 73.9%Planetoid (Yang, Cohen, and Salakhudinov 2016) 75.7% 64.7% 77.2%Chebyshev (Defferrard, Bresson, and Vandergheynst 2016) 81.2% 69.8% 74.4%GCN (Kipf and Welling 2017) 81.5% 70.3% 79.0%MoNet (Monti et al. 2016) 81.7 ± 0.5% — 78.8 ± 0.3%GAT (Velickovic et al. 2018) 83.0 ± 0.7% 72.5 ± 0.7% 79.0 ± 0.3%GraphScan (Ding, Tang, and Zhang 2018) 83.3 ±1.3 73.1±1.8 —DisenGCN (Ma et al. 2019) 83.7% 73.4% 80.5%Graph U-Net (Gao and Ji 2019) 84.4% 73.2% 79.6%BVAT (Deng, Dong, and Zhu 2019) 83.6±0.5 74.0±0.6 79.9±0.4

Our Experiments

GCN 81.30±0.66 70.61±0.22 79.86±0.34GAT 82.70±0.21 70.40±0.35 79.05±0.64Graph U-Net 81.74±0.54 67.69±1.10 77.73 ±0.98FCN (self-training) 80.30±0.75 71.50±0.80 77.40±0.37GCN (self-training) 82.03±0.43 73.38±0.35 80.42±0.36GAT (self-training) 82.95±0.23 72.87±0.51 79.67±0.69Graph-U-Net (self-training) 82.01±0.91 68.23±1.57 78.12±1.23

GraphMix (GCN) 83.94±0.57 74.52±0.59 80.98±0.55GraphMix (GAT) 83.32±0.18 73.08±0.23 81.10±0.78GraphMix (Graph U-Net) 82.18±0.63 69.00 ±1.32 78.76±1.09

For results reported in Section 4.1 For GCN and Graph-Mix(GCN), we used Adam optimizer with learning rate 0.01and L2-decay 5e-4, the number of units in the hidden layer16 , dropout rate in the input layer and hidden layer was set to0.5 and 0.0, respectively. For GAT and GraphMix(GAT), weused Adam optimizer with learning rate 0.005 and L2-decay5e-4, the number of units in the hidden layer 8 , and thedropout rate in the input layer and hidden layer was searchedfrom the values {0.2, 0.5, 0.8}.

For α and γ of GraphMix(GCN) and GraphMix(GAT) ,we searched over the values in the set [0.0, 0.1, 1.0, 2.0] and[0.1, 1.0, 10.0, 20.0] respectively.

For GraphMix(GCN) : α = 1.0 works best across allthe datasets. γ = 1.0 works best for Cora and Citeseer andγ = 10.0 works best for Pubmed.

For GraphMix(GAT) : α = 1.0 works best for Cora andCiteseer and α = 0.1 works best for Pubmed. γ = 1.0 worksbest for Cora and Citeseer and γ = 10.0 works best forPubmed. Input droputrate=0.5 and hidden dropout rate=0.5work best for Cora and Citeseer and Input dropout rate=0.2and hidden dropout rate =0.2 work best for Pubmed.

We conducted all the experiments for 2000 epochs. The

value of weighing coefficient w(t) in Algorithm 1) is in-creased from 0 to its maximum value γ from epoch 500 to1000 using the sigmoid ramp-up of Mean-Teacher (Tarvainenand Valpola 2017).

Hyperparameter Details for Results in Section 4.2 Forall the experiments we use the standard architecture men-tioned in Section A.8 and used Adam optimizer with learn-ing rate 0.001 and 64 hidden units in the hidden layer. ForCoauthor-CS and Coauthor-Physics, we trained the networkfor 2000 epochs. For Cora-Full, we trained the network for5000 epochs because we observed the training loss of Cora-Full dataset takes longer to converge.

For Coauthor-CS and Coauthor-Physics: We set the inputlayer dropout rate to 0.5 and weight-decay to 0.0005, bothfor GCN and GraphMix(GCN) . We did not conduct anyhyperparameter search over the GraphMix hyperparametersα, λmax, temparature T and number of random permutationsK applied to the input data for GraphMix(GCN) for thesetwo datasets, and set these values to 1.0, 1.0, 0.1 and 10respectively.

For Cora-Full dataset: We found input layer dropout rate0.2 and weight-decay 0.0 to be the best for both GCN

Table 9: Results using less labeled samples (% test accuracy). K referes to the number of labeled samples per class.

Algorithm Cora Citeseer PubmedK = 5 K = 10 K = 5 K = 10 K = 5 K = 10

GCN 66.39±4.26 72.91±3.10 55.61±5.75 64.19±3.89 66.06±3.85 75.57±1.58GAT 68.17±5.54 73.88±4.35 55.54±1.82 61.63±0.42 64.24±4.79 73.60±1.85

Graph U-Net 64.42±5.44 71.48±3.03 49.43±5.81 61.16±3.47 65.05±4.69 68.65±3.69

GraphMix (GCN) 71.99±6.46 79.30±1.36 58.55±2.26 70.78±1.41 67.66±3.90 77.13±3.60GraphMix (GAT) 72.01±6.68 75.82±2.73 57.6±0.64 62.24±2.90 66.61±3.69 75.96±1.70

GraphMix (Graph U-Net) 66.84±6 5.10 73.14±3.17 54.39±5.07 64.36±3.48 67.40±5.33 70.43±3.75

and GraphMix(GCN) . For GraphMix(GCN) we fixed α,temparature T and number of random permutations K to1.0 0.1 and 10 respectively. For λmax, we did search over{1.0, 10.0, 20.0} and found that 10.0 works best.

For all the GraphMix(GCN) experiments, the value ofweighing coefficient w(t) in Algorithm 1) is increased from 0to its maximum value γmax from epoch 500 to 1000 using thesigmoid ramp-up of Mean-Teacher (Tarvainen and Valpola2017).

For results reported in Section 4.3 For α of Graph-Mix(GCN) , we searched over the values in the set[0.0, 0.1, 0.5, 1.0] and found that 0.1 works best for boththe datasets. For γ, we searched over the values in the set[0.1, 1.0, 10.0] and found that 0.1 works best for both thedatasets. We conducted all the experiments for 150 epochs.The value of weighting coefficient w(t) in Algorithm 1) isincreased from 0 to its maximum value γ from epoch 75 to125 using the sigmoid ramp-up of Mean-Teacher (Tarvainenand Valpola 2017).

Both for GraphMix(GCN) and GCN, we use Adam opti-mizer with learning rate 0.01 and L2-decay 0.0, the numberof units in the hidden layer 128 , dropout rate in the inputlayer was set to 0.5.

For results reported in Section A.7 For α of Graph-Mix(GCN) , we searched over the values in the set[0.0, 0.1, 0.5, 1.0] and found that 0.1 works best across allthe datasets. For γ, we searched over the values in the set[0.1, 1.0, 10.0] and found that 0.1 and 1.0 works best acrossall the datasets. Rest of the details for GraphMix(GCN) andGCN are same as Section A.8.

A.9 Hyperparameter Selection

For each configuration of hyperparameters, we run the exper-iments with 100 random seeds. We select the hyperparameterconfiguration which has the best validation accuracy aver-aged over these 100 trials. With this best hyperparameterconfiguration, for 100 random seeds, we train the modelagain and use the validataion set for model selection ( i.e. wereport the test accuracy at the epoch which has best validationaccuracy.)

A.10 Soft-RankLet H be a matrix containing the hidden states of all thesamples from a particular class. The Soft-Rank of matrix His defined by the sum of the singular values of the matrixdivided by the largest singular value. A lower Soft-Rankimplies fewer dimensions with substantial variability and itprovides a continuous analogue to the notion of rank frommatrix algebra. This provides evidence that the concentrationof class-specific states observed when using GraphMix inFigure 3 can be measured directly from the hidden states andis not an artifact of the T-SNE visualization.

A.11 Feature VisualizationWe present the 2D visualization of the hidden states learnedusing GCN and GraphMix(GCN) for Cora, Pubmed andCiteseer datasets in Figure 3. We observe that for Cora andCiteseer, GraphMix learns substantially better hidden statesthan GCN. For Pubmed, we observe that although there is noclear separation between classes, "Green" and "Red" classesoverlap less using the GraphMix, resulting in better hiddenstates.

A.12 Joint Optimization vs Alternateoptimization

In this Section, we discuss the effect of hyperparameter λ,that is used to compute the weighted sum of GCN and FCNlosses in in Eq 7. In Figure 4, we see that a wide range of λ (from 0.1 to 1.0) achieves better validation accuracy than thevanilla GCN. Furthermore, alternate optimization of the FCNloss and the GCN loss achieves substantially better validationaccuracy than the simultaneous optimization.

Figure 3: T-SNE of hidden states for Cora (left), Pubmed (middle), and Citeseer (right). Top row is GCN baseline, bottom row isGraphMix.

Figure 4: GCN train loss and validation loss for Alternate optimization vs. weighted joint optimization. lambda = X.X representsthe value of λ for simultaneous optimization in Eq 7.

graphmix: improved training of gnns for semi-supervised

Documents