link prediction survey

Assignment 1: Application Survey on Data Mining and Data Warehousing

CSCI 4144, Winter 2016

ID: B00707506, Student Name: Patrick Walter

1/15/2016

Link-prediction in Social Networks: A Survey

Introduction

A social network consists of two main components, a set of social actors and a set of

connections. In many cases the social actors represent people, while the connections represent

any form of social interaction, collaboration or influence. It follows that a social network can be

easily represented by a graph with the actors being nodes, and the connections being edges.

The popularity of social networks online has exploded over the past decade. Social networks

have expanded from the contexts of networks of researchers who have collaborated with each

other or employees at a company who have worked together to social networks which can

connect anyone in the world together.

Given that social networks are often based on people, they are often highly dynamic

with actors constantly making new interactions and connections with each other. In many

applications it is beneficial to be able to make predictions about these future connections. The

link-prediction problem was defined by Jon Kleinberg and David Liben-Nowell as the following,

“Given a snapshot of a social network at time t, we seek to accurately predict the edges that

will be added to the network during the interval from time t to a given future time t’ “. (Liben-

Nowell & Kleinberg, 2007) Using link-prediction a system can model the evolution of the

network based on features that are intrinsic to the network. An example of the link-prediction

problem is seen in social networks such as Facebook and other web-based social networks.

Facebook has systems that suggest users to make connections with other users who they may

2

know, or with companies they may like. These suggestions may create a more engaging

experience for users when they can easily make connections with their friends. Link-predictions

can also be used by companies to make suggestions on employees that should work together

on new projects. Thus many companies have vested interest in developing effective link-

prediction systems.

Using Location-based Data to Make Better Predictions

Many link-prediction systems rely heavily on making predictions based on 2-hop

neighbours, or friends- of-friends. This is a result of the scale of most social networks being the

millions of nodes, and the likelihood of two nodes making a connection declining exponentially

with each hop. Social networks that deploy location-based information such as check-ins can

give a way to make predictions that do not occur between neighbouring nodes. By exploiting

the location data of nodes, link-predictions can be made for nodes sharing one or more of these

locations. These nodes may not be within the 2-hop neighbourhood of each other and

therefore the link between them could not be made by a friends-of-friends system. The new

link made by these place-friends can be predicted by using the check-in information of the two

nodes. Thus the problem is defined by a group of researchers from University of Cambridge is:

“how do we design a link prediction system which exploits data about user check-ins” (Scellato,

Noulas, & Mascolo, 2011).

Solution Technology

The solution that Scellato, Noulas, & Mascolo used came in the form of supervised

learning. For each pair of users the link prediction is based on a set of features that describe the

3

pair. These features are based on both common social links and common and overlapping

location data. To create the training data simple labelling is applied. For each snapshot, the

features of every disjoint pair of users are computed, then in the next snapshot the pairs that

become connected are labelled positive and the others are labelled negative. Using the created

training data, classifiers are trained to construct models which can classify test data. Due to the

nature of the data having heavily skewed class distribution, using a supervised method allows

for effective discovery of inter-class boundaries to perform better classification (2011).

Evaluation

Using multiple supervised learning implementations, Scellato, Noulas, & Mascolo were

able to empirically show that using place-data increased the performance of a link-prediction

system. Random forests and model trees with linear regression gave the best performance in

their research. It was noted that the link-prediction was the more accurate in predicting links

that would be made by place-friends since they were able to exploit location-based user activity

(2011).

Allowing for Positive and Negative Links in Link-prediction Networks

In the real world, not all connections between actors in a social network are positive.

Some online social networks have implemented this concept by having actors able to create

connections that can be either positive or negative, for example “friend” or “foe”. A group of

researchers from Stanford and Cornell University “study online social networks in which

relationships can be either positive (indicating relations such as friendship) or negative

(indicating relations such as opposition or antagonism).” (Leskovec, Jure, Huttenlocher, &

4

Kleinberg, 2010). In their research, Leskovec, Jure, Huttenlocher, & Kleinberg discuss how the

sign of a given link interacts with other links in the same neighbourhood or other links

throughout the entire network. Or in terms of the link-prediction problem, what predictions can

be made about the configurations of link signs in a real social network (2010). They define the

edge sign prediction problem as follows: “given a social network with signs on all its edges, but

the sign on the edge from node u to node v, denoted s(u, v), has been “hidden.” How reliably

can we infer this sign s(u, v) using the information provided by the rest of the network?”

(Leskovec, Jure, Huttenlocher, & Kleinberg, 2010).

Solution Technology

To solve the edge sign prediction problem, Leskovec, Huttenlocher and Kleinberg

implemented a solution using a logistic regression classifier, a form of supervised learning. Since

most networks exhibited skewed distribution of positive and negative signed links the group

used two approaches. One approach used a full dataset which had only about one fifth of the

connections being negative, and the other used a balanced dataset with an equal distribution of

signs. In order to use this machine-learning approach features must be defined that describe

pairs of actors with a hidden link. There are two sets of features used. One set of features is

based on the signed degree of the two nodes which are called the degree features (2010). The

other, called the triad features, are based on the joint relationships the two nodes have with

other nodes in their neighbourhood, similar to the friends-of-friends features used in Scellato,

Noulas, and Mascolo’s research.

5

Evaluation

In total there are 23 features used to describe each hidden link, 7 degree features and

16 triad features. The Leskovec, Jure, Huttenlocher, & Kleinberg evaluated the solution on the

basis of each set of features by representing each set by a vector. What stood out the most in

the evaluation was that predictions based on their models significantly outperformed a

previous study which used propagation to go beyond the 2-hop neighbourhood on the same

dataset. This means that sign prediction can be understood based solely on the signs of other

links in the same one-step neighbourhood. In general using the full dataset gained much higher

accuracy, with about 15% improvement from random guessing (2010).

Using Continuous-valued Links in Link-predictions Networks

In the previously mentioned case of link-prediction using location-based information,

the researches treated links as binary relations, and in the edge sign prediction problem the

links were evaluated as being ternary relations. Researchers at Purdue University believe that

“in online social networks the low cost of link formation can lead to networks with

heterogeneous relationship strengths (e.g., acquaintances and best friends mixed together).”

(Xiang, Neville, & Rogati, 2010). Xiang, Neville, & Rogati developed a model to predict and

estimate the strength of links in a social network based on their interaction activity and

similarity. This challenge extends from the link-prediction problem as the group believes that

treating links as binary relations will increase the amount of noise learned by a prediction

model by treating strong and weak links equal. In most online social networks, creating links

comes at such a low-cost that many links may be much less significant than others. Including

6

these insignificant leaks in the learned model can greatly degrade the performance of the

system (2010).

Solution Technology

In order to achieve their model, the Xiang, Neville, & Rogati implemented an

unsupervised method to infer the strength of links in a network. These strength values are

continuous to represent a range of weak to strong relationships (2010). More specifically the

researchers “formulate a latent variable model to infer (hidden) relationship strengths and

develop a coordinate ascent optimization procedure for inference.” (Xiang, Neville, & Rogati,

2010). A Gaussian Distribution was used to model the conditional probability of strengths using

the similarity of the actors involved in each link and maximum likelihood of the probabilities is

used to estimate the latent variable model and a gradient-based method is used to optimize the

parameters of the model (Xiang, Neville, & Rogati, 2010).

Evaluation

Evaluation was done based on two measures, the autocorrelation improvement and the

classification improvement. In terms of autocorrelation, “the relationship-strength network has

significantly higher autocorrelation than the friendship graph in all cases” (Xiang, Neville, &

Rogati, 2010). Using Gaussian random field semi-supervised classification algorithm and

comparing with other works the group reports their model “results in the highest classification

performance for all tasks, suggesting that [their] approach to summarizing the rich profile and

interaction information in online social networks leads to a single meaningful relationship graph

7

which can improve subsequent knowledge discovery and prediction tasks.” (Xiang, Neville, &

Rogati, 2010).

Drivers and Enablers of Data Mining and Data Warehousing

There are many factors that create a demand for data mining and data warehousing

technologies. Many companies, organizations, and institutions have an interest in extracting

information and knowledge from their stored and incoming data. Some groups seek to use their

data to create monetary value while others seek understand how to serve their customers or

employees better. In today’s wide spread use of technology and the World Wide Web, society is

creating new data at alarming rates. In order to handle all this endless stream of data many

companies turn to data mining and warehousing technologies. Many companies can use data

mining to make better business decisions, better target their customers, and find new ways to

market their products and services. The amount of data created in stored far exceeds the

capabilities of any traditional data analysis tools and creates a demand for data mining.

The decreasing cost of computational power and storage are facilitating the widespread

use of data mining and data warehousing in the business world. Globalization is also driving

these technologies as the world becomes more interconnected in online communities. The

increasing availability of data collection devices such as smart phones is also contributing to the

use of data mining. Increasingly datasets are becoming openly available to the public from

many governments and organizations. The abundance of data, the low cost of computation

power, and the use of open and free software creates an environment that fosters data mining.

8

References

Leskovec, Jure, Huttenlocher, D., & Kleinberg, J. (2010). Predicting Positive and Negative Links in Online Social Networks. Proceeding WWW '10 Proceedings of the 19th international conference on World wide web (pp. 641-650). New York, NY, USA: ACM.

Liben-Nowell, D., & Kleinberg, J. (2007). The Link-Prediction Problem for Social Networks. Journal of the American Society for Information Science and Technology , 58 (7), 1019-1031.

Scellato, S., Noulas, A., & Mascolo, C. (2011). Exploiting Places Features in Link Prediction on Locatio-based Social Networks. Proceeding KDD '11 Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1046-1054). New York, NY: ACM.

Xiang, R., Neville, J., & Rogati, M. (2010). Modeling Relationship Strength in Online Social Networks. Proceeding WWW '10 Proceedings of the 19th international conference on World wide web (pp. 981-990). New York, NY, USA: ACM.

9

Questions

a) Why DM and DW technologies are becoming important tools for today's business world?

With the growth of data being collected by businesses data warehousing technologies are

become more important. Companies need Data Warehousing technologies to easily access

aggregate information from their data. Businesses also seek to integrate data from multiple

different database systems with different designs and schemas. Data warehousing technology

allows for a company to store their data based on groupings. With all this data companies need

to make sense of it all. Data mining technologies allow for businesses to turn the information

stored in their data warehousing technologies into knowledge. Data mining aids businesses in

making decisions and sheds light on interested correlations that would be otherwise unknown.

In today’s online world, data is what drives businesses and data mining is the methodology of

producing knowledge from vast amounts of data.

b) What are the main differences between data mining, traditional statistics data analysis,

and information retrieval?

Data mining is the process of extracting knowledge from large amounts of data which

involves several steps that turn raw data into knowledge that is easily understood by

humans. Traditional statistical data analysis cannot handle large amounts of data.

Information retrieval, in terms of database systems, only involves accessing and retrieving

data, creating aggregate values, or performing deductive queries.

10

c) How is a data warehouse model different from a relational database model? Why DW

technology is more advanced in supporting business management?

A relational database is simply a collection of tables. Each table has columns and rows and

each cell can be accessed independently or an aggregate query may be applied to a subset

of cells. In order to access any data from a relational database queries must be made in a

relational query language. This is much different than a data warehouse which is a

repository of information from many sources stored under a unified schema. Data in a data

warehouse is stored in a way that it can provide information in a historical perspective and

in a summarized manner. Data warehouses are multidimensional and each cell contains

some aggregate measure. All of these are more advanced in supporting business

management. For example a manager can easily access the aggregate sales of a particular

product by region, or year, or region and year, or any other combination of attributes.

d) What are the main difference between using OLAP on DW and using SQL on traditional

database for supporting business decision making?

Using on-line analytical processing operations allow for data to be presented in different

layers of abstraction to accommodate for different viewpoints. This is useful in a business

environment as different departments may want to see the company’s data in different

ways. Using OLAP is much faster than SQL aggregate queries as the aggregates are

precompiled and don’t need to use computationally expensive operations such as join.

11