updated top list of data mining ieee project dotnet and java 2016-17 for me/mtech,be/btech final...

21

Upload: elysium-technologies-private-ltd

Post on 17-Jan-2017

44 views

Category:

Education


1 download

TRANSCRIPT

Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a

computational intensive task with a large range of applications such as knowledge discovery or data

mining. However, as the volume and the dimension of data increase, only distributed approaches can

perform such costly operation in a reasonable time. Recent works have focused on implementing

efficient solutions using the MapReduce programming model because it is suitable for distributed large

scale data processing. Although these works provide different solutions to the same problem, each one

has particular constraints and properties. In this paper, we compare the different existing approaches

for computing kNN on MapReduce, first theoretically, and then by performing an extensive

experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN

computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze

each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety

of datasets, and analyze the impact of data volume, data dimension and the value of k from many

perspectives like time and space complexity, and accuracy. The experimental part brings new

advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this

is the first paper that compares kNN computing methods on MapReduce both theoretically and

experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based

practical problems in the context of big data.

ETPL

DM - 001 K Nearest Neighbour Joins for Big Data on MapReduce: a Theoretical

and Experimental Analysis

High utility itemsets (HUIs) mining is an emerging topic in data mining, which refers to discovering

all itemsets having a utility meeting a user-specified minimum utility threshold min_util. However,

setting min_util appropriately is a difficult problem for users. Generally speaking, finding an

appropriate minimum utility threshold by trial and error is a tedious process for users. If min_util is set

too low, too many HUIs will be generated, which may cause the mining process to be very inefficient.

On the other hand, if min_util is set too high, it is likely that no HUIs will be found. In this paper, we

address the above issues by proposing a new framework for top-k high utility itemset mining, where k

is the desired number of HUIs to be mined. Two types of efficient algorithms named TKU (mining

Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) are proposed for mining

such itemsets without the need to set min_util. We provide a structural comparison of the two

algorithms with discussions on their advantages and limitations. Empirical evaluations on both real and

synthetic datasets show that the performance of the proposed algorithms is close to that of the optimal

case of state-of-the-art utility mining algorithms.

ETPL

DM - 002 Efficient Algorithms for Mining Top-K High Utility Item sets

Textual documents created and distributed on the Internet are ever changing in various forms. Most of

existing works are devoted to topic modelling and the evolution of individual topics, while sequential

relations of topics in successive documents published by a specific user are ignored. In this paper, in

order to characterize and detect personalized and abnormal behaviours of Internet users, we propose

Sequential Topic Patterns (STPs) and formulate the problem of mining User-aware Rare Sequential

Topic Patterns (URSTPs) in document streams on the Internet. They are rare on the whole but relatively

frequent for specific users, so can be applied in many real-life scenarios, such as real-time monitoring

on abnormal user behaviours. We present a group of algorithms to solve this innovative mining

problem through three phases: pre-processing to extract probabilistic topics and identify sessions for

different users, generating all the STP candidates with (expected) support values for each user by

pattern-growth, and selecting URSTPs by making user-aware rarity analysis on derived STPs.

Experiments on both real (Twitter) and synthetic datasets show that our approach can indeed discover

special users and interpretable URSTPs effectively and efficiently, which significantly reflect users’

characteristics.

ETPL

DM - 003 Mining User-Aware Rare Sequential Topic Patterns in Document

Streams

Sequence classification is an important task in data mining. We address the problem of sequence

classification using rules composed of interesting patterns found in a dataset of labelled sequences and

accompanying class labels. We measure the interestingness of a pattern in a given class of sequences

by combining the cohesion and the support of the pattern. We use the discovered patterns to generate

confident classification rules, and present two different ways of building a classifier. The first classifier

is based on an improved version of the existing method of classification based on association rules,

while the second ranks the rules by first measuring their value specific to the new data object.

Experimental results show that our rule based classifiers outperform existing comparable classifiers in

terms of accuracy and stability. Additionally, we test a number of pattern feature based models that use

different kinds of patterns as features to represent each sequence as a feature vector. We then apply a

variety of machine learning algorithms for sequence classification, experimentally demonstrating that

the patterns we discover represent the sequences well, and prove effective for the classification task.

ETPL

DM - 004 Pattern Based Sequence Classification

We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional

discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our

proposed method detects groups (clusters) of anomalies; i.e. sets of points which collectively exhibit

abnormal patterns. In many applications this can lead to better understanding of the nature of the

atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case

where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature

space. Individual AD techniques and techniques that detect anomalies using all the features typically

fail to detect such anomalies, but our method can detect such instances collectively, discover the shared

anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we

focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on

topic models. Results of our experiments show that our method can accurately detect anomalous topics

and salient features (words) under each such topic in a synthetic data set and two real-world text corpora

and achieves better performance compared to both standard group AD and individual AD techniques.

All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD.

ETPL

DM - 005 ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

Some important data management and analytics tasks cannot be completely addressed by automated

processes. These “computer-hard” tasks such as entity resolution, sentiment analysis, and image

recognition, can be enhanced through the use of human cognitive ability. Human Computation is an

effective way to address such tasks by harnessing the capabilities of crowd workers (i.e., the crowd).

Thus, crowdsourced data management has become an area of increasing interest in research and

industry. There are three important problems in crowdsourced data management. (1) Quality Control:

Workers may return noisy results and effective techniques are required to achieve high quality; (2)

Cost Control: The crowd is not free, and cost control aims to reduce the monetary cost; (3) Latency

Control: The human workers can be slow, particularly in contrast to computing time scales, so latency-

control techniques are required. There has been significant work addressing these three factors for

designing crowdsourced tasks, developing crowdsourced data manipulation operators, and optimizing

plans of multiple operators. In this paper, we survey and synthesize a wide spectrum of existing studies

on crowdsourced data management. Based on this analysis we then outline key factors that need to be

considered to improve crowdsourced data management.

ETPL

DM - 006 Crowd sourced Data Management: A Survey

Since Jeff Howe introduced the term Crowdsourcing in 2006, this human-powered problem-solving

paradigm has gained a lot of attention and has been a hot research topic in the field of Computer

Science. Even though a lot of work has been conducted on this topic, so far we do not have a

comprehensive survey on most relevant work done in crowdsourcing field. In this paper, we aim to

offer an overall picture of the current state of the art techniques in general-purpose crowdsourcing.

According to their focus, we divide this work into three parts, which are: incentive design, task

assignment and quality control. For each part, we start with different problems faced in that area

followed by a brief description of existing work and a discussion of pros and cons. In addition, we also

present a real scenario on how the different techniques are used in implementing a location-based

crowdsourcing platform, gMission. Finally, we highlight the limitations of the current general-purpose

crowdsourcing techniques and present some open problems in this area.

ETPL

DM - 007 A Survey of General-Purpose Crowdsourcing Techniques

General health examination is an integral part of healthcare in many countries. Identifying the participants

at risk is important for early warning and preventive intervention. The fundamental challenge of learning a

classification model for risk prediction lies in the unlabeled data that constitutes the majority of the collected

dataset. Particularly, the unlabeled data describes the participants in health examinations whose health

conditions can vary greatly from healthy to very-ill. There is no ground truth for differentiating their states

of health. In this paper, we propose a graph-based, semi-supervised learning algorithm called SHG-Health

(Semi-supervised Heterogeneous Graph on Health) for risk predictions to classify a progressively

developing situation with the majority of the data unlabeled. An efficient iterative algorithm is designed

and the proof of convergence is given. Extensive experiments based on both real health examination datasets and synthetic datasets are performed to show the effectiveness and efficiency of our method.

ETPL

DM - 008 Mining Health Examination Records — A Graph-based Approach

Twitter has become one of the largest microblogging platforms for users around the world to share

anything happening around them with friends and beyond. A bursty topic in Twitter is one that triggers

a surge of relevant tweets within a short period of time, which often reflects important events of mass

interest. How to leverage Twitter for early detection of bursty topics has therefore become an important

research problem with immense practical value. Despite the wealth of research work on topic

modelling and analysis in Twitter, it remains a challenge to detect bursty topics in real-time. As existing

methods can hardly scale to handle the task with the tweet stream in real-time, we propose in this paper

TopicSketch, a sketch-based topic model together with a set of techniques to achieve real-time

detection. We evaluate our solution on a tweet stream with over 30 million tweets. Our experiment

results show both efficiency and effectiveness of our approach. Especially it is also demonstrated that

TopicSketch on a single machine can potentially handle hundreds of millions tweets per day, which is

on the same scale of the total number of daily tweets in Twitter, and present bursty events in finer-

granularity.

ETPL

DM - 009 TopicSketch: Real-time Bursty Topic Detection from Twitter

The development of a topic in a set of topic documents is constituted by a series of person interactions

at a specific time and place. Knowing the interactions of the persons mentioned in these documents is

helpful for readers to better comprehend the documents. In this paper, we propose a topic person

interaction detection method called SPIRIT, which classifies the text segments in a set of topic

documents that convey person interactions. We design the rich interactive tree structure to represent

syntactic, context, and semantic information of text, and this structure is incorporated into a tree-based

convolution kernel to identify interactive segments. Experiment results based on real world topics

demonstrate that the proposed rich interactive tree structure effectively detects the topic person

interactions and that our method outperforms many well-known relation extraction and protein-protein

interaction methods.

ETPL

DM -10 SPIRIT: A Tree Kernel-based Method for Topic Person Interaction

Detection

The ubiquity of smartphones has led to the emergence of mobile crowdsourcing tasks such as the

detection of spatial events when smartphone users move around in their daily lives. However, the

credibility of those detected events can be negatively impacted by unreliable participants with low-

quality data. Consequently, a major challenge in mobile crowdsourcing is truth discovery, i.e., to

discover true events from diverse and noisy participants' reports. This problem is uniquely distinct from

its online counterpart in that it involves uncertainties in both participants' mobility and reliability.

Decoupling these two types of uncertainties through location tracking will raise severe privacy and

energy issues, whereas simply ignoring missing reports or treating them as negative reports will

significantly degrade the accuracy of truth discovery. In this paper, we propose two new unsupervised

models, i.e., Truth finder for Spatial Events (TSE) and Personalized Truth finder for Spatial Events

(PTSE), to tackle this problem. In TSE, we model location popularity, location visit indicators, truths

of events, and three-way participant reliability in a unified framework. In PTSE, we further model

personal location visit tendencies. These proposed models are capable of effectively handling various

types of uncertainties and automatically discovering truths without any supervision or location

tracking. Experimental results on both real-world and synthetic datasets demonstrate that our proposed

models outperform existing state-of-the-art truth discovery approaches in the mobile crowdsourcing

environment.

ETPL

DM - 011 Truth Discovery in Crowdsourced Detection of Spatial Events

Feature selection is a challenging problem for high dimensional data processing, which arises in many

real applications such as data mining, information retrieval, and pattern recognition. In this paper, we

study the problem of unsupervised feature selection. The problem is challenging due to the lack of

label information to guide feature selection. We formulate the problem of unsupervised feature

selection from the viewpoint of graph regularized data reconstruction. The underlying idea is that the

selected features not only preserve the local structure of the original data space via graph regularization,

but also approximately reconstruct each data point via linear combination. Therefore, the graph

regularized data reconstruction error becomes a natural criterion for measuring the quality of the

selected features. By minimizing the reconstruction error, we are able to select the features that best

preserve both the similarity and discriminant information in the original data. We then develop an

efficient gradient algorithm to solve the corresponding optimization problem. We evaluate the

performance of our proposed algorithm on text clustering. The extensive experiments demonstrate the

effectiveness of our proposed approach.

ETPL

DM - 012 Graph Regularized Feature Selection with Data Reconstruction

The last few years have witnessed the emergence and evolution of a vibrant research stream on a large

variety of online social media network (SMN) platforms. Recognizing anonymous, yet identical users

among multiple SMNs is still an intractable problem. Clearly, cross-platform exploration may help

solve many problems in social computing in both theory and applications. Since public profiles can be

duplicated and easily impersonated by users with different purposes, most current user identification

resolutions, which mainly focus on text mining of users’ public profiles, are fragile. Some studies have

attempted to match users based on the location and timing of user content as well as writing style.

However, the locations are sparse in the majority of SMNs, and writing style is difficult to discern from

the short sentences of leading SMNs such as Sina Microblog and Twitter. Moreover, since online

SMNs are quite symmetric, existing user identification schemes based on network structure are not

effective. The real-world friend cycle is highly individual and virtually no two users share a congruent

friend cycle. Therefore, it is more accurate to use a friendship structure to analyze cross-platform

SMNs. Since identical users tend to set up partial similar friendship structures in different SMNs, we

proposed the Friend Relationship-Based User Identification (FRUI) algorithm. FRUI calculates a

match degree for all candidate User Matched Pairs (UMPs), and only UMPs with top ranks are

considered as identical users. We also developed two propositions to improve the efficiency of the

algorithm. Results of extensive experiments demonstrate that FRUI performs much better than current

network structure-based algorithms.

ETPL

DM - 013 Cross-Platform Identification of Anonymous Identical Users in Multiple

Social Media Networks

Taxonomy learning is an important task for knowledge acquisition, sharing, and classification as well

as application development and utilization in various domains. To reduce human effort to build a

taxonomy from scratch and improve the quality of the learned taxonomy, we propose a new taxonomy

learning approach, named TaxoFinder. TaxoFinder takes three steps to automatically build a taxonomy.

First, it identifies domain-specific concepts from a domain text corpus. Second, it builds a graph

representing how such concepts are associated together based on their co-occurrences. As the key

method in TaxoFinder, we propose a method for measuring associative strengths among the concepts,

which quantify how strongly they are associated in the graph, using similarities between sentences and

spatial distances between sentences. Lastly, TaxoFinder induces a taxonomy from the graph using a

graph analytic algorithm. TaxoFinder aims to build a taxonomy in such a way that it maximizes the

overall associative strengths among the concepts in the graph to build a taxonomy. We evaluate

TaxoFinder using gold-standard evaluation on three different domains: emergency management for

mass gatherings, autism research, and disease domains. In our evaluation, we compare TaxoFinder

with a state-of-the-art subsumption method and show that TaxoFinder is an effective approach

significantly outperforming the subsumption method.

ETPL

DM - 014 TaxoFinder: A Graph-Based Approach for Taxonomy Learning

As more and more applications produce streaming data, clustering data streams has become an

important technique for data and knowledge engineering. A typical approach is to summarize the data

stream in real-time with an online process into a large number of so called micro-clusters. Micro-

clusters represent local density estimates by aggregating the information of many data points in a

defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline

step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-

clusters are used as pseudo points with the density estimates used as their weights. However,

information about density in the area between micro-clusters is not preserved in the online process and

reclustering is based on possibly inaccurate assumptions about the distribution of data within and

between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-

cluster-based online clustering component that explicitly captures the density between micro-clusters

via a shared density graph. The density information in this graph is then exploited for reclustering

based on actual density between adjacent micro-clusters. We discuss the space and time complexity of

maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets

highlight that using shared density improves clustering quality over other popular data stream

clustering methods which require the creation of a larger number of smaller micro-clusters to achieve

comparable results.

ETPL

DM - 015 Clustering Data Streams Based on Shared Density between Micro-

Clusters

Social media networks are dynamic. As such, the order in which network ties develop is an important

aspect of the network dynamics. This study proposes a novel dynamic network model, the Nodal

Attribute-based Temporal Exponential Random Graph Model (NATERGM) for dynamic network

analysis. The proposed model focuses on how the nodal attributes of a network affect the order in

which the network ties develop. Temporal patterns in social media networks are modeled based on the

nodal attributes of individuals and the time information of network ties. Using social media data

collected from a knowledge sharing community, empirical tests were conducted to evaluate the

performance of the NATERGM on identifying the temporal patterns and predicting the characteristics

of the future networks. Results showed that the NATERGM demonstrated an enhanced pattern testing

capability and an increased prediction accuracy of network characteristics compared to benchmark

models. The proposed NATERGM model helps explain the roles of nodal attributes in the formation

process of dynamic networks.

ETPL

DM - 016 NATERGM: A Model for Examining the Role of Nodal Attributes in

Dynamic Social Media Networks

Graph classification aims to learn models to classify structure data. To date, all existing graph

classification methods are designed to target one single learning task and require a large number of

labeled samples for learning good classification models. In reality, each real-world task may only have

a limited number of labeled samples, yet multiple similar learning tasks can provide useful knowledge

to benefit all tasks as a whole. In this paper, we formulate a new multi-task graph classification (MTG)

problem, where multiple graph classification tasks are jointly regularized to find discriminative

subgraphs shared by all tasks for learning. The niche of MTG stems from the fact that with a limited

number of training samples, subgraph features selected for one single graph classification task tend to

overfit the training data. By using additional tasks as evaluation sets, MTG can jointly regularize

multiple tasks to explore high quality subgraph features for graph classification. To achieve this goal,

we formulate an objective function which combines multiple graph classification tasks to evaluate the

informativeness score of a subgraph feature. An iterative subgraph feature exploration and multi-task

learning process is further proposed to incrementally select subgraph features for graph classification.

Experiments on real-world multi-task graph classification datasets demonstrate significant

performance gain.

ETPL

DM - 018 Joint Structure Feature Exploration and Regularization for Multi-Task

Graph Classification

Resource Description Framework (RDF) has been widely used in the Semantic Web to describe

resources and their relationships. The RDF graph is one of the most commonly used representations

for RDF data. However, in many real applications such as the data extraction/integration, RDF graphs

integrated from different data sources may often contain uncertain and inconsistent information (e.g.,

uncertain labels or that violate facts/rules), due to the unreliability of data sources. In this paper, we

formalize the RDF data by inconsistent probabilistic RDF graphs, which contain both inconsistencies

and uncertainty. With such a probabilistic graph model, we focus on an important problem, quality-

aware subgraph matching over inconsistent probabilistic RDF graphs (QA-gMatch), which retrieves

subgraphs from inconsistent probabilistic RDF graphs that are isomorphic to a given query graph and

with high quality scores (considering both consistency and uncertainty). In order to efficiently answer

QA-gMatch queries, we provide two effective pruning methods, namely adaptive label pruning and

quality score pruning, which can greatly filter out false alarms of subgraphs. We also design an

effective index to facilitate our proposed pruning methods, and propose an efficient approach for

processing QA-gMatch queries. Finally, we demonstrate the efficiency and effectiveness of our

proposed approaches through extensive experiments.

ETPL

DM - 017 Quality-Aware Subgraph Matching Over Inconsistent Probabilistic

Graph Databases

General health examination is an integral part of healthcare in many countries. Identifying the

participants at risk is important for early warning and preventive intervention. The fundamental

challenge of learning a classification model for risk prediction lies in the unlabeled data that constitutes

the majority of the collected dataset. Particularly, the unlabeled data describes the participants in health

examinations whose health conditions can vary greatly from healthy to very-ill. There is no ground

truth for differentiating their states of health. In this paper, we propose a graph-based, semi-supervised

learning algorithm called SHG-Health (Semi-supervised Heterogeneous Graph on Health) for risk

predictions to classify a progressively developing situation with the majority of the data unlabeled. An

efficient iterative algorithm is designed and the proof of convergence is given. Extensive experiments

based on both real health examination datasets and synthetic datasets are performed to show the

effectiveness and efficiency of our method.

ETPL

DM - 019 Mining Health Examination Records — A Graph-based Approach

In this paper, we propose a semantic-aware blocking framework for entity resolution (ER). The

proposed framework is built using locality-sensitive hashing (LSH) techniques, which efficiently

unifies both textual and semantic features into an ER blocking process. In order to understand how

similarity metrics may affect the effectiveness of ER blocking, we study the robustness of similarity

metrics and their properties in terms of LSH families. Then, we present how the semantic similarity of

records can be captured, measured, and integrated with LSH techniques over multiple similarity spaces.

In doing so, the proposed framework can support efficient similarity searches on records in both textual

and semantic similarity spaces, yielding ER blocking with improved quality. We have evaluated the

proposed framework over two real-world data sets, and compared it with the state-of-the-art blocking

techniques. Our experimental study shows that the combination of semantic similarity and textual

similarity can considerably improve the quality of blocking. Furthermore, due to the probabilistic

nature of LSH, this semantic-aware blocking framework enables us to build fast and reliable blocking

for performing entity resolution tasks in a large-scale data environment.

ETPL

DM - 020 Semantic-Aware Blocking for Entity Resolution

Introducing recent advances in the machine learning techniques to state-of-the-art discrete choice

models, we develop an approach to infer the unique and complex decision making process of a

decision-maker (DM), which is characterized by the DM’s priorities and attitudinal character, along

with the attributes interaction, to name a few. On the basis of exemplary preference information in the

form of pairwise comparisons of alternatives, our method seeks to induce a DM’s preference model in

terms of the parameters of recent discrete choice models. To this end, we reduce our learning function

to a constrained non-linear optimization problem. Our learning approach is a simple one that takes into

consideration the interaction among the attributes along with the priorities and the unique attitudinal

character of a DM. The experimental results on standard benchmark datasets suggest that our approach

is not only intuitively appealing and easily interpretable but also competitive to state-of-the-art

methods.

ETPL

DM - 021 On Learning of Choice Models with Interactive Attributes

In many applications, there is a need to identify to which of a group of sets an element $x$ belongs, if

any. For example, in a router, this functionality can be used to determine the next hop of an incoming

packet. This problem is generally known as set separation and has been widely studied. Most existing

solutions make use of hash-based algorithms, particularly when a small percentage of false positives

is allowed. A known approach is to use a collection of Bloom filters in parallel. Such schemes can

require several memory accesses, a significant limitation for some implementations. We propose an

approach using Block Bloom Filters, where each element is first hashed to a single memory block that

stores a small Bloom filter that tracks the element and the set or sets the element belongs to. In a naïve

solution, when an element $x$ in a set $S$ is stored, it necessarily increases the false positive

probability for finding that $x$ is in another set $T$. In this paper, we introduce our One Memory

Access Set Separation (OMASS) scheme to avoid this problem. OMASS is designed so that for a giv-

n element $x$, the corresponding Bloom filter bits for each set map to different positions in the memory

word. This ensures that the false positive rates for the Bloom filters for element $x$ under other sets

are not affected. In addition, OMASS requires fewer hash functions compared to the naïve solution.

ETPL

DM - 022 OMASS: One Memory Access Set Separation

Items shared through Social Media may affect more than one user's privacy—e.g., photos that depict

multiple users, comments that mention multiple users, events in which multiple users are invited, etc.

The lack of multi-party privacy management support in current mainstream Social Media

infrastructures makes users unable to appropriately control to whom these items are actually shared or

not. Computational mechanisms that are able to merge the privacy preferences of multiple users into a

single policy for an item can help solve this problem. However, merging multiple users’ privacy

preferences is not an easy task, because privacy preferences may conflict, so methods to resolve

conflicts are needed. Moreover, these methods need to consider how users’ would actually reach an

agreement about a solution to the conflict in order to propose solutions that can be acceptable by all of

the users affected by the item to be shared. Current approaches are either too demanding or only

consider fixed ways of aggregating privacy preferences. In this paper, we propose the first

computational mechanism to resolve conflicts for multi-party privacy management in Social Media

that is able to adapt to different situations by modelling the concessions that users make to reach a

solution to the conflicts. We also present results of a user study in which our proposed mechanism

outperformed other existing approaches in terms of how many times each approach matched users’

behaviour.

ETPL

DM - 023 Resolving Multi-party Privacy Conflicts in Social Media

Data exchange is the process of generating an instance of a target schema from an instance of a source

schema such that source data is reflected in the target. Generally, data exchnge is performed using

schema mapping, representing high level relations between source and target schemas. In this paper,

we argue that data exchange solely based on schema level information limits the ability to express

semantics in data exchange. We show such schema level mappings not only may result in entity

fragmentation, they are unable to resolve some ambiguous data exchange scenarios. To address this

problem, we propose Scalable Entity Preserving Data Exchange (SEDEX), a hybrid method based on

data and schema mapping that employs similarities between relation trees of source and target relations

to find the best relations that can host source instances. Our experiments show SEDEX outperforms

other methods in terms of quality and scalability of data exchange.

ETPL

DM - 024 SEDEX: Scalable Entity Preserving Data Exchange

Despite recent advances in distributed RDF data management, processing large-amounts of RDF data

in the cloud is still very challenging. In spite of its seemingly simple data model, RDF actually encodes

rich and complex graphs mixing both instance and schema-level data. Sharding such data using

classical techniques or partitioning the graph using traditional min-cut algorithms leads to very

inefficient distributed operations and to a high number of joins. In this paper, we describe DiploCloud,

an efficient and scalable distributed RDF data management system for the cloud. Contrary to previous

approaches, DiploCloud runs a physiological analysis of both instance and schema information prior

to partitioning the data. In this paper, we describe the architecture of DiploCloud, its main data

structures, as well as the new algorithms we use to partition and distribute data. We also present an

extensive evaluation of DiploCloud showing that our system is often two orders of magnitude faster

than state-of-the-art systems on standard workloads.

ETPL

DM - 025 DiploCloud: Efficient and Scalable Management of RDF Data in the

Cloud

Rapid advance of location acquisition technologies boosts the generation of trajectory data, which track

the traces of moving objects. A trajectory is typically represented by a sequence of time stamped

geographical locations. A wide spectrum of applications can benefit from the trajectory data mining.

Bringing unprecedented opportunities, large-scale trajectory data also pose great challenges. In this

paper, we survey various applications of trajectory data mining, e.g., path discovery, location

prediction, movement behaviour analysis, and so on. Furthermore, this paper reviews an extensive

collection of existing trajectory data mining techniques and discusses them in a framework of trajectory

data mining. This framework and the survey can be used as a guideline for designing future trajectory

data mining solutions.

ETPL

DM - 026 A Survey on Trajectory Data Mining: Techniques and Applications

In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-

based data mining (DKBDM), such as distributed support vector machine. Among several known data

breaching problems, those associated with insider attacks have been rising significantly, making this

one of the fastest growing types of security breaches. Once considered a negligible concern, insider

attacks have risen to be one of the top three central data violations. Insider-related research involving

the distribution of kernel-based data mining is limited, resulting in substantial vulnerabilities in

designing protection against collaborative organizations. Prior works often fall short by addressing a

multifactorial model that is more limited in scope and implementation than addressing insiders within

an organization colluding with outsiders. A faulty system allows collusion to go unnoticed when an

insider shares data with an outsider, who can then recover the original data from message transmissions

(intermediary kernel values) among organizations. This attack requires only accessibility to a few data

entries within the organizations rather than requiring the encrypted administrative privileges typically

found in the distribution of data mining scenarios. To the best of our knowledge, we are the first to

explore this new insider threat in DKBDM. We also analytically demonstrate the minimum amount of

insider data necessary to launch the insider attack. Finally, we follow up by introducing several

proposed privacy-preserving schemes to counter the described attack.

ETPL

DM - 027 Insider Collusion Attack on Privacy-Preserving Kernel-Based Data

Mining Systems

Frequent sequence mining is well known and well-studied problem in data mining. The output of the

algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.

Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we

present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.

The static load-balancing is done by measuring the computational time using a probabilistic algorithm.

For reasonable size of instance, the algorithms achieve speedups up to where is the number of

processors. In the experimental evaluation, we show that our method performs significantly better than

the current state-of-the-art methods. The presented approach is very universal: it can be used for static

load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms.

ETPL

DM - 028 Probabilistic Static Load-Balancing of Parallel Mining of Frequent

Sequences

As more and more applications produce streaming data, clustering data streams has become an

important technique for data and knowledge engineering. A typical approach is to summarize the data

stream in real-time with an online process into a large number of so called micro-clusters. Micro-

clusters represent local density estimates by aggregating the information of many data points in a

defined area. On demand, a (modified) conventional clustering algorithm is used in a second offline

step to recluster the micro-clusters into larger final clusters. For reclustering, the centers of the micro-

clusters are used as pseudo points with the density estimates used as their weights. However,

information about density in the area between micro-clusters is not preserved in the online process and

reclustering is based on possibly inaccurate assumptions about the distribution of data within and

between micro-clusters (e.g., uniform or Gaussian). This paper describes DBSTREAM, the first micro-

cluster-based online clustering component that explicitly captures the density between micro-clusters

via a shared density graph. The density information in this graph is then exploited for reclustering

based on actual density between adjacent micro-clusters. We discuss the space and time complexity of

maintaining the shared density graph. Experiments on a wide range of synthetic and real data sets

highlight that using shared density improves clustering quality over other popular data stream

clustering methods which require the creation of a larger number of smaller micro-clusters to achieve

comparable results.

ETPL

DM - 029 Clustering Data Streams Based on Shared Density between Micro-

Clusters

Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic

parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to

this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the

MapReduce programming model. To achieve compressed storage and avoid building conditional

pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP

trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial

third MapReduce job, the mappers independently decompose itemsets, the reducers perform

combination operations by constructing small ultrametric trees, and the actual mining of these trees

separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster

is sensitive to data distribution and dimensions, because itemsets with different lengths have different

decomposition and construction costs. To improve FiDoop's performance, we develop a workload

balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD,

an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis.

Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution

is efficient and scalable.

ETPL

DM - 030 FiDoop: Parallel Mining of Frequent Itemsets Using MapReduce

Mining communities or clusters in networks is valuable in analyzing, designing, and optimizing many

natural and engineering complex systems, e.g. protein networks, power grid, and transportation

systems. Most of the existing techniques view the community mining problem as an optimization

problem based on a given quality function (e.g., modularity), however none of them are grounded with

a systematic theory to identify the central nodes in the network. Moreover, how to reconcile the mining

efficiency and the community quality still remains an open problem. In this paper, we attempt to

address the above challenges by introducing a novel algorithm. First, a kernel function with a tunable

influence factor is proposed to measure the leadership of each node, those nodes with highest local

leadership can be viewed as the candidate central nodes. Then, we use a discrete-time dynamical system

to describe the dynamical assignment of community membership; and formulate the serval conditions

to guarantee the convergence of each node’s dynamic trajectory, by which the hierarchical community

structure of the network can be revealed. The proposed dynamical system is independent of the quality

function used, so could also be applied in other community mining models. Our algorithm is highly

efficient: the computational complexity analysis shows that the execution time is nearly linearly

dependent on the number of nodes in sparse networks. We finally give demonstrative applications of

the algorithm to a set of synthetic benchmark networks and also real-world networks to verify the

algorithmic performance.

ETPL

DM - 031 Fast and accurate mining the community structure: integrating center

locating and membership optimization

In mobile communication, spatial queries pose a serious threat to user location privacy because the

location of a query may reveal sensitive information about the mobile user. In this paper, we study

approximate k nearest neighbour (kNN) queries where the mobile user queries the location-based

service (LBS) provider about approximate k nearest points of interest (POIs) on the basis of his current

location. We propose a basic solution and a generic solution for the mobile user to preserve his location

and query privacy in approximate kNN queries. The proposed solutions are mainly built on the Paillier

public-key cryptosystem and can provide both location and query privacy. To preserve query privacy,

our basic solution allows the mobile user to retrieve one type of POIs, for example, approximate k

nearest car parks, without revealing to the LBS provider what type of points is retrieved. Our generic

solution can be applied to multiple discrete type attributes of private location-based queries. Compared

with existing solutions for kNN queries with location privacy, our solution is more efficient.

Experiments have shown that our solution is practical for kNN queries.

ETPL

DM - 032 Practical Approximate k Nearest Neighbour Queries with Location and

Query Privacy

With advances in geo-positioning technologies and geo-location services, there are a rapidly growing

amount of spatio-textual objects collected in many applications such as location based services and

social networks, in which an object is described by its spatial location and a set of keywords (terms).

Consequently, the study of spatial keyword search which explores both location and textual description

of the objects has attracted great attention from the commercial organizations and research

communities. In the paper, we study two fundamental problems in the spatial keyword queries: top k

spatial keyword search (TOPK-SK), and batch top k spatial keyword search (BTOPK-SK). Given a set

of spatio-textual objects, a query location and a set of query keywords, the TOPK-SK retrieves the

closest k objects each of which contains all keywords in the query. BTOPK-SK is the batch processing

of sets of TOPK-SK queries. Based on the inverted index and the linear quadtree, we propose a novel

index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to exploit

both spatial and keyword based pruning techniques to effectively reduce the search space. An efficient

algorithm is then developed to tackle top k spatial keyword search. To further enhance the filtering

capability of the signature of linear quadtree, we propose a partition based method. In addition, to deal

with BTOPK-SK, we design a new computing paradigm which partition the queries into groups based

on both spatial proximity and the textual relevance between queries. We show that the IL-Quadtree

technique can also efficiently support BTOPK-SK. Comprehensive experiments on real and synthetic

data clearly demonstrate the efficiency of our methods.

ETPL

DM - 033 Inverted Linear Quadtree: Efficient Top K ord Search

We propose TrustSVD, a trust-based matrix factorization technique for recommendations. TrustSVD

integrates multiple information sources into the recommendation model in order to reduce the data

sparsity and cold start problems and their degradation of recommendation performance. An analysis of

social trust data from four real-world data sets suggests that not only the explicit but also the implicit

influence of both ratings and trust should be taken into consideration in a recommendation model. Trust

SVD therefore builds on top of a state-of-the-art recommendation algorithm, SVD++ (which uses the

explicit and implicit influence of rated items), by further incorporating both the explicit and implicit

influence of trusted and trusting users on the prediction of items for an active user. The proposed

technique is the first to extend SVD++ with social trust information. Experimental results on the four

data sets demonstrate that Trust SVD achieves better accuracy than other ten counterpart’s

recommendation techniques.

ETPL

DM - 034 A Novel Recommendation Model Regularized with User Trust and Item

Ratings

Frequent sequence mining is well known and well-studied problem in data mining. The output of the

algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis.

Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we

present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing.

The static load-balancing is done by measuring the computational time using a probabilistic algorithm.

For reasonable size of instance, the algorithms achieve speedups up to where is the number of

processors. In the experimental evaluation, we show that our method performs significantly better than

the current state-of-the-art methods. The presented approach is very universal: it can be used for static

load-balancing of other pattern mining algorithms such as item set/tree/graph mining algorithms.

ETPL

DM - 036 Probabilistic static load-balancing of parallel mining of frequent

sequences

Although the matrix completion paradigm provides an appealing solution to the collaborative filtering

problem in recommendation systems, some major issues, such as data sparsity and cold-start problems,

still remain open. In particular, when the rating data for a subset of users or items is entirely missing,

commonly known as the cold-start problem, the standard matrix completion methods are inapplicable

due the non-uniform sampling of available ratings. In recent years, there has been considerable interest

in dealing with cold-start users or items that are principally based on the idea of exploiting other sources

of information to compensate for this lack of rating data. In this paper, we propose a novel and general

algorithmic framework based on matrix completion that simultaneously exploits the similarity

information among users and items to alleviate the cold-start problem. In contrast to existing methods,

our proposed recommender algorithm, dubbed DecRec, decouples the following two aspects of the

cold-start problem to effectively exploit the side information: (i) the completion of a rating sub-matrix,

which is generated by excluding cold-start users/items from the original rating matrix; and (ii) the

transduction of knowledge from existing ratings to cold-start items/users using side information. This

crucial difference prevents the error propagation of completion and transduction, and also significantly

boosts the performance when appropriate side information is incorporated. The recovery error of the

proposed algorithm is analyzed theoretically and, to the best of our knowledge, this is the first algorithm

that addresses the cold-start problem with provable guarantees on performance. Additionally, we also

address the problem where both cold-start user and item challenges are present simultaneously. We

conduct thorough experiments on real datasets that complement our theoretical results. These

experiments demonstrate the ef- ectiveness of the proposed algorithm in handling the cold-start

users/items problem and mitigating data sparsity issue.

ETPL

DM - 036 Cold-Start Recommendation with Provable Guarantees: A Decoupled

Approach