massive data analysis- challenges and applications

Massive data analysis: applications and challenges

Vijay RaghavanUniversity of Louisiana at Lafayette

Jayasimha KatukurieBay

Ying XieKennesaw State University

2

Agenda

�Trends and Perspectives

� Kinds of Big Data problems

� Big Data Application scenarios

� Current State of the Art

� Big Data Applications- Examples

� Big Data Analysis- Research Areas

� Conclusions

12/30/2013

Trends and Perspectives

� In 2009, McKinsey estimated that nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per organization (for organizations with more than 1000 employees).

� As an example, Walmart’s customer transaction database was reported to be 110 terabytes in 2000. By 2004 it increased to be over half a petabyte (Schuman, 2004).

� An increasing 80% of data organizations own, can be classified as unstructured data: for example data packed in emails, social media and multimedia.

312/30/2013

Trends and Perspectives (Contd …)

� Taking account the average data growth, annually by 59% (Pettey & Goasduff, 2011), this percentage (unstructured data) will likely be much higher in a few years.

� Not only an increasing number of human beings are connected to the Internet, also there is a significant increase in the number of physical devices connected to the Internet.

� Besides the volume of data is becoming a problem, also the variety and velocity are issues we need to look at (Russom, 2011).

412/30/2013

Trends and Perspectives (Contd …)

� Big Data: Data that is complex in terms of volume, variety, velocity and/or its relation to other data, which makes it hard to handle using tradition database management or tools.

� “Through 2015, more than 85% of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.” (Gartner’s Top Predictions 2012).

� Analysts need to i) cope with massive data distributed across locations; ii) treat data as a resource to understand underlying phenomena (NRC Study, 2013).

12/30/2013 5

The Meaning of Big Data – 3V’s

� Big Volume

- With simple and complex (SQL) analytics

- Scaling complex operations

� Big Velocity-Drink from the fire hose

-Beyond OLTP, NoSQL

� Big Variety-Large number of diverse data sources to integrate

-Beyond Global Schema-based approaches

612/30/2013

Velocity- Time to action vs. Value (Hackathorn, 2002)

712/30/2013

Kinds of Big Data Problems (Davis, 2012)

12/30/2013 8

Big Data – Big Analytics

� Complex math operations (machine learning, clustering, trend detection, …)

� mostly specified as linear algebra on array data

� in the stock market domain, the world of “quants”

� A dozen or so common “inner loops”

� Matrix multiply

� QR decomposition

� SVD decomposition

� Linear regression

912/30/2013

Big Data – Big Analytics- An Example

� Consider choosing price on all trading days for the last 5 years for two stocks A and B

� What is the covariance between the two time-series?

(1/N) * sum (Ai – mean(A)) * (Bi – mean (B))

� Now Make it more challenging …

All pairs of 4000 selected stocks- 4000 x 1000 matrix

Hourly, instead of daily?

All securities?

1012/30/2013

11

Big Data Application Scenarios-Detecting anomalies or emerging events

� Visa’s fraud detection program

� HP’s compliance detection using its event management solution

� Detecting abnormal situations in ICU

� Detecting server attacks, marketing keywords, environmental hazards

� Detecting terror and diseases

� Detecting national security risks (Singapore’s RAHS (Risk Assessment & Horizon Scanning) against disease, financial risk

12/30/2013

12

Big Data Application Scenarios-Predicting near future & Trend analysis

� CRM: churn prediction

� Criminal protection by predicting likely

locations of criminal activities

� Defect prediction (Volvo)

� Google flu trend

� Personalized recommender systems (Amazon)

� Personalized labor support system (Germany, saving 10B euro saving)

12/30/2013

13

Big Data Application Scenarios-Real-time analysis and Decision Support

� CRM

� Healthcare applications in ICU

� Marketing support

� Navigation service

� Real-time Q/A systems

12/30/2013

14

Big Data Application Scenarios-Pattern Learning

(�Google’s automatic language translation

�Apple’s siri, Google’s now

�IBM Watson (Seton HealthCare Family use Watson to learn 2M patient data annually)

12/30/2013

Current State of the Art

� Rise of the cloud

Big analytics as a service

Amazon DynamoDB, Google BigQuery, Windows Azure Tables

� Hadoop, Open source- heart of big data analytics

HDFS does not index data

Run big jobs using big files vs. small jobs as fast as possible

Several variants- Cloudera, Amazon Elastic MapReduce, IBM Infosphere

1512/30/2013

Current State of the Art (contd.)

� Machine learning for massive data sets

Hadoop requires mappers and reducers to communicate with each other through a file system (HDFS). Some of the alternative technologies in this space are:

Graphlab (http://graphlab.org/)

Apache spark (http://spark.incubator.apache.org/)

� Real-time analytics

Hadoop is not ideal for real-time analytics. Apache storm

(http://storm-project.net/) is one technology that is trying to

address the real-time analytics solution

1612/30/2013

Current State of the Art (contd.)

� In-Memory analytics

Focuses on the velocity part of big data

Oracle Exalytics In-Memory machine, 1 terabyte RAM

SAS High-performance Analytics (unstructured data)

Non-commercial- VoltDB

1712/30/2013

Big Data Applications- Hypothesis Discovery

1812/30/2013

Motivation for Literature-based Hypotheses Discovery Systems

� Biomedical research is divided into highly specialized fields and subfields, with poor communication between them.

� The rate of growth of publications makes it difficult for a researcher to derive connections between concepts from different research specialties. It also means an opportunity, since the usefulness of the literature-based discovery is greater as more data means better reliability in statistical methods.

� Mining hidden connections among biomedical concepts from large amounts of scientific literature is one of the important goals pursued in this field [1].

� Pfizer uses text mining software to move to a broader understanding before making major investments in specific compounds. It is estimated that $18 billion is spent per year on compounds that never reach market, while $30 billion is spent reinventing what is in the literature.

1912/30/2013

20

Hypothesis Discovery from Biomedical Literature : Example

� Swanson found the hidden connection between “Fish Oil” and “Reynaud's Disease” by finding the common concepts from the document set of “Fish Oil” and “Reynaud's Disease” [4,5].

Fish OilRaynaud’s

disease

High blood viscosity

Platelet aggregation

12/30/2013

Link Discovery Methods in Biomedical Literature

� The problem of hypotheses discovery in biomedical literature is similar to the link discovery problem.

� The existing approaches for hypotheses discovery have not explored the network topology features used in the link discovery methods.

� The existing approaches do not provide an automated way of evaluating the results.

� Supervised learning methods have not been explored. 2112/30/2013

Proposed Method: Supervised Link Discovery

� Supervised Link Discovery� Concept Network : Model the whole Medline literature repository as

a complex network of biomedical concepts

� Generate labeled data automatically using Concept Networks corresponding to two different time periods.

� Extract a set of features from the concept network for concept pairs.

� A supervised learning approach to learn a model for link discovery.

2212/30/2013

Concept Network

Each node represents a biomedical concept

Node Attributes: � concept name

� semantic type,

� related authors, and

� document frequency

Each edge represents an association between two concepts.

Edge Attributes:

� Co-occurrence frequency

2312/30/2013

Concept Network – Map-Reduce

24

Doc-1 Doc-2 Doc-3

Mapper-1 Mapper-2 Mapper-3CCM_local

CCM_local

CCM_local

Reducer-1 Reducer-2

HDFS

Key: (c1, c2,year)Value: co-count

12/30/2013

Concept Network Statistics

� Total number of concept pairs = 17356486

� Total number of documents = 11021605

� Total number of concepts = 165674

2512/30/2013

Automatic Generation of Labeled Concept Pairs

For each pair whose connection is strong in Gts,

if it has no direct connection in Gtf, we assign positive to this

pair.

For each pair whose connection is weak in Gts,

if it has no direct connection in Gtf, we assign negative to this

pair.

Select a random sample of the nodes in Gtf and generate

concept pairs from the selected random sample.

if a pair has no connection in both Gtf and Gts, we assign

negative to it.

2612/30/2013

Features

�In addition to the commonly used network topological features, we extract the following features:

� Cycle Free Effective Conductance (CFEC)

� The Semantic-CFEC

� The Author_List Jacccard

2712/30/2013

Feature Extraction� For each of labeled pairs, we extract the set of features as described before from the snapshot of the concept network Gtf.

� To scale the feature extraction for large number of labeled pairs, feature extraction is implemented on a Map-Reduce cluster.

� The distributed implementation of feature extraction can be described in the following way: � Trim Gtf such that it only contains edges with strength greater than

or equal to the minimum support. Store the trimmed Gtf in each of the mapper’s main memory.

� Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed Gtf . 2812/30/2013

min_support

29

All the measures improved as we increase the value for the parameter ‘min_Support’. As we increase the ‘Min_Support’, there will be fewer positive examples.

10-fold cross-validation is used in all the experiments.12/30/2013

Different Classifiers

30

SVM provided around 1.5%-2% better classification accuracy than that of decision trees.

12/30/2013

Case Study

31

Prostatic Neoplasms

Adenosine Triphosphate

Oligopeptides

Tumor Necrosis Factor-alpha

Tetradecanoylphorbol Acetate

NF-kappaB inhibitor alpha

12/30/2013

Big Data Applications- Recommendations in e-Commerce

3212/30/2013

eBay Today

12/30/2013 33

Introduction

� Challenges in a dynamic marketplace like eBay� Huge inventory

� Several hundreds of millions

� Seller-defined listings� Listings are short-lived � Wide variety

� From electronics to unique collectibles� Majority are unstructured and w/o a product catalog

� Listing quality� Condition, price, shipping, etc

� Seller trustworthiness

� Goal for a Recommendation System in eBay� Address challenges associated with a dynamic marketplace� Scalable and efficient

� Computationally intensive tasks during offline model generation� Efficient online performance system

3412/30/2013

Motivation – Pre-purchase

� User couldn’t purchase a listing s/he showed interest in

� Placed a bid but lost the auction

� “Watched” an item but someone else bought it before s/he was ready to buy

� Similar Item Recommendation (SIR)

� Recommend replacement items

3512/30/2013

Motivation – Post-purchase

36

� User just purchased an item

� Related Item Recommendation (RIR)

� Inspire incremental purchases

� Recommend complementary/related items

12/30/2013

System Architecture - Overview

37

Inventory

Cluster-ClusterRelations

Transactions

Clusters

Conceptual Knowledgebase

Offline Model Generation The Data Store Real-time Performance System

Similar Items Recommender

(SIR)

Related Items Recommender (RIR)

Clusters Model Generation

Related Clusters Model

Generation

Clickstream

Lost Item

Similar Items

?similarTo(item)

Bought Item

Related Items

?relatedTo(item)

12/30/2013

Data Store

38

Inventory

Cluster-ClusterRelations

Clickstream

Transactions

Clusters


� Glue between offline and real-time systems� Raw data

� Inventory data� Clickstream data� Transaction data

� Conceptual Knowledgebase� Category Tree� Stop words, spell corrections, synonyms, etc� Term dictionary

� Models� Item Clusters

� “clarks women shoe pumps classics”� “authentic handmade amish quilt”

� Cluster-Cluster Relations� “samsung galaxy s4” – “samsung galaxy s4 screen

protector”� “wolfgang puck electric pressure cooker” –

“kitchenaid food processor”

12/30/2013

Model Generation - Clusters� Global clustering not feasible

� Inventory size in several hundreds of millions

� Varied inventory ranging from electronic goods to unique collectibles

� Partition input data by user queries� Take advantage of how users’

perspective of item similarity

� Parallel distributed K-Means in Hadoop MapReduce

� Feature set � Title tokens� Category hierarchy � Attributes or concepts

� Dedupe and merge overlapping clusters

� 100X reduction in size over inventory with over 90% coverage

39

new clusters

items user queries

concepts,categories

query-to-items

Query-Recall Generation

Cluster Generation

Clusters Model Generation

Data Store

Clusters

Inventory

Clickstream


12/30/2013

Model Generation – Related Clusters

� Transactional data � Item-Item co-purchase matrix

� Cluster Assignment� Cluster-Cluster directed graph

� Rank outgoing edges � Collaborative filtering

� Edge strength ie no. of users with co-purchase

� Cluster-Cluster content similarity

40

concepts,categories

bought item-item

Cluster Assignmentbought cluster-

cluster

related

cluster-cluster

Cluster-to-ClusterModel

Generation

Related Clusters Model Generation

Data Store

Clusters Cluster-ClusterRelations

Transactions Conceptual Knowledgebase

clusters

12/30/2013

Experimental Results

� A/B Tests comparing against legacy systems� SIR legacy system

� Completely online� Naïve approach of using seed item title as a search query

� RIR legacy system� Chen, Y. and J.F. Canny, Recommending ephemeral items at web scale, ACM SIGIR 2011

� Collaborative Filtering on stable representations of items

� Significant improvements at 90% confidence interval� SIR resulted in 38.18% higher user engagement (CTR)� RIR resulted in 10.5% higher CTR

� Statistically significant improvement in site-wide business metrics from both SIR & RIR

4112/30/2013

Recommendations in e-Commerce-Conclusions

� Balance between similarity and quality crucial in driving user engagement and conversion

� Clusters of similar items in the inventory

� Local clustering in the coverage set of user queries

� Offline models built using Map-Reduce

� Huge input datasets including inventory, clickstream and transactional data

� Efficient real-time performance system

� Currently deployed on ebay.com

4212/30/2013

Big Data Analytics- Research Areas

� Data representation, including transformations that reduce representational complexity

� Computational complexity issues to characterize computational resource needs and tradeoffs

� Statistical model-building in massive data settings having messy data validation issues

� Sampling- both as data gathering and for data reduction

� Methods to include humans in the data analysis loop

4312/30/2013

Conclusions

� Great opportunity in improving the functioning of many disciplines by leveraging the data and turning the data into knowledge

� Requires an interdisciplinary approach to solving problems of massive data

� A major need exists for software targeted to end users

� Concerted effort is needed to educate students and the workforce in statistical thinking and computational thinking

4412/30/2013

References

� Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The Next Frontier for innovation, Competition, and Productivity.

� Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012, from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMart-Worlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/

� Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends. Computer, 33(1), 117–119.

� Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving“Big Data” Challenge Involves More Than Just Managing Volumes of Data. Stamford: Gartner. Retrieved from http://www.gartner.com/it/page.jsp?id=1731916

12/30/2013 45

References (cont’d)

� Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding Digital Universe. Director, 285(6). doi:10.1002/humu.21252

� Russom, P. (2011). Big Data Analytics. TDWI Research.

� Pettey, C. (2012, October 18). Gartner Identifies the Top 10 Strategic Technologies for 2012. Gartner.

� Hackathorn, R. (2002). Current practices in active data warehousing. available: http://www.dmreview.com/whitepaper/WID489.pdf

� Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big Deal. Deloitte. Retrieved from http://www.deloitte.com/view/en_GX/global/insights/c22d83274d1b4310VgnVCM2000001b56f00aRCRD.htm

� Lee, P., & Steward, D. (2012). Technology, Media & Telecommunications Predictions 2012, (Deloitte).

12/30/2013 46


� NRC of the National Academies, Frontiers in Massive Data Analysis, The National Academy Press, Washington, D.C., 2013. Retrieved from

http://www.nap.edu/catalog.php?record_id=18374

� Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl 3):S5, 2012.

� Katukuri, J., Mukherjee, R., and Konik, T. “Large scale recommendations in a dynamic marketplace”. ACM RecSys (LSRS workshop), 2013.

12/30/2013 47


� Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall Street Journal. Retrieved September 14, 2011, from http://online.wsj.com/article/SB10001424053111903532804576569133957145822.html

� Davis, J. (2012). What Kind of Big Data Problem Do You Have? SASBlogs Home. Retrieved December 16, 2012, from http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-of-big-data-problem-do-you-have/

� Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance?, Last Retrieved on December 16, 2012.

� Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’Value Creation. Master Thesis, University of Amsterdam. Retrieved December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf

12/30/2013 48