massive data analysis- challenges and applications
DESCRIPTION
We highlight a few trends of massive data that are available for corporations, government agencies and researchers and some examples of opportunities that exist for turning this data into knowledge. We provide a brief overview of some of the state-of-the-art technologies in the massive data analysis landscape. Then, we describe two applications from two diverse areas in detail: recommendations in e-commerce, link discovery from biomedical literature. Finally, we present some challenges and open problems in the field of massive data analysis.TRANSCRIPT
Massive data analysis: applications and challenges
Vijay RaghavanUniversity of Louisiana at Lafayette
Jayasimha KatukurieBay
Ying XieKennesaw State University
2
Agenda
�Trends and Perspectives
� Kinds of Big Data problems
� Big Data Application scenarios
� Current State of the Art
� Big Data Applications- Examples
� Big Data Analysis- Research Areas
� Conclusions
12/30/2013
Trends and Perspectives
� In 2009, McKinsey estimated that nearly all sectors in the US economy had at least an average of 200 terabytes of stored data per organization (for organizations with more than 1000 employees).
� As an example, Walmart’s customer transaction database was reported to be 110 terabytes in 2000. By 2004 it increased to be over half a petabyte (Schuman, 2004).
� An increasing 80% of data organizations own, can be classified as unstructured data: for example data packed in emails, social media and multimedia.
312/30/2013
Trends and Perspectives (Contd …)
� Taking account the average data growth, annually by 59% (Pettey & Goasduff, 2011), this percentage (unstructured data) will likely be much higher in a few years.
� Not only an increasing number of human beings are connected to the Internet, also there is a significant increase in the number of physical devices connected to the Internet.
� Besides the volume of data is becoming a problem, also the variety and velocity are issues we need to look at (Russom, 2011).
412/30/2013
Trends and Perspectives (Contd …)
� Big Data: Data that is complex in terms of volume, variety, velocity and/or its relation to other data, which makes it hard to handle using tradition database management or tools.
� “Through 2015, more than 85% of Fortune 500 organizations will fail to effectively exploit big data for competitive advantage.” (Gartner’s Top Predictions 2012).
� Analysts need to i) cope with massive data distributed across locations; ii) treat data as a resource to understand underlying phenomena (NRC Study, 2013).
12/30/2013 5
The Meaning of Big Data – 3V’s
� Big Volume
- With simple and complex (SQL) analytics
- Scaling complex operations
� Big Velocity-Drink from the fire hose
-Beyond OLTP, NoSQL
� Big Variety-Large number of diverse data sources to integrate
-Beyond Global Schema-based approaches
612/30/2013
Velocity- Time to action vs. Value (Hackathorn, 2002)
712/30/2013
Kinds of Big Data Problems (Davis, 2012)
12/30/2013 8
Big Data – Big Analytics
� Complex math operations (machine learning, clustering, trend detection, …)
� mostly specified as linear algebra on array data
� in the stock market domain, the world of “quants”
� A dozen or so common “inner loops”
� Matrix multiply
� QR decomposition
� SVD decomposition
� Linear regression
912/30/2013
Big Data – Big Analytics- An Example
� Consider choosing price on all trading days for the last 5 years for two stocks A and B
� What is the covariance between the two time-series?
(1/N) * sum (Ai – mean(A)) * (Bi – mean (B))
� Now Make it more challenging …
All pairs of 4000 selected stocks- 4000 x 1000 matrix
Hourly, instead of daily?
All securities?
1012/30/2013
11
Big Data Application Scenarios-Detecting anomalies or emerging events
� Visa’s fraud detection program
� HP’s compliance detection using its event management solution
� Detecting abnormal situations in ICU
� Detecting server attacks, marketing keywords, environmental hazards
� Detecting terror and diseases
� Detecting national security risks (Singapore’s RAHS (Risk Assessment & Horizon Scanning) against disease, financial risk
12/30/2013
12
Big Data Application Scenarios-Predicting near future & Trend analysis
� CRM: churn prediction
� Criminal protection by predicting likely
locations of criminal activities
� Defect prediction (Volvo)
� Google flu trend
� Personalized recommender systems (Amazon)
� Personalized labor support system (Germany, saving 10B euro saving)
12/30/2013
13
Big Data Application Scenarios-Real-time analysis and Decision Support
� CRM
� Healthcare applications in ICU
� Marketing support
� Navigation service
� Real-time Q/A systems
12/30/2013
14
Big Data Application Scenarios-Pattern Learning
(�Google’s automatic language translation
�Apple’s siri, Google’s now
�IBM Watson (Seton HealthCare Family use Watson to learn 2M patient data annually)
12/30/2013
Current State of the Art
� Rise of the cloud
Big analytics as a service
Amazon DynamoDB, Google BigQuery, Windows Azure Tables
� Hadoop, Open source- heart of big data analytics
HDFS does not index data
Run big jobs using big files vs. small jobs as fast as possible
Several variants- Cloudera, Amazon Elastic MapReduce, IBM Infosphere
1512/30/2013
Current State of the Art (contd.)
� Machine learning for massive data sets
Hadoop requires mappers and reducers to communicate with each other through a file system (HDFS). Some of the alternative technologies in this space are:
Graphlab (http://graphlab.org/)
Apache spark (http://spark.incubator.apache.org/)
� Real-time analytics
Hadoop is not ideal for real-time analytics. Apache storm
(http://storm-project.net/) is one technology that is trying to
address the real-time analytics solution
1612/30/2013
Current State of the Art (contd.)
� In-Memory analytics
Focuses on the velocity part of big data
Oracle Exalytics In-Memory machine, 1 terabyte RAM
SAS High-performance Analytics (unstructured data)
Non-commercial- VoltDB
1712/30/2013
Big Data Applications- Hypothesis Discovery
1812/30/2013
Motivation for Literature-based Hypotheses Discovery Systems
� Biomedical research is divided into highly specialized fields and subfields, with poor communication between them.
� The rate of growth of publications makes it difficult for a researcher to derive connections between concepts from different research specialties. It also means an opportunity, since the usefulness of the literature-based discovery is greater as more data means better reliability in statistical methods.
� Mining hidden connections among biomedical concepts from large amounts of scientific literature is one of the important goals pursued in this field [1].
� Pfizer uses text mining software to move to a broader understanding before making major investments in specific compounds. It is estimated that $18 billion is spent per year on compounds that never reach market, while $30 billion is spent reinventing what is in the literature.
1912/30/2013
20
Hypothesis Discovery from Biomedical Literature : Example
� Swanson found the hidden connection between “Fish Oil” and “Reynaud's Disease” by finding the common concepts from the document set of “Fish Oil” and “Reynaud's Disease” [4,5].
Fish OilRaynaud’s
disease
High blood viscosity
Platelet aggregation
12/30/2013
Link Discovery Methods in Biomedical Literature
� The problem of hypotheses discovery in biomedical literature is similar to the link discovery problem.
� The existing approaches for hypotheses discovery have not explored the network topology features used in the link discovery methods.
� The existing approaches do not provide an automated way of evaluating the results.
� Supervised learning methods have not been explored. 2112/30/2013
Proposed Method: Supervised Link Discovery
� Supervised Link Discovery� Concept Network : Model the whole Medline literature repository as
a complex network of biomedical concepts
� Generate labeled data automatically using Concept Networks corresponding to two different time periods.
� Extract a set of features from the concept network for concept pairs.
� A supervised learning approach to learn a model for link discovery.
2212/30/2013
Concept Network
Each node represents a biomedical concept
Node Attributes: � concept name
� semantic type,
� related authors, and
� document frequency
Each edge represents an association between two concepts.
Edge Attributes:
� Co-occurrence frequency
2312/30/2013
Concept Network – Map-Reduce
24
Doc-1 Doc-2 Doc-3
Mapper-1 Mapper-2 Mapper-3CCM_local
CCM_local
CCM_local
Reducer-1 Reducer-2
HDFS
Key: (c1, c2,year)Value: co-count
12/30/2013
Concept Network Statistics
� Total number of concept pairs = 17356486
� Total number of documents = 11021605
� Total number of concepts = 165674
2512/30/2013
Automatic Generation of Labeled Concept Pairs
For each pair whose connection is strong in Gts,
if it has no direct connection in Gtf, we assign positive to this
pair.
For each pair whose connection is weak in Gts,
if it has no direct connection in Gtf, we assign negative to this
pair.
Select a random sample of the nodes in Gtf and generate
concept pairs from the selected random sample.
if a pair has no connection in both Gtf and Gts, we assign
negative to it.
2612/30/2013
Features
�In addition to the commonly used network topological features, we extract the following features:
� Cycle Free Effective Conductance (CFEC)
� The Semantic-CFEC
� The Author_List Jacccard
2712/30/2013
Feature Extraction� For each of labeled pairs, we extract the set of features as described before from the snapshot of the concept network Gtf.
� To scale the feature extraction for large number of labeled pairs, feature extraction is implemented on a Map-Reduce cluster.
� The distributed implementation of feature extraction can be described in the following way: � Trim Gtf such that it only contains edges with strength greater than
or equal to the minimum support. Store the trimmed Gtf in each of the mapper’s main memory.
� Distribute the labeled pairs among the mappers. Each mapper extracts the features for a subset of concept pairs using the trimmed Gtf . 2812/30/2013
min_support
29
All the measures improved as we increase the value for the parameter ‘min_Support’. As we increase the ‘Min_Support’, there will be fewer positive examples.
10-fold cross-validation is used in all the experiments.12/30/2013
Different Classifiers
30
SVM provided around 1.5%-2% better classification accuracy than that of decision trees.
12/30/2013
Case Study
31
Prostatic Neoplasms
Adenosine Triphosphate
Oligopeptides
Tumor Necrosis Factor-alpha
Tetradecanoylphorbol Acetate
NF-kappaB inhibitor alpha
12/30/2013
Big Data Applications- Recommendations in e-Commerce
3212/30/2013
eBay Today
12/30/2013 33
Introduction
� Challenges in a dynamic marketplace like eBay� Huge inventory
� Several hundreds of millions
� Seller-defined listings� Listings are short-lived � Wide variety
� From electronics to unique collectibles� Majority are unstructured and w/o a product catalog
� Listing quality� Condition, price, shipping, etc
� Seller trustworthiness
� Goal for a Recommendation System in eBay� Address challenges associated with a dynamic marketplace� Scalable and efficient
� Computationally intensive tasks during offline model generation� Efficient online performance system
3412/30/2013
Motivation – Pre-purchase
� User couldn’t purchase a listing s/he showed interest in
� Placed a bid but lost the auction
� “Watched” an item but someone else bought it before s/he was ready to buy
� Similar Item Recommendation (SIR)
� Recommend replacement items
3512/30/2013
Motivation – Post-purchase
36
� User just purchased an item
� Related Item Recommendation (RIR)
� Inspire incremental purchases
� Recommend complementary/related items
12/30/2013
System Architecture - Overview
37
Inventory
Cluster-ClusterRelations
Transactions
Clusters
Conceptual Knowledgebase
Offline Model Generation The Data Store Real-time Performance System
Similar Items Recommender
(SIR)
Related Items Recommender (RIR)
Clusters Model Generation
Related Clusters Model
Generation
Clickstream
Lost Item
Similar Items
?similarTo(item)
Bought Item
Related Items
?relatedTo(item)
12/30/2013
Data Store
38
Inventory
Cluster-ClusterRelations
Clickstream
Transactions
Clusters
Conceptual Knowledgebase
� Glue between offline and real-time systems� Raw data
� Inventory data� Clickstream data� Transaction data
� Conceptual Knowledgebase� Category Tree� Stop words, spell corrections, synonyms, etc� Term dictionary
� Models� Item Clusters
� “clarks women shoe pumps classics”� “authentic handmade amish quilt”
� Cluster-Cluster Relations� “samsung galaxy s4” – “samsung galaxy s4 screen
protector”� “wolfgang puck electric pressure cooker” –
“kitchenaid food processor”
12/30/2013
Model Generation - Clusters� Global clustering not feasible
� Inventory size in several hundreds of millions
� Varied inventory ranging from electronic goods to unique collectibles
� Partition input data by user queries� Take advantage of how users’
perspective of item similarity
� Parallel distributed K-Means in Hadoop MapReduce
� Feature set � Title tokens� Category hierarchy � Attributes or concepts
� Dedupe and merge overlapping clusters
� 100X reduction in size over inventory with over 90% coverage
39
new clusters
items user queries
concepts,categories
query-to-items
Query-Recall Generation
Cluster Generation
Clusters Model Generation
Data Store
Clusters
Inventory
Clickstream
Conceptual Knowledgebase
12/30/2013
Model Generation – Related Clusters
� Transactional data � Item-Item co-purchase matrix
� Cluster Assignment� Cluster-Cluster directed graph
� Rank outgoing edges � Collaborative filtering
� Edge strength ie no. of users with co-purchase
� Cluster-Cluster content similarity
40
concepts,categories
bought item-item
Cluster Assignmentbought cluster-
cluster
related
cluster-cluster
Cluster-to-ClusterModel
Generation
Related Clusters Model Generation
Data Store
Clusters Cluster-ClusterRelations
Transactions Conceptual Knowledgebase
clusters
12/30/2013
Experimental Results
� A/B Tests comparing against legacy systems� SIR legacy system
� Completely online� Naïve approach of using seed item title as a search query
� RIR legacy system� Chen, Y. and J.F. Canny, Recommending ephemeral items at web scale, ACM SIGIR 2011
� Collaborative Filtering on stable representations of items
� Significant improvements at 90% confidence interval� SIR resulted in 38.18% higher user engagement (CTR)� RIR resulted in 10.5% higher CTR
� Statistically significant improvement in site-wide business metrics from both SIR & RIR
4112/30/2013
Recommendations in e-Commerce-Conclusions
� Balance between similarity and quality crucial in driving user engagement and conversion
� Clusters of similar items in the inventory
� Local clustering in the coverage set of user queries
� Offline models built using Map-Reduce
� Huge input datasets including inventory, clickstream and transactional data
� Efficient real-time performance system
� Currently deployed on ebay.com
4212/30/2013
Big Data Analytics- Research Areas
� Data representation, including transformations that reduce representational complexity
� Computational complexity issues to characterize computational resource needs and tradeoffs
� Statistical model-building in massive data settings having messy data validation issues
� Sampling- both as data gathering and for data reduction
� Methods to include humans in the data analysis loop
4312/30/2013
Conclusions
� Great opportunity in improving the functioning of many disciplines by leveraging the data and turning the data into knowledge
� Requires an interdisciplinary approach to solving problems of massive data
� A major need exists for software targeted to end users
� Concerted effort is needed to educate students and the workforce in statistical thinking and computational thinking
4412/30/2013
References
� Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. (2011). Big data: The Next Frontier for innovation, Competition, and Productivity.
� Schuman, E. (2004, October 13). At Wal-Mart, Worlds Largest Retail Data Warehouse Gets Even Larger. eWeek. Retrieved August 9, 2012, from http://www.eweek.com/c/a/Enterprise-Applications/At-WalMart-Worlds-Largest-Retail-Data-Warehouse-Gets-Even-Larger/
� Roberts, L. G. (2000). Beyond Moore's law: Internet growth trends. Computer, 33(1), 117–119.
� Pettey, C., & Goasduff, L. (2011, June 27). Gartner Says Solving“Big Data” Challenge Involves More Than Just Managing Volumes of Data. Stamford: Gartner. Retrieved from http://www.gartner.com/it/page.jsp?id=1731916
12/30/2013 45
References (cont’d)
� Gantz, J. F., Mcarthur, J., & Minton, S. (2007). The Expanding Digital Universe. Director, 285(6). doi:10.1002/humu.21252
� Russom, P. (2011). Big Data Analytics. TDWI Research.
� Pettey, C. (2012, October 18). Gartner Identifies the Top 10 Strategic Technologies for 2012. Gartner.
� Hackathorn, R. (2002). Current practices in active data warehousing. available: http://www.dmreview.com/whitepaper/WID489.pdf
� Seguine, H. (n.d.). Billions and billions: Big Data Becomes a Big Deal. Deloitte. Retrieved from http://www.deloitte.com/view/en_GX/global/insights/c22d83274d1b4310VgnVCM2000001b56f00aRCRD.htm
� Lee, P., & Steward, D. (2012). Technology, Media & Telecommunications Predictions 2012, (Deloitte).
12/30/2013 46
References (cont’d)
� NRC of the National Academies, Frontiers in Massive Data Analysis, The National Academy Press, Washington, D.C., 2013. Retrieved from
http://www.nap.edu/catalog.php?record_id=18374
� Katukuri, J., Xie, Y., Raghavan, V., and Gupta, A. “Hypotheses generation as supervised link discovery with automated class labeling on large-scale biomedical concept networks”, BMC Genomics, 13(Suppl 3):S5, 2012.
� Katukuri, J., Mukherjee, R., and Konik, T. “Large scale recommendations in a dynamic marketplace”. ACM RecSys (LSRS workshop), 2013.
12/30/2013 47
References (cont’d)
� Berman, D. K. (n.d.). “Big Data” Firm Raises $84 Million. The Wall Street Journal. Retrieved September 14, 2011, from http://online.wsj.com/article/SB10001424053111903532804576569133957145822.html
� Davis, J. (2012). What Kind of Big Data Problem Do You Have? SASBlogs Home. Retrieved December 16, 2012, from http://blogs.sas.com/content/corneroffice/2012/10/08/what-kind-of-big-data-problem-do-you-have/
� Brynjolfsson, E. Lorin Hitt, Heekyung Kim (2011). Strength in Numbers: How Does Data-Driven Decision Making Affect Firm Performance?, Last Retrieved on December 16, 2012.
� Mouthaan, N. (2012). Effects of Big Data Analytics on Organizations’Value Creation. Master Thesis, University of Amsterdam. Retrieved December 16, 2012 from http://nielsmouthaan.nl/big-data-thesis.pdf
12/30/2013 48