big data & analytics: moving beyond hype to insight
TRANSCRIPT
Big Data: Beyond HypeSrinath PereraVP Research WSO2, Apache Member(@srinath_perera) [email protected]
2
OPER
ATIO
NAL
MAN
AGER
DA
SHBO
ARD
Big Data Technology Works? Understand
how to use itUnderstand how to run it in production
Search +1 (Lucene) +1 +1
NoSQL 0+, but DBs are striking back
0 0-
Distributed File Systems
+1 (HDFS) +1 +1
Batch Processing
+1 ( Hadoop, Spark) +1 +1
Realtime Analytics
+1 ( CEP, Storm, Fink) 0- 0-
Predictive Analytics
0 (MLLib, R, Graphlab) 0- 0-
Visualizations +1 (D3) 0- +1
Success Stories• Money Ball ( Baseball drafting) • Nate Silver predicted outcomes in 49 of the 50
states in the 2008 U.S. Presidential election• Cancer detection from Biopsy cells ( Big Data
find 12 patterns while we only knew 9), http://go.ted.com/CseS
• Bristol-Myers Squibb reduced the time it takes to run clinical trial simulations by 98%
• Xerox used big data to reduce the attrition rate in its call centers by 20%.
• Kroger Loyalty programs ( growth in 45 consecutive quarters)
Premise of Big DataIf you collect data about your business, and feed it to a Big Data system, you will find useful insights
that will provide competitive advantage– (e.g. Analysis of data sets can find new correlations to
"spot business trends, prevent diseases, combat crime and so on”. [Wikipedia])Underline assumption is that way we
operate, and organizations are inefficient.
Big Data as a Way to Optimize
• Assumptions: Once you identify your sickness, you are halfway cured
• You must know what is worth Optimizing
premature optimization is the root of all
evil
“Big Data Washing”You can tick yes, but unlikely to make a difference
How to Big Data Wash your System in 24
hours?• Publish collect the data you can with
minimal effort• Do lot of simple aggregations• Figure out what data combinations makes
prettiest pictures • Throw in some machine learning
algorithms, predict something but don’t compare
• Create a cool dashboard and do a cool demo, and say that you are just scratching the surface!!
Are Insights are automatic?• I wish• Only if we have right
data • Only if we look at the
right place • Only if such insights are
there • Only if we found the
insights
What can Big Data do?
• Enterprise Performance Management • Daily Operational Controls and Reports • Operational Management ( Logistics,
Decision Support)• Social and Community Intelligence (e.g.
Sentiments, find champions)• Sales and Marketing (Targeting, Channel
analytics, SEO analytics, funnel analytics)• Customer Service (Segmentation,
Recommendations, and Churn Prediction)• Preventative maintenance• Fraud detection
Big Data Tools • KPIs• Analytics ( Batch, Real-
time, Interactive, Predicative)
• Visualizations, Dashboards • Alerts • Sensors ( and other data
collection plumbing)
KPIs and their Role• KPIs (Key Performance Indicators) are numbers
that can give you an idea about performance of something – E.g. Countries have them ( GDP, Per Capita
Income, HDI index etc) • Examples
– Company Revenue – Lifetime value of a customer – Revenue per Square foot ( in retail industry)
• Idea is to define them and monitor them. But defining them is hard work!!
• Often one indicator tells half the story, and you need several that cover different angles
What is a Dashboard?• Think a car dashboard • It give you idea about
overall system in a glance • It is boring when all is
good, and grab attention when something is wrong
• Often have support for drill down and find root cause
Alerts• Notifications ( sent via email,
SMS, Pager etc.) • Goal is to give you peace of mind
( not having to check all the time)
• They should be specific • They should be infrequent • They should have very low false
positives • Let users control sensitivity
You need a Human in the Loop
Systems that digest your data, take decisions, and run the system by itself, they can only be used with limited applications Yet(e.g. Algorithmic trading, Showing Advertisements, or War)
Decisions, Actions, and Drill down
• Operators need to see the data in context, and drill down into detail to understand the root cause
• Typical model is to start from an alert or dashboard, see data in context (other transactions around same time, what does same user did before and after etc.) and then let the user drill down
• For example, http://wso2.com/videos/wso2-fraud-detection-solution
Role of Realtime Analytics• Use to detect something very fast!
Within few milliseconds to few seconds.
• Very powerful in detecting conditions over time (e.g. ball possession in a football game)
• Alerts are done through Realtime analytics
Role of Predictive Analytics
• Predictive analytics learn a problem from examples– E.g. learn to drive
• Two main cases are – Predicting next value or values (e.g. electricity load
prediction) – Predicting category (e.g. SPAM or not for a email)
• Used to grouping, to generate alerts, or to augment visualizations
• Need lot of expertise to create correct models and use them.
Big Data Pipeline
Doing it Once is Cheap, Setting up a system to do it continuously is Expensive
Do your scenarios ad-hoc first (hire some expertise if you must), before setting up a system that does it every day
Templates for Big Data Projects• Use existing Dataset: I already have a data set,
and list of potential problems, and figure out how to fix it.
• **Fix a known Problem: Find a problem, collect data about it, analyze, visualize, build a model and improve. Then build a dashboard to monitor.
• Improve Overall Process: Instrument processes ( start with most crucial), find KPIs, analyze and visualize the processes, and improve
• Find Correlations: Collect all available data, data mine the data or visualize, find interesting correlations.
Actionable Insights are the Key!!
• Insights are about significant event that warrant attention ( e.g. more than two technical issues would lead customer to churn)
• Decision makers can identify the context associated with the insight ( e.g. operators can see though history of customers who qualify)
• Decision makers can do something about the insight ( e.g. can work with customers to reassures and fix)
Think Deeply about Who will use you’re the system and How?
Challenges: Keeping the System Running
● Incorporate Continuous data o Integrate data continuously o We get feedback about effectiveness
of decisions (e.g. Accuracy of Fraud)● Track and update models
o Trends changeo Generate models in batch mode and
update
Challenges: Causality• Correlation does not imply Causality!! ( send a book home
example [1])• Causality
– do repeat experiment with identical test – If CAN’T do a randomized test (A/B test)– With Big data we cannot do either
• Option 1: We can act on correlation if we can verify the guess or if correctness is not critical (Start Investigation, Check for a disease, Marketing )
• Option 2: We verify correlations using A/B testing or propensity analysis [1] http://www.freakonomics.com/2008/12/10/the-blagojevich-upside/[2] https://hbr.org/2014/03/when-to-act-on-a-correlation-and-when-not-to/
Curious Case of Missing Data
http://www.fastcodesign.com/1671172/how-a-story-from-world-war-ii-shapes-facebook-today, Pic from http://www.phibetaiota.net/2011/09/defdog-the-importance-of-selection-bias-in-statistics/
• WW II, Returned Aircrafts and data on where they were hit?
• How would you add Armour?
Challenges: Taking Decisions (Context)
Summary• Big Data provide a way to Optimize • Tools
– KPIs– Analytics ( Batch, Real-time, Interactive, Predicative) – Visualizations, Dashboards – Alerts – Sensors ( and other data collection plumbing)
• Start small • Try out with data sets before setup a system • Find a high impact problem and make it work
end to end • Pay attention to user Experience
Thank You