how to avoid pitfalls in big data analytics webinar

28
© 2014 Datameer, Inc. All rights reserved. How to Avoid Pitfalls in Big Data Analytics

Upload: datameer

Post on 27-Jan-2015

112 views

Category:

Technology


0 download

DESCRIPTION

Big data analytics is revolutionizing the way businesses are collecting, storing, and more importantly, analyzing data. However, the adoption of a big data analytics solution has its share of failures and false starts. Watch this webinar to learn how to navigate the most common obstacles of big data analytics. Datameer and MapR have worked with customers to identify and solve the common pitfalls organizations face when deploying Hadoop-based analytics. In this webinar, we will show you how to: • Find the balance between infrastructure and business use cases • Overcome challenges of using multiple tools that address big data analytics • Leverage all your resources (data scientists, IT and analysts) most effectively

TRANSCRIPT

Page 1: How to Avoid Pitfalls in Big Data Analytics Webinar

© 2014 Datameer, Inc. All rights reserved.

How to Avoid Pitfalls in Big Data Analytics"

Page 2: How to Avoid Pitfalls in Big Data Analytics Webinar

View Recording "" You can view the recording of this webinar

at:

http://info.datameer.com/Online-Slideshare-How-to-Avoid-Pitfalls-in-Big-Data-

Analytics-OnDemand.html

Page 3: How to Avoid Pitfalls in Big Data Analytics Webinar

© 2013 Datameer, Inc. All rights reserved.

Matt Schumpert @datameer Senior Director, Solutions Engineering Matt has been working in the enterprise infrastructure software space for over 14 years in various capacities, including sales engineering, strategic alliances and consulting. Matt currently runs the pre-sales engineering team at Datameer, supporting all technical aspects of customer engagement from initial contact through roll-out of customers into production. Matt holds a BS in Computer Science from the University of Virginia. 

#datameer @datameer

About Our Speaker"

Page 4: How to Avoid Pitfalls in Big Data Analytics Webinar

© 2013 Datameer, Inc. All rights reserved.

Dale Kim @MapR Director, Product Marketing Dale Kim is the Director of Product Marketing at MapR.  His background includes a variety of technical and management roles at information technology companies. While his experience includes work with relational databases, much of his career pertains to non-relational data in the areas of search, content management, and NoSQL.   Dale holds an MBA from Santa Clara University, and a BA in Computer Science from the University of California, Berkeley.

#mapr @mapr

About Our Speaker"

Page 5: How to Avoid Pitfalls in Big Data Analytics Webinar

Agenda"

▪ Quick introduction to Hadoop ▪ Overview of analytics on Hadoop ▪ Quick tips on big data analytics ▪ Our 5 big data pitfalls to avoid

Page 6: How to Avoid Pitfalls in Big Data Analytics Webinar

Quick Introduction to Apache Hadoop"

▪ What is Apache Hadoop – Software framework for reliable, scalable,

distributed computing – “Divide-and-conquer” approach to

processing large data sets ▪ Hadoop does analytics

– Hadoop is the platform of choice for big data –  If you have big data, then you are analyzing

big data

Page 7: How to Avoid Pitfalls in Big Data Analytics Webinar

Types of Analytics for Hadoop"▪ Descriptive – what happened, and why

– The “why” is also known as “diagnostic” – Data mining, management reporting

Page 8: How to Avoid Pitfalls in Big Data Analytics Webinar

Types of Analytics for Hadoop [2]"▪ Predictive – what will happen

– Cross-sell/up-sell (recommendations), fraud/anomaly detection

▪ Prescriptive – what should I do – Preventative maintenance,

smart meter analysis

Better with more data

Page 9: How to Avoid Pitfalls in Big Data Analytics Webinar

Common Data Types for Hadoop"▪ Clickstream/user behavior history

▪ Sensor/machine/event logs

▪ Social media profiles & communication

▪ Data warehouse data (structured, SoR)

▪ Long-tail/archive data

Page 10: How to Avoid Pitfalls in Big Data Analytics Webinar

The Foundation for an Analytics Platform"

▪ Performance – Make sure you get results in a timely manner

▪ Scalability – Let your platform grow as your data grows

▪ Reliability – Keep your users productive

▪ Ease-of-use – Give users an end-to-end, self-service

platform that delivers fast time-to-insight

Page 11: How to Avoid Pitfalls in Big Data Analytics Webinar

Quick Tips on Big Data Analytics"▪  Minimize copying large data volumes across the wire ▪  Plan for production issues (system responsiveness,

performance, high availability, disaster recovery, audits) ▪  Start by looking for ways Hadoop can supplement, not

supplant your existing system ▪  Be wary of reusing a classic app. virtualization stack ▪  Choose "built-on”, not “connects-to" Hadoop vendors ▪  Be wary of lofty claims around machine learning (e.g.,

IBM Watson) ▪  As Hadoop in an emerging technology, pick innovative

rather than legacy vendors

Page 12: How to Avoid Pitfalls in Big Data Analytics Webinar

Common Pitfalls in Big Data Implementations"

1.  Incomplete plan for scaling up 2. Not architecting for maximum uptime 3. Over-use of immature technologies 4. Excessive/insufficient data governance 5. Wasting data scientists’ time with data

preparation

Page 13: How to Avoid Pitfalls in Big Data Analytics Webinar

Incomplete Plan for Scaling Up"

RDBMS

VS.

•  Monolithic, RDBMS-based system •  Vertical scaling •  Large upgrade expenditure

•  Commodity server-based Hadoop system •  Horizontal scaling •  Incremental expenditure

Page 14: How to Avoid Pitfalls in Big Data Analytics Webinar

Incomplete Plan for Scaling Up [2]"

▪ Relatively easy to extrapolate existing data load to future ▪ But, must also factor in:

–  Larger time windows of data •  Expanding beyond 3-month time window broke system •  Now can store 18-months, results in more accurate

analytics – More data sources

•  Typically, new sources that could not be added before – More use cases and users

•  More divisions want to join system

Page 15: How to Avoid Pitfalls in Big Data Analytics Webinar

Not Architecting for Maximum Uptime"

Separate user communities and data are isolated, but…

greater infrastructure complexity and risk

Page 16: How to Avoid Pitfalls in Big Data Analytics Webinar

Not Architecting for Maximum Uptime [2]"

▪ Separate physical clusters for separate “tenants” appears easy ▪ Multiple clusters lead to:

–  Infrastructural complexity, more risk of error – More points of failure

▪ Instead, leverage software components to help logically separate users/data

Page 17: How to Avoid Pitfalls in Big Data Analytics Webinar

Not Architecting for Maximum Uptime [3]"

▪ Global Storage Solutions Company ▪ Deployed file-serving HBase application ▪ Introduce ad-hoc analytics in same cluster ▪ No resource fencing, poor workload mgmt. ▪ Result: Significant downtime

Page 18: How to Avoid Pitfalls in Big Data Analytics Webinar

Over-Use of Hadoop Ecosystem Technologies"

▪ Research group at a Fortune 500 ▪ Anxious to deliver the first NoSQL project ▪ Built an overly complex data model ▪ Deployed HBase with no support/expertise ▪ Lack of integration/analytics = limited success

Page 19: How to Avoid Pitfalls in Big Data Analytics Webinar

Excessive / Insufficient Data Governance"

▪ Under-Governed – Users deleting “unused data” after a project –  Incorrectly interpreted as data loss by others – Result: panic

▪ Over-Governed – Fortune 500 deployed Hadoop as a shared IT service – Needed chargebacks based on data volume – Setup a “walled garden” for each project – Result: no sharing, no collaboration, fewer insights

Page 20: How to Avoid Pitfalls in Big Data Analytics Webinar

Wasting Data Scientists’ Time with Data Prep"

▪ DS groups are often the first tenants on Hadoop ▪ Traditional DS tools are weak in data prep ▪ Hadoop tools like Pig unfamiliar to DS users ▪ Result: 80% of time spent on data wrangling

Page 21: How to Avoid Pitfalls in Big Data Analytics Webinar

Demo …"

Page 22: How to Avoid Pitfalls in Big Data Analytics Webinar

Datameer: Purpose-Built for Hadoop"

Page 23: How to Avoid Pitfalls in Big Data Analytics Webinar

The #1 Data Discovery Platform"

Source: GigaOM, 03/14

Page 24: How to Avoid Pitfalls in Big Data Analytics Webinar

MapR Distribution for Hadoop"

BIG DATA

BEST PRODUCT

BUSINESS IMPACT

Hadoop Top

Ranked

Production Success

Look for our follow-up blog post at: www.mapr.com/blog

Page 25: How to Avoid Pitfalls in Big Data Analytics Webinar

The Power of the Open Source Community"M

anag

emen

t

MapR Data Platform

APACHE HADOOP AND OSS ECOSYSTEM

Security

YARN

Pig

Cascading

Spark

Batch

Spark Streaming

Storm*

Streaming

HBase

Solr

NoSQL & Search

Juju

Provisioning &

coordination

Savannah*

Mahout

MLLib

ML, Graph

GraphX

MapReduce v1 & v2

EXECUTION ENGINES DATA GOVERNANCE AND OPERATIONS

Workflow & Data

Governance Tez*

Accumulo*

Hive

Impala

Shark

Drill*

SQL

Sentry* Oozie ZooKeeper Sqoop

Knox* Whirr Falcon* Flume

Data Integration & Access

HttpFS

Hue

*  Cer&fica&on/support  planned  for  2014  

Page 26: How to Avoid Pitfalls in Big Data Analytics Webinar

Projects to Follow"▪ Apache Spark – fast, large-scale data

processing engine – MapR is only distribution for Hadoop to

support the entire Spark stack

▪ Apache Drill – fast query execution engine – MapR-initiated open source project – Supports instant

querying and broaddata format support

Page 27: How to Avoid Pitfalls in Big Data Analytics Webinar
Page 28: How to Avoid Pitfalls in Big Data Analytics Webinar

For more information"

" http://www.datameer.com " http://www.mapr.com " @datameer " @MapR " [email protected] " [email protected]

Learn more

Contact

#datameer @datameer