decision making in the era of cloud computing and big data

Post on 26-Jan-2015

114 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

a talk on cloud computing, big data, data science

TRANSCRIPT

AN INTRODUCTION TO BIG DATA ANALYTICS AND CLOUD COMPUTING

a talk on Decision Making in

Big Data and Cloud Computing era

May 10, 2014 (1400-1600 Hrs) in

Room no. 511, Fifth floor, Department of Management Studies,

Vishwakarma Bhawan, IIT Delhi

Your speaker

Ajay Ohri

R for Business Analytics http://www.springer.com/statistics/book/978-1-4614-4342-1

My requirementsWhat are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can

be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to

process Big Data?

What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of

programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?

How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR

Analytics) ?

How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate

with real life example.

How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based

model?

My requirements- let’s break this downWhat are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can

be used for analysis? Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to

process Big Data?

What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of

programming skills is required to work in this area? Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?

How R can be used to do data mining in Social Network Data? Can it help HR persons to do analytics to hire right set of people (HR

Analytics) ?

How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate

with real life example.

How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based

model?

My requirements- let’s sort this upWhat are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can

be used for analysis?

How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based

model?

Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to

process Big Data?

What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of

programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR

Analytics) ?

How R can be used to do data mining in Social Network Data?

How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate

with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?

My requirements- let’s tag this downWhat are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can

be used for analysis?

How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based

model?

Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to

process Big Data?

What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of

programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR

Analytics) ?

How R can be used to do data mining in Social Network Data?

How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate

with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?

Data Analytics and Cloud Computing

Big Data

R

R (Data Science Careers)

My requirements- let’s check this againWhat are the key challenges to Data Analytics? How to capture unstructured data and then process it, so that it can

be used for analysis?

How Cloud computing can help in processing or analyzing data efficiently? What are the risks associated with using Cloud-based

model?

Which methodology can be more efficient to handle Big Data ? What are the key technologies that can help to

process Big Data?

What skill sets are required to become a Data Scientist? What are the possible key areas for research in Big Data Analytics? What level of

programming skills is required to work in this area? Can it help HR persons to do analytics to hire right set of people (HR

Analytics) ?

How R can be used to do data mining in Social Network Data?

How R can be used to perform Regression, Classification, Clustering, Structural Equation Modeling and Data Envelopment Analysis? Illustrate

with real life example. Which packages/algorithms are useful ? Does R support some inbuilt functionality to make efficient use of multi-core processors ?

Data Analytics and Cloud Computing

Big Data

R

R (Data Science Careers) Incorrect Classification?

Topics to be covered

Business AnalyticsData ScienceBig DataCloud ComputingR

Sub- topics to be covered

Business Analytics -methodologies, challenges,structured /unstructured data

Data ScienceBig DataCloud ComputingR

Sub- topics to be covered

Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics

Data Science -careers, skills

Big Data - Technology, skills

Cloud ComputingR

Sub- topics to be covered

Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics

Data Science -careers, skills

Big Data - Technology, skills

Cloud Computing -technology,risks

R-

Sub- topics to be covered

Business Analytics -methodologies, challenges,structured /unstructured data,HR analytics

Data Science -careers, skills

Big Data - Technology, skills

Cloud Computing -technology,risks

R- ???

Sub- topics that won’t be covered

R- Data Envelopment Analysis (http://professorjf.webs.com/DEA%202013.pdf )

http://www.uri.edu/artsci/ecn/burkett/DEAnotes.pdf

Structural Equation Modeling ( http://socserv.socsci.mcmaster.ca/jfox/Misc/sem/SEM-paper.pdf )

http://cran.r-project.org/doc/contrib/Fox-Companion/appendix-sems.pdf

and if time permits

HR Analytics

http://www.slideshare.net/ajayohri/hr-analytics-34080636

Business AnalyticsDefinitionBusiness analytics (BA) refers to the field of exploration and investigation of data generated by businesses.

Business Intelligence (BI) is the seamless dissemination of information through the organization, which primarily involves business metrics both past and current for the use of decision support in businesses.

Data Mining (DM) is the process of discovering new patterns from large data using algorithms and statistical methods.

To differentiate between the three, BI is mostly current reports, BA is models to predict and strategize and DM matches patterns in big data.

Business Analytics

Definition

-Information Ladder-CRISP DM-KDD-SEMMA

Business Analytics

-Information Ladder Data → Information → Knowledge → Understanding → Insight → Wisdom

Whereas the first two steps can be scientifically exactly defined, the upper parts belong to the domain of psychology and

philosophy.

Also DIKW

CRISP DM

KDD

SEMMA

Data Mining - a good map http://www.saedsayad.com/

What is a Data Scientista data scientist is simply a person who can

write code understand statistics derive insights from data

Oh really, is this a Data Scientist ?a data scientist is simply a person who can write code = in R,Python,Java, SQL, Hadoop (Pig,HQL,MR) etc

= for data storage, querying, summarization, visualization

= how efficiently, and in time (fast results?)

= where on databases, on cloud, servers

and understand enough statistics

to derive insights from data so business can make decisions

Some tools

Linux+Java /Python/Pig+R+SQL

Cheat Sheets for Data Scientistshttp://www.slideshare.net/ajayohri/cheat-sheets-for-data-scientists

Data Scientist Programming Skills

Java http://www.learnjavaonline.org/

Python http://www.codecademy.com/tracks/python

SQL http://www.w3schools.com/sql/

R http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/

http://www.statmethods.net/

Hadoop http://hortonworks.com/hadoop-training/

Linux https://github.com/WilliamHackmore/linuxgems/blob/master/cheat_sheet.org.sh

Other place to learn

MOOCs 1 https://www.edx.org/ 2 https://www.coursera.org/ 3 https://www.udacity.com/ 4 https://www.udemy.com/

Books

Courses

Workshops

Big Data

Statistics on Facebook https://newsroom.fb.com/company-info/

● 802 million daily active users on average in March 2014

● 609 million mobile daily active users on average in March 2014

● 1.28 billion monthly active users as of March 31, 2014

● 1.01 billion mobile monthly active users as of March 31, 2014

Statistics on Twitterhttps://about.twitter.com/company

● 255 million monthly active users● 500 million Tweets are sent per day● 78% of Twitter active users are on mobile● 77% of accounts are outside the U.S.

Big Data

is changing marketingis changing marketing modelsis much more quantitative compared to earlier marketing

Hadoop - infrastructure for Big Data http://hadoop.apache.org/

Hadoop- evolving ecosystem

Hadoop- evolving ecosystem

Hadoop- evolving ecosystem

Hadoop- evolving ecosystem

Cloud Computing-Google https://cloud.google.com/products/

Compute Engine is Google’s Infrastructure-as-a-Service (IaaS).

App Engine is Google’s Platform-as-a-Service (PaaS).

Storage

Cloud SQL -a fully-managed, relational MySQL database.

Cloud Storage -a simple API that allows you to manage your data programmatically

Cloud Datastore provides a managed, NoSQL, schemaless database for storing non-relational data

Big DataBigQuery. Run fast, SQL-like queries against multi-terabyte datasets in seconds

https://github.com/GoogleCloudPlatform

Cloud Computing-Google

Cloud Computing-Amazonhttp://aws.amazon.com/products/

More on Cloud Computing

Challenges and Opportunities for India (from http://chennai.vit.ac.in/isbcc/)http://www.slideshare.net/ajayohri/data-analytics-using-the-cloud-challenges-and-opportunities-for-india

Big Data Big Analytics (http://krishnarajpm.com/bigdata/abstract.pdf Workshop on Statistical Machine Learning and Game Theory Approaches for Large Scale Data Analysis)

http://www.slideshare.net/ajayohri/big-data-big-analytics

Rhttp://www.r-project.org/

Open Source

Free

5000+ Packages

Growing Faster

>2 million users

RAM constraints??

Rhttp://www.r-project.org/

Object Oriented

has GUI and IDE

has Commercial offerings

Rhttp://www.r-project.org/

Object Oriented

has GUI and IDE

has Commercial offerings

R - Rattle- Data Mining GUIhttp://www.r-project.org/

Object Oriented

has GUI and IDE

has Commercial offerings

R - R Commanderhttp://www.r-project.org/

Object Oriented

has GUI and IDE

has Commercial offerings

R -R Studio

R -Revolution Analytics Free for Academics

World Wide !!

RevoScaleR package

for Big Data

Recommended Install -

http://info.revolutionanalytics.com/free-academic.html

R -Revolution Analytics Free for Academics

World Wide !!

RevoScaleR package

for Big Data

My favorite places to learn Rhttp://www.swirlstats.com/

My favorite places to learn Rhttp://www.twotorials.com/

My favorite places to learn Rhttp://tryr.codeschool.com/

My favorite places to learn Rhttps://www.coursera.org/course/rprog

also see http://blog.datacamp.com/complete-list-of-coursera-courses-using-r-ranked-by-popularity/

R Case Study

Who are my Facebook friends?

Step 1

http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/

Step 2

https://gist.github.com/decisionstats/f18126aea544be324169

Case Study

my FB friends?

Step 1

http://thinktostart.wordpress.com/2013/11/19/analyzing-facebook-with-r/

Step 2

https://gist.github.com/decisionstats/f18126aea544be324169

Big Data Social Network AnalysisAnalyzing A Big Social Network using R and distributed graph engineshttp://thinkaurelius.com/2012/02/05/graph-degree-distributions-using-r-over-hadoop/

Big Data Social Media Analysis

Can be used for Customers (and also for latent influencers) -http://www.r-

bloggers.com/an-example-of-social-network-analysis-with-r-using-package-igraph/

Big Data Social Media Analysis

R package twitteR http://cran.r-project.org/web/packages/twitteR/index.html can be used for prototyping but Twitter's API is rate limited to 1500 per hour(?)/day, so we can use Datasift API http://datasift.com/pricing#costs

Big Data Social Media Analysis

How does information propagate through a social network?http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/

Big Data Social Network AnalysisCan be used for Terrorists (and also for potential protestors ) -Drew Conway http://riskecon.com/wp-content/uploads/2012/02/Conway-Socio_Terrorism.pdf

Primary focus is one three aspects of network analysis

1. Identifying leadership and key actors

2. Revealing underlying structure and intra-network community structure

3. Evolution and decay of social networks

R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.html

● The RHIPE package, started by Saptarshi Guha and now developed by a core team via GitHub, provides an interface between R and Hadoop for analysis of large complex data wholly from within R using the Divide and Recombine approach to big data. ( link )

● The rmr package by Revolution Analytics also provides an interface between R and Hadoop for a Map/Reduce programming framework. ( link )

● A related package, segue package by Long, permits easy execution of embarassingly parallel task on Elastic Map Reduce (EMR) at Amazon. ( link )

● The RProtoBuf package provides an interface to Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data. This package can be used in R code to read data streams from other systems in a distributed MapReduce setting where data is serialized and passed back and forth between tasks.

● The HistogramTools package provides a number of routines useful for the construction, aggregation, manipulation, and plotting of large numbers of Histograms such as those created by Mappers in a MapReduce application.

R -Hadoop Packages https://github.com/RevolutionAnalytics/RHadoop/wiki

● plyrmr - higher level plyr-like data processing for structured data, powered by rmr

● rmr - functions providing Hadoop MapReduce functionality in R

● rhdfs - functions providing file management of the HDFS from within R

● rhbase - functions providing database management for the HBase distributed database from within R

http://amplab-extras.github.io/SparkR-pkg/

SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.

https://github.com/nexr/RHive

RHive is an R extension facilitating distributed computing via HIVE query. RHive allows easy usage of HQL(Hive SQL) in R, and

allows easy usage of R objects and R functions in Hive.

R - Cloud Computinghttp://cran.r-project.org/web/views/WebTechnologies.html

R -Big Data Packages http://cran.r-project.org/web/views/HighPerformanceComputing.htmlLarge memory and out-of-memory data

● The biglm package by Lumley uses incremental computations to offer lm() and glm() functionality to data sets stored outside of R's main memory.

● The ff package by Adler et al. offers file-based access to data sets that are too large to be loaded into memory, along with a number of higher-level functions.

● The bigmemory package by Kane and Emerson permits storing large objects such as matrices in memory (as well as via files) and uses external pointer objects to refer to them. .

● A large number of database packages, and database-alike packages (such as sqldf by Grothendieck and data.table ● The HadoopStreaming package provides a framework for writing map/reduce scripts for use in Hadoop Streaming; it also

facilitates operating on data in a streaming fashion which does not require Hadoop.● The speedglm package permits to fit (generalised) linear models to large data. ● The biglars package by Seligman et al can use the ff to support large-than-memory datasets for least-angle regression,

lasso and stepwise regression.● The bigrf package provides a Random Forests implementation with support for parellel execution and large memory.● The MonetDB.R package allows R to access the MonetDB column-oriented, open source database system as a backend.

R in Financehttp://www.rinfinance.com/

R in Financehttp://www.quantmod.com/

C’est la vie

IN INDUSTRY - a R expert is one who knows which package to use from

IN RESEARCH- a R expert is one who creates a new popular and improved package

CRAN Views help expertshttp://cran.r-project.org/web/views/

SAP with RDeparture of Aeroplanes-SAP Hana 200m http://allthingsr.blogspot.in/#!/2012/04/big-data-r-and-hana-analyze-200-million.html

R using SAP Hana

http://www.decisionstats.com/interview-blag-sap-labs-montreal-using-sap-hana-with-rstats/

Oracle R EnterpriseCase Studies and Exampleshttp://www.oracle.com/technetwork/database/options/advanced-analytics/r-enterprise/index.html

Additional

http://www.slideshare.net/ajayohri/open-source-analytics

Open Source in Analytics (OSSCamp 2014) http://osscamp.in/

http://osscamp.in/events/6/open-source-analytics-overview-r-python-and-others

How does this affect decision making

Lots of Data IT is not a support function

Analytical Organizations with cross functional domains

and

Employees as first line of analysis

is education and research keeping up?

Lets do a revision

Requirements and People a=NULL

a$req=c("Met","Unmet")

a$counts=c(50,50)

a=as.data.frame(a)

a

pie(a$counts,label=a$req)

library(RColorBrewer)

p=NULL

p$req=c("Satisfied","Unsatisfied","Busy Sleeping")

p$counts=c(50,40,10)

p=as.data.frame(p)

pie(p$counts,label=p$req,col=brewer.pal(3, "Set1"))

Thanks for listening

Contact - ohri2007@gmail.com

LinkedIN -http://linkedin.com/in/ajayohri

Questions please?

One more thing

a movie on a murdered IIM batchmate of mine fighting against corruption just released yesterday

http://www.imdb.com/title/tt3056632/

Dedicated to

top related