how big data ,cloud computing ,data science can help business

115
Analytics Talk By Ajay Ohri at Allianz Trivandrum 9 October 2016

Upload: ajay-ohri

Post on 19-Feb-2017

1.398 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: How Big Data ,Cloud Computing ,Data Science can help business

Analytics TalkBy Ajay Ohri at Allianz

Trivandrum9 October 2016

Page 2: How Big Data ,Cloud Computing ,Data Science can help business

Analytics SessionIntroduction to Big Data, Cloud Computing, Data Science and How They Affect You

Page 3: How Big Data ,Cloud Computing ,Data Science can help business

Agenda

Big Data - definition and explanation

Cloud Computing

Data Science

Business Strategy Models

Case Studies in Insurance

Page 4: How Big Data ,Cloud Computing ,Data Science can help business

Big Data

What is Big Data?"Big data" is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Examples include web logs, RFID, sensor networks, social networks, social data (due to the social data revolution), Internet text and documents, Internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological, and other complex and often interdisciplinary scientific research, military surveillance, medical records, photography archives, video archives, and large-scale e-commerce.

Page 5: How Big Data ,Cloud Computing ,Data Science can help business

Big Data

What is Big Data?"extremely large data sets that may be analysed computationally to reveal patterns, trends, and associations, especially relating to human behaviour and interactions.

1. "much IT investment is going towards managing and maintaining big data"

https://en.wikipedia.org/wiki/Big_data Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.

Page 6: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Statistics

IBM- http://www-01.ibm.com/software/data/bigdata/

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Page 7: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Moving Fast

IBM- https://www.ibm.com/big-data/us/en/

Big data is being generated by everything around us at all times. Every digital process and social media exchange produces it. Systems, sensors and mobile devices transmit it. Big data is arriving from multiple sources at an alarming velocity, volume and variety. To extract meaningful value from big data, you need optimal processing power, analytics capabilities and skills.

Page 8: How Big Data ,Cloud Computing ,Data Science can help business

4V of BIG DATAhttp://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 13: How Big Data ,Cloud Computing ,Data Science can help business

VALUEhttp://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data

Page 14: How Big Data ,Cloud Computing ,Data Science can help business

Veracity and Varietyhttp://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data

Page 15: How Big Data ,Cloud Computing ,Data Science can help business

Volume and Velocityhttp://www.ibmbigdatahub.com/infographic/extracting-business-value-4-vs-big-data

Page 18: How Big Data ,Cloud Computing ,Data Science can help business

Who uses Big Datahttp://www.sas.com/en_us/insights/big-data/what-is-big-data.html

BankingWith large amounts of information streaming in from countless sources, banks are faced with finding new and innovative ways to manage big data. While it’s important to understand customers and boost their satisfaction, it’s equally important to minimize risk and fraud while maintaining regulatory compliance. Big data brings big insights, but it also requires financial institutions to stay one step ahead of the game with advanced analytics.

EducationEducators armed with data-driven insight can make a significant impact on school systems, students and curriculums. By analyzing big data, they can identify at-risk students, make sure students are making adequate progress, and can implement a better system for evaluation and support of teachers and principals.

GovernmentWhen government agencies are able to harness and apply analytics to their big data, they gain significant ground when it comes to managing utilities, running agencies, dealing with traffic congestion or preventing crime. But while there are many advantages to big data, governments must also address issues of transparency and privacy.

Page 19: How Big Data ,Cloud Computing ,Data Science can help business

Who uses Big Data http://www.sas.com/en_us/insights/big-data/what-is-big-data.html

Health CarePatient records. Treatment plans. Prescription information. When it comes to health care, everything needs to be done quickly, accurately – and, in some cases, with enough transparency to satisfy stringent industry regulations. When big data is managed effectively, health care providers can uncover hidden insights that improve patient care.

ManufacturingArmed with insight that big data can provide, manufacturers can boost quality and output while minimizing waste – processes that are key in today’s highly competitive market. More and more manufacturers are working in an analytics-based culture, which means they can solve problems faster and make more agile business decisions.

RetailCustomer relationship building is critical to the retail industry – and the best way to manage that is to manage big data. Retailers need to know the best way to market to customers, the most effective way to handle transactions, and the most strategic way to bring back lapsed business. Big data remains at the heart of all those things.

Page 20: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop StackThe Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

The project includes these modules:

Hadoop Common: The common utilities that support the other Hadoop modules.Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.Hadoop YARN: A framework for job scheduling and cluster resource management.Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

http://hadoop.apache.org/

Page 21: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop StackHadoop-related projects at Apache include:

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.

Avro™: A data serialization system.Cassandra™: A scalable multi-master database with no single points of failure.Chukwa™: A data collection system for managing large distributed systems.HBase™: A scalable, distributed database that supports structured data storage for large tables.Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.Mahout™: A Scalable machine learning and data mining library.Pig™: A high-level data-flow language and execution framework for parallel computation.Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that

supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to

execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.

ZooKeeper™: A high-performance coordination service for distributed applications.

Page 22: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack

Page 23: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack

Page 24: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack

Page 25: How Big Data ,Cloud Computing ,Data Science can help business

NoSQL

A NoSQL (Not-only-SQL) database is one that has been designed to store, distribute and access data using methods that differ from relational databases (RDBMS’s). NoSQL technology was originally created and used by Internet leaders such as Facebook, Google, Amazon, and others who required database management systems that could write and read data anywhere in the world, while scaling and delivering performance across massive data sets and millions of users.

Page 26: How Big Data ,Cloud Computing ,Data Science can help business

NoSQLhttps://www.datastax.com/nosql-databases

Page 27: How Big Data ,Cloud Computing ,Data Science can help business

NoSQLhttps://www.datastax.com/nosql-databases

Page 28: How Big Data ,Cloud Computing ,Data Science can help business

How NoSQL Databases Differ From Each Otherhttps://www.datastax.com/nosql-databases

There are a variety of different NoSQL databases on the market with the key differentiators between them being the following:Architecture: Some NoSQL databases like MongoDB are architected in a master/slave model in somewhat the same way as many RDBMS’s. Others (like Cassandra) are designed in a ‘masterless’ fashion where all nodes in a database cluster are the same. The architecture of a NoSQL database greatly impacts how well the database supports requirements such as constant uptime, multi-geography data replication, predictable performance, and more.Data Model: NoSQL databases are often classified by the data model they support. Some support a wide-row tabular store, while others sport a model that is either document-oriented, key-value, or graph.Data Distribution Model: Because of their architecture differences, NoSQL databases differ on how they support the reading, writing, and distribution of data. Some NoSQL platforms like Cassandra support writes and reads on every node in a cluster and can replicate / synchronize data between many data centers and cloud providers.Development Model: NoSQL databases differ on their development API’s with some supporting SQL-like languages (e.g. Cassandra’s CQL).

Page 29: How Big Data ,Cloud Computing ,Data Science can help business

Big Data Strategy

Page 30: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models, and four deployment models.http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

--National Institute of Standards and Technology

Page 31: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing: Types

five essential characteristics

1. On demand self service2. Broad Network Access3. Resource Pooling4. Rapid Elasticity5. Measured Service

Page 32: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing1. the practice of using a network of remote servers hosted on the Internet to store, manage, and process data, rather than a

local server or a personal computer.

http://csrc.nist.gov/publications/nistpubs/800-145/SP800-145.pdf

Page 33: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing: Types

three service models (SaaS, PaaS and IaaS)

Page 34: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing: Types

four deployment models (private, public, community and hybrid).

Key enabling technologies include:

1. fast networks,2. inexpensive computers, and3. virtualization for commodity hardware.

Page 35: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing: Types

major barriers to broader cloud adoption aresecurity, interoperability, and portability For a layman to be explained in simple short terms, cloud computing is a lot of

scalable and custom computing power available by rent/by hour and accessible remotely. It can help in doing more computing at a fraction of the cost

Page 36: How Big Data ,Cloud Computing ,Data Science can help business

Data Driven Decision Making- using data and trending historical data- validating assumptions if any- using champion challenger to test scenarios- using experiments- use baselines- continuous improvement

- customer experiences- costs- revenues

If you can't measure it, you can't manage it -Peter Drucker

Page 37: How Big Data ,Cloud Computing ,Data Science can help business

BCG Matrix for Product Lines

BCG Matrix is best used to analyze your own or target organization’s product portfolio- applicable for companies with multiple products

To help corporations with analyzing their business units

or product lines. This helps the company allocate resources

Page 38: How Big Data ,Cloud Computing ,Data Science can help business

Porter’s 5 Forces Model for Industries

It draws upon industrial organization (IO) economics to derive five forces that determine the competitive intensity and therefore attractiveness of a market. Attractiveness in this context refers to the overall industry profitability. An “unattractive” industry is one in which the combination of these five forces acts to drive down overall profitability. A very unattractive industry would be one approaching “pure competition”, in which available profits for all firms are driven to normal profit.

Page 39: How Big Data ,Cloud Computing ,Data Science can help business

Porter’s Diamond Model an economical model developed by Michael Porter in his book The Competitive Advantage of Nations, where he published his theory of why particular industries become competitive in particular locations.

Page 40: How Big Data ,Cloud Computing ,Data Science can help business

McKinsey 7S Framework

To check which teams work and which teams done (within an organization) use this framework by the famous consulting company-a strategic vision for groups, to include businesses, business units, and teams. The 7S are structure, strategy, systems, skills, style, staff and shared values. The model is most often used as a tool to assess and monitor changes in the internal situation of an organization.

Page 41: How Big Data ,Cloud Computing ,Data Science can help business

Greiner Model for Organizational Growth

Developed by Larry E. Greiner it is helpful when examining the problems associated with growth on organizations and the impact of change on employees.

It can be argued that growing organizations move through five relatively calm periods of evolution, each of which ends with a period of crisis and revolution.

Each evolutionary period is characterized by the dominant management style used to achieve growth, while

Each revolutionary period is characterized by the dominant management problem that must be solved before growth will continue.

Page 42: How Big Data ,Cloud Computing ,Data Science can help business

Marketing Model 4P and 4 C model helps you identify marketing mix

Products Price Promotion PlaceConsumers Cost Communication Convenience

Page 43: How Big Data ,Cloud Computing ,Data Science can help business

Business Canvas Model The Business Model Canvas is a strategic management template for developing new or documenting existing business models. It is a visual chart with elements describing a firm’s value proposition, infrastructure, customers, and finances. It assists firms in aligning their activities by illustrating potential trade-offs.

Page 44: How Big Data ,Cloud Computing ,Data Science can help business

Motivation ModelsHertzberg motivation-hygiene theory job satisfaction and job dissatisfaction act independently of each other

Leading to satisfactionAchievement

Recognition

Work itself

Responsibility

Advancement

Growth

Leading to dissatisfactionCompany policy

Supervision

Relationship with boss

Work conditions

Salary

Relationship with peers

Security

Page 45: How Big Data ,Cloud Computing ,Data Science can help business

Motivation ModelsMaslow Hierarchy of Needs

Page 46: How Big Data ,Cloud Computing ,Data Science can help business

Business Strategy Modelshttp://decisionstats.com/2013/12/19/business-strategy-models/

1. Porters 5 forces Model-To analyze industries

2. Business Canvas

3. BCG Matrix- To analyze Product Portfolios

4. Porters Diamond Model- To analyze locations

5. McKinsey 7 S Model-To analyze teams

6. Gernier Theory- To analyze growth of organization

7. Herzberg Hygiene Theory- To analyze soft aspects of individuals

8. Marketing Mix Model- To analyze marketing mix.

Page 47: How Big Data ,Cloud Computing ,Data Science can help business

Data Science

What is a data scientist? A data scientist is one who had inter disciplinary skills in both programming, statistics and business domains to create actionable insights based on experiments or summaries from data.

Page 48: How Big Data ,Cloud Computing ,Data Science can help business

Data ScienceOn a daily basis, a data scientist is simply a person

who can write some codein one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)

fordata storage, querying, summarization, visualization efficiently, and in time

ondatabases, on cloud, servers and understand enough statistics to derive insights from

data so business can make decisions

What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights.

Page 49: How Big Data ,Cloud Computing ,Data Science can help business

Big Data Social Media Analysishttps://rdatamining.wordpress.com/2012/05/17/an-example-of-social-network-analysis-with-r-using-package-igraph/

Social Network Analysis

Page 50: How Big Data ,Cloud Computing ,Data Science can help business

How does information propagate through a social network?

http://www.r-bloggers.com/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/

Page 51: How Big Data ,Cloud Computing ,Data Science can help business

Fraud Analysisanomaly detection (also outlier detection) is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

Page 52: How Big Data ,Cloud Computing ,Data Science can help business

How they affect you :Financial Profitability

Data Storage is getting cheaper but the way it is stored is changing ( from company servers to external cloud)

Big Data helps to store every interaction, transaction, with customer but this also increases complexity of data

Data Science is getting cheaper ( open source) but more skilled professionals in analytics required

Page 53: How Big Data ,Cloud Computing ,Data Science can help business

How they affect you :Sales and Marketing

Which customers to target and who not to target ( traditional propensity models)

Where to target ( geocoded)

When to target

Forecast Demand

Page 54: How Big Data ,Cloud Computing ,Data Science can help business

How they affect you :Operations

Optimize cost and logistics

Maximize output per resource

Can also be combined with IoT

Page 55: How Big Data ,Cloud Computing ,Data Science can help business

How they affect you :Human Resources

Which employee is like to leave first

Which skill is most likely to be crucial next 12 24 months

Forecast for skills, employees

Page 56: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Exampleshttp://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-progressive-insurance-35951-1.html

Agents increasingly want mobile enablement, and not just the ability to quote, but to bind and sell policies on smartphones and tablets. -Progressive

progressive snapshot https://www.progressive.com/auto/snapshot/ To participate you attach the Snapshot device to the computer in your car, which collects data about your driving habits. According to Progressive, the device records your vehicle identification number (VIN), how many miles you drive each day and how often you drive between between midnight and 4 a.m.

After driving with Snapshot for 30 days, you return it to Progressive and, depending on your driving habits, the company says you can get a discount up to 30%

Page 57: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples

Mass Mutual http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-massmutual-35952-1.html Created Haven Life, an online insurance agency that uses an algorithmic underwriting tool and series of related decisions that was created in collaboration with team of data scientists.

insurance companies are vast decision-making engines that take and manage risk. The inputs into this engine are data, and the capabilities created by the field of data science can and will impact every process in the company — from underwriting to claims management to security,

Page 58: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples

CNA is applying big data technology to workers compensation claims and adjusters’ notes.

“That is a classic, unstructured big data kind of problem,” says Nate Root, SVP of CNA’s shared service organization. “We

have hundreds of thousands of workers compensation claims, and claims adjuster notes, and there is tremendous value in

those notes.”

Root says the insurer recently began identifying workers’ compensation claims that have the potential to turn into a total

disability, or partial permanent disability, without the right sort of attention. By examining the unstructured data, CNA has

developed a hundred different variables that can predict a propensity for a claim to become serious, and then assign a

nurse case manager to help the insured get necessary treatments for a better patient outcome, get them back to work and

lower the overall cost of coverage. For example, the program can find people who are missing appointments or who are not

engaged with physical therapy and should be.

http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-cna-35959-1.html

Page 59: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples American Family Insurance licensed APT’s Test & Learn software (http://www.predictivetechnologies.com/products/test-learn.aspx ) to enhance customer engagement and increase support for agents. “This is a statistical tool that enables us to create and analyze statistical tests,”

For example, call-routing techniques affect wait times and, ultimately claims satisfaction. The insurer also tracks how claims are handled, and by whom, and whether agents are involved in resolution. Using APT, the insurer can isolate variables and accurately determine the success of one design vs. another for various products, geographies or demographics,

http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html .

Page 60: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples

http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-american-family-insurance-35953-1.html .

American Family Insurance Unstructured data, such as that collected in call center transcripts, also can be studied to better understand what approaches are best for different situations, he says. “Hadoop and other tools enable natural-language processing and sentiment analysis,” Cruz says. “We can look for key words or patterns in those words, do counts and build models off textual indicators that enable us to identify three things:

1. when there could be fraud involved,

2. where there might be severity issues,

3. or how we can get ahead of that and plan for it,”

Customer communication, web design and direct mail are other areas the insurer is, or soon will be, using APT,

4. Do we see greater lift in these geographies vs. those? Or,

5. from younger vs. older customers, or

6. customers that came from these insurance companies vs. other insurance companies.’

Page 61: How Big Data ,Cloud Computing ,Data Science can help business

Insurance ExamplesLike MassMutual, Nationwide has partnered with a local college — Ohio State University, the university with the third-largest enrollment in the country. The Nationwide Center for Advanced Customer Insights (NCACI) gives OSU students in advanced degree programs the ability to work with real-world data to solve some of the biggest insurance business problems. Faculty and students from the marketing, statistics, psychology, economics and computer science departments work with Nationwide to develop predictive models and data mining techniques aimed at improving

1. marketing and distribution,2. identifying consumer behavior patterns, and 3. increasing customer satisfaction and 4. lifetime value.

Page 62: How Big Data ,Cloud Computing ,Data Science can help business

Insurance ExamplesJohn Hancock

his team set out to find a way to leverage the wealth of data collected by wearable technologies, including the popular FitBit and recently released Apple Watch, to give something back to their customers. The end result was John Hancock Vitality, a new life insurance product that offers up to a 15 percent premium discount to customers who track their healthy habits with wearables and turn that information over to the insurance company. New buyers even get their own FitBit to begin tracking.

http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-john-hancock-35954-1.html

Fitbit Inc. is an American company known for its products of the same name, which are activity trackers,

wireless-enabled wearable technology devices that measure data such as the number of steps walked,

heart rate, quality of sleep, steps climbed, and other personal metrics.

Page 63: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples

Swiss Re is using more public data to improve underwriting results and decrease the number of questions the insurer has

to ask consumers to underwrite them. Swiss Re is looking at big data in terms of two major streams. In the first, big data is

being used to help reduce costs and improve the efficiency of current processes throughout the insurance value chain,

including claims and fraud management, cyber risk, customer management, pricing, risk assessment and selection,

distribution and service management, product innovation, and research and development.

In the second stream, big data also offers a new framework to think bigger in terms of market disruption. Swiss Re has

created more than 100 prototypes internally, and that as a result the entire organization sees the value and importance of

big data and smart analytics.

http://www.insurancenetworking.com/news/data-analytics/big-datas-big-guns-swiss-re-35957-1.html

Page 64: How Big Data ,Cloud Computing ,Data Science can help business

Insurance Examples

‘How do you take that operationally efficient data and turn it into a customer/household view and understand all the

products attached to a person?’”

Allstate has focused heavily on master data management and data governance creating party and household IDs for

data. The company is also building a team to work across business areas on analytics projects rather than siloing big data

projects within certain units.

“Something meant for a single purpose often leads to other insights. We know, for example based on some call-volume

analysis in our call center, how often customers defect.”We have an application in claims, QuickFoto, where a policyholder

that isn’t in a major accident can snap a picture of the damage and send it to us. But whereas in the past, that would’ve

gone into a physical folder and then a filing cabinet, now I have all those pictures of cars in a database, and there’s a lot

more that I can do.”

Page 65: How Big Data ,Cloud Computing ,Data Science can help business

Questions?

Page 66: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Tools and Techniques for extracting maximum value from Customer Data and Interactions

Page 67: How Big Data ,Cloud Computing ,Data Science can help business

Agenda

Data Science Approach

Data Science Tools

Data Science Techniques

Page 68: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Approach

On a daily basis, a data scientist is simply a person

who can write some codein one or more of the languages of R, Python, Java, SQL, Hadoop (Pig, HQL, MR)

fordata storage, querying, summarization, visualization efficiently, and in time

ondatabases, on cloud, servers and understand enough statistics to derive insights from data so

business can make decisions

Page 69: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Approach

What should a data scientist know? He should know how to get data, store it, query it, manage it, and turn it into actionable insights. The following approach elaborates on this simple and sequential premise.

Page 70: How Big Data ,Cloud Computing ,Data Science can help business

Where to get DataA data scientist needs data to do science on, right! Some of the usual sources of data for a data scientist are-

APIs- API is an acronym for Application Programming Interface.We cover APIs in detail in Chapter 6. APIs is how the current big data paradigm is enabled, as it enables machines to talk and fetch data from each other programmatically. For a list of articles written by the same author on APIs- see https://www.programmableweb.com/profile/ajayohri.

Internet Clickstream Logs- Internet clickstream logs refer to the data generated by humans when they click specific links within a webpage. This data is time stamped, and the uniqueness of the person clicking the link can be established by IP address. IP addresses can be parsed by registries like https://www.arin.net/whois or http://www.apnic.net/whois for examining location (country and city). internet service provider and owner of the address (for website owners this can be done using the website http://who.is/). In Windows using the command ipconfig and in Linux systems using ifconfig can help us examine IP Address. You can read this for learning more on IP addresses http://en.wikipedia.org/wiki/IP_address. Software like Clicky from (http://getclicky.com) and Google Analytics( www.google.com/analytics) also help us give data which can then be parsed using their APIs. (See https://code.google.com/p/r-google-analytics/ for Google Analytics using R).

Machine Generated Data- Machines generate a lot of data especially for sensors to ensure that the machine is working properly. This data can be logged and can used with events like cracks or failures to have predictive asset maintance of M2M (Machine to Machine) Analytics.

Page 71: How Big Data ,Cloud Computing ,Data Science can help business

Where to get DataSurveys- Surveys are mostly questionaries filled by humans. They used to be administed manually over paper, but online surveys are now the definitive trend. Surveys reveal valuable data about current preferences of current and potential customers. They do suffer from the bias inherent from design of questions by the creator. Since customer preferences evolve surveys help in getting primary data about current preferences. Coupled with stratified random sampling, they can be a powerful method for collecting data. SurveyMonkey is one such company that helps create online questionaries (https://www.surveymonkey.com/pricing/)

Commercial Databases- Commercial Databases are properietary databases that have been collected over time and are sold /rented by vendors. They can be used for prospect calling, appending information to existing database, and refining internal database quality.

Credit Bureaus- Credit bureaus collect financial information about people, and this information is then available for marketing organizations (subject to legal and privacy guideliness). The cost of such information is balanced by the added information about customers.

Social Media- Social media is a relatively new source of data and offers powerful insights albiet through a lot of unstructured data. Companies like Datasift offer social media data, and companies like Salesforce/Radian6 offer social media tools (http://www.salesforcemarketingcloud.com/). Facebook has 829 million daily active users on average in June 2014 with 1.32 billion monthly active users . Twitter has 255 million monthly active users and 500 million Tweets are sent per day. That generates a lot of data about what current and potential customers are thinking and writing about your products.

Page 72: How Big Data ,Cloud Computing ,Data Science can help business

Where to process data?Now you have the data. We need computers to process it.

Local Machine - Benefits of storing the data in local machine are ease of access. The potential risks include machine outages, data recovery, data theft (especially for laptops) and limited scalability. A local machine is also much more expensive in terms of processing and storage and gets obsolete within a relatively short period of time.

Server- Servers respond to requests across networks. They can be thought of as centralized resources that help cut down cost of processing and storage. They can be an intermediate solution between local machines and clouds, though they have huge capital expenditure upfront. Not all data that can fit on a laptop should be stored on a laptop. You can store data in virtual machines on your server and connected through thin shell clients with secure access.

Cloud- The cloud can be thought of a highly scalable, metered service that allows requests from remote networks. They can be thought of as a large bank of servers but that is a simplistic definition.

hindrance to adoption to the cloud is resistance within existing IT department whose members are not trained to transition and maintain the network over cloud as they used to do for enterprise networks.

Page 73: How Big Data ,Cloud Computing ,Data Science can help business

Cloud Computing ProvidersWe exapnd on the cloud processing part.

Amazon EC2 - Amazon Elastic Compute Cloud (Amazon EC2) provides scalable processing power in the cloud. It has a web based management console, has a command line tool , and offers resources for Linux and Windows virtual images. Further details are available at http://aws.amazon.com/ec2/ . Amazon EC2 is generally considered the industry leader.For beginners a 12 month basic preview is available for free at http://aws.amazon.com/free/ that can allow practioners to build up familiarity.

Google Compute- https://cloud.google.com/products/compute-engine/ Microsoft Azure - https://azure.microsoft.com/en-us/pricing/details/virtual-machines / Azure Virtual Machines enable you to deploy

a Windows Server, Linux, or third-party software images to Azure. You can select images from a gallery or bring your own customized images. Charge for Virtual Machines is by the minute. Discounts can range from 205 to 32 % depending if you pre pay 6 months or 12 month plans and based on usage tier.

IBM shut down its SmartCloud Enterprise cloud computing platform by Jan. 31, 2014 and will migrate those customers to its SoftLayer cloud computing platform, which was an IBM acquired company https://www.softlayer.com/virtual-servers

Oracle Oracle's plans for the cloud are still in preview for enterprise customers a https://cloud.oracle.com/compute

Page 74: How Big Data ,Cloud Computing ,Data Science can help business

Where to store data

The need to store data in a secure and reliable environment for speedy and repeated access. There is a cost of storing this data, and there is a cost of losing the data due to some technical accident.

You can store data in the following way

csv files, spreadsheet and text files locally espeially for smaller files. Note while this increases ease of access, it also creates problems of version control as well as security of confidential data.

relational databases (RDBMS) and data warehouseshadoop based storage

Page 75: How Big Data ,Cloud Computing ,Data Science can help business

Where to store data

noSQL databases- are non-relational, distributed, open-source and horizontally scalable. A complete list of NoSQL databases is at http://nosql-database.org/ . Notable NoSQL databases are MongoDB, couchDB et al.

key value store -Key-value stores use the map or dictionary as their fundamental data model. In this model, data is represented as a collection of key-value pairs, such that each possible key appears at most once in the collection

Redis -Redis is an open source, BSD licensed, advanced key-value store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets and sorted sets (http://redis.io/).

Riak is an open source, distributed database. http://basho.com/riak/. MemcacheDB is a persistence enabled variant of memcached,

column oriented databasescloud storage

Page 76: How Big Data ,Cloud Computing ,Data Science can help business

Cloud StorageAmazon- Amazon Simple Storage Services (S3)- Amazon S3 provides a simple web-services interface that can be used to store

and retrieve any amount of data, at any time, from anywhere on the web. http://aws.amazon.com/s3/ . Cost is a maximum of 3 cents per GB per month. There are three types of storage Standard Storage, Reduced Redundancy Storage, Glacier Storage. Reduced Redundancy Storage (RRS) is a storage option within Amazon S3 that enables customers to reduce their costs by storing non-critical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. Amazon Glacier stores data for as little as $0.01 per gigabyte per month, and is optimized for data that is infrequently accessed and for which retrieval times of 3 to 5 hours are suitable. These details can be seen at http://aws.amazon.com/s3/pricing/

Google - Google Cloud Storage https://cloud.google.com/products/cloud-storage/ . It also has two kinds of storage. Durable Reduced Availability Storage enables you to store data at lower cost, with the tradeoff of lower availability than standard Google Cloud Storage.. Prices are 2.6 cents for Standard Storage (GB/Month) and 2 cents for Durable Reduced Availability (DRA) Storage (GB/Month). They can be seen at https://developers.google.com/storage/pricing#storage-pricing

Azure- Microsoft has different terminology for it's cloud infrastructure. Storage is classified in three types with a fourth type (Files) being available as a preview. There are three levels of redundancy Locally Redundant Storage (LRS),Geographically Redundant Storage (GRS) ,Read-Access Geographically Redundant Storage (RA-GRS): You can see details and prices at https://azure.microsoft.com/en-us/pricing/details/storage/

Oracle Storage is available at https://cloud.oracle.com/storage and costs around 30$ / TB per month

Page 77: How Big Data ,Cloud Computing ,Data Science can help business

Databases on the Cloud- Amazon Amazon RDS -Managed MySQL, Oracle and SQL Server databases. http://aws.amazon.com/rds/ While relational

database engines provide robust features and functionality, scaling requires significant time and expertise.DynamoDB - Managed NoSQL database service. http://aws.amazon.com/dynamodb/ Amazon DynamoDB focuses on

providing seamless scalability and fast, predictable performance. It runs on solid state disks (SSDs) for low-latency response times, and there are no limits on the request capacity or storage size for a given table. This is because Amazon DynamoDB automatically partitions your data and workload over a sufficient number of servers to meet the scale requirements you provide.

Redshift - It is a managed, petabyte-scale data warehouse service that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. You can start small for just $0.25 per hour and scale to a petabyte or more for $1,000 per terabyte per year. http://aws.amazon.com/redshift/

SimpleDB- It is highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests http://aws.amazon.com/simpledb/. a table in Amazon SimpleDB has a strict storage limitation of 10 GB and is limited in the request capacity it can achieve (typically under 25 writes/second); it is up to you to manage the partitioning and Gre-partitioning of your data over additional SimpleDB tables if you need additional scale. While SimpleDB has scaling limitations, it may be a good fit for smaller workloads that require query flexibility. Amazon SimpleDB automatically indexes all item attributes and thus supports query flexibility at the cost of performance and scale.

Page 78: How Big Data ,Cloud Computing ,Data Science can help business

Databases on the Cloud - Others

GoogleGoogle Cloud SQL -Relational Databases in Google's Cloud

https://developers.google.com/cloud-sql/ Google Cloud Datastore - Managed NoSQL Data Storage Service

https://developers.google.com/datastore/ Google Big Query- Enables you to write queries on huge datasets. BigQuery uses a columnar

data structure, which means that for a given query, you are only charged for data processed in each column, not the entire table https://cloud.google.com/products/bigquery/

Azure SQL Database https://azure.microsoft.com/en-in/services/sql-database/ SQL Database is a relational database service in the cloud based on the Microsoft SQL Server engine, with mission-critical capabilities. Because it’s based on the SQL Server engine, SQL Database supports existing SQL Server tools, libraries and APIs, which makes it easier for you to move and extend to the cloud.

Page 79: How Big Data ,Cloud Computing ,Data Science can help business

Basic StatisticsSome of the basic statistics that every data scientist should know are given here. This assumes rudimentary basic knowledge of statistics ( like measures of central tendency or variation) and basic familiarity with some of the terminology used by statisticians.

Random Sampling- In truly random sampling,the sample should be representative of the entire data. RAndom sampling remains of relevance in the era of Big Data and Cloud Computing

Distributions- A data scientist should know the distributions ( normal, Poisson, Chi Square, F) and also how to determine the distribution of data.

Hypothesis Testing - Hypothesis testing is meant for testing assumptions statistically regarding values of central tendency (mean, median) or variation. A good example of an easy to use software for statistical testing is the “test” tab in the Rattle GUI in R.

Outliers- Checking for outliers is a good way for a data scientist to see anomalies as well as identify data quality. The box plot (exploratory data analysis) and the outlierTest function from car package ( Bonferroni Outlier Test) is how statistical rigor can be maintained to outlier detection.

Page 80: How Big Data ,Cloud Computing ,Data Science can help business

Basic Techniques

Some of the basic techniques that a data scientist must know are listed as follows-

Text Mining - In text mining , text data is analyzed for frequencies, associations and corelation for predictive purposes. The tm package from R greatly helps with text mining.

Sentiment Analysis- In sentiment analysis the text data is classified based on a sentiment lexicography ( eg which says happy is less positive than delighted but more positive than sad) to create sentiment scores of the text data mined.

Social Network Analysis- In social network analysis, the direction of relationships, the quantum of messages and the study of nodes,edges and graphs is done to give insights..

Time Series Forecasting- Data is said to be auto regressive with regards to time if a future value is dependent on a current value for a variable. Technqiues such as ARIMA and exponential smoothing and R packages like forecast greatly assist in time series forecasting.

Web AnalyticsSocial Media AnalyticsData Mining or Machine Learning

Page 81: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Tools

- R

- Python

- Tableau

- Spark with ML

- Hadoop (Pig and Hive)

- SAS

- SQL

Page 82: How Big Data ,Cloud Computing ,Data Science can help business

R

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes an effective data handling and storage facility, a suite of operators for calculations on arrays, in particular matrices, a large, coherent, integrated collection of intermediate tools for data analysis, graphical facilities for data analysis and display either on-screen or on hardcopy, and a well-developed, simple and effective programming language

https://www.r-project.org/about.html

Page 83: How Big Data ,Cloud Computing ,Data Science can help business

Pythonhttp://python-history.blogspot.in/ and https://www.python.org/

Page 84: How Big Data ,Cloud Computing ,Data Science can help business

SAShttp://www.sas.com/en_in/home.html

Page 85: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack with Sparkhttp://spark.apache.org/ Apache Spark™ is a fast and general engine for large-scale data processing.

Page 86: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack with Mahouthttps://mahout.apache.org/

The Apache Mahout™ project's goal is to build an environment for quickly creating scalable performant machine learning applications.Apache Mahout Samsara Environment includes

Distributed Algebraic optimizerR-Like DSL Scala APILinear algebra operationsOps are extensions to ScalaIScala REPL based interactive shellIntegrates with compatible libraries like MLLibRuns on distributed Spark, H2O, and Flink

Apache Mahout Samsara Algorithms included

Stochastic Singular Value Decomposition (ssvd, dssvd)Stochastic Principal Component Analysis (spca, dspca)Distributed Cholesky QR (thinQR)Distributed regularized Alternating Least Squares (dals)Collaborative Filtering: Item and Row Similarity

Page 87: How Big Data ,Cloud Computing ,Data Science can help business

Big Data: Hadoop Stack with Mahouthttps://mahout.apache.org/

Apache Mahout software provides three major features:A simple and extensible programming environment and framework for building scalable algorithmsA wide variety of premade algorithms for Scala + Apache Spark, H2O, Apache FlinkSamsara, a vector math experimentation environment with R-like syntax which works at scale

Page 88: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Techniques

- Machine Learning

- Regression

- Logistic Regression

- K Means Clustering

- Association Analysis

- Decision Trees

- Text Mining

- Social Network Analysis

- Time Series Forecasting

- LTV and RFM Analysis

- Pareto Analysis

Page 89: How Big Data ,Cloud Computing ,Data Science can help business

What is an algorithm

a process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer.

a self-contained step-by-step set of operations to be performed

a procedure or formula for solving a problem, based on conducting a sequence of specified action

a procedure for solving a mathematical problem (as of finding the greatest common divisor) in a finite number of steps that frequently involves repetition of an operation; broadly : a step-by-step procedure for solving a problem or accomplishing some end especially by a computer.

Page 90: How Big Data ,Cloud Computing ,Data Science can help business

Machine Learning

Machine learning concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages

Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).

In the terminology of machine learning, classification is considered an instance of supervised learning, i.e. learning where a training set of correctly identified observations is available.

In machine learning, the problem of unsupervised learning is that of trying to find hidden structure in unlabeled data. Since the examples given to the learner are unlabeled, there is no error or reward signal to evaluate a potential solution. This distinguishes unsupervised learning from supervised learning

The corresponding unsupervised procedure is known as clustering or cluster analysis, and involves grouping data into categories based on some measure of inherent similarity (e.g. the distance between instances, considered as vectors in a multi-dimensional vector space).

Page 91: How Big Data ,Cloud Computing ,Data Science can help business

CRAN VIEW Machine Learning

http://cran.r-project.org/web/views/MachineLearning.html

Page 92: How Big Data ,Cloud Computing ,Data Science can help business

Machine Learning in Python

http://scikit-learn.org/stable/

Page 93: How Big Data ,Cloud Computing ,Data Science can help business

Classification

In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. The individual observations are analyzed into a set of quantifiable properties, known as various explanatory variables,features, etc. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for blood type), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a part word in an email) or real-valued (e.g. a measurement of blood pressure).

Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be discretized into groups (e.g. less than 5, between 5 and 10, or greater than 10).

Page 94: How Big Data ,Cloud Computing ,Data Science can help business

Regression

regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables.

Page 95: How Big Data ,Cloud Computing ,Data Science can help business

kNN

Page 96: How Big Data ,Cloud Computing ,Data Science can help business

Support Vector Machines

http://axon.cs.byu.edu/Dan/678/miscellaneous/SVM.example.pdf

Page 97: How Big Data ,Cloud Computing ,Data Science can help business

Association Rules

http://en.wikipedia.org/wiki/Association_rule_learningBased on the concept of strong rules, Rakesh Agrawal et al.[2] introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection, Continuous production, and bioinformatics. As opposed to sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions

Conecpts- Support, Confidence, LiftIn Rapriori() in arules packageIn Pythonhttp://orange.biolab.si/docs/latest/reference/rst/Orange.associate/

Page 98: How Big Data ,Cloud Computing ,Data Science can help business

Gradient Descent

Gradient descent is a first-order iterative optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

http://econometricsense.blogspot.in/2011/11/gradient-descent-in-r.html

Start at some x value, use derivative at that value to tellus which way to move, and repeat. Gradient descent.

http://www.cs.colostate.edu/%7Eanderson/cs545/Lectures/week6day2/week6day2.pdf

Page 99: How Big Data ,Cloud Computing ,Data Science can help business

Gradient Descent

https://spin.atomicobject.com/2014/06/24/gradient-descent-linear-regression/ A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is.

initial_b = 0 # initial y-intercept guessinitial_m = 0 # initial slope guessnum_iterations = 1000

Page 100: How Big Data ,Cloud Computing ,Data Science can help business

Decision Trees

http://select.cs.cmu.edu/class/10701-F09/recitations/recitation4_decision_tree.pdf

Page 101: How Big Data ,Cloud Computing ,Data Science can help business

Decision Trees

Http://www.ise.bgu.ac.il/faculty/liorr/hbchap9.pdf

Page 102: How Big Data ,Cloud Computing ,Data Science can help business

Random Forest

Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest).Each tree is grown as follows:

1. If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.

2. If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

3. Each tree is grown to the largest extent possible. There is no pruning.In the original paper on random forests, it was shown that the forest error rate depends on two things:

The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate.The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the

individual trees decreases the forest error rate.

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#intro

Page 103: How Big Data ,Cloud Computing ,Data Science can help business

Bagging

Bagging, aka bootstrap aggregation, is a relatively simple way to increase the power of a predictive statistical model by taking multiple random samples(with replacement) from your training data set, and using each of these samples to construct a separate model and separate predictions for your test set. These predictions are then averaged to create a, hopefully more accurate, final prediction value.

http://www.vikparuchuri.com/blog/build-your-own-bagging-function-in-r/

Page 104: How Big Data ,Cloud Computing ,Data Science can help business

Boosting

Boosting is one of several classic methods for creating ensemble models, along with bagging, random forests, and so forth. Boosting means that each tree is dependent on prior trees, and learns by fitting the residual of the trees that preceded it. Thus, boosting in a decision tree ensemble tends to improve accuracy with some small risk of less coverage.XGBoost is a library designed and optimized for boosting trees algorithms. XGBoost is used in more than half of the winning solutions in machine learning challenges hosted at Kaggle.

http://xgboost.readthedocs.io/en/latest/model.html# And http://dmlc.ml/rstats/2016/03/10/xgboost.html

Page 105: How Big Data ,Cloud Computing ,Data Science can help business

Data Science Process

By Farcaster at English Wikipedia, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=40129394

Page 106: How Big Data ,Cloud Computing ,Data Science can help business

LTV AnalyticsLife Time Value (LTV) will help us answer 3

fundamental questions:

1. Did you pay enough to acquire

customers from each marketing

channel?

2. Did you acquire the best kind of

customers?

3. How much could you spend on

keeping them sweet with email and

social media?

Page 107: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analytics :Case Studyhttps://blog.kissmetrics.com/how-to-calculate-lifetime-value/

Page 108: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analyticshttps://blog.kissmetrics.com/how-to-calculate-lifetime-value/

Page 109: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analyticshttps://blog.kissmetrics.com/how-to-calculate-lifetime-value/

Page 110: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analyticshttps://blog.kissmetrics.com/how-to-calculate-lifetime-value/

Page 111: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analyticshttp://www.kaushik.net/avinash/analytics-tip-calculate-ltv-customer-lifetime-value/

Page 112: How Big Data ,Cloud Computing ,Data Science can help business

LTV Analytics

Download the zip file from http://www.kaushik.net/avinash/avinash_ltv.zip

Page 113: How Big Data ,Cloud Computing ,Data Science can help business

Pareto principle

The Pareto principle (also known as the 80–20 rule, the law of the vital few, and the principle of factor sparsity) states that, for many events, roughly 80% of the effects come from 20% of the causes

80% of a company's profits come from 20% of its customers

80% of a company's complaints come from 20% of its customers

80% of a company's profits come from 20% of the time its staff spend

80% of a company's sales come from 20% of its products

80% of a company's sales are made by 20% of its sales staff

Several criminology studies have found 80% of crimes are committed by 20% of criminals.

Page 114: How Big Data ,Cloud Computing ,Data Science can help business

RFM AnalysisRFM is a method used for analyzing customer value.

Recency - How recently did the customer purchase?

Frequency - How often do they purchase?

Monetary Value - How much do they spend?

A method

Recency = 10 - the number of months that have passed since the customer last purchased

Frequency = number of purchases in the last 12 months (maximum of 10)

Monetary = value of the highest order from a given customer (benchmarked against $10k)

Alternatively, one can create categories for each attribute. For instance, the Recency attribute might be broken into three

categories: customers with purchases within the last 90 days; between 91 and 365 days; and longer than 365 days.

Such categories may be arrived at by applying business rules, or using a data mining technique, to find meaningful

breaks.

A commonly used shortcut is to use deciles. One is advised to look at distribution of data before choosing breaks.

Page 115: How Big Data ,Cloud Computing ,Data Science can help business

Are you ready To use more Data Science