data science introduction - data science: what art thou?

40
Data Science What art thou? Gregg Barrett I Partner I SignalRunner

Upload: gregg-barrett

Post on 29-Jan-2018

249 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Data Science Introduction - Data Science: What Art Thou?

Data Science What art thou?

Gregg Barrett I Partner I SignalRunner

Page 2: Data Science Introduction - Data Science: What Art Thou?

Disclosure

This is a paid for engagement for the Analytics & Big Data in Insurance conference.

This is not a marketing engagement.

Where reference is made to third party products and services it is merely for illustrative purposes and should not be viewed as an endorsement of any kind unless stated so.

The intent of this engagement is to build a better understanding of data science. To this end the engagement should be viewed as introductory.

Should you have any questions or require any further information please contact us at:

[email protected]

Biographical information of the presenter can be found at:

http://www.linkedin.com/in/greggbarrett

Page 3: Data Science Introduction - Data Science: What Art Thou?

Outline

Data Science: What art thou?- Building an understanding

- Definition

- The approach- Understanding the data science capability

- Purple people- Data lake/Enterprise data hub- Unstructured Information Management Architecture- University program - Start-up program- You never created the world but you can blow it up- Tutorials

- A note on Big Data- Curse of dimensionality - On the insurance front- Quick note: adverse selection- Think- Strategy

Page 4: Data Science Introduction - Data Science: What Art Thou?

Building an understanding

Describing data science is like trying to describe the sunset.

It should be easy, but somehow capturing the words is impossible. (Booz Allen Hamilton, 2015)

Page 5: Data Science Introduction - Data Science: What Art Thou?

[1] The math behind NUMB3RS: http://bit.ly/2vRarOR[2] The Numbers Behind NUMB3RS: Solving Crime with Mathematics: http://amzn.to/2wwKEsb

Page 6: Data Science Introduction - Data Science: What Art Thou?

Building an understanding

Definition:

Data science is the utilisation of a vast set of tools for modelling and understanding complex datasets.

To simplify matters we shall consider:

- analytics as being equivalent to data science

- machine learning as a subset of data science

Page 7: Data Science Introduction - Data Science: What Art Thou?

The approach

Strategy

Freedom & Responsibility

People I Process I Technology

University Program I Start-up Program

[3] Freedom and Responsibility: http://bit.ly/2mg9ZEZ

Page 8: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Page 9: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Lack of competent leadership- Poor business understanding when data science is housed

in and/or driven by IT – lack business acumen- Poor technical understanding of tasks by business – lack

technical acumen- Business and technology fail to work together – different

pages - Lone wolf data scientist myth

Requirement: - “Purple people” are a necessity.- Data science is a team effort – pulling together people and

expertise from multiple domains [4]

[4] Why Teams?: http://bit.ly/2vRv5yn

Page 10: Data Science Introduction - Data Science: What Art Thou?

Purple people

Blue -> business

-> Purple people

Red -> technology

“…..people who live at the confluence of disparate approaches and opinions have a broader perspective. They see connections and possibilities that others miss. They speak multiple languages and gracefully move between different groups and norms. They continuously translate, synthesize, and unify. As a result, they imagine new ways to solve old problems, and they reinvent old ways to tackle new challenges. They are powerful change agents and value creators.

In the world of analytics, I call these men and women “purple people”. They are not “blue” in the business or “red” in technology, but a blend of the two, hence purple.” (Eckerson, 2012)

What to look for:

- Inspire

- Entrepreneurial

Page 11: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Poor data strategy – leading to data governance issues

amongst others things.[5]- Data silos persist and sometimes grow.- Unstructured data is not addressed.

Requirement: - Constant vigilance by leadership (purple person) to ensure

business and technical are on the same page.- Unstructured data MUST be addressed. - Points of discussion include data lakes and UIMA for

example.

This step can be viewed more as the data strategy.

[5] DeepMind's Access to U.K. Health Data Deemed `Inappropriate'https://bloom.bg/2pREa2B

Page 12: Data Science Introduction - Data Science: What Art Thou?

Data lake/Enterprise data hub

Page 13: Data Science Introduction - Data Science: What Art Thou?

Data lake/Enterprise data hub

Page 14: Data Science Introduction - Data Science: What Art Thou?

Data lake/Enterprise data hub

Page 15: Data Science Introduction - Data Science: What Art Thou?

Unstructured Information Management Architecture

Page 16: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Lack of data engineers. [6]

Note: - Algos are the fussiest of fussy eaters. - Data wrangling [7]

Requirement: - Realise now that you will need data engineers.

This step can be viewed more as tactical -> data strategy execution.

[6] Gone Fishing – For Data: http://bit.ly/2uPTVNy[7] Wrangling and governing unstructured datahttps://ibm.co/2urTxTh

Page 17: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Lack of breadth and depth of technical understanding.- Technical persons can lack streets smarts and business

acumen.- HR is left to source and validate human talent.- Poor model validation [8]- Technology strategies that reduce optionality

Requirement: - Spend your time hiring the right people. - Take model validation seriously.- Technology is a means to an end, NOT and end in and of

itself.

[8] Big data: A big mistake?http://bit.ly/2urzg03

Page 18: Data Science Introduction - Data Science: What Art Thou?

University program

- Advance programs in the graduate field [9]

- Creates a win-win relationship [10, 11]

- Provide cleansed data -> students and faculty generate solutions

- Generate solutions to real world problems

- Students and faculty get access to real world data and problems

- Identify promising students for recruitment [12]

- Input to syllabus formulation

[9] The 6 Best Data Science Master's Degree Courses In the UShttp://bit.ly/2tTeHrq[10] Allstate University Hackathonhttp://bit.ly/2uYYzty[11] How a $26 Billion Hedge Fund Lures the Beautiful Mindshttps://bloom.bg/2msuc9T[12] Geomagnetically Induced Current (GIC) Earth-based Early Warning Predictions based on the Horizontal Polar Component’s Horizontal Magnetic Field Intensityhttp://bit.ly/2tMvD83

Page 19: Data Science Introduction - Data Science: What Art Thou?

Start-up program

- Maintain surveillance of new start-up offerings in the data science space

- Important because:

- Talent

- Service

- Product

- Equity

Page 20: Data Science Introduction - Data Science: What Art Thou?

Start-up program: Landscape is bigger than you think

Page 21: Data Science Introduction - Data Science: What Art Thou?

Start-up program: Landscape is bigger than you think

Page 22: Data Science Introduction - Data Science: What Art Thou?

Start-up program: Landscape is bigger than you think

Page 23: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Failure to map the modelling effort back to business

understanding and the realisation of business value.

Requirement:- Data science participants need to have skin-in-the-game

to align interests and counter moral hazard when it arises.

Page 24: Data Science Introduction - Data Science: What Art Thou?

You never created the world but you can blow it up

Most people use statistics the way a drunkard uses a lamp post, more for support than illumination.

The Modelers' Hippocratic Oath

- I will remember that I didn't make the world, and it doesn't satisfy my equations.

- Though I will use models boldly to estimate value, I will not be overly impressed by mathematics. [13]

- I will never sacrifice reality for elegance without explaining why I have done so.

- Nor will I give the people who use my model false comfort about its accuracy. Instead, I will make explicit its assumptions and oversights.

- I understand that my work may have enormous effects on society and the economy, many of them beyond my comprehension. [14]

[13] Recipe for Disaster: The Formula That Killed Wall Streethttp://bit.ly/2uQytrN[14] Weapons of Math Destructionhttp://bit.ly/29Lm92P

Page 25: Data Science Introduction - Data Science: What Art Thou?

Understanding the data science capability

Where things go wrong: - Poor change management and communication – strategy

and execution.- Poor project management – planning fallacy – unrealistic

expectations.

Requirement: - Get serious about change management and

communication – and don’t leave it up to someone else.- Tutorials, eLearning, trailer style corporate videos,

workshops.- Modellers must articulate their models clearly – if they

can’t explain it – they can’t use it.

Page 26: Data Science Introduction - Data Science: What Art Thou?

Tutorials: Khan Academy

Page 27: Data Science Introduction - Data Science: What Art Thou?

Tutorials: The New Boston

Page 28: Data Science Introduction - Data Science: What Art Thou?

A note on Big Data

Big Data:

Big Data refers to a data environment that cannot be handled by traditional technologies.

Hadoop:

Is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.

High dimensional data:

High dimensional data brings with it the curse of dimensionality.

Dimension reduction can be a non-trivial undertaking [15].

[15] Beware the Big Errors of 'Big Data'http://bit.ly/2gsVTcd

Page 29: Data Science Introduction - Data Science: What Art Thou?

Curse of dimensionality

Page 30: Data Science Introduction - Data Science: What Art Thou?

On the insurance front

Insurance:- Fundamentally about pricing risk

- To price risk -> data science

- Data science is fertile ground for insurance [16]

- Those doing a poor job face adverse selection

- At the very least it is on the verge of the Bezos bullseye

Models:

For prediction -> of interest is the prediction error

For estimation -> of interest is the accuracy of a function

For explanation -> requires the use of more elaborate inferential tools

[16] That Drone Hovering Over Your Home? It’s the Insurance Inspector http://on.wsj.com/2feVhMd

Page 31: Data Science Introduction - Data Science: What Art Thou?

Quick note: adverse selection

Do they have the same expected loss?

Page 32: Data Science Introduction - Data Science: What Art Thou?

Quick note: adverse selection

Turns out the expected loss of each is different

The problem becomes adverse selection in a competitive marketplace.

Page 33: Data Science Introduction - Data Science: What Art Thou?

Think

- People, Process, Technology

- Raw material -> data

- Customer service

- Ability to execute

- Amazon Merchant Services [17]

[17] Amazon’s Lending Business for Online Merchants Gains Momentumhttps://bloom.bg/2sGSm12

Page 34: Data Science Introduction - Data Science: What Art Thou?

Strategy

- Data science is at the core

- Benefit from driving adverse selection

- Don’t make survival contingent on a single outcome – pursue optionality through data driven business models – innovate new products and services [18, 19]

- Weave data science into the fabric of the organisation – not isolated to a centre of excellence.

[18] The Rise of Distributed Organisationshttp://bit.ly/1Bt4EWFDistributed Organizationshttp://bit.ly/2uu31iJ[19] Top trends among AI power usershttps://ibm.co/2eKBJyQ

Page 35: Data Science Introduction - Data Science: What Art Thou?

ReferenceApache UIMA. (2017). What is uima? [webpage]. Retrieved from https://uima.apache.org/

Bengio, Y. (2017). The curse of dimensionality. [figure]. Retrieved from Bengio, Y. (2017). The need for non-local generalization and distributed representations. [webpage]. Retrieved from http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html

Booz Allen Hamilton. (2015). The field guide to data science. [pdf]. Retrieved from https://www.boozallen.com/content/dam/boozallen/documents/2015/12/2015-Field-Guide-To-Data-Science.pdf

CRISP-DM. (2000). Generic tasks (bold) and outputs (italic) of the CRISP-DM reference model. (figure). Retrieved from CRISP-DM. (2000). CRISP-DM 1.0. [pdf]. Retrieved from https://the-modeling-agency.com/crisp-dm.pdf

Derman, E. Wilmott, P. (2009). The financial modeler’s manifesto. [pdf]. Retrieved from http://www.uio.no/studier/emner/sv/oekonomi/ECON4135/h09/undervisningsmateriale/FinancialModelersManifesto.pdf

Eckerson, W. (2012). Secrets of analytical leaders: insights from information insiders (1st ed.). Technics Publications, LLC. [ISBN 10: 1935504347]

EMC. (2016). How data lakes work. (Ffgure). Retrieved from Kumaar, A. (2016). Building data lake using open source technologies. [webpage]. Retrieved from https://www.linkedin.com/pulse/building-data-lake-using-open-source-technologies-aneel

Khan Academy. (2017). Homepage. [webpage]. Retrieved from https://www.khanacademy.org/

MapR. (2013). Enterprise data hub. (figure). Retrieved from MapR. (2013). Set the bar high for enterprise data hub requirements [blog]. Retrieved from https://mapr.com/blog/set-the-bar-high-for-enterprise-data-hub-requirements/

Microsoft. (2016). The data lake approach. (figure). Retrieved from Microsoft. (2016). Azure data lake and u-sql. [ppt]. Retrieved from https://www.slideshare.net/MichaelRys/azure-data-lake-and-usql

The New Boston. (2017). Homepage. [webpage]. Retrieved from https://thenewboston.com/

Images:

Slide 31 and 32

http://looneytunes.wikia.com/wiki/Wile_E._Coyote_and_the_Road_Runner/Gallery

Page 36: Data Science Introduction - Data Science: What Art Thou?

Slide notes

Slide 5:

Numbers (stylized as NUMB3RS or NUMB3RS) is an American crime drama television series that ran on CBS from January 23, 2005, to March 12, 2010. The series was created by Nicolas Falacci and Cheryl Heuton, and follows FBI Special Agent Don Eppes (Rob Morrow) and his brother Charlie Eppes (David Krumholtz), a college mathematics professor and prodigy who helps Don solve crimes for the FBI.

The show focuses equally on the relationships among Don Eppes, his brother Charlie Eppes, and their father, Alan Eppes (Judd Hirsch), and on the brothers' efforts to fight crime, normally in Los Angeles. A typical episode begins with a crime, which is subsequently investigated by a team of FBI agents led by Don and mathematically modeled by Charlie, with the help of Larry Fleinhardt (Peter MacNicol) and AmitaRamanujan (Navi Rawat). The insights provided by Charlie's mathematics were always in some way crucial to solving the crime.

Slide 7:

Strategy drives People, Process and Technology

The university program and start-up program support People, Process and Technology

This is all done within an environment that fosters Freedom and Responsibility

Slide 9:

The rational on this is simple: data and analytics is a business requirement, driven by business, used by business to solve business challenges and to drive business opportunities. Certainly IT has a seat at the table, and is critical to enablement but ultimately IT should not own the strategy. (Forrester, 2015)

Slide 12:

A Data Lake is a central source in which data can be used in a variety of ways for many different internal customers, some currently of interest, others to be discovered in the future. Importantly a Data Lake provides the organisation with the centralization of data, a capability required in order to break down unwanted data silos. The growing use of Data Lakes has been made possible by the relatively low cost of large-scale storage on Hadoop.

Page 37: Data Science Introduction - Data Science: What Art Thou?

Slide notes

Slide 15:

Unstructured Information Management applications are software systems that analyze large volumes of unstructured information in order to discover knowledge that is relevant to an end user. An example UIM application might ingest plain text and identify entities, such as persons, places, organizations; or relations, such as works-for or located-at.

UIMA enables applications to be decomposed into components, for example "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)". Each component implements interfaces defined by the framework and provides self-describing metadata via XML descriptor files. The framework manages these components and the data flow between them. Components are written in Java or C++; the data that flows between components is designed for efficient mapping between these languages.

UIMA additionally provides capabilities to wrap components as network services, and can scale to very large volumes by replicating processing pipelines over a cluster of networked nodes

Slide 16:

Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

http://nyti.ms/2kl1V3Y

Slide 17:

Model validation: Google Flu Trends, a spectacular success and then a spectacular failure.

Page 38: Data Science Introduction - Data Science: What Art Thou?

Slide notes

Slide 19:

Technology start-up advisory and development:We maintain surveillance of the new start-up offerings in the data science space and identify and work with those that we think have a strong offering. Start-ups are good to work with as:- Talent: start-ups frequently have the best talent.- Service: start-ups actively seek clients and will go the extra mile, so one can often obtain levels of service that cannot be matched by large tech firms.

- Product: start-ups frequently offer products that are niche or solve challenges in a different way thus opening up new opportunities from the conventional commercial offerings.

- Equity: There is the opportunity to get an equity position in these start-ups, and working with them and experiencing the value proposition first hand is the best way to assess those that have strong potential value going forward from those that don’t.

Slide 24:

A model describes something in terms of likeness. Instrument x behaves LIKE this. There is error. Models are tools for approximate thinking.

Theory/ law describes the essence of something. The behaviour of instrument x IS this. There is no error.

See: http://www.emanuelderman.com/books/models-behaving-badly

The first two points are addressed through hiring the right people – people who know this!

Points three and four are addressed in the implementation process – including the use of tutorials

Point five is addressed through approaches that create skin-in-the-game – contracts, equity etc

Page 39: Data Science Introduction - Data Science: What Art Thou?

Slide notes

Slide 28:

A key principle in the analysis of high dimensional data, which is known as the curse of dimensionality. One might think that as the number of features used to fit a model increases, the quality of the fitted model will increase as well.

In general, adding additional signal features that are truly associated with the response will improve the fitted model, in the sense of leading to a reduction in test set error. However, adding noise features that are not truly associated with the response will lead to a deterioration in the fitted model, and consequently an increased test set error. This is because noise features increase the dimensionality of the problem, exacerbating the risk of overfitting (since noise features may be assigned nonzero coefficients due to chance associations with the response on the training set) without any potential upside in terms of improved test set error. Thus, we see that new technologies that allow for the collection of measurements for thousands or millions of features are a double-edged sword: they can lead to improved predictive models if these features are in fact relevant to the problem at hand, but will lead to worse results if the features are not relevant. Even if they are relevant, the variance incurred in fitting their coefficients may outweigh the reduction in bias that they bring.

(Efron, Hastie, 2016)

Slide 29:

Also see slide 8:

https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/statistical_learning.pdf

To maintain a fixed level of accuracy for a given nonparametric estimator, the sample size must increase exponentially with the increase in dimensions.

Slide 30:

Prediction -> pricing

Estimation -> capital requirements

Explanation/Inference -> risk mitigation

Page 40: Data Science Introduction - Data Science: What Art Thou?

Slide notes

Slide 34:What happened? A company that wasn’t even in your industry launched a new product and has completely flattened you. Sound familiar? It does for anyone who’s familiar with Uber. Uber first launched as a transportation service, using data and analytics to provide customers with easy, accessible and fast transportation directly from their phone. Now, Uber has since expanded to beyond just transportation, offering additional services from consumers’ phones such as meals and delivery.

(IBM, 2016)