data science: philosopher's stone

Philosopher’s stoneOpen Data Science Conference, San Francisco

November 2015

Vin Sharma

@ciphr | [email protected]

2

Data science: Philosopher’s stone Data Science has grow from a tongue-in-cheek epithet (see “rocket science”) into a real profession. Data Scientists now have great power in enterprises. We hold the Philosopher's Stone that transforms raw data into intelligence. But with great power comes great responsibility.

For Data Science to evolve into a peer of physical sciences like chemistry, our community needs to help it develop the essential character of a Science:

Openness, methodological consistency, substantive body of knowledge, reuse, reproducibility, open research questions, ethics and professional responsibility.

Our team at Intel has been working on these issues helping to evolve Data Science from alchemy to chemistry.

3

From alchemy to chemistry

+ =

THINGS VALUE

Revenue Growth

Cost Savings

Margin Gain

50 Billion 35 ZB

DATA

Transmutation of Data into Value

+ =

THINGS VALUE

Revenue Growth

Cost Savings

Margin Gain

50 Billion 35 ZB

DATA

Personalized

Ubiquitous

New Ventures

Higher Productivity

Greater Efficiency

Better Products

Engaged

CustomersNew

Solutions

Transmutation of Data into Value

Value

Innovation

Delays and detours

+ =

THINGS VALUE

Revenue Growth

Cost Savings

Margin Gain

50 Billion 35 ZBNO NO NO

TRUST INSIGHT PROOF

Fail to Scale

Lack of Use Cases

Fail to Secure

Scarcity of Skills Complexity of Systems

Fail to show ROI

DATA

IoTDeveloper Platform

Wearables Developer Platform

Parkinson’sResearchPlatform

RetailAnalytics Solutions

Power Distribution

Analytics

DigitalOil Field

Population Genomics

DataSource

Use Cases

Maker solutions on

intel® Galileo & Intel® Edison

Customer device usage analyses for

fashion watch ODM

Disease progression tracking via

sensors

RFID-based inventory

tracking; socialmedia based

demand forecasting

Grid overlay network data

analysis

Preventive maintenance for oil field

assets

Compare the anonymized genome data

of a local patient with

genome data in public data

sets

Concept solutions

Science frictionData Science:

• Iterative error-prone drudgery

• One-off, ad hoc models in isolation

Analytics Processing:

• Single-threaded, single-node processing

• Proprietary, fixed-function solutions

Application Code:

• Monolithic architecture

• Legacy components

From data science to big data analytics: Less alchemy, more chemistry

8

Open source software project to accelerate creation of cloud native apps driven by big data analytics. TAP provides a shared environment for app developers to collaborate with data scientists, making it easier to use advanced analytics on big data in the Cloud.

Trusted Analytics Platform

Graph

Trusted analytics Platform Connectors

Message Brokers & Queues

Kafka, RabbitMQ

MQTT, WS, REST…

Processors

Stream & Batch

Hadoop, Spark, GearPump…

Manage Orchestration, Telemetry, Security

Stores

Polyglot Persistence

HDFS, HBase, PostgreSQL,

MySQL, Redis, MongoDB,

InfluxDB, Objectivity, etc…

Models

Develop, train, evaluate,

deploy models as services

Data Scientist

Develop Deploy

Intel, DataRobot, DL4J, H2O

Runtimes

Polyglot App Runtime

Python, R, Java, Scala, Go…

Develop, test, push

applications; manage lifecycle

App DeveloperSystem Operator

Infrastructure (IaaS)

Appliance

Model building services

11

Data PreparationJoin, filter, andcleanse data sets

Model EvaluationAccuracy measures, cross-validation

Application IntegrationInvoke model via APIs

Hypothesis SelectionDefine inferential or predictive hypothesis

Model TrainingUse ML to find β

Model DeploymentRun in scoring engine, track concept drift

TAP community

12

Case study: patient readmission prediction at penn medicine

13

LDA-derived medication features led to15% improvement in accuracy

Raw Medication Lists

Cleaned Medication Lists(text processing methods,

regular expressions)

LDA-derived Features

Data are noisy and sparse[ ]

Data are less noisy, but sparse[ ]

Data are neither noisy nor sparse[ ]

42,358 features

23,663 features

23,663 features

20 features

Penn Medicine wants to identify and stratify heart failure patients at risk of re-admission within 30 or 90 days of discharge.

• Patient phenotype approach to risk classification

• Use of patient medication history

• Applying unsupervised text analytics algorithms, such as Latent Dirichlet Allocation (LDA), to model relationship between medications and medical conditions

• Using this model with patient health records to identify high-risk patient profiles

• Evaluating individual patient risk of re-admission for new and existing patients

14

Vin Sharma / @ciphr / [email protected]

data science: philosopher's stone

Data & Analytics