Download - Data Science: Philosopher's Stone
Philosopher’s stoneOpen Data Science Conference, San Francisco
November 2015
Vin Sharma
@ciphr | [email protected]
2
Data science: Philosopher’s stone Data Science has grow from a tongue-in-cheek epithet (see “rocket science”) into a real profession. Data Scientists now have great power in enterprises. We hold the Philosopher's Stone that transforms raw data into intelligence. But with great power comes great responsibility.
For Data Science to evolve into a peer of physical sciences like chemistry, our community needs to help it develop the essential character of a Science:
Openness, methodological consistency, substantive body of knowledge, reuse, reproducibility, open research questions, ethics and professional responsibility.
Our team at Intel has been working on these issues helping to evolve Data Science from alchemy to chemistry.
3
From alchemy to chemistry
+ =
THINGS VALUE
Revenue Growth
Cost Savings
Margin Gain
50 Billion 35 ZB
DATA
Transmutation of Data into Value
+ =
THINGS VALUE
Revenue Growth
Cost Savings
Margin Gain
50 Billion 35 ZB
DATA
Personalized
Ubiquitous
New Ventures
Higher Productivity
Greater Efficiency
Better Products
Engaged
CustomersNew
Solutions
Transmutation of Data into Value
Value
Innovation
Delays and detours
+ =
THINGS VALUE
Revenue Growth
Cost Savings
Margin Gain
50 Billion 35 ZBNO NO NO
TRUST INSIGHT PROOF
Fail to Scale
Lack of Use Cases
Fail to Secure
Scarcity of Skills Complexity of Systems
Fail to show ROI
DATA
IoTDeveloper Platform
Wearables Developer Platform
Parkinson’sResearchPlatform
RetailAnalytics Solutions
Power Distribution
Analytics
DigitalOil Field
Population Genomics
DataSource
Use Cases
Maker solutions on
intel® Galileo & Intel® Edison
Customer device usage analyses for
fashion watch ODM
Disease progression tracking via
sensors
RFID-based inventory
tracking; socialmedia based
demand forecasting
Grid overlay network data
analysis
Preventive maintenance for oil field
assets
Compare the anonymized genome data
of a local patient with
genome data in public data
sets
Concept solutions
Science frictionData Science:
• Iterative error-prone drudgery
• One-off, ad hoc models in isolation
Analytics Processing:
• Single-threaded, single-node processing
• Proprietary, fixed-function solutions
Application Code:
• Monolithic architecture
• Legacy components
From data science to big data analytics: Less alchemy, more chemistry
8
Open source software project to accelerate creation of cloud native apps driven by big data analytics. TAP provides a shared environment for app developers to collaborate with data scientists, making it easier to use advanced analytics on big data in the Cloud.
Trusted Analytics Platform
Graph
Trusted analytics Platform Connectors
Message Brokers & Queues
Kafka, RabbitMQ
MQTT, WS, REST…
Processors
Stream & Batch
Hadoop, Spark, GearPump…
Manage Orchestration, Telemetry, Security
Stores
Polyglot Persistence
HDFS, HBase, PostgreSQL,
MySQL, Redis, MongoDB,
InfluxDB, Objectivity, etc…
Models
Develop, train, evaluate,
deploy models as services
Data Scientist
Develop Deploy
Intel, DataRobot, DL4J, H2O
Runtimes
Polyglot App Runtime
Python, R, Java, Scala, Go…
Develop, test, push
applications; manage lifecycle
App DeveloperSystem Operator
Infrastructure (IaaS)
Appliance
Model building services
11
Data PreparationJoin, filter, andcleanse data sets
Model EvaluationAccuracy measures, cross-validation
Application IntegrationInvoke model via APIs
Hypothesis SelectionDefine inferential or predictive hypothesis
Model TrainingUse ML to find β
Model DeploymentRun in scoring engine, track concept drift
TAP community
12
Case study: patient readmission prediction at penn medicine
13
LDA-derived medication features led to15% improvement in accuracy
Raw Medication Lists
Cleaned Medication Lists(text processing methods,
regular expressions)
LDA-derived Features
Data are noisy and sparse[ ]
Data are less noisy, but sparse[ ]
Data are neither noisy nor sparse[ ]
42,358 features
23,663 features
23,663 features
20 features
Penn Medicine wants to identify and stratify heart failure patients at risk of re-admission within 30 or 90 days of discharge.
• Patient phenotype approach to risk classification
• Use of patient medication history
• Applying unsupervised text analytics algorithms, such as Latent Dirichlet Allocation (LDA), to model relationship between medications and medical conditions
• Using this model with patient health records to identify high-risk patient profiles
• Evaluating individual patient risk of re-admission for new and existing patients
14
Vin Sharma / @ciphr / [email protected]