the science of data science
TRANSCRIPT
The Science of Data Science(Data plus Semantics yields Knowledge)
Prof. James Hendler
Tetherless World Constellation Chair of Computer, Web and
Cognitive Sciences
Director, The Rensselaer IDEA
1
The Rensselaer Institute for Data
Exploration and Applications
Performance Plan to Budget Presentation
February 2015
The Rensselaer Institute for Data Exploration and Applications (IDEA) is abreakthrough initiative brings together key research areas and advancedtechnologies to revolutionize the way we use data in science, engineering,and virtually every other research and educational discipline. By bridging thegaps between analytics, modeling, and simulation we continue theRensselaer tradition as a leader in applying critical technologies to improvingeveryday life and meeting the challenges of the future.
3
The Rensselaer Institute for Data Exploration and Applications
Business
Systems:
Built and Natural
Environments:
Cyber-
Resiliency:
Policy, Ethics and
Stewardship:
Materials Informatics:Data-driven
Physical/Life Sciences:
Healthcare Analytics
and Mobile Health:
Social Network
Analytics:Agents and
Augmented Reality:
4
IDEA project examples
• Healthcare in Context:
Data mining/analytics to
Improve public health from
a systems perspective at
the individual to national
scales.
• Data-Centric Engineering
Design: Data-driven
Design & Control under
uncertainty via data fusion
across multiple scales and
sources
• Supply Chain Resilience
through Information
Visibility: Demonstrate
uses of supply chain
information visibility for
anticipating, mitigating and
recovering from disruptive
events
• Accelerated design of
functional materials/Material
Ontology: Address basic
materials processing data-based
informatics for complex,
multifunctional (often nano)
materials.
• Biome-informatics: Develop data
aggregation and computational
tools to integrate disparate
datasets into large ecosystem
models using data collected on
the microbial communities that
inhabit the base of most
ecosystems
• Deducing Structure to Function
in Biomedicine: Develop
systematic data-resourced
methods for discovering and
exploiting structure-to-function
relationships.
5
KDD Pipeline – as usually presented
Data Storage
(Big Data
Warehouse)
KDD Pipeline – in the real world
Data is increasingly being
brought in from external
sources, with mixed
provenance, and
increasingly outside the
analyzers’ control.
At increasing rates and scales
6
Data
Storage
Sensors and apps Social
Media
Customer
BehaviorsWeb
Partners
Formatting, standards use, data
cleansing, data bias analysis, …
Open data
Data
StorageData
StorageData
StorageData
StorageData
StorageData
StorageData
Storage
Tough data integration challenges
Enterprise
analytics
Open Data
Integration
Hard
problems!
Closing the loop on (big) data
IDEA is focusing on key data science
areas
which are revolutionizing engineering, science
and business with significant social impact
8
Predictive Analytics Discovery Informatics Data Exploration
Theme 1: Predictive Analytics
9
From “what is” to “what if”
Courtesy of
Eric Schadt,
Mount Sinai
Example: Healthcare Data Analytics
The Digital Universe of Data to Better
Diagnose and Treat Patients
Courtesy of
Eric
Schadt,
Mount Sinai
Identifying predictive features in data
Each factor must be separately analyzed for its “Predictivity”
• Mutual information measure
The “black art” of predictive analytics is finding the right ones
• Use too few, the model is weak
• Use too many, the model becomes slow and dominated by noise
Algorithms required to do this because the overwhelming number of “weak” factors defies human abilities to combine
• Machine learning identifies key feature
• some require “roll ups”
• some require “pull outs”
• Mathematical techniques then reduce the dimensionality
11
12
Predictive analytics in sensors
Extend-o-hand (Josh Shinavier. PhD)
Classification of the sensor data (via machine-learning) allows predictive recognition
of different gestures (i.e. before the gesture is finished).
13
Predictive analytics in large scale behaviors
List clusters at risk for Asian Clams
<1mile Cook’s Bay.”
Machine-learning generates predicts future distributions of invasive species in Lake
George based on current distributions and bathymetry similarity.
Predictive Social Network Analytics (with RPI NeST center)
14
Social Networks in Action
Analyzing cascading failures
Modeling (supply chain)
networks…
and predicting (cascading)
network risks.
Modeling network stressors (including
human cognitive element)
Understanding network dynamics
15
Data Science Research Center: tools for data analytics
Theory & Algorithms
• Randomized
• Optimization
• Approximation
• Multilinear Algebra
Applications
Statistics
• Multivariate analysis
• Optimal Experimental
Design
Dimension reduction by
randomized algorithms for
numerical linear algebra for
identify significant components
and visualizing Petabyte-scale
data matrices (P. Drineas, CSCI)
Parallel Factor Analysis for tensor systems creates a scalable
solution, on AMOS, for a critical data-processing component of
data analytics for large graphs. (B. Yener, CSCI)
Computational concerns
• Scaling
• Cyber Security for Data
Adding Semantics: Discovery Informatics
16
From “what if” to “Why”
17
Scientific data: Microbiome informatics
Human Biome
Environmental Biome
Built Environment
Data Analytics
Semantic Data Integration
While microbes are among the smallest
organisms on the planet, they are also
the largest influence on mass and
nutrient transport in the biosphere. They
are the base of most natural ecosystems,
as well as the purveyors of air and water
quality. It is also microbes that primarily
govern disease transfer and human
health in our built environments.
18
Materials Processing Ontology (cMDIS/IDEA)The materials field has made much progress on systematically understanding materials
structure-to-property relationships, but lacks an organized model of processing-to-
property relations.
A critical need for systematic development of new materials technologies!
Goal: Create a (machine-readable) ontology
for materials processing.
By combining our expertise in data science,
materials and manufacturing, we are creating a
key missing link in the Materials Genome
Initiative.
Some questions need a qualitative answer
Platform for Experimental Collaborative Ethnography
20
Discovery Informatics Requires Unstructured data
Integration of text analytics,
natural language processing,
network-based multimedia
analysis and
structured/unstructured data
integration
Requires Unstructured data (real-time feeds /images/video)
DOE SEAB report on HPC:
How might a neuromorphic “accelerator” type processor be
used to improve the application performance, power
consumption and overall system reliability of future
exascale systems?
21
Power Consumption (w/IBM)
Network Learning (sensors)
Sparse Distributed Representations
Hybrid Neural/Symbolic Systems
Neuromorphic Computing: software systems that implement models
inspired by neural systems to analyze data tied to perception, motor control,
or multisensory integration.
22
Neuromorphic Computing (CCI/IDEA)
Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors
Use for improving AMOS energy use (like autonomic control)
Use for exploring inputs from data-sensing systems (extrinsic control)
Neuromorphic Computing requires critical Rensselaer technologies
Integrating data analytics (on the fly) with simulation and modeling
CCI (AMOS) allows us to explore new variants on neuromorphic
approaches
IDEA provides learning models and analytics capabilities for evaluation
Together allow us to attack audio/visual streaming data
autonomic
extrinsic
Theme 3: Data Exploration
23
From “why” to “what is”
24
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
25
From visualization to exploration
… Unfortunately, visualization too often becomes an end product of scientific analysis,
rather than an exploration tool that scientists can use throughout the research life cycle.
However, new database technologies, coupled with emerging Web-based technologies,
may hold the key to lowering the cost of visualization generation and allow it to become
a more integral part of the scientific process.
26
From what is, to what if, to why (and back)
These capabilities are critical in “closing the loop” between data,
simulation and modeling in scientific discovery, engineering
design, and business innovation.
27
A “Data Science” Research Agenda
MultiscaleSparcity
Abductive Agent-oriented
• Gathering and
representing
information from
multiple sources
• topic of CODS talk
• Systematic (and
scalable) methods for
predictive analytics
• example: Parallel
search for best kernel
functions
28
Supporting the Scientific agenda
• New Data Exploration
platforms
• example: Patent
pending on new multi-
user collaborative
device
• Cognitive and
immersive platforms
• Data sharing standards
• Research Data Alliance
• W3C
The Rensselaer IDEA
Summary
• Data is not just the “oil” of the new
generation• information is the new power source generated from that “oil”
• Using data for prediction is becoming less of an art,
but still needs systematicity
• Scaling tools beyond MapReduce
• Better methods for rapid customization
• Turning data into causal or design knowledge is in its
early stages
• Closing the loop from data to design requires new informatics,
new mathematics, and new ways of thinking beyond data mining
29