the science of data science

The Science of Data Science(Data plus Semantics yields Knowledge)

Prof. James Hendler

Tetherless World Constellation Chair of Computer, Web and

Cognitive Sciences

Director, The Rensselaer IDEA

1

The Rensselaer Institute for Data

Exploration and Applications

Performance Plan to Budget Presentation

February 2015

The Rensselaer Institute for Data Exploration and Applications (IDEA) is abreakthrough initiative brings together key research areas and advancedtechnologies to revolutionize the way we use data in science, engineering,and virtually every other research and educational discipline. By bridging thegaps between analytics, modeling, and simulation we continue theRensselaer tradition as a leader in applying critical technologies to improvingeveryday life and meeting the challenges of the future.

3

The Rensselaer Institute for Data Exploration and Applications

Business

Systems:

Built and Natural

Environments:

Cyber-

Resiliency:

Policy, Ethics and

Stewardship:

Materials Informatics:Data-driven

Physical/Life Sciences:

Healthcare Analytics

and Mobile Health:

Social Network

Analytics:Agents and

Augmented Reality:

4

IDEA project examples

• Healthcare in Context:

Data mining/analytics to

Improve public health from

a systems perspective at

the individual to national

scales.

• Data-Centric Engineering

Design: Data-driven

Design & Control under

uncertainty via data fusion

across multiple scales and

sources

• Supply Chain Resilience

through Information

Visibility: Demonstrate

uses of supply chain

information visibility for

anticipating, mitigating and

recovering from disruptive

events

• Accelerated design of

functional materials/Material

Ontology: Address basic

materials processing data-based

informatics for complex,

multifunctional (often nano)

materials.

• Biome-informatics: Develop data

aggregation and computational

tools to integrate disparate

datasets into large ecosystem

models using data collected on

the microbial communities that

inhabit the base of most

ecosystems

• Deducing Structure to Function

in Biomedicine: Develop

systematic data-resourced

methods for discovering and

exploiting structure-to-function

relationships.

5

KDD Pipeline – as usually presented

Data Storage

(Big Data

Warehouse)

KDD Pipeline – in the real world

Data is increasingly being

brought in from external

sources, with mixed

provenance, and

increasingly outside the

analyzers’ control.

At increasing rates and scales

6

Data

Storage

Sensors and apps Social

Media

Customer

BehaviorsWeb

Partners

Formatting, standards use, data

cleansing, data bias analysis, …

Open data

Data

StorageData

StorageData

StorageData

StorageData

StorageData

StorageData

Storage

Tough data integration challenges

Enterprise

analytics

Open Data

Integration

Hard

problems!

Closing the loop on (big) data

IDEA is focusing on key data science

areas

which are revolutionizing engineering, science

and business with significant social impact

8

Predictive Analytics Discovery Informatics Data Exploration

Theme 1: Predictive Analytics

9

From “what is” to “what if”

Courtesy of

Eric Schadt,

Mount Sinai

Example: Healthcare Data Analytics

The Digital Universe of Data to Better

Diagnose and Treat Patients

Courtesy of

Eric

Schadt,

Mount Sinai

Identifying predictive features in data

Each factor must be separately analyzed for its “Predictivity”

• Mutual information measure

The “black art” of predictive analytics is finding the right ones

• Use too few, the model is weak

• Use too many, the model becomes slow and dominated by noise

Algorithms required to do this because the overwhelming number of “weak” factors defies human abilities to combine

• Machine learning identifies key feature

• some require “roll ups”

• some require “pull outs”

• Mathematical techniques then reduce the dimensionality

11

12

Predictive analytics in sensors

Extend-o-hand (Josh Shinavier. PhD)

Classification of the sensor data (via machine-learning) allows predictive recognition

of different gestures (i.e. before the gesture is finished).

13

Predictive analytics in large scale behaviors

List clusters at risk for Asian Clams

<1mile Cook’s Bay.”

Machine-learning generates predicts future distributions of invasive species in Lake

George based on current distributions and bathymetry similarity.

Predictive Social Network Analytics (with RPI NeST center)

14

Social Networks in Action

Analyzing cascading failures

Modeling (supply chain)

networks…

and predicting (cascading)

network risks.

Modeling network stressors (including

human cognitive element)

Understanding network dynamics

15

Data Science Research Center: tools for data analytics

Theory & Algorithms

• Randomized

• Optimization

• Approximation

• Multilinear Algebra

Applications

Statistics

• Multivariate analysis

• Optimal Experimental

Design

Dimension reduction by

randomized algorithms for

numerical linear algebra for

identify significant components

and visualizing Petabyte-scale

data matrices (P. Drineas, CSCI)

Parallel Factor Analysis for tensor systems creates a scalable

solution, on AMOS, for a critical data-processing component of

data analytics for large graphs. (B. Yener, CSCI)

Computational concerns

• Scaling

• Cyber Security for Data

Adding Semantics: Discovery Informatics

16

From “what if” to “Why”

17

Scientific data: Microbiome informatics

Human Biome

Environmental Biome

Built Environment

Data Analytics

Semantic Data Integration

While microbes are among the smallest

organisms on the planet, they are also

the largest influence on mass and

nutrient transport in the biosphere. They

are the base of most natural ecosystems,

as well as the purveyors of air and water

quality. It is also microbes that primarily

govern disease transfer and human

health in our built environments.

18

Materials Processing Ontology (cMDIS/IDEA)The materials field has made much progress on systematically understanding materials

structure-to-property relationships, but lacks an organized model of processing-to-

property relations.

A critical need for systematic development of new materials technologies!

Goal: Create a (machine-readable) ontology

for materials processing.

By combining our expertise in data science,

materials and manufacturing, we are creating a

key missing link in the Materials Genome

Initiative.

Some questions need a qualitative answer

Platform for Experimental Collaborative Ethnography

20

Discovery Informatics Requires Unstructured data

Integration of text analytics,

natural language processing,

network-based multimedia

analysis and

structured/unstructured data

integration

Requires Unstructured data (real-time feeds /images/video)

DOE SEAB report on HPC:

How might a neuromorphic “accelerator” type processor be

used to improve the application performance, power

consumption and overall system reliability of future

exascale systems?

21

Power Consumption (w/IBM)

Network Learning (sensors)

Sparse Distributed Representations

Hybrid Neural/Symbolic Systems

Neuromorphic Computing: software systems that implement models

inspired by neural systems to analyze data tied to perception, motor control,

or multisensory integration.

22

Neuromorphic Computing (CCI/IDEA)

Joint CCI/IDEA project to use supercomputer to model state-of-the-art neuromorphic processors

Use for improving AMOS energy use (like autonomic control)

Use for exploring inputs from data-sensing systems (extrinsic control)

Neuromorphic Computing requires critical Rensselaer technologies

Integrating data analytics (on the fly) with simulation and modeling

CCI (AMOS) allows us to explore new variants on neuromorphic

approaches

IDEA provides learning models and analytics capabilities for evaluation

Together allow us to attack audio/visual streaming data

autonomic

extrinsic

Theme 3: Data Exploration

23

From “why” to “what is”

24

From visualization to exploration

… Unfortunately, visualization too often becomes an end product of scientific analysis,

rather than an exploration tool that scientists can use throughout the research life cycle.

However, new database technologies, coupled with emerging Web-based technologies,

may hold the key to lowering the cost of visualization generation and allow it to become

a more integral part of the scientific process.

25

From visualization to exploration

… Unfortunately, visualization too often becomes an end product of scientific analysis,

rather than an exploration tool that scientists can use throughout the research life cycle.

However, new database technologies, coupled with emerging Web-based technologies,

may hold the key to lowering the cost of visualization generation and allow it to become

a more integral part of the scientific process.

26

From what is, to what if, to why (and back)

These capabilities are critical in “closing the loop” between data,

simulation and modeling in scientific discovery, engineering

design, and business innovation.

27

A “Data Science” Research Agenda

MultiscaleSparcity

Abductive Agent-oriented

• Gathering and

representing

information from

multiple sources

• topic of CODS talk

• Systematic (and

scalable) methods for

predictive analytics

• example: Parallel

search for best kernel

functions

28

Supporting the Scientific agenda

• New Data Exploration

platforms

• example: Patent

pending on new multi-

user collaborative

device

• Cognitive and

immersive platforms

• Data sharing standards

• Research Data Alliance

• W3C

The Rensselaer IDEA

Summary

• Data is not just the “oil” of the new

generation• information is the new power source generated from that “oil”

• Using data for prediction is becoming less of an art,

but still needs systematicity

• Scaling tools beyond MapReduce

• Better methods for rapid customization

• Turning data into causal or design knowledge is in its

early stages

• Closing the loop from data to design requires new informatics,

new mathematics, and new ways of thinking beyond data mining

29

the science of data science

Technology

data fusion

data cleansing

data mininganalytics

data aggregation

healthcare data analytics

science of data sciencedata

data bias analysis

digital universe of