from%datamining%% to%discovery%analy4cs%in%daily%life% mining supermarket baskets tracking flu...

24
From Data Mining to Discovery Analy4cs Naren Ramakrishnan CS@VT CS Open House, March 25, 2011

Upload: tranthuy

Post on 30-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

From  Data  Mining    to  Discovery  Analy4cs  

Naren  Ramakrishnan  CS@VT  

 CS  Open  House,  March  25,  2011  

What  is  data  mining?  

•  Extrac4ng  non-­‐trivial  and  ac4onable  paNerns  from  data  (lots  of  data)  

•  Integrates  ideas  from  – Algorithms  – Databases  – Sta4s4cal  Inference  – Visualiza4on  

Data  mining  in  daily  life  

Mining supermarket baskets

Tracking flu spread

Man vs Manhole

Data  mining  research  at  CS@VT  

•  Algorithmic  innova4ons  mo4vated  by  real  applica4ons  

•  One  technique,  mul4ple  uses  – Storytelling  

•  Life  sciences,  intelligence  analysis  – Event  sequence  discovery  

•  Manufacturing,  neuroscience,  sustainability  

– Graph  mining  •  Social  networks,  biochemical  networks  

Data  mining  research  at  CS@VT  

•  Algorithmic  innova4ons  mo4vated  by  real  applica4ons  

•  One  technique,  mul4ple  uses  – Storytelling  

•  Life  sciences,  intelligence  analysis  – Event  sequence  discovery  

•  Manufacturing,  neuroscience,  sustainability  

– Graph  mining  •  Social  networks,  biochemical  networks  

Storytelling  

•  A  “Connec4ng  the  dots”  Problem  •  Input  

– Documents                                                                                                                                  (lots  of  them)  

L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm, and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, Technical Report, Department of Biochemistry, Virginia Tech, 2010.

M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.

?

Connecting the dots

L. Garczarek, N. Ramakrishnan, D. Kumar, R.F. Helm, and M. Potts, Global cross-over points in the genome responses of Synechocystis sp. PCC 6803, to dehydration, UV-irradiation, and other stresses, Technical Report, Department of Biochemistry, Virginia Tech, 2010.

M.B. Roth and T. Nystul, Buying time in suspended animation, Scientific American, Vol. 292, No. 6, pages 48-55, June 2005.

L. Schmitt and R. Tampe, Structure and mechanism of ABC transporters, Current Opinion in Structural Biology, Vol. 14, No. 4, pages 426-431, Aug 2004.

J.W. Scott, S.A. Hawley, K.A. Green, M. Anis, G. Stewart, G.A. Scullion, D.G. Norman, and D.G. Hardie, CBS domains form energy-sensing modules whose binding of adenosine ligands is disrupted by disease mutations, Journal of Clinical Investigation, Vol. 113, No. 2, pages 182-184, Jan 2004.

C. Tang, X. Li and J. Du, Hydrogen sulfide as a new endogenous gaseous transmitter in the cardiovascular system, Current Vascular Pharmacology, Vol. 4, No. 1, pages 17-22, Jan 2006.

CBS domains

ABC transporters Ligands bound to CBS domains

Hydrogen sulfide

Connecting the dots

The  Storygrapher  

Bidirec4onal  explora4on  of  “leads”    on  System  X  

Stories  mined  from  Wikileaks  

Spain

USAVenezuela

USA is concerned about

ships with Venezuela

USA creates pressure on Netherlands to boycott Venezuela

Spain claims that their relationship with Venezuela is strictly financial and political

Netherlands want to maintain good relationship with Venezuela

US Embassy in Venezuela is trying to create positive impression about USA among Venezuelans

Spanish Foreign Minister visits Cuba. Spain has

relation with Cuba.

European  Union

LibyaAl-­‐Qaeda

Al-Qaeda is a concern to LibyaSpain suspects Al-

Qaeda for Madrid bombing

Spain tries to convince EU to involve Cuba in the political community

Ghana

Afghanistan

Libya provides help to boost agriculture in Ghana

Libyan company invests in Liberia

Libya lends tractors to Mozambique

Spain wants to increase presence in Afghanistan

USA and Libya tie about counter-terrorism cooperation,, prospective military-to-military ties, and petroleum resources.

Story  1:  05MADRID1604 05MADRID703 08CARACAS420 07THEHAGUE2012 Story  2:  05MADRID1604 05MADRID1879   09MADRID1121 09LONDON2592

UN

Human rights violation in Cuba

concerns UN

Story  3:  09TRIPOLI221 08TRIPOLI680 09TRIPOLI73 04MADRID974 06MADRID2657

Three  automa4cally  discovered  stories  summarized  by  an  analyst  

Data  mining  research  at  CS@VT  

•  Algorithmic  innova4ons  mo4vated  by  real  applica4ons  

•  One  technique,  mul4ple  uses  – Storytelling  

•  Life  sciences,  intelligence  analysis  – Event  sequence  discovery  

•  Manufacturing,  neuroscience,  sustainability  

– Graph  mining  •  Social  networks,  biochemical  networks  

Event  sequence  discovery  One  long  sequence  of  events  

( ) ( ) ( )nn tEtEtE ,,...,,,, 2211

A   1   1   1  B   1   1  C   1   1   1  D   1   1   1   1  

Even

ts  

Time  

Event  of  Type  A  occurred  at  t  =  6.1  

Event  of  Type  D  occurred  at  t  =  5.2  

Applica4on  contexts  Multi-neuronal spike trains

Assembly lines

Reactor systems

From  Events  to  Episodes  to  Structures  

time

B"C"

G"Episode "Mining"

ACDEFH, BDCFEH, ACDEFH, BCDEFH, ...."

A"

B"C"

E"D"

F"

H"(or)"

(and)"(and)"

(not)"

Structure"Discovery"

A"

Why  is  this  problem  difficult?  

15  

CQLQSOQKDRQXCDRZSNQRVDXPDTBHYOCJUSWLPEDFTEQEDYASTKRYIVDTGZJUYEUPXFEQYCTCEFSSFAEOJSOBKREKSWIEQEKLSISRNSMEDWNCRESXNDQEFNSXEBYSBYRRYQTWDAOOWPKJEIINAUECBIMSFEFSSRJIBOEIPSWEEXYQTDXIRMSISNMEAREARSDSJDKFJHCIOWBSEUPAUSSBXEHYTMTLBPWERYKQDIHEBJOWSMUEROXPYFKPTINEOSSASJPEKNEBIGNMESQLCQLQIGWFIELJMELSRLNSGCTZPXJXETZSOOPAEKIERSKOIQDIPKTEDFIASCNHTDONDCYUXSNYHEXTOIXEXKGEJSYIMOLECKZXIKGIFMUESSSICEHSYZOUISERBEEADASCAW

CQLQSOQKDRQXCDRZSNQRVDXPDTBHYOCJUSWLPEDFTEQEDYASTKRYIVDTGZJUYEUPXFEQYCTCEFSSFAEOJSOBKREKSWIEQEKLSISRNSMEDWNCRESXNDQEFNSXEBYSBYRRYQTWDAOOWPKJEIINAUECBIMSFEFSSRJIBOEIPSWEEXYQTDXIRMSISNMEAREARSDSJDKFJHCIOWBSEUPAUSSBXEHYTMTLBPWERYKQDIHEBJOWSMUEROXPYFKPTINEOSSASJPEKNEBIGNMESQLCQLQIGWFIELJMELSRLNSGCTZPXJXETZSOOPAEKIERSKOIQDIPKTEDFIASCNHTDONDCYUXSNYHEXTOIXEXKGEJSYIMOLECKZXIKGIFMUESSSICEHSYZOUISERBEEADASCAW

Episode Frequency C→S 682 O→P→E→N 439 H→O→U→S→E 260

Record  Ac4vity  Find  Repea4ng    

PaNerns  Infer  

Network  Connec4vity  

Mining  mul4-­‐neuronal  spike  trains  

Modeling  data  center  chillers  

3 air cooled chillers operating

2 air cooled + 1 water cooled

Data  mining  research  at  CS@VT  

•  Algorithmic  innova4ons  mo4vated  by  real  applica4ons  

•  One  technique,  mul4ple  uses  – Storytelling  

•  Life  sciences,  intelligence  analysis  – Event  sequence  discovery  

•  Manufacturing,  neuroscience,  sustainability  

– Graph  mining  •  Social  networks,  biochemical  networks  

Graph  mining  

•  Input  – Graphs  (lots  of  them)  

Biochemistry  by  search  

“How do cells remember? What is the biochemical basis of memory?”

“Family tree” of > 3000 switches discovered by mining > 100 CPU years of simulation results

Biochemistry  by  search  

•  Which  combina4ons  of  reac4ons  endow  a  system  with  bistability?  

Experiences  redux  

•  Data  mining  research  can  be  organized  into  “horizontals”  and  “ver4cals”  

•  Data  mining  is  beneficial  when  – First-­‐principles  answers  are  not  available  –  Informa4on  integra4on  is  key  

                         Discovery  Analy4cs    Center  

•  A  new  ICTAS  center  focused  on  the  use  of  analy4cs  for  scien4fic  discovery  

•  Brings  together    – Core  faculty  from  CS,  STAT,  MATH,  ECE  – Applica4ons  faculty  from  various  other  departments  

•  Some  ini4al  areas  of  emphasis  –  Intelligence  analysis,  sustainability,  neuroscience  

For  more  info  

•  Contact  – Naren  Ramakrishnan  – 2050  Torgersen  Hall  – [email protected]  – hNp://www.cs.vt.edu/~naren