npaci ahm 2001 tutorial on data mining for scientific

118
San Diego Supercomputer Center National Partnership for Advanced Computational Infrastructure NPACI AHM2001 NPACI AHM2001 Tutorial Tutorial on on Data Mining for Scientific Data Mining for Scientific Applications Applications Chaitan Baru Tony Fountain San Diego Supercomputer Center

Upload: tommy96

Post on 20-Jan-2015

319 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

NPACI AHM2001NPACI AHM2001

TutorialTutorialonon

Data Mining for Scientific Data Mining for Scientific ApplicationsApplications

Chaitan Baru

Tony Fountain

San Diego Supercomputer Center

Page 2: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Tutorial Objectives

• Provide overview of the infrastructure – technologies and techniques – for:• data mining, database systems

• Provide some illustrative examples of how the infrastructure can be used in scientific applications

• Present plans for the SDSC Knowledge and Information Discovery Lab (SKIDL)

• Identify potential collaborations – for applications as well as infrastructure

–– Our emphasis is on the infrastructure

Page 3: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Tutorial Outline

8:00 - 8:15 Data intensive computing in NPACI (Baru)

8:15 - 9:15 Introduction to data mining (Fountain)

9:15 - 10:15 DBMS support for analysis of large-scale data (Baru)

10:15 - 10:30 BREAK

10:30 - 12:00 Examples of data mining tools (Fountain)

Next steps...

Page 4: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

NPACI DICE

• Focus on data, information, and knowledge management:• Persistent archives

• Use of XML and archival storage systems (e.g. HPSS) for data storage

• Metadata-based access to data sets (Extensible Metadata Catalog, eMCAT)

• Distributed data handling (Storage Resource Broker, SRB)

• Information mediation (Mediation of Information using XML, MIX)

• Model-based mediation (NeuroMIX), use of Topic Maps

Page 5: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

HPSS

• Capacity • Total >400TB• Current usage: >240TB stored

• Load• Transfer rate: 1TB/day

• SRB provides a “container” mechanism for better usage and improved efficiency

Page 6: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Application(SRB client)

Distributed Storage Resources

DB2, Oracle, ObjectStore HPSS, UniTree UNIX, ftp

MCATSRB ServersSRB Middleware

The SDSC Storage Resource Broker• Metadata-based access to data sets stored in distributed, heterogeneous storage resources

Solaris, Linux, NT, AIX, HP-UX, IRIX

Page 7: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Current Usage of SRB

• Collections• Digital Sky: ~4TB, ~8 million files• Digital Embryo: ~700GB, millions of files• Digital library collections (ADL, UCB, Michigan): ~1

million files• HyperLTER – hyperspectral data • Particle Physics Data Grid

• Upcoming collections• SLAC...

Page 8: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Mediation of Information using XML (MIX)

DataSource

XML DataSource

DataSource

MIXmMIXmMediatorMediator

XML View(s)

Blended BrowsingBlended Browsingand Querying (BBQ)and Querying (BBQ)

interface

XML View(s)

XML View(s)

Definition of mediated view inXML Matching And Structuring (XMAS)XML Matching And Structuring (XMAS)

query language

WrapperWrapper

Lazy evaluation ofXMAS queries usingDOM-VXD

Page 9: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

From data management infrastructure to knowledge discovery infrastructure

• The Affymetrix story• “Technology built for Wall Street helps bioinformatics companies as

well…”

• The “scientist in the middle”• The infrastructure is a tool to help the scientist, not a replacement!

infrastructure

KDD

Page 10: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The Infrastructure Supports:

• Exploratory data analysis of large data sets• efficient ad hoc statistical processing

• Parallel data access, subsetting, and analysis• Data intensive approach to model building and

verification• including, fusion of different forms of data (e.g. database tables,

instrument outputs, remote sensing data, maps, …)

– Employ, and build upon, existing (commercial, freeware) tools and software packages, as much as possible

Page 11: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The SDSC Knowledge and Information Discovery Lab (SKIDL)

• Initial hardware platform• 2-processor Sun, 512MB memory, 36 GB local disk

• Upgrade to: • 20 processor Sun, 6 GB memory, 400 GB local disk• Access to additional disk storage via storage area network (SAN)

• Possible further upgrade (via CalIT2)• Additional 4 GB memory, 1 TB SAN disk, Gigabit Ethernet capability

• Software• High-performance, parallel database systems and file systems

• DB2• Oracle, GPFS

• Suite of data mining tools• Intelligent Miner, MineSet, Bayesian network tools• S-Plus, Darwin, Clementine, SAS

• Presentation, visualization: ESRI ArcIMS, ...

Page 12: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining

Tony FountainNPACI ESS

SDSC Knowledge & Information Discovery Lab

Page 13: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Overview (DM101)

• Part 1: • Definition• Motivations• Methods, Techniques, & Tools

• Part 2:• Examples & Demos• Data Mining to Decision Support

Page 14: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Overview (DM101)

• Part 1: • Definition• Motivations• Methods, Techniques, & Tools

• Database 605 – Chaitan Baru

• Part 2:• Examples & Demos• Data Mining to Decision Support

Page 15: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Outline (DM101)

Part 1 – What is data mining?1. Direct2. Contributions from other disciplines3. Motivations & context4. Example applications5. Analytical methods:

• Association Rules• Classification & Prediction• Clustering• OLAP

6. MSU data set

Page 16: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns…

Page 17: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases…

Page 18: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases,

that were collected for other applications…

Page 19: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases,

that were collected for other applications,

using machine learning algorithms…

Page 20: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases,

that were collected for other applications,

using machine learning algorithms,

and high-performance computers…

Page 21: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases,

that were collected for other applications,

using machine learning algorithms,

and high-performance computers,

for fun and profit!

Page 22: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Definition

The search for interesting patterns,

in large databases,

that were collected for other applications,

using machine learning algorithms,

and high-performance computers,

for science and society!

Page 23: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

KDD ProcessKnowledge Discovery and Data Mining

Collection

Processing/Cleansing/Correction/Formatting

Mining/Analysis/Modeling

Presentation/Visualization

Application/Decision Support

Management/Integration/Warehousing

Page 24: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining & Knowledge Discovery KD, KDD, KDD(D)*

What’s in a name?• Database• Data Mining• Discovery• Derivation• Decision Support

Page 25: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Contributions

Data Mining

Artificial Intelligence

High Performance ComputingStatistics

Database Systems

Page 26: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Contributions

Data Mining

Artificial Intelligence

High Performance ComputingStatistics

Database Systems

Operations Research

GIS

Visualization

Page 27: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The Case for Data Mining: Data Reality

• Controlled experimental data collection is an ideal• Legacy archives and independent collection activities• Deluge from new sources

• Remote Sensing• Instrumentation & Wireless Communications• Simulation Models

• Growth of data collections vs. analysts • Many types of data, many uses, many types of queries• Advances in computational infrastructure provide new

opportunities for access and integration • Paradigm shift: hypothesis-driven data collection to data

mining (KDD)

Page 28: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The Revolution in Ecology

• Computational Ecology and Eco-Informatics• Instrumentation & Remote Sensing

• Amphibian urls and hyperspectral data• Tropical glaciers in Ohio

• Computer Simulations• Coupled biogeochemistry, ocean,

atmosphere…

• Ecology without boots!

Page 29: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classic Applications - Commercial

• Fraud Detection – credit card• Churning – long-distance carriers• Targeted Marketing – customer profiles• Stock Market – futures trading• Market Basket Analysis

• Soon to be classic: FL 2000 election

Page 30: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classic Applications - Science

• Volcanoes on Venus - Classification

• Burl, et al., NASA, Cal Tech.

• Astronomical clustering – Autoclass, Bayesian Clustering

• Cheeseman, Stutz, NASA

• Oil spills from remote sensing data – Decision Trees

• Kubat, et al., Ottowa

• Biodiversity analysis – Genetic algorithms, Bayesian Nets

• Stockwell, SDSC/UCSD

• …???

Page 31: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classic Applications - Science

• Volcanoes on Venus - Classification

• Burl, et al., NASA, Cal Tech.

• Astronomical clustering – Autoclass, Bayesian Clustering

• Cheeseman, Stutz, NASA

• Oil spills from remote sensing data – Decision Trees

• Kubat, et al., Ottowa

• Biodiversity analysis – Genetic algorithms, Bayesian Nets

• Stockwell, SDSC/UCSD

• YOUR NAME HERE!! (1800-SKIDLME)

Page 32: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining Tools (suites)

• SPSS - Clementine

• http://www.spss.com/clementine/

• Oracle - Darwin

• http://www.oracle.com/ip/analyze/warehouse/datamining/

• SGI - MineSet

• http://www.sgi.com/software/mineset/

• IBM - Intelligent Miner • http://www-4.ibm.com/software/data/iminer/fordata/

• http://www.kdnuggets.com/software/index.html

Page 33: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Data Mining Analytical Techniques(patterns, hypotheses, models)

• Statistical Methods • Descriptive, Modeling, Data Reduction…

• Associations• Simple relations in categorical data

• Classification & Prediction• Model induction - Supervised learning

• Clustering• Concept discovery - Unsupervised learning

Page 34: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Association Rule Mining

• Associations• Simple rules in categorical data

• Sample applications • Market Basket Analysis

Buys(Milk) => Buys(Eggs)• Transaction Processing

Income(Hi) & Single(Y) => Owns(Computer)

• Search for Strong Rules• Support R(A => B) = P(A U B)• Confidence R(A => B) = P(B | A) = P(AB) / P(A)

Page 35: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Association Rule Mining

70 Bird Antelope Lion

80 Hyena Bird Snake

70 Tiger Snake Antelope

70 Bird Lion Hyena

70 Snake Lion Bird

R1: [70 => (Bird & Lion)]

Support: P(70 or (Bird & Lion)) = 4/5 = 80%

Confidence: P((Bird & Lion) | 70)) =

P(Bird & Lion & 70) / P(70) = (3/5) / (4/5) = 75%

Page 36: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification

• Classification and prediction• Create model for distinguishing concepts• Labeled training data• Metrics based on accuracy rates and cross-validation

• Numerous methods• Decision trees• Neural Nets• Bayesian Networks• Regression

• Many applications• Identifying credit risks• Predicting biological productivity• Medical diagnosis• Classifying toxic risks…

Page 37: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Ecosystem Precipitation

Page 38: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5Precipitation < 60

Page 39: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5 Forest 120

Forest 104

Forest 116

Prairie 63

Precip < 60Precip < 100

Page 40: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Classification – Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

IF(Precip < 60 ) then Desert

Else If (Precip < 100) then Prairie

Else Forest

Page 41: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Pruned Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5Precipitation < 60

Page 42: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Pruned Decision Tree

Desert 2

Forest 120

Forest 104

Desert 5

Forest 116

Prairie 63

Forest 120

Forest 104

Forest 116

Prairie 63

Desert 2

Desert 5

Precipitation < 60

IF(Precip < 60 ) then Desert

Else [P(Forest) = .75] &

[P(Prairie) = .25]

Page 43: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering

• Cluster Analysis – Concept Discovery• Create models for discovered concepts• No known class labels• Metrics based on cluster similarity

• Numerous methods• K-means (partitioning)• Bayesian Networks• Hierarchical clustering• Neural Networks

• Example applications• Identifying common subpopulations • Creating taxonomies (biological, manufacturing, commerce)• Discovering failure patterns in manufactured parts• Locating environmental risk areas…

Page 44: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

Precipitation Temperature

8 81

71 70

62 63

49 45

17 76

32 49

Page 45: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 46: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 47: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 48: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 49: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 50: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

per

atu

re

Page 51: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation

Page 52: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

50 – 8050 – 80C3

25 - 5535 - 60C2

0 - 2570 - 85C1

Cluster Temperature Precipitation

Page 53: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Clustering – K-Means

30

40

50

60

70

80

90

0 20 40 60 80

Precipitation

Tem

pera

ture

C1 70 - 85 0-25 Desert

C2 35 - 60 25 - 55 Prairie

C3 50 – 80 50 – 80 Forest

Cluster Temperature Precipitation Ecosystem

Page 54: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

On-line Analytical Processing (OLAP)

• On-line Transaction Processing (OLTP) vs. OLAP• Analysis & decision support are more compute intensive

• Concept hierarchies - (representing forests & trees)• Space: site, county, state, country…• Time: day, week, month….• Taxonomic hierarchies …

• Methods: rules, explicit specification, clustering• Multidimensional data & efficient access/selection• Operations: slice, dice, roll up, drill down, pivot

Page 55: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Concept Hierarchy for Precipitation

(high)0-3

0-12 inches

4-8 9-12

0-1 2-3 7-8 11-129-104-6

(low) (med)

Page 56: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

OLAP Examples

• Slice• For (precip = “4-8 inches”)

• Dice• For (precip = “4-8 inches” AND week = “120”)

• Drill down (specification)

• On time from months to weeks

• Roll up (generalization, summarization)

• On Space from counties to states

Page 57: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

MSU Data Set

• Agricultural productivity simulation • Integrates land use, climate, ecosystem data• Remote sensing, computer simulations, field observations

• Inputs – geographic & climatic parameters• Max and min temperatures• Solar radiation• Precipitation ….

• Outputs – ecosystem • Leaf area index • Crop yield• Soil Water …

Page 58: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Statistics of MSU Simulation Data

• 20 years, daily records• 1053 regions• 5 million rows• Approx 300MB

• Stuart Gage, MSU ComputationalEcology and Visualization Lab

• http://www.cevl.msu.edu/index.html

Page 59: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Example: DBMS support for OLAP

• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)

FROM msu.details_table

GROUP BY ROLLUP (region, week, day_of_week)

ORDER BY region, week, day_of_week

• Output is summation of solrad by• (region, week, day_of_week)• (region, week, –)• (region, –, –)• (–, –, –)

Page 60: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

DBMS Support for Large Data Analysis

• Large database support

• Parallel processing

• OLAP functions

• New data types, object extensions, spatial data, XML…

• Distributed databases

Page 61: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Dealing with large databases• In the beginning…

• Database size max logical filesystem size (2GB) (UNIX)

• Tablespaces• A tablespace can have multiple tablespace containers• Size of tablespace container max filesystem size

DatabaseT1, T2, T3 ...

/filesystem

DatabaseT1, T2, ...

/fs

Database

T1, T3Tablespace1

T2Tablespace2

/fs /fs /fs/fs

Page 62: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Tablespaces...

• Different types of tablespace containers• DBMS managed (“raw”)• File system managed (“cooked”)

• Different types of tablespaces• Regular data and indexes (typical max size of 64GB)• Large objects (LOB’s) and temporary data (typical max

is 2TB)

• Larger page sizes for containers (4K to 32K)• Max. TS size for regular data increases to 512GB

• What if a given table is > 512GB?

Page 63: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Loading large databases

• The relevant industry benchmark is TPC-H (www.tpc.org)

• Evolved from TPC-D Benchmark• First audited benchmark was performed in December 1995• 100GB database, 32-node IBM SP

• Current largest benchmark runs are for 1TB database

• Largest table in benchmark has • ~ 70% of data (700GB)

• 6 billion rows

• Measures single user performance (“power metric”) and multi-user performance (“throughput metric”)

Page 64: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Large Database Benchmarks• Results from IBM, April 2000

• Loading 1TB database takes about 7.5 hours

• Total disk = 9.7 times the raw size of database

• Hardware configuration• 32 4-way IBM SP nodes, 4GB/node (128GB), 35x9GB disks/node

• Total 5-year cost of system: $9.3M

• Power: 12,812; QphH: 12,867; Price/perf: $725

• Results from HP, Feb. 13th, 2001

• Loading database takes 5.25 hours

• Total disk = 10.2 times raw size of database

• Hardware configuration

• 64 processor Superdome, 96GB memory, 3 disk arrays with 558 18.2GB drives

• Total 5-year cost of system: $9.6M

• Power: 13,730; QphH: 9,755; Price/perf: $985

Page 65: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Large Database Benchmarks• IBM(cluster of SMP) vs HP (SMP) – based solely on

analysis of published TPC-H numbers:• HP is 7.2% better in power (12,812 vs 13,730)• IBM is 24.2% better in throughput (12,867 vs 9,755)• IBM is 3.2% better in price ($9.3M vs $9.6M)• IBM is 36% better in price/performance ($725 vs $985)

• TPC-C Benchmark example – IBM• 32x4 processors, 4GB/node (128GB), 218 18GB disks/node

• Total managed storage of ~125TB

• 440,879 tpm-c

• Total cost: $14.2M

• See www.tpc.org for all results

Page 66: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Large Database Benchmarks

• High-end database sizes• “several customers with 100TB of managed disk” – IBM• “customer has requested 1PB (that’s petabyte) of on-line

storage for bioinformatics application over next 5 years” – Sun• “TB’s are passé, think PB’s” – IBM Life Sciences rep• Legacy formats are files, but newer data will be in DBMS

• Dealing with very large data sizes• Interfacing to archival storage• Parallelism

Page 67: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

DB2

Databasetable

Create TablespaceHPSS-TSPACE

Managed By DatabaseUsingFILE (HPSS <hpss-filename> <size> DISKBUF <path> <size>);

HPSSHPSSdisk

cache

Linking DBMS to archival storageThe DB2/HPSS Project

C4 C5C1 C2 C3• Joint project with IBM TJ Watson Research Center

• DB2 provides link to Tivoli ADSM• Oracle also supports interface to archival storage

HPSS_TSPACE

DB2 disk buffer

Page 68: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Parallelism in Database Systems

• Example databaseINPUT table:region int — spatial region, county

year smallint— year (1972 - 1990)

day smallint— day of year (1-366)

solrad int — solar radiation

tmax float — max day temp. (-33, 44)

tmin float — min day temp. (-45, 29.5)

pp float — precipitation (mm)

dd float — degree days (heat)

OUTPUT table:region int

year smallint

day smallint

x_albers int — x-coordinate

y_albers int — y-coordinate

tdd10 float — total degree days

add float — total anthesis degree days

tlai float — total leaf area index

seed float — total seed biomass (gr/m2)

yield float — final yield (tons/ha)

twater float — total soil water evaporation + total transpiration

ttsw float — Maximum water available

Page 69: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Generating query graphs• Convert SQL queries to query execution plans consisting of low-

level query operators• Q1: Select all regions where max temp is greater than 40 degrees, over the

entire period of the study: • SELECT distinct(region) FROM Input WHERE tmax>40

• Q2: Select solar radiation and total leaf area index values for all days and regions in the year 1978:

• SELECT solrad, tlai FROM Input A, Output B

WHERE A.region=B.region AND A.year=B.year AND A.day=B.day

Remove duplicates,format output

Apply tmax>40

Read INPUT table

Format output

Join (region, day, year)

Read INPUT table Read OUTPUT table

Q1 Q2

Page 70: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Levels of Query Parallelism

• Inter-query• Execute multiple queries (Q1 and Q2) at the same time

• Inter-operator (intra-query)• Concurrently execute multiple operators in the query• Pipeline through the operators, e.g. read and join

Format output

Join (region, day, year)

Read INPUT table Read OUTPUT table

Page 71: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Levels of Parallelism...

• Intra-operator• Data parallelism• Employ multiple processes for each operator

INPUT table OUTPUT table

Read table Read table

Format output

Join

Page 72: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Parallel Architecture models and DBMS

• Shared-everything • memory, process space, disk subsystem are all

common

• Shared disk• Separate memory/process space• Disk subsystem/filesystem is common

• Shared nothing• Separate memory, disks, OS…• Only communication “bus” is shared

Page 73: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Shared Everything

Disk

Processor

Memory

• SMP: Symmetric Multi-Processors• Provide well-balanced systems• Shared workload, resilient to “unexpected” workload• Dynamic allocation of processes to query operators (inter- as well

as intra-query)• Expensive and don’t scale to large configurations

Page 74: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Shared Disk

Disk

Processor

Memory

• Some of the classic architectures map to this, VaxCluster, IBM mainframes (could make a comeback with SAN’s)

• Can share I/O workload, dynamic partitioning of data• Only need to scale I/O subsystem, and not memory

Page 75: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Shared Nothing

Disk

Processor

Memory

• Highly scalable• Static partitioning of data• Cannot share workload• Cluster of SMP’s provides advantages of shared-

nothing and SMP’s

Page 76: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

SN System

Combining Nodegroups and Tablespaces

Nodegroup

Tablespace2OUTPUT

Tablespace1INPUT

SN System

Nodegroup1 Nodegroup2

Tablespace2OUTPUT

Tablespace1INPUT

Format output

Join (region, day, year)

Read INPUT table Read OUTPUT table

Page 77: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The DBMS/Application bottleneck

• Serial communication between DBMS and app.

Application

Page 78: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The DBMS/Application bottleneck

App App App App

• Parallel communication between DBMS and app.

App

Page 79: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

DBMS / DM software connection

Data Mining Platform

Database Platform

Extract data subsets

Generate results

Presentation (e.g. GIS, 3D)

Store session results

Page 80: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Performance Tuning

• Sample set of Database Manager configuration parameters:

• CPU speed (millisec/instruction) (CPUSPEED) = 9.700848e-07• Comm. bandwidth (MB/sec) (COMM_BANDWIDTH) = 1.000000e+00• Max number of existing agents (MAXAGENTS) = 400• Initial number of agents in pool (NUM_INITAGENTS) = 0• Max number of coord. Agents (MAX_COORDAGENTS)• Max no. of concurrent coord. agents (MAXCAGENTS) • Maximum query degree of parallelism (MAX_QUERYDEGREE) = ANY• Enable intra-partition parallelism (INTRA_PARALLEL) = NO

Page 81: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Database Tuning

• Sample set of Database configuration parameters:

• Default query optimization class (DFT_QUERYOPT) = 9• Degree of parallelism (DFT_DEGREE) = 1

• Database heap (4KB) (DBHEAP) = 1200• Catalog cache size (4KB) (CATALOGCACHE_SZ) = 64• Log buffer size (4KB) (LOGBUFSZ) = 8• Utilities heap size (4KB) (UTIL_HEAP_SZ) = 5000• Buffer pool size (pages) (BUFFPAGE) = 128000• Max storage for lock list (4KB) (LOCKLIST) = 100

• Number of asynch page cleaners (NUM_IOCLEANERS) = 1• Number of I/O servers (NUM_IOSERVERS) = 3• Sequential detect flag (SEQDETECT) = YES• Default prefetch size (pages) (DFT_PREFETCH_SZ) = 32

Page 82: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Examples of data exploration

• Testing temporal relationship (sensitivity analysis)• Can conditions from day N-1 be used to predict output of day N• How far back can we go?• Input table:

• Generate output:• Region, Year, Day, Inputi, Outputi, Output(i-1)

Region Year Day Inputs Outputs1 78 1 I1 O1

1 78 2 I2 O2

1 78 3 I3 O3

1 78 4 I4 O4

2 78 1 I1 O1

2 78 2 I2 O2

Page 83: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

“Flattening” the table

SELECT A.region, A.year, A.day, A.solrad, A.tlai, B.day, B.tlai FROM msu.combined A, msu.combined B WHERE A.region=B.region AND A.year=B.year

AND A.day=B.day-1

• E.g. SQL query:

• Query Explain facility

Page 84: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

“Flattening” the table

Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: YEAR (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleMerge Join

Merge Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Relation Scan| | | Prefetch: Eligible| | Insert Into Sorted Temp Table ID = t2| | | #Columns = 4| | | #Sort Key Columns = 1| | | | Key 1: YEAR (Ascending)| Access Temp Table ID = t2| | Relation Scan| | | Prefetch: Eligible| Residual Predicate(s)| | #Predicates = 2Return Data to Application| #Columns = 7

Page 85: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

“Flattening” the table, with indexing

Access Table Name = MSU.COMBINED | #Columns = 5| Relation Scan| | Prefetch: Eligible| Insert Into Sorted Temp Table ID = t1| | #Columns = 4| | #Sort Key Columns = 1| | | Key 1: REGION (Ascending)Access Temp Table ID = t1| Relation Scan| | Prefetch: EligibleNested Loop Join

Nested Loop Join| Access Table Name = MSU.COMBINED| | #Columns = 4| | Index Scan: Name = MSU.C_RYD | | | Index Columns:| | | | 1: REGION (Ascending)| | | | 2: YEAR (Ascending)| | | | 3: DAY (Ascending)| | | Data Prefetch: Eligible 157| | | Index Prefetch: Eligible 157| | Return Data to Application| | | #Columns = 7

Page 86: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Declustering the table

• Partition the table by Region and/or Year• Linearly scalable join operation

• Testing spatial relationships/sensitivity• Compare region R with a specified neighborhood of R• Compare region R with other “similar” regions–spatial

clustering• Decluster table by year/day

Page 87: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Built-in support for OLAP• Example table

• INPUT (Region, Week, Day_of_week, Solrad)• 2 regions, 1978, 250 days/year (500 rows)

• SQL support for rollup• SELECT region, week, day_of_week, sum(solrad)

FROM Input

GROUP BY ROLLUP (region, week, day_of_week)

ORDER BY region, week, day_of_week

• Output is summation of solrad by• (region, week, day_of_week), (region, week, –)• (region, –, –), (–, –, –)

Page 88: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

(Region, Week, Day_of_Week SUM(Solrad)

17003 1 1 1661.0

17003 1 2 2654.0

17003 1 3 2709.0

17003 1 4 2101.0

17003 1 5 1197.0

17003 1 6 1605.0

17003 1 7 1133.0

17003 1 - 13060.0

….

17003 36 1 6030.0

17003 36 2 6222.0

17003 36 3 6351.0

17003 36 4 6387.0

17003 36 5 6160.0

17003 36 - 31150.0

17003 - - 1206273.0

- - - 2398149.0

Page 89: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The “cube” operator

• SQL query• SELECT region, week, day_of_week, sum(solrad)

FROM Input

GROUP BY CUBE (region, week, day_of_week)

ORDER BY region, week, day_of_week

• Output is summation of solrad by• (region, week, day_of_week), (region, week, –),

(region, –, –), (–, –, –)• (region, –, day_of_week)• (–, week, day_of_week)• (–, week, –)• (–, –, day_of_week)

Page 90: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Distributed data mining

• “Function shipping” vs. “data shipping”• Generalization of the “operator pushdown”

notion• “DataCutter” operations in SRB• Source/wrapper-side processing in MIX

• Need to understand which operations can be distributed and how

• Web-based infrastructure for OLAP and DM• XML for Analysis

Page 91: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Application(SRB client)

MCATSRB ServersSRB Middleware

“Remote” operations in SRB

DataCutter, other “remote” operations

Page 92: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Wrapper-side processing in MIX

DataSource XML Data

Source

DataSource

MIXmMIXmMediatorMediator

ApplicationApplication

WrapperWrapper

Wrapper

Page 93: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

The role of XML

• Representing, exchanging metadata• image headers, instrumentation information,

descriptive metadata...

• Expressing service descriptions• Web-based services

• Exchanging data among services• “Raw” data: sequence information, GIS

information…• Results of analysis: rowsets, multidimensional

cubes,...

Page 94: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Client Functionality

XML for analysis

UI

Client Functions

Discover, Execute Calls

SOAP

HTTP

XML for AnalysisProvider

Implementation

Discover, Execute Calls

- Server

SOAP

HTTP

Data

Client Web Service Provider Web Service

Discover, Execute

Data

Data Source

Page 95: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Examples - Overview

• Intelligent Miner – Data Analysis and Mining• Interface, database connectivity, data creation• Statistical routines• Classification

• Decision Tree

• Neural Network

• Clustering

• Netica - Probabilistic Modeling and Decision Support• Belief networks, probabilistic queries • Statistical decision theory, decision models, influence diagrams

Page 96: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 97: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 98: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 99: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 100: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 101: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 102: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 103: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 104: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 105: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 106: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 107: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 108: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Probabilistic Modeling andBayesian Belief Networks

Productivity

Precipitation Solar Radiation

Yield

Page 109: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 110: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 111: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Statistical Decision Theory*

• Normative model of rational decision making• Decision: Irrevocable allocation of resources• Beliefs: Probability theory• Preferences: Utility theory• Expected Utility = Probability * Utility• Value of Information = EU (A | I) – EU(A)

• Principle of Rationality:

Maximize Expected Utility

• (*Rational agents are your friends.)

Page 112: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 113: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Page 114: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

BK2 Induction Algorithm

• Data Mining via Probabilistic Model Induction• Discover Network Structure and Parameters• Greedy Algorithm – ML gradient search• Encode background Knowledge – Preferences

Page 115: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Model Day 1

Page 116: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Model Day 120

Page 117: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Other Mining Applications

• Spatial Data Mining • Time Series • Sequence Mining• Text Data Mining• Multimedia Database Mining• Web Mining• Network Traffic Analysis

Page 118: NPACI AHM 2001 Tutorial on Data Mining for Scientific

San Diego Supercomputer Center

National Partnership for Advanced Computational Infrastructure

Acknowledgements• Students

• Peter Shin, Ankur Jain

• Science Collaborator• Stuart Gage, MSU – shared his data set and many insights about

the data

• SDSC• Mike Vildibill, Deputy Dir, providing hardware resources for SKIDL• Josh Polterock / Dave Archbell – help with software installation,

maintenance

• Funding support• NPACI ESS: support for Tony Fountain, Ankur Jain• NPACI DICE: support for Chaitan Baru• NSF REU: support for Peter Shin