© vipin kumar august 20, 2003 1 discovery of patterns in the global climate system using data...

© Vipin Kumar August 20, 2003 1

Discovery of Patterns in the Global Climate System using Data Mining

Vipin Kumar

Army High Performance Computing Research CenterDepartment of Computer Science

University of Minnesota

http://www.cs.umn.edu/~kumar

Research sponsored by AHPCRC/ARL, DOE, NASA, and NSF


What is Data Mining?

Many Definitions

– Non-trivial extraction of implicit, previously unknown and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns


What is (not) Data Mining?

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search

engine for information about “Amazon”

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rourke, … in Boston area)

– Group together similar documents returned by search engine according to their context (Amazon rainforest, Amazon.com, etc.)


Why Mine Data? Commercial Viewpoint

Lots of data is being collected and warehoused – Web data

Yahoo! collects 10GB/hour

– purchases at department/grocery stores Walmart records 20 million transactions per day

– Bank/Credit Card transactions Computers have become cheaper and more powerful Competitive Pressure is Strong

– Provide better, customized services for an edge (e.g. in Customer Relationship Management)

Why Mine Data? Scientific Viewpoint

Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite NASA EOSDIS archives over

1-petabytes of Earth Science data per year

– telescopes scanning the skies Sky survey data

– gene expression data

– scientific simulations terabytes of data generated in a few hours

Traditional techniques infeasible for raw data Data mining may help scientists

– in automated analysis of massive data sets– in hypothesis formation


Mining Large Data Sets - Motivation

There is often information “hidden” in the data that is not readily evident Human analysts may take too long to discover useful information Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

Ref: R. Grossman, C. Kamath, V. Kumar, Data Mining for Scientific and Engineering Applications


Origins of Data Mining

Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

Traditional techniquesmay be unsuitable due to

– Enormity of data

– High dimensionality of data

– Heterogeneous, distributed nature of data

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems


Role of Parallel & Distributed Computing

High Performance Computing (HPC) is often critical for scalability to large data sets – Many algorithms use more than O(n)

computation time – Sequential computers

have limited memory, thus requiring multiple, expensiveI/O passes over data

Distributed computing is neededbecause data is distributed – due to privacy reasons– physically dispersed over

many different geographic locations

Machine Learning/Pattern

Recognition

Statistics/AI

High Performance Computing

Data Mining

Database systems

Data Mining Tasks...

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes



13 No Single 85K Yes


15 No Single 90K Yes 10

Predictive M

odeling

Clustering

Association

Rules

Anomaly Detection

Milk

Data


Predictive Modeling

Find a model for class attribute as a function of the values of other attributes

Married

Income100K

Income 80K

YESNO

NO

NO

YesNo

Yes

Yes No

Yes

Tid Refund Marital Status

Taxable Income

Evade

1 Yes Single 125K No


3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

categoric

al

categoric

al

continuous

class

Learn

Classifier

Model for predicting tax evasion

Predictive Modeling: Applications

Targeted Marketing

Customer Attrition/Churn

Classifying Galaxies

Early

Intermediate

Late

Sky Survey Data Size: • 72 million stars, 20 million galaxies• Object Catalog: 9 GB• Image Database: 150 GB

Class: • Stages of

FormationAttributes:• Image features, • Characteristics of

light waves received, etc.

Courtsey: http://aps.umn.edu

http://aps.umn.edu/

http://aps.umn.edu/

http://aps.umn.edu/

http://aps.umn.edu/

http://aps.umn.edu/

http://aps.umn.edu/


Clustering

Given a set of data points, find groupings such that

– Data points in one cluster are more similar to one another

– Data points in separate clusters are less similar to one another


Clustering: Applications

Market Segmentation

Gene expression clustering

Document Clustering


Association Rule Discovery

Given a set of records, find dependency rules which will predict occurrence of an item based on occurrences of other items in the record

Applications– Marketing and Sales Promotion– Supermarket shelf management– Inventory Management

TID Items

1 Bread, Coke, Milk

2 Beer, Bread

3 Beer, Coke, Diaper, Milk

4 Beer, Bread, Diaper, Milk

5 Coke, Diaper, Milk

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}

Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}


Deviation/Anomaly Detection

Detect significant deviations from normal behavior Applications:

– Credit Card Fraud Detection

– Network Intrusion Detection

Typical network traffic at University level may reach over 100 million connections per day

Discovery of Patterns in the Earth Science Data

SST

Precipitation

NPP

Pressure

SST

Precipitation

NPP

Pressure

Longitude

Latitude

Timegrid cell zone

...

Global snapshots of values for a number of variables on land surfaces or water

Data sources: weather observation stations earth orbiting satellites (since 1981) modeled-based data

NASA ESE questions: How is the global Earth system changing? What are the primary forcings? How does Earth system respond to

natural & human-induced changes? What are the consequences of changes in

the Earth system? How well can we predict future changes?


Climate Indices: Connecting the Ocean/Atmosphere and the

Land

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

longitude

latit

ude

Correlation Between ANOM 1+2 and Land Temp (>0.2)

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

El Nino Events

Nino 1+2 Index

A climate index is a time series of sea surface temperature or sea level pressure

Climate indices capture teleconnections The simultaneous variation in

climate and related processes over widely separated points on the Earth


Discovery of Climate Indices Using Clustering

A novel clustering technique was developed to identify regions of uniform behavior in spatio-temporal data. The use of clustering for discovering climate indices is driven by the intuition that a climate phenomenon is expected to involve a significant region of the ocean or atmosphere where the behavior is relatively uniform over the entire area.

A cluster-based approach for discovering climate indices provides better physical interpretation than those based on the SVD/EOF paradigm, and provide candidate indices with better predictive power than known indices for some land areas.

Some SST clusters reproduce well-known climate indices. In particular, we were able to replicate the four El Nino SST-based indices: cluster 94 corresponds to NINO 1+2, 67 to NINO 3, 78 to NINO 3.4, and 75 to NINO 4. The correlations of these clusters to their corresponding indices are higher than 0.9.

Some SST clusters, e.g., cluster 29, are significantly different than known indices, but provide better correlation with land climate variables than known indices for many parts of the globe. The bottom figure shows the difference in correlation to land temperature between cluster 29 and the El Nino indices. Areas in yellow indicate where cluster 29 has higher correlation.

Cluster 29 versus El Nino Indices

longitude

latit

ude

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90-0.6

-0.4

-0.2

0

0.2

0.4

0.6

longitude

latit

ud

e

SST Clusters With Relatively High Correlation to Land Temperature

-180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

90

60

30

0

-30

-60

-90

29

75 78 67 94


Mining the Climate Data: Clustering

Clusters of SST that have high impact on land temperature

El Nino Regions Defined by Earth Scientists

Niño Region

Range Longitude

Range Latitude

1+2 (94) 90°W-80°W 10°S-0°

3 (67) 150°W-90°W 5°S-5°N

3.4 (78) 170°W-120°W 5°S-5°N

4 (75) 160°E-150°W 5°S-5°N

# grid points: 67K Land, 40K Ocean Current data size range: 20 – 400 MB

Monthly data over a range of 17 to 50 years

Cluster Nino Index Correlation94 NINO 1+2 0.922567 NINO 3 0.946278 NINO 3.4 0.919675 NINO 4 0.9165

SST Cluster Moderately Correlated to Known Indices

Cluster 62

-180 -140 -100 -60 -20 20 60 100 140 180

90

70

50

30

10

-10

-30

-50

-70

-90

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

Cluster 62 - SOI ANOM12 ANOM3 ANOM4 ANOM34 (mincorr = 0.20)

-180 -140 -100 -60 -20 20 60 100 140 180

90

70

50

30

10

-10

-30

-50

-70

-90

Ref: Steinbach et al 2002/2003

(KDD 2003)

Correlation of Known Indices with SST Cluster Centroids and SVD Components

Climate Indices

Cluster Centroids SVD Components

Best-shifted Correlation

Best Centroid

Best SVDCorrelation

Best ComponentSOI -0.7006 75 (G0) -0.5427 3

NAO -0.2973 19 (G2) 0.1774 8

AO -0.2383 29 (G1) 0.2301 8

PDO 0.5172 20 (G1) -0.4684 7

QBO -0.2675 20 (G1) 0.3187 11

CTI 0.9147 67 (G0) 0.6316 3

WP 0.2590 78 (G0) 0.1904 3

NINO1+2 0.9225 94 (GO) -0.5419 1

NINO3 0.9462 67 (G0) -0.6449 1

NINO3.4 0.9196 78 (G0) -0.6844 1

NINO4 0.9165 75 (G0) -0.6894 1


SLP Clusters

DMI

SOISOI

NAO AO


Pair of SLP Clusters that Correspond to SOI

Centroids of SLP clusters 13 and 20 Cluster centroid 20 – 13 versus SOI

87 88 89 90 91 92 93 94 95 96 97 98 99-3

-2

-1

0

1

2

3Centroid 20Centroid 13

87 88 89 90 91 92 93 94 95 96 97 98 99-3

-2

-1

0

1

2

3

Centroid 13 - 20SOI

Correlation = 0.75


Finding New Patterns: Indian Monsoon Dipole Mode

Index

Recently a new index, the Indian Ocean Dipole Mode index (DMI), has been discovered.

DMI is defined as the difference in SST anomaly between the region 5S-5N, 55E-75E and the region 0-10S, 85E-95E.

DMI and is an indicator of a weak monsoon over the Indian subcontinent and heavy rainfall over East Africa.

We can reproduce this index as a difference of pressure indices of clusters 16 and 22.

Plot of cluster 16 – cluster 22 versus the Indian Ocean Dipole Mode index. (Indices smoothed using 12 month moving average.)


Mining the Climate Data: Associations

FPAR-Hi ==> NPP-Hi (sup=5.9%, conf=55.7%)

Grassland/Shrubland areas

Association rule is interesting because it appears mainly in regions with grassland/shrubland vegetation type

Ref: Tan et al 2001


Release: 03-51AR

NASA DATA MINING REVEALS A NEW HISTORY OF NATURAL DISASTERS

NASA is using satellite data to paint a detailed global picture of the interplay among natural disasters, human activities and the rise of carbon dioxide in the Earth's atmosphere during the past 20 years.

Detection of Ecosystem Disturbances

Detection of sudden changes in greenness over extensive areas from these large global satellite data sets required development of automated techniques that take into account the timing, location, and magnitude of such changes.

An algorithm was designed to identify any significant and sustained declines in FPAR during an 18 year time period. This algorithm transforms a non-stationary time series to a sequence of disturbance events. Techniques were also developed to discover associations between ecosystem disturbance regimes and historical climate anomalies.

These algorithms and techniques have allowed Earth Science researchers to gain a deeper insight into the interplay among natural disasters, human activities and the rise of carbon dioxide in Earth's atmosphere during two recent decades.

http://amesnews.arc.nasa.gov/releases/2003/03_51AR.html


Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes

longitude

latit

ud

e

-90 -60 -30

30

0

-30

-60

82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98-3

-2

-1

0

1

2

3Average NPP at 55.0 W , 15.0 S vs. Average AO

NPPAO

Discovered, using correlation analysis, a strong connection between the rainfall patterns generated by the South American monsoon system and terrestrial greenness over a large section of the southern Amazon region.

This is the first direct evidence of large-scale effects of the Atlantic Ocean rainfall systems on yearly greenness changes in the Amazon region, and the finding has important implications for the impacts of "slash and burn" deforestation on this crucial ecosystem of the world.

High Resolution EOS Data EOS satellites provide high resolution

measurements– Finer spatial grids

8 km 8 km grid produces 10,848,672 data points 1 km 1 km grid produces 694,315,008 data points

– More frequent measurements– Multiple instruments

Generates terabytes of day per day

High resolution data allows us to answer more detailed questions:

– Detecting patterns such as trajectories, fronts, and movements of regions with uniform properties

– Finding relationships between leaf area index (LAI) and topography of a river drainage basin

– Finding relationships between fire frequency and elevation as well as topographic position

Earth Observing System (e.g., Terra and Aqua satellites)

http://www.crh.noaa.gov/lmk/soo/docu/basicwx.htm


Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining: Journal Publications

Potter, C., Tan, P., Steinbach, M., Klooster, S., Kumar, V., Myneni, R., Genovese, V., 2003. Major disturbance events in terrestrial ecosystems detected using global satellite data sets. Global Change Biology, July, 2003.

Potter, C., Klooster, S. A., Myneni, R., Genovese, V., Tan, P., Kumar,V. 2003. Continental scale comparisons of terrestrial carbon sinks estimated from satellite data and ecosystem modeling 1982-98. Global and Planetary Change (in press)

Potter, C., Klooster, S. A., Steinbach, M., Tan, P., Kumar, V., Shekhar, S., Nemani, R., Myneni, R., 2003. Global teleconnections of climate to terrestrial carbon flux. Geophys J. Res.- Atmospheres (in press).

Potter, C., Klooster, S., Steinbach, M., Tan, P., Kumar, V., Myneni, R., Genovese, V., 2003. Variability in Terrestrial Carbon Sinks Over Two Decades: Part 1 – North America. Geophysical Research Letters (in press)

Potter, C. Klooster, S., Steinbach, M., Tan, P., Kumar, V., Shekhar, S. and C. Carvalho, 2002. Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes. Global Change Biology (in press)

Potter, C., Zhang, P., Shekhar, S., Kumar, V., Klooster, S., and Genovese, V., 2002. Understanding the Controls of Historical River Discharge Data on Largest River Basins. (in preparation)


Discovery of Changes from the Global Carbon Cycle and Climate System Using Data Mining: Conference/Workshop Publications

Steinbach, M., Tan, P. Kumar, V., Potter, C. and Klooster, S., 2003. Discovery of Climate Indices Using Clustering, KDD 2003, Washington, D.C., August 24-27, 2003.

Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., 2003. Exploiting Spatial Autocorrelation to Efficiently Process Correlation-Based Similarity Queries , Proc. of the 8th Intl. Symp. on Spatial and Temporal Databases (SSTD '03)

Zhang, P., Huang, Y., Shekhar, S., and Kumar, V., 2003. Correlation Analysis of Spatial Time Series Datasets: A Filter-And-Refine Approach, Proc. of the Seventh Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD '03)

Ertoz, L., Steinbach, M., and Kumar, V., 2003. Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data, Proc. of Third SIAM International Conference on Data Mining.

Tan, P., Steinbach, M., Kumar, V., Potter, C., Klooster, S., and Torregrosa, A., 2001. Finding Spatio-Temporal Patterns in Earth Science Data, KDD 2001 Workshop on Temporal Data Mining, San Francisco

Kumar, V., Steinbach, M., Tan, P., Klooster, S., Potter, C., and Torregrosa, A., 2001. Mining Scientific Data: Discovery of Patterns in the Global Climate System, Proc. of the 2001 Joint Statistical Meeting, Atlanta

© vipin kumar august 20, 2003 1 discovery of patterns in the global climate system using data...

Documents

data mining tasks

large quantities of

parallel issues

nsfc vipin kumar

allc vipin kumar

raw datadata mining

datadistributed computing

web search engine