analysis of complex real-time atmospheric data sets: a ...dgross/poster_acs2005_mel...analysis of...

�

�

�

�

�

�

�

�

�

�

��

�

�

�

�

ART-2aK-Means

Figure 5. Comparison of Cluster Centers for East St. Louis Data FePbZnCuCdSe

FePbZnCuCdSe

Analysis of Complex Real-Time Atmospheric Data Sets: A Data Mining ApproachMelanie Yuen1, Andrew Ault1, Deborah S. Gross1, Ben Anderson2, Anna Ritz2, David R. Musicant2

1) Department of Chemistry and 2) Department of Mathematics and Computer Science, Carleton College, Northfield, MN 55057James J. Schauer3, Lei Chen4, Bee-Chung Chen4, Raghu Ramakrishnan4

3) Engineering Technology and Chemistry and 4) Department of Computer Sciences, University of Wisconsin, Madison, WI 53706

THE CHALLENGEReal-time single-particle instruments, such as the ATOFMS (TSI 3800, Shoreview, MN), obtain complex information about each detected aerosol particle. The ATOFMS provides the particle’s aerodynamic diameter (Da), the time it was sampled, and two complete mass spectra (one each for positive and negative ions), each of which contain 32,000 data points. A sample particle is shown below, in Figure 1. Each peak in the spectrum represents at least one ion composition, having that specific mass-to-charge (m/z) value. A dataset consists of particle spectra that are collected over a user-selected time period. Each dataset may contain hundreds or millions of particles, and the analysis is therefore time consuming. Prior to interpretation of the single-particle data, it must be calibrated (both the Da and the m/z). Upon calibration, the dataset can be analyzed, using available tools, including those being developed as part of EnChIlADA.

There are many different instruments that are able to obtain mass spectra of single atmospheric aerosol particles in real-time. These instruments are used to measure aerosol particles from both natural and anthropogenic sources. Challenges arise not only in acquiring data with these instruments, but with analyzing the large amount of complex data that is obtained, often thousands to hundreds of thousands of particle spectra per day. To more fully understand an atmospheric system, the single-particle mass spectra must be analyzed in conjunction with a variety of other measurement results, including meteorological data and other real-time aerosol measurements. This compounds the difficult problem posed by the complex single-particle data, as the various data sets are typically analyzed individually prior to merging them to look for correlations. We have launched the Exploratory Data Analysis and Management (EDAM) project to bring the expertise of the data-mining community to bear on atmospheric data analysis. We are currently working on the development of a computer software package that will more efficiently analyze data from a variety of instruments that collect real-time aerosol information. Our goal is to search for new and existing trends in the data by implementing data mining algorithms in our software. The development of a software package, called “Environmental Chemistry through Intelligent Aerosol Data Analysis” (EnChIlADA) has been in progress for only a few months. Here we present initial results of the project using data obtained from the single-particle Aerosol Time-of-Flight Mass Spectrometer (ATOFMS) in order to assess the newly developed data-mining tools

The DataFigure 1 shows sample single-particle mass spectra from a particle detected in East St. Louis, during December 2003. The spectrum is only displayed up to 300 m/z, due to the scarcity of peaks above this value. There may be more than one chemical composition associated with each m/z. At any particular location, there will be different compound combinations that are associated with one another, due to local source profiles. Thus, the simple task of figuring out what compounds are represented by the peaks in each spectrum is remarkably complex. The integration of the rules of chemistry into data-mining algorithms can make them applicable to this dataset, and can simplify the problem somewhat. For example, the naturally abundant isotopes of various elements place limits on the possible peak identifications – such as those seen for iron in Figure 1. The blue peaks in the background represent the natural abundance of the iron isotopes.

INTRODUCTION

MS-AnalyzeMS-Analyze is provided with TSI’s ATOFMS instrument. This program is designed to calibrate acquired data and allow for organized database storage. Through MS-Analyze’s interface to Microsoft Access, SQL queries can be run on the datasets. Thus, MS-Analyze is designed to facilitate the user obtaining detailed information about an acquired ATOFMS dataset, based on what the user knows to look for.

ANALYSIS TOOLS

1501251007550250

60.057.555.052.550.0

m/z

54Fe+

Fe+Ca+

Na+

O-

OH-

NO3-

H(NO3)2-

NO2-

Al+C2

+

C3+

57Fe+

Figure 1. Sample spectrum from ATOFMS

56Fe+

EC

Inor

gani

c-2

Inor

gani

c-1

Mix

ture

OC

-1

OC

-3

OC

-2

100

50

100

50

100

50

0

EC

Inor

gani

c-2

Inor

gani

c-1

Mix

ture

OC

-1

OC

-3

OC

-2

100

50

100

50

100

50

0

EC

Inor

gani

c-2

Inor

gani

c-1

Mix

ture

OC

-1

OC

-3

OC

-2

100

50

100

50

100

50

0

# particles/10

ART-2aK-means

Figure 4. Population of 9 Cluster Centers for a Test Data Set

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

300250200150100500/

ART-2aK-Means

Figure 5. Comparison of Cluster Centers for East St. Louis Data

Figure 4 shows the distribution of test dataset particles into nine clusters, two greater than the inherent number of seed particles in the data. The two algorithms do a remarkably similar job of distributing the particles into the clusters. The cluster centers that have similar population contain particles that were originally from the same seed particle, but which have different amounts of noise, causing them to cluster differently. K-means required 26 % ± 10 % of the passes through the entire dataset that ART-2a required, to cluster data into the same number of clusters.

Figure 5. shows the overlaid cluster centers for twelve clusters obtained from ART-2a and K-means for the real data from East St. Louis. In this example, K-means required 19 % ±14 % of the passes through the entire dataset that ART-2a required, to cluster data into the same number of clusters. The remarkable similarity in these cluster-centers indicates that the two algorithms obtain nearly identical results.

K-means is faster than ART-2a, and generates very similar results, both in terms of the way particles are partitioned into clusters and the resulting cluster centers. K-means is a better-known algorithm, and thus more is known about its behavior. It shows great promise for the analysis of single-particle mass spectrometry datasets.

Clustering Single-Particle Data:Clustering has been used successfully with ATOFMS data (1). Here we compare the commonly used ART-2a algorithm with the best known clustering algorithm, K-means (2). Both algorithms cluster particles based on their similarity. ART-2a has two user-adjustable parameters (learning rate and vigilance factor), while K-means has only one, the number of clusters (“K”). While we don’t know in advance the correct number of clusters for atmospheric aerosol data, processes have been developed to determine the appropriate number of clusters (3). EnChIlADA contains both the ART-2a and K-means algorithms. We compare their performance by forcing each algorithm to cluster the data into the same number of clusters. In K-means, this required changing the user-defined cluster number. In ART-2a, it meant user adjustment of the vigilance factor. Figure 3 shows the comparison of the errors (average of the squares of the distances of all points to their nearest cluster-centers) for each algorithm, as a function of number of clusters. Because ART-2a does not include outlier particles in the existing clusters, the error is artificially lowered. Also in Figure 3, we see that K-means requires a significantly smaller number of passes through the entire dataset to cluster data into the same number of clusters as ART-2a. When implemented on large data sets, this time dominates the overall time to cluster the data, clearly showing an advantage of K-means.

RESULTS

EnChIlADAThe software program that we are developing, EnChIlADA, shown in Figure 2, is designed to use data mining algorithms to find patterns in the single-particle data (and eventually in other real-time data) without the need for the user to specify what to look for. EnChIlADA is being developed in Java and will be open-source, with the expectation that it will be modified by future users to include any relevant data sets. The aspects of the software that we have developed to date are described here.

EnChIlADA

Figure 2. Screen shot of EnChIlADA

A test dataset created from seven single-particle mass spectra of representative types (six distinct types, and a seventh that is similar to one of the original six). Each particle was reproduced, with the random addition of noise, to create a dataset of 2000 particles.A dataset of 2699 particles acquired in East St. Louis during February 2004. These particles are a small fraction of the millions of single-particle mass spectra acquired during this study.

The Datasets:We compared the performance of our data mining software on two datasets of single-particle mass spectra acquired with the ATOFMS instrument:

Figure 3. Comparison of Errors and Number of Data Passes

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 11 12 13Number of Clusters

Ave

rage

Err

or

K-MeansART-2a

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Number of Clusters

Num

ber o

f Dat

a Pa

sses

K-meansK-MeansART-2a

Test Data(2000 particles from 7 seed particles)

0.1

0.2

0.3

0.4

0.5

1 2 3 4 5 6 7 8 9 10 11 12 13Number of Clusters

Ave

rage

err

or

K-means

ART-2a Best Case

ART-2a Worst Case

K-MeansART-2a (best case)ART-2a (worst case)

0

20

40

60

80

100

120

140

160

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14Number of Clusters

Num

ber o

f Dat

a Pa

sses

K-means

Art-2a

K-MeansART-2a

Real Data(2699 particles from East St. Louis)

C3+

C5+

C6+

C7+

C8+C9

+C10

+C11

+

C12+C14

+

C15+

C4+/Ti+

C2H5+

C3H+

C+

K+

200150100500m/z

IonC3+C5+C11+K+C4+C10+C7+AlO+C9+C3H+FeO+C6+

Peak Area18332.714642.010530.010163.38728.18589.38176.84982.74482.94147.03943.03908.2

IonC12+C5H+Ti+C8+C+NaNO3+C15+C2H5+C3H5+C3H4+C14+

Peak Area3813.03343.03018.32371.41907.21784.81739.11712.11676.71597.71490.6

70656055

Na+

Al+ Ti+

63TiO+

64TiO+

TiO+

CaOH+

65TiO+

66TiO+62TiO+

C3H3+

C4H8+

C3H5+

200150100500m/z

IonC3H3+Na+TiO+C3H5+Ti+Al+C4H8+C3H6+

Peak Area41012.716725.09562.37493.84378.62084.12057.81509.8

IonCaOH+AlO+SiO+C2H+C2+C3H4+Zn+NO2+

Peak Area1506.51342.91090.9590.8559.3521.4517.7388.5

Figure 6: Automatically LabeledSingle Particle Mass Spectra

0

500000

1000000

1500000

2000000

2500000

01/3

1/04

02/0

1/04

02/0

2/04

02/0

3/04

02/0

4/04

02/0

5/04

02/0

6/04

02/0

7/04

02/0

8/04

02/0

9/04

02/1

0/04

02/1

1/04

02/1

2/04

02/1

3/04

02/1

4/04

02/1

5/04

02/1

6/04

02/1

7/04

02/1

8/04

02/1

9/04

02/2

0/04

02/2

1/04

02/2

2/04

02/2

3/04

02/2

4/04

02/2

5/04

02/2

6/04

02/2

7/04

02/2

8/04

02/2

9/04

03/0

1/04

03/0

2/04

FePbZnCuCdSe

Tota

l Pea

k A

rea

of E

lem

ent/3

0 m

in

Figure 7: Labeling Large Datasets, Elements as a Function of Time

CONCLUSIONS AND FUTURE WORK

Labeling SpectraAnother focus of the data-mining research is to create an algorithm that labels all the peaks in a single-particle mass spectrum. This means that the algorithm will attempt to determine the chemical composition of every particle, and use the isotopic contributions of the elements to resolve isobaric ions. A list of commonly observed peaks in ATOFMS spectra, plus elemental and PAH molecular weights were provided as the source of possible ion compositions. This list can be easily augmented. Figure 6 shows two positive ion mass spectra labeled by this algorithm (negative ions will be implemented soon). These particles were sampled in East St. Louis. All peak labels were automatically generated. Due to spectral congestion, not all peak labels are shown. The tables next to the spectra include the output from the labeling algorithm. Figure 7 shows results, as total peak area due to specific elements as a function of time obtained with this algorithm for metals found in data acquired in East St. Louis, during February 2004.

We have implemented a simpler and faster clustering algorithm (K-means) that works as well as ART-2a, the most commonly used clustering algorithm for single-particle mass spectrometry data. We have developed an algorithm to automatically label the ions represented by each peak in a mass spectrum, which can be used to look at individual spectra, or to look at aggregate data as a function of time. Future work will include extensions of this system to other data streams (eg. AMS, other real-time aerosol measurements), optimization for analysis of large datasets, and implementation of real-time analysis capabilitiesAcknowledgements: This research is funded by NSF ITR grant IIS-0326328 (to UW-Madison and Carleton College) and by Carleton College.References:1. X.-H. Song, P. K. Hopke, D. P. Fergenson, and K. A. Prather, "Classification of Single Particles Analyzed by ATOFMS Using an Arti-ficial Neural Network, ART-2A," Analytical Chemistry, vol. 71, pp. 860-865, 1999.2. J. Han and M. Kamber, Data Mining: Concepts and Techniques: Morgan Kaufmann Publishers, 2000.3. A. Gordon, Classification. London: Chapman and Hall/CRC Press, 1999; G. W. Milligan and M. C. Cooper, "An examination of pro-cedures for determining the number of clusters in a data set," Psychometrika, vol. 50, pp. 159-179, 1985

analysis of complex real-time atmospheric data sets: a ...dgross/poster_acs2005_mel...analysis of...

Documents