alignment and preprocessing for data analysisdepts.washington.edu/cpac/activities/meetings/... ·...

Post on 22-Mar-2020

38 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Alignment and Preprocessing for Data Analysis

• Preprocessing tools for chromatography

• Basics of alignment

• GC‐FID (1D) data and issues– PCA– F‐Ratios

• GC‐MS (2D) data and issues– PCA– F‐Ratios– PARAFAC

• Piecewise Alignment GUI (available online)– synoveclab.chem.washington.edu/Downloads.htm

– Email for username/password

Tools for Analysis: Classification

• Principal Component Analysis (PCA)

• Degree of Class Separation (DCS)

,

22

A B

A B

DCSD

s s

2 2

, ( ) ( )A B A B A BX X Y YD DCS

Scor

e PC

2

Score PC1

DCS

PCA Loadings`

Retention Time

Retention Time

Sign

al

Scor

e PC

2

Score PC1

ScoresX + Error

Why Align?

• Reduction in classification– PCA

• Increase in uncertainty for quantification– PARAFAC

• Misalignment occurs frequently– Daily instrument variation causes misalignment

– Correction is necessary to apply these methods

Retention Time Precision & PCA

DCS=32

DCS=3

PCA

PCA

10 Replicates withlimited shifting

10 Replicates withlarge shifting

Loadings

Retention Time

Retention Time

Retention Time

Retention Time

Sign

alSi

gnal

__

Scor

e PC

2

Score PC1

Score PC1

Scor

e PC

2

Loadings

Scores

Scores

Basics of Alignment

• Types of alignment algorithms– Cross correlation coefficient– Correlation Optimized Warping (COW)

– *Piecewise alignment

• Alignment Parameters– Window Size

– Shift

• Target Selection– PCA– Correlation Coefficient– Windowed Target

Alignment Algorithms

• Cross correlation coefficient– Move the entire chromatogram to maximize correlation

• Correlation Optimized Warping (COW)– Separate the chromatogram into windows

– Warp and move the windows to optimize the correlation

– Find the best alignment path to correct the data

• *Piecewise alignment– Separate the chromatogram into windows

– Shift the windows to optimize the correlation

– Find the best alignment path to correct the data

Alignment Parameters

Parameter

Window Size, W

Shift, L

Description EffectToo Small

Relative movement

high, difficult to determine

quality of alignment

Insufficient movement of

segments

EffectToo Large

Insufficient flexibility to correct peak

to peak shifting

Increases time

Window Size ~1 pk

width

0<shift<=maximum shift

Determining correct values

Alignment Metric

Alignment Metric

L  tR

Target Selection

• *Global Approach– All chromatograms are initially collected

– *User chosen target– PCA optimized target

• Scores are produced for every sample

• The sample with the minimum distance from the center is the target

– *Maximum correlation target• Calculate the product of each chromatograms correlation to the others

• The maximum correlation is the target

• Online Approach– An initial target is set– Chromatograms are aligned as they are collected

– The target changes as new chromatograms are collected

Target Selection (Maximum Correlation)

0 2 4 6 8 10 120

10

20

30

40

50

60

8.05 8.1 8.15 8.2

0

1

2

3

4

5

6

7

8

9

10

0 5 10 15 200.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6Sample # 9 is the target

8.05 8.1 8.15 8.2

0

2

4

6

8

10

Type 1Type 2Type 3

TargetType 1Type 2Type 3

Retention Time, min

Sign

al

Retention Time, min

Sign

al

Sample #

Pro

duct

Cor

rela

tion

Alignment of GC‐FID (1D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA– F‐Ratio

Preprocessing Tools for Chromatography

• Noise filtering– Median filter

• Baseline correction• Normalization

• Alignment

Experimental

• GC‐FID Separation• 4 diesel sample types

• 9 minute separation

• 17 replicate injections over 5 days

7 7.2 7.4 7.6 7.8

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Noise Filtering (Solvent, Spikes, and Outliers)

2 4 6 8 10

0

2

4

6

8

10

12

14

7 7.2 7.4 7.6 7.80.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Median Filtered Data 

(minimize spike)

Retention Time, min

Sign

al

GC-FID Separation

Sign

alSi

gnal

Retention Time, min

Noise Spike

Noise Spike

Noise Filtering (Solvent, Spikes, and Outliers)

0 2 4 6 80

2

4

6

8

10

12

0.8 1 1.2 1.4 1.6 1.8 2 2.2x 10

9

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5 x 108

Outlier (low signal)

Solvent Dominated Info

Retention Time, min

Sign

al

GC-FID Separation

PC1PC

2

Scores Plot

Solvent

Noise Filtering (Solvent, Spikes, and Outliers)

0 2 4 6 8 100

2

4

6

8

10

12

14

1.5 2 2.5 3 3.5 4 4.5 5x 10

7

-3

-2

-1

0

1

2

3

4

5 x 106

Outlier (Low Signal)

Outlier Low Signal

Retention Time, min

Sign

al

GC-FID Separation

PC1PC

2

Scores Plot

0 2 4 6 8 100

2

4

6

8

10

12

14

Noise Filtering (Solvent, Spikes, and Outliers)

No Outlier

3.4 3.6 3.8 4 4.2 4.4 4.6x 10

7

-3

-2

-1

0

1

2

3

4

5 x 106

No Outlier (Baseline / 

Normalization on PC1)

Retention Time, min

Sign

al

GC-FID Separation

PC1PC

2

Scores Plot

Noise Filtering (Solvent, Spikes, and Outliers)

6.2 6.3 6.4 6.5 6.6 6.7 6.8x 10

-3

-4

-2

0

2

4

6

8 x 10-4

0 2 4 6 8 10-0.5

0

0.5

1

1.5

2

2.5 x 10-9

Tight Clustering in PCA

Baseline Corrected and Normalized Data

Subtract offset line from data

Then divide by total signal

Retention Time, min

Nor

mal

ized

Sig

nal

PC1PC

2

Scores Plot

Experimental

• GC Separation• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days• Misalignment is due to day to day instrument 

variation

Alignment GUI

Initially blank to guide the user through alignment

Load from workspace

Load from file

Alignment GUI

Zoom & Pan

Choose or Estimate TargetChoose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

Now save data to 

workspace or .mat file

Data after alignment

Data before 

alignment

Estimated 

target and 

parameters

Alignment GUI

PCA Classification of Gasoline

0.038 0.0385 0.039 0.0395 0.04 0.0405 0.041-5

-4

-3

-2

-1

0

1

2

3 x 10-3

8.05 8.1 8.15 8.2 8.25 8.3 8.35

0

0.5

1

1.5

2

2.5

3

x 10-8

We want better 

separation of groups

Retention Time, min

Sign

al

PC1PC

2

Scores Plot

Sample 1

Sample 2

Col 1

Sample 3

GC‐MS 

Separations

Fisher Ratio Method (F‐Ratio)

Works well for samples that have large amounts of within 

class variance

Works best when comparing a small number of sample 

classes

For each mass channel calculate Fisher Ratio at each point in 2D

space,  

Fisher Ratio =2

2b tw c la s s

w i th in c la s s

Fisher Ratios

Col 1

0 5 10 150

20

40

60

80

100

120

140

160

180

200

0 5 10 150

100

200

300

400

500

600

700

800

Fisher Ratio Method (F‐Ratio)

8.05 8.1 8.15 8.2 8.25 8.3 8.35

0

0.5

1

1.5

2

2.5

3

x 10-8

No alignment

After alignment

Select Top Ten

Retention Time, min

Sign

al

Selected Chromatographic Region

Retention Time, min

F-R

atio

Retention Time, min

F-R

atio

0 5 10 15-1

0

1

2

3

4

5

6

7 x 10-8

Fisher Ratio Method (F‐Ratio)

5 6 7 8 9 10x 10

-3

-1

-0.5

0

0.5

1

1.5

2

2.5 x 10-3

Retention Time, min

Sign

al

PC1PC

2

Scores Plot

Regions of interest

Tight clustering of sample types

Summary

• Removal of solvents or artifacts is essential

• Baseline correction is an important step

• Alignment is essential for improving classification

• The F‐Ratio algorithm can further improve classification

Alignment of GC‐MS (2D) Data and Issues

• Removal of artifacts and solvent peaks

• Baseline correction and normalization

• Alignment

• Improving PCA– F‐Ratio

• PARAFAC

Experimental

• GC‐MS Separation

• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days• Misalignment is due to day to day instrument 

variation

Alignment GUI

Initially blank to guide the user through alignment

Load from workspace

Load from file

Alignment GUI

Zoom & Pan

Choose or Estimate TargetChoose or Estimate Parameters

Load & View Data Sizes

Align Data

Alignment GUI

Now save data to 

workspace or .mat file

Data after alignment

Data before 

alignment

Estimated 

target and 

parameters

Alignment GUI

0 2 4

5

10

15

20

Alignment

TIC

Retention Time

Retention Time

ShiftAlignment

Total Ion Current Shift Function (TIC‐SF)SampleTarget

0 2 4

2

4

6

8

10

12

Retention Time

Shift

Total Ion Current Shift Function (TIC‐SF)

GC‐MS Data

Retention Time

Apply 

alignment

0 2 4

2

4

6

8

10

12

Retention Time

SampleTarget

GC‐MS Data

Alignment

7.1 7.15 7.2 7.25 7.30

0.5

1

1.5

2

2.5

x 10-8

7.1 7.15 7.2 7.25 7.30

0.5

1

1.5

2

2.5

x 10-8

0 5 10 15-1

0

1

2

3

4

5

6

7 x 10-8

Align

Alignment

Retention Time, min

TIC

Sig

nal

TIC

Sig

nal

Retention Time, min

TIC

Sig

nal

7.05 7.1 7.15 7.2 7.250

0.5

1

1.5

2

2.5

3x 10

-8

0.013 0.0135 0.014 0.0145 0.015-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10-3

PCA w/ full MS

Classification of Full MS Data

Retention Time, min

TIC

Sig

nal

PC1PC

2

Scores Plot

Sample 1m/z

Col 1

m/z

Col 1

Sample 2

m/z

Col 1

Sample 3

Fisher Ratios 

for Samples

m/z

Col 1

GC‐MS 

Separations

Fisher Ratio Method

Works well for samples that have large amounts of within 

class variance

Works best when comparing a small number of sample 

classes

For each mass channel calculate Fisher Ratio at each point in 2D

space,  

Fisher Ratio =2

2b tw c la s s

w i th in c la s s

Sum Fisher Ratios along mass 

channel dimension

Col 1

Simulated Example Using F‐Ratios for PCA

PCA

W = 400L = 1000 Align

Retention Time

m/z

Scores 2D F‐Ratios

Shows differences 

between all samples

TICTIC

F-Ratios

GC-M

S Da

ta

4.8 5 5.2 5.4 5.640

50

60

70

80

90

100

100

200

300

400

500

600

700

0 5 10 1540

60

80

100

120

140

100

200

300

400

500

600

700

0 5 10 15-1

0

1

2

3

4

5

6

7 x 10-8

F‐Ratios

Top Fisher Ratios

GC‐MS Fisher Ratios

GC‐MS Fisher Ratios

GC‐MS Separation

Retention Time, min

TIC

Sig

nal

Procedure

• Select masses using mass spectral information

• Select time regions using F‐Ratios

• Combine to reduce the data set

• Improve results of PCA

0 5 10 15-1

0

1

2

3

4

5

6

7 x 10-8

Classification of Full MS Data

1.8 1.9 2 2.1 2.2 2.3 2.4x 10

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5x 10-4

Clear Separation on 1st

PC

PC1P

C2

Scores Plot of Nearest Samples

Retention Time, min

Sign

al

Regions of interest (with MS values)

Quantification of GC‐MS Data

• Use aligned, baseline corrected and  normalized data

• Use PARAFAC of small regions for analysis– Match values to mass spectra

– Peak Sums

– Peak Profiles

Quantification

• Target analyte Parallel Factor Analysis (PARAFAC) isolates the 

pure component peak and mass spectral information from 

overlapping peaks and background for both identification

and 

quantification.

Data Matrix = Analyte 1 + Analyte 2 + ... Noisex x

+ +…+=

1 2

y1 y2

z1 z 2

R E

Compound : malate, TMS# of factors: 3Chosen comp.: #3Match value : 948Peak sum : 4413882Peak height : 376188

PARAFAC Results

Col. 1 Time Col. 2 Time

Intensity

Intensity

1.0 1.2 1.4 1.6

0

2

4

6

8

PARAFAC for GC‐MS Data

0 2 4

2

4

6

8

10

12

14

0 2 4

2

4

6

8

10

12

Stack Samples

Retention Time

Sample

Run PARAFAC

Peak Profile

Mass Spectrum

Analyte Concentration

Sample 1

Sample 2

Sample 1

Sample 2

Experimental

• GC‐MS Separation

• 3 gasoline sample types

• 15 minute separation

• 6 replicate injections over 2 days• Misalignment is due to day to day instrument 

variation

• Looking for isobutyl benzene in the gasoline

Isobutyl Benzene Spectrum

20 40 60 80 100 120 140 1600

10

20

30

40

50

60

70

80

90

100

Mass

Sign

al

7.04 7.06 7.08 7.1 7.12 7.14 7.16 7.180

1

2

3

4

5

6 x 10-9

0 5 10 15 20 25 30 350

1

2

3

4

5

6

7

8

9 x 10-3

0 5 10 15 200

0.5

1

1.5

2

2.5

3

3.5 x 10-3

20 40 60 80 100 120 140 1600

1

2

3

4

5

6

7 x 10-5

PARAFAC Results for GC‐MS Data

Retention Time, min

TIC

Sig

nal

Quantitative Info

Qualitative Info

Qualitative Info

Mass Spectrum

Peak Profile

Peak Sum

Selected Analysis Region for PARAFAC 1

2

3

1

2

3

1

20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

20 40 60 80 100 120 140 1600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MVIBB

=403 MVIBB

=788 MVIBB

=910

PARAFAC Results for GC‐MS Data

Component 1 Component 2 Component 3

Best match to isobutyl benzene

Summary

• Mass spectral chromatographic data should  be aligned

• An available alignment algorithm is able to  align 1‐D and 2‐D data

• F‐Ratio methods can be used to improve  classification

• PARAFAC can be effectively used to separate  overlapped analytes

top related