yuwu chen wastewater treatment
TRANSCRIPT
The urban wastewater treatment
Yuwu Chen
Department of Chemical Engineering
12/4/2014
Introduction
Wastewater treatment is the process of removing contaminants from wastewater
Introduction
Water quality index
Chemical oxygen demand (COD): the amount of dissolved oxygen needed by a
strong oxidizing agent water to break down organic material present in a given
water sample at certain temperature over a specific time period.
Biological oxygen demand (BOD): the amount of dissolved oxygen needed by
aerobic biological organisms in a body of water to break down organic material
present in a given water sample at certain temperature over a specific time
period.
They indirectly measure the amount of organic compounds in water. COD and
BOD should be correlated.
Suspended solids (SS)
Volatile supended
Sediments (SED)
Inorganic element (N-NH3, P, S etc)
pH
Directly measure the amount of a certain contaminant in water
Data Description
The dataset comes from the daily measures of sensors in a urban wastewater treatment
plant.
The data was collected by Manel Poch at Universitat Autonoma de Barcelona. Bellaterra.
Barcelona; Spain
The full dataset was donated by Javier Bejar and Ulises Cortes at Universitat Politecnica
de Catalunya. Barcelona; Spain, and is available at:
http://archive.ics.uci.edu/ml/machine-learning-databases/water-treatment/
Data Description
Date
In dd/mm/yy format: 1/1/90 to10/30/91. Some days in this period are not
included.
Water volume
The daily flow volume to the plant in m3: 10005 to 60081
Water quality index (28 variables)
Water quality index were recorded before and/or after a process step.
BOD, COD, SS, SSV, SED ...
Performance (9 variables )
Performance variables were directly calculated from water quality index. They
can be used to evaluate the performance of each process unit. 0.6% to 100%
Data Description
Data Management Data transformation
The original variable “date” is characteristic and too long. So I transform it to
a categorical variable “day”:date day
1/1/1990 1
2/1/1990 2
……
30/10/1991 668
Then rename the row name of the data-frame with the variable day.
Correct the wrong format in the variable BOD.in3
Subset data
In this study, five water quality index of influent/effluent were used: pH, COD, BOD,
SS, SED.
Omit the missing value in each subset
Pretreatment Primar
y
Secondar
y
influent2 influent3 effluentinfluent1
Data Summary Paired plot example: influent1 (influent to the pretreatment unit)
Method Description Step 1: Principle component analysis (PCA) on each influent/effluent subset
Visualize the data to see the relationships among the observations and
variables in low dimensions
Step 2: Clustering days based on the daily performance
Identify subgroups of similar days based on the daily performance of each
process unit or the whole plant
Step 1: Principle component analysis (PCA)
on influent1 subset Principal component loading vector of influent1 (influent to the pretreatment unit)
Proportion of variance explained (PVE) by each PC and cumulative PVE
1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Pro
port
ion o
f V
ariance E
xpla
ined
1 2 3 4 5 6
0.0
0.2
0.4
0.6
0.8
1.0
Principal Component
Cum
ula
tive P
roport
ion o
f V
ariance E
xpla
ined
Step 1: Principle component analysis (PCA)
on influent1subset Biplot for influent1
Step 1: Principle component analysis (PCA)
on other three influent/effluent subsets Biplots for other three influent/effluent subsets
-2 0 2 4 6
-20
24
6
PC1
PC
2
2
34
7
8910
11 12
14
15
1617
1819
21
2223
24 25
26
28
293033
35
363738
394042
43 4445
4647
49
50
52
5354
566466
67
6870
71727374
7577
78
79
80
8182
8485
8687
88
8991
9293
94
9596
98
99
100
101
106
107
108109112
113
114
115116
117
119121
122123124
126
128
129
130
131133
134
135
138140
141142
143
144145
147148
149 150
152
154
155
156157
158 159
161
162163
164
165166
168
169170171
172
173175
176
177
178
179180
182
183
184
185
186
187189
190
191
192193194
196
197
198
199
200201
203
204
205
206
207208
210
212213
214
215217
218219
220
221222225
231232
233
234
235236
239240
241
242
243
245
246
247
248
249250
252254
255256
257
259
260261
262263264
266267
268
269270
271
273
274
275
276
277278280
281282283285
287
288289
290 291 292
294
295
296
297298
299
308
309310
311312313
315
316
317318
319322
323324
325
326
327
329
330 331
332333334
336337
338
340
343
344
346347
350351 352353
354
355
357
360
361364
366
367
368369
371
372
373374
375
378
379
380381382
383
385
386387388389
392
393
394
395
396
397
399
400
401 402403
406
407408409410
411
413
414
415
417
420
421422423
424425
427
428429
430431
434 435436
437438
439441
443444
445448
449
450
456 457458
459
460
462
463464
465
466
469
470471472
473474
476
477
478480
483
484486
487
488490
491
492
493494
497
498
499500
501502504
505506 507
508511
512
513514515516
518 519520
521
522525
526528529532
533534535
536537
540
541542543544
546547
548
549550
553
554
555
556
578579
581
582
583
584585
588
589590
591593
596597
598599600
603 604
605606
639640
641
642644
646
647649
650
651
653
654
656
657658
660
661
667
-0.5 0.0 0.5 1.0 1.5
-0.5
0.0
0.5
1.0
1.5
volume
pH.in3
BOD.in3COD.in3
SS.in3
SED.in3
0 5 10
05
10
PC1
PC
2
3 4
7
11
1214
15161718
19
21
22
23
242526
28
2930
33
35
37
383940
46
47
49
50
52
5354
5664
70
71
72
73
7475
7778
79
81
8284
858687
88
8991
9293
949596
98 99100101
106107
108109110
112
113
114115 116117119
121122
123
124
126
127128129
130131133
134
135
138
140
141142
143144145147
148149
150152154
155
156
157
158159
161
162163164
165
166
168
169170171
172173
175
176 177178179180
182
183
184
185186
187
189
190
191
192193194196
197198 199200201203 204
205
206207
208
210
211
212213214
215
217 218219
220
221222224
225227231232
233234
235
239240
241
242
243
245
246247248
249250252254
255256
257
259
260261262
263264266
267268
270271
273274275276277
278
280
281282
283
285287
288289
290
291292
294295
296 297299
306308
309
310311312
313
315
316317318
319
322
323 324325326327
330
331332
333334336
337338340
343
345346347
350351
352
353
354 355
357
360
361364366367368
369371
372
373 374375
378
379
380
381382383
385
386387
388389
392
393394
395
396
397
399400
401
402 403
406
407408409410
411
413
414415
417420421
422
423
424
425
427428429430431434
435
436437438
439441442
443
444445
448
449
450
451
456
457458
459
460
462
463464
465
466
469
470
471472
473474476477
478
479480
483
484486
487488
490
491492
493
494497
498499500 501
502504
505506
507
508
511512
513
514515
516518519520
521522
525 526528529
532
533534
535
536537
540
541542543
544546547548
549550
553554555
556
578579581
582
583
584
585
586
588
589590
591593
595
596597598599
600603
604
605606
639640641643
644
646647649
650
651653
654656657658
660661662
663665667
-0.2 0.0 0.2 0.4 0.6 0.8
-0.2
0.0
0.2
0.4
0.6
0.8
volume
pH.in2
BOD.in2
SS.in2
SED.in2
Influent2 (pretreatment >> primary) Influent3 (primary >>
secondary )
Effluent (out of plant)
0 5 10 15 20 25
05
10
15
20
25
PC1
PC
2
1234
7
891011121415
161718192122
2324
2526
28
293033
35
36 3738394042
4345464749
505253
545664656667
6870
71
7273
7475777879 80 8182
84
8586
87888991
9293
9495
96
9899
100101
106107108
109
112
113114
115 116117 119121
122123124126127
128129
130131133134135137138140
141 142
143
144145147148
149150
152154
155
156157
158 159
161162163
164165
166169170171172173
175
177
178179180
182
183184185186187189190
191192
193194196
197198199200201
204205206208
210212213
214
215217
218219220221222224225227228229231232233234235236
240242
243
245246
247248
249250252254255257
259
260261262263264266267
269270
271
273274
275
276277278280
281282283285287
288289290291292294
295
296
297298299308
310311312313315316
317318
319322
323324325
326327329
330331
332333334336337338340343344 346347
350351352353354355
357360361366367368369371372
373374
375
378
379380 381
382383
385386387388389392393394395396397
399400401402403408409410411
413414415
417420421422423
424425427
428429430
431434435436437438439
441442443444445
448450456457458459460
462463464465
466
469470471472473474
476477
478479480
483
484486487
488490491492
493494497498
499500501502 504
505506507508511512513514515
518519520521522
525526528529532533534535536537
540541542543544546547548550
553555
556578579582
583584
585586
588
589590591593
595
596597598599600603604
605606639640641642644646
647649
650651
653654655656
657 658660
661663664665667
0.0 0.5 1.0
0.0
0.5
1.0
volume
pH.out
BOD.out
COD.out
SS.out
SED.out
Step 2: Clustering days based on the daily
performance What dissimilarity measure should be used to cluster the days?
If Euclidean distance is used, then days when the process unit/the whole plant
have similar overall performance will be clustered together (Yes, this is
desirable).
if correlation-based distance is used, then days with similar “preferences” (e.g.
days when have better BOD and COD performance but worse SS and SED
performance) will be clustered together, even if some days with these
“preferences” were better overall performance than others
Scale to the unit variance or not?
Data must be scaled, otherwise the water volume will dominate.
Hierarchical clustering will be used.
K-means or K-medoids?
K-medoids is more robust than K-means in the presence of outlier
Hierarchical clustering: Average linkage
74
403
116
222
149
162
448
378
224
219
147
430
142
191
437 9
166
325
148
33
22
270
177
85
282
86
330
260
505
94
93
96
122
667
236
235
654
595
329 7
525
420
327
582
534 3
518
205
352
544
112
488
478
555
591
152
65
45
108
383
184
91
190
507
506
387
266
285
355
463
277
371
201
439
199
547
350
589
550
500
435
511
457
198
374
197
502
99
492
200
140
470
476
332
583
597
422
606
519
14
02
46
810
12
average linkage
Heig
ht
Hierarchical clustering: Complete linkage
74
403
116
222
149
378
122
667
235
236
152
591
65
45
108
534 3
518
205
352
544
112
488
478
555
654 7
420
327
582
595
329
140
470
476
332
14
525
583
422
606
200
597
519
162
448
224
219
147
430
142
191
437 9
166
325
96
93
260
505
94
148
33
22
86
330
85
282
270
177
457
374
197
435
198
492
511
502
99
387
266
285
355
463
371
350
277
589
550
500
201
439
199
547
383
184
91
190
507
506
05
10
15
complete linkage
Heig
ht
Hierarchical clustering: Single linkage
74
403
116
378
222
149
448
162
96
147
235
437
93
33
22
148
85
282
219
177
654 9
236
430
142
191
224
595
329
270
166
325
260
505
94
86
330
591
152
355 7
506
534 3
507
91
190
285
45
108
65
463
544
184
383
420
327
582
205
518
352
555
478
112
488
511
371
14
435
525
200
492
277
476
457
332
201
439
199
350
547
589
550
500
198
374
197
502
99
583
140
470
519
597
422
606
387
266
122
667
02
46
810
single linkage
Heig
ht
K-medoids clustering
0 5 10 15 20
-50
5
clusplot(pam(x = sdata, k = k, diss = diss))
Component 1
Com
ponent
2
These two components explain 80.33 % of the point variability.
Silhouette width si
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = sdata, k = k, diss = diss)
Average silhouette width : 0.37
n = 430 2 clusters Cj
j : nj | avei Cj si
1 : 149 | -0.01
2 : 281 | 0.57
0 5 10 15 20
-10
-50
5
clusplot(pam(x = globalscale2, k = 3))
Component 1
Com
ponent
2
These two components explain 80.33 % of the point variability.
Silhouette width si
-0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
Silhouette plot of pam(x = globalscale2, k = 3)
Average silhouette width : 0.19
n = 430 3 clusters Cj
j : nj | avei Cj si
1 : 91 | -0.17
2 : 157 | 0.25
3 : 182 | 0.32
Conclusion
Water quality index and flow amount of influent/effluent
have been visualized by PCA to see the relationships
among the observations and variables in low dimensions.
Clustering methods have been used to identify subgroups
of similar days.
Reference
``Avaluacio de tecniques de classificacio per a la gestio de Bioprocessos: Aplicacio a un
reactor de fangs activats'' Master Thesis. Dept. de Quimica. Unitat d'Enginyeria Quimica.
Universitat Autonoma de Barcelona. Bellaterra (Barcelona). 1993.
``LINNEO+: A Classification Methodology for Ill-structured Domains''. Research report RT-
93-10-R. Dept. Llenguatges i Sistemes Informatics. Barcelona. 1993.
``A knowledge-based system for the diagnosis of waste-water treatment plant''.
Proceedings of the 5th international conference of industrial and engineering applications of
AI and Expert Systems IEA/AIE-92. Ed Springer-Verlag. Paderborn, Germany, June 92.