![Page 1: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/1.jpg)
Automated Procedures for Improving the Accuracy of Sensor-Based
Monitoring Data
Rebecca Buchheit
AIS Lab
![Page 2: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/2.jpg)
Background
• sporadic use of KDD techniques in civil infrastructure
• relative youth of data mining research• difficult to systematically apply KDD process • KDD process tools (CRISP-DM) still under
development• KDD process highly domain dependent• time consuming to teach data mining analysts
domain knowledge
![Page 3: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/3.jpg)
Research Objectives
• develop a framework for systematically applying KDD process to civil infrastructure data analysis needs– set of guidelines for inexperienced analysts– checklist for more experienced analysts
• describe intersection of KDD process characteristics and civil infrastructure– what problems are well-suited to KDD?– what characteristics are unique to
infrastructure?
![Page 4: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/4.jpg)
Summary
• increased data collection => increased need to intelligently analyze data
• KDD process as a “power tool” for analyzing data for high-level knowledge
• civil infrastructure problems are well-suited to data mining but will need to apply entire KDD process to get good results
• proposed framework will help researchers to systematically apply KDD process to their data analysis problems
![Page 5: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/5.jpg)
Data Quality
• What is it?– in this talk, “accuracy”– how close is the observed value to the true
value?– “ground truth” is rare– look for anomalous patterns
• Why is it important?– poor quality data may taint analyses– patterns of poor quality data may
overwhelm data mining/machine learning algorithms
![Page 6: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/6.jpg)
Mn/ROAD Data • weigh-in-motion data– axle spacings and
weights, speed, lane, error codes
• derived quantities– equivalent standard axle
loads (ESALs)– FHWA vehicle type– gross vehicle weight– total vehicle length
• trucks only (type >= 4) • Jan 1 ‘98 to Dec 31 ’00• about 3 million vehicles
courtesy Mn/ROAD
![Page 7: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/7.jpg)
Sample Data
![Page 8: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/8.jpg)
Overview of Approach
• use statistical analysis and data mining algorithms to separate anomalies from normal data– clustering– regression– physical constraints– statistical properties
• focus on differences between anomalies and normal data to help discover causation
![Page 9: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/9.jpg)
Clustering
• group data into “natural classes”
• anomalies separated from normal data
• used Autoclass clustering algorithm
![Page 10: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/10.jpg)
Clustering Results
![Page 11: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/11.jpg)
Regression
• confidence interval of 95%
• R-square (fit) = 0.923
• if error > 15% then identify as anomaly
∑ ESAL = (3.531±0.176) ∑vehicles –(1.252±0.099) ∑axles +(0.066±0.003) ∑GVW –139.000 ± 79.813
![Page 12: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/12.jpg)
Regression Results
![Page 13: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/13.jpg)
Binary Constraints (1)
constraint # violations (3,068,384 total)
offscale hit error 61,129 (1.99%)
significant weight difference error
11,107 (0.36%)
different axle counts error 69,521 (2.27%)
tailgating 10,211 (0.33%)
speed >= 64.37 km/h 51,114 (1.86%)
speed <= 128.74 km/h 3,723 (0.12%)
![Page 14: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/14.jpg)
Binary Constraints (2)
constraint # violations (3,068,384 total)
gross weight <= 45,359kg
24,897 (0.81%)
length <= 22.86 m 79,454 (2.59%)
unknown vehicle type
190,191 (6.20%)
number of axles != 0 47 (0.00%)
number of axles <= 8 57,114 (1.86%)
![Page 15: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/15.jpg)
Constraint Interactions
c1 c2 % interactions
slow speed length over limit 63.5%
length over limit slow speed 45.7%
tailgating unknown type 31.7%
high speed unknown type 28.7%
overweight diff axle counts 25.2%
tailgating slow speed 21.1%
tailgating length over limit 15.2%
![Page 16: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/16.jpg)
Distribution Constraints
• use a goodness-of-fit test to compare distributions from the same day of week– length– gross weight– ESALs– lane
![Page 17: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/17.jpg)
Anomaly Identification
• identify days with higher than normal concentrations of binary constraint violations
• identify days that are not likely to have come from the baseline distributions for length, ESALs, gross weight and lane
![Page 18: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/18.jpg)
Binary Constraints Results
![Page 19: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/19.jpg)
Distribution Constraints Results
![Page 20: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/20.jpg)
A Quick Refresher
• used four different procedures to detect anomalies– clustering– regression– binary (physical) constraints– distribution constraints
• next up– what is causing the anomalies?– can we fix them?
![Page 21: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/21.jpg)
Gross Vehicle Weight
![Page 22: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/22.jpg)
Lane
![Page 23: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/23.jpg)
What Happened?
• two vehicles traveling slowly and close together (tailgating) may be recorded as a single vehicle
• lightweight vehicles are tailgating cars– cars not supposed to be in database– mis-classified because of tailgating– this causes the “high” vehicle counts
• very heavy vehicles are tailgating trucks
• lane 1 (right-hand side) data is missing for all “low” vehicle count days
![Page 24: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/24.jpg)
Can It Be Fixed? (1)
• removed all tailgating cars– lightweight– short– 2 or 3 axles– error code
• “halved” all tailgating trucks– very long– very heavy– more than 9
axles– error code
![Page 25: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/25.jpg)
Can It Be Fixed? (2)
• inserted lane 1 vehicles from same time period in 2000
• “shifted” days to make sure day of week was constant– Tuesday Sept
8 1998 => Tuesday Sept 5 2000
![Page 26: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/26.jpg)
Summary
• statistical analysis and data mining algorithms can be used to detect systematic anomalies in data– focus on differences between anomalies
and normal data to discover differences – need domain knowledge to understand
causation
![Page 27: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/27.jpg)
Current Progress/Future Work
• integrate algorithms into data quality assessment program == automation– physical constraints– distribution constraints– other statistical characteristics of data– clustering– regression, neural networks
• will support infrastructure-related data collection activities
• use algorithms to identify and “clean” anomalies
![Page 28: Automated Procedures for Improving the Accuracy of Sensor-Based Monitoring Data](https://reader035.vdocuments.site/reader035/viewer/2022070414/56814e44550346895dbbbcf2/html5/thumbnails/28.jpg)
Acknowledgements
• Minnesota Department of Transportation, especially Maggi Chalkline
• based upon work supported by the National Science Foundation, under Grant Numbers 9987871 and DGE 9553380