air quality dashboard
TRANSCRIPT
CleanAir: Air Quality Dashboard
Puspal Hore
Insight Data Engineering Fellow
Air Pollution• Air Pollution is a Health Hazard• Many Types of Pollutants–Ozone–Particulate matter (pm2.5, pm10)–NO2– SO2–CO
• Sensitive Groups: Children, Elderly, Patients with Lung and Heart Disease etc.
Data Source: EPA
• EPA tracks air quality
• Monitoring stations across the country
• Hourly, daily data feed available as flat
files with delay (only data up to 2015 is
available at present)
AQI
• Air Quality Index
• Composite Index based on
–Ozone, pm2.5, pm10, NO2, SO2, CO
Data
• Location Data : State, County, Site
• Time Data : Date, Hour, Minute
• Value : AQI, individual pollutant concentration
Data Volume
• Daily data : 52 MB/ year for each pollutant
• Hourly data : 1.3 – 3 GB / year / pollutant
• 2 – 6 million rows
Data Pipeline
Simulated Air Pollution Data
Demo: https://youtu.be/6ILrpyf8zPQ
Challenges: Forecasting
• Poor support in python compared to R
• SparkR vs Python / rpy2
• Python / Statsmodels.tsa
• Cloudera Spark-TS
About Me
• Electrical Engineer, IIT, India
• DBA, Systems Engineer
• MD, Rutgers NJMS
• Books, Travel, Photography
Influxdb
• Time Series DB with clustering
• CRud
• SQL like query language, HTTP, JSON
• Continuous Queries and Downsampling
• Built-in data retention policy
Versions…
• Kafka v0.8.2.2 with Scala 2.10• Zookeeper v3.4.6• Spark v1.5.2 with Hadoop v2.4+• Hadoop v2.7.1• Influxdb 0.10.0• Grafana 2.6.0