Download - Data Care, Feeding, and Maintenance
![Page 1: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/1.jpg)
Data Care, Feeding, and Maintenance
Mercedes Coyle Data Infrastructure Engineer at
@benzobot
![Page 2: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/2.jpg)
• Online Video Syndication platform
• Connect content providers, video publishers, advertising partners
• 2-3 million streams/day
![Page 3: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/3.jpg)
Where does your data come from?
• One-time use analytics, or continual collection and processing?
• How much control do you have over data content and formatting?
• public datasets (gov, twitter) - little control
• application logging - more control
![Page 4: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/4.jpg)
• Universal data format, or Normalize All the Data
• Pre- vs Post- processing
• Mapping data to a schema, even if it doesn’t have one
How is your data formatted?
![Page 5: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/5.jpg)
Storage and Analytics tools
• Hadoop - distributed map reduce batch processing for large data sets
• Powerful querying tools (SQL-like Hive, Pig)
• Automated processing tasks for data ingestion and processing
• Slow - analyzing large data takes time, so no realtime results
![Page 6: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/6.jpg)
Storage and Analytics tools • Realtime infrastructure - instantly available analytics
and data storage
• Storm, Spark, MongoDB, Logstash & Elasticsearch
• Can create aggregations and analytics jobs on the fly, and get results in seconds
• Quickly detect issues and make informed decisions
• Not always simple to query backwards over time series
Storage and Analytics tools
![Page 7: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/7.jpg)
Storage and Analytics tools
• Small datasets? Reach for some more familiar tools
• CSVs can be handy for quick data analysis on a sample set of your data, especially for biz folks
• Don’t forget about command line tools: grep, awk, sort -u, sum
Storage and Analytics tools
![Page 8: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/8.jpg)
You have data - now what?!• What do you want to
learn from your data?
• How quickly do you need results?
• Is your dataset one time use, or will you add to it over time?
• How accurate do your results need to be?
• Where does your data need to end up?
![Page 9: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/9.jpg)
Data Infrastructure!at! ! ! ! !
• 75-100 million documents per day
• Lambda Architecture
• Batch processing with Hadoop
• Homegrown Realtime Processing system using RSyslog, Logstash, Elasticsearch and Kibana (currently undergoing rewrite with Storm)
![Page 10: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/10.jpg)
Data Infrastructure!at! ! ! ! !
![Page 11: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/11.jpg)
• Alert Fatigue
• Vanity Metrics
• Alerts and metrics can only be intelligent and actionable if they are relatable
Log All the Data, but don’t monitor All the Data
![Page 12: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/12.jpg)
Data Investigation: Rapid Stream Decline
Whoops!
![Page 13: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/13.jpg)
Data Investigation: Rapid Stream Decline
• Our graphs only showed one metric (streams). Why did it decrease so much?
• Two player types, only one was affected.
• System performance metrics and monitoring showed no outages at this time.
![Page 14: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/14.jpg)
Data Investigation: Digging Deeper
• Publishers provided page load data
• Correlated batch summaries of player loads with page load counts
• Cross-checked data in the Speed Layer to rule out batch processing issues
![Page 15: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/15.jpg)
Data Investigation: Digging Deeper
![Page 16: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/16.jpg)
Data Investigation: Digging Deeper
• Further data investigation revealed browser compatibility issues with our players
• Our batch reporting layer visualization highlighted the problem
• Ad-hoc queries in the speed layer allowed quick analysis to determine what caused the issue
![Page 17: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/17.jpg)
Data Investigation: Next Steps
• More intelligent realtime reporting
• Refine our data visualization tools to better represent our metrics
• Better communication with the teams/products we collect data on to inform analytics and dashboards
![Page 18: Data Care, Feeding, and Maintenance](https://reader034.vdocuments.site/reader034/viewer/2022052621/558a8c01d8b42a095f8b45f7/html5/thumbnails/18.jpg)
• Hortonworks Hadoop Sandbox - http://hortonworks.com/products/hortonworks-sandbox/
• Storm Starter - https://github.com/nathanmarz/storm-starter and storm-project.net
• MongoDB Aggregation - http://docs.mongodb.org/manual/core/aggregation-introduction/
• Common Event Expression - http://cee.mitre.org/about/
Resources