managing a multi-tenant data lake
TRANSCRIPT
Managing A Multi-Tenant Data Lake
2Copyright 2016 Comcast Corporation. All Rights Reserved
Agenda Timeline for Evolution Why Governance Multi-Tenancy Anti-Patterns / Warning Signs Instituting Governance Managing through Chaos Monitoring/Metrics Environment Tools SLA Management Support and Staffing Demo - Command Center
3Copyright 2016 Comcast Corporation. All Rights Reserved
Timeline – 2013
2013 – “The Experiment” Started with 10 node cluster
Experimentation with batch processing and enrichment of event data
Team assembled from across organization
Primarily solving single use case
30 nodes by end of trial
2 Racks
4Copyright 2016 Comcast Corporation. All Rights Reserved
Timeline – 2014 (H1)
2014 Production “Honeymoon”
Added 70 more nodes along with lower environments (Dev & QA)
Onboard additional ~20 data sets through batch ETL
Supporting a dozen use cases
5 Racks
5Copyright 2016 Comcast Corporation. All Rights Reserved
Timeline – 2014 (H2)
2014 Production “Tiger’s Tail”
Total of 200 nodes to support additional use cases (data science)
Total of ~30 more data sets through batch ETL
Supporting several dozen use cases and ad-hoc exploration
Starting to have difficulty managing resource requests
9 Racks
6Copyright 2016 Comcast Corporation. All Rights Reserved
Timeline – 2015
2015 Production “Cortez”
Adding 250 more nodes to production environment
Fully embraced governance
Supporting 24x7 production use cases
19 Racks
7Copyright 2016 Comcast Corporation. All Rights Reserved
Timeline – 2016
2016 Production “Planetary”
Adding 1300 more nodes to production environment
Standing up separate 500 node data science cluster
Spinning off critical compute to boundary satellite clusters
Reaping benefits from governance and resource planning
48 Racks
8Copyright 2016 Comcast Corporation. All Rights Reserved
Why Governance?
It’s about establishing acceptable behaviors for the benefit of the community
Minimize user/application impact on cluster
Users will do whatever is technically possible Everyone has been conditioned to work “smarter not harder”
Establishing the guardrails not edicts.
9Copyright 2016 Comcast Corporation. All Rights Reserved
Multi-Tenancy Anti-Patterns
Speculative Execution
Optional User Training
Lack of Resource Isolation
Lack of Testing and Measurement
Ad-hoc Communication Channels
Excessive Resource Utilization/Reservation
Informal Service Level Agreements (SLAs)
Public Domain: Plynn9
10Copyright 2016 Comcast Corporation. All Rights Reserved
Signs of Looming Disaster
Pending Jobs
Queue Fidgeting
Job Rescheduling
Non Predictive Workloads
Cluster Storage Out Of Balance
Public Domain: US DOE
11Copyright 2016 Comcast Corporation. All Rights Reserved
Instituting Governance
Governance is not a technology problem
Governance must be solved using People - Who Processes – What / When / How Policy – Why
Always employ technology to help with enforcement and measurement
12Copyright 2016 Comcast Corporation. All Rights Reserved
Setting Out Governance Standards – Starting Out
Involve the business users to define light-weight policies and processes Onboarding users/applications/tools Resource Utilization Worksheets Deployment checklists Service Level Agreements / Penalties Updates of Governance Standards
You MUST socialize and educate your community on these policies and process
Strive for evolution not revolution
13Copyright 2016 Comcast Corporation. All Rights Reserved
Setting Out Governance Standards – Measurement
Define universally accepted performance measures Storage Compute System Availability Issues and MTTR Average Completion Time Average Pending Apps
Be transparent with results and make them available to entire community
Establish monthly performance reviews with key stakeholders
14Copyright 2016 Comcast Corporation. All Rights Reserved
Setting Out Governance Standards – Enforcement
Lock down as many resources as possible
Monitor resource utilization for compliance
Automate corrective measures
Its all about transitioning from defense to offense and eliminating surprises!
15Copyright 2016 Comcast Corporation. All Rights Reserved
Setting Out Governance Standards – Enforcement
Hadoop provides some base capabilities YARN Queues for compute HDFS Quotas/ACLs for storage
Implement custom solutions for proactive offensive capabilities Job monitoring and migration (Penalty Box) Dynamic Allocation / Queue Flexing Monitor and track leading indicators (Command Center)
16Copyright 2016 Comcast Corporation. All Rights Reserved
Multi-Tenancy: Understanding the Chaos - Monitoring/Metrics
Image Attribution: Pixabay - Creative Commons CC0
17Copyright 2016 Comcast Corporation. All Rights Reserved
Use Case – Extreme Ad Hoc (Data Science)
18Copyright 2016 Comcast Corporation. All Rights Reserved
Use Case – Extreme Ad Hoc (Data Science)
19Copyright 2016 Comcast Corporation. All Rights Reserved
Challenges? You bet!
20Copyright 2016 Comcast Corporation. All Rights Reserved
Challenges Monitoring and Managing a Multi-tenant Hadoop Environment – Diverse User Community
Div
erse
Use
r Com
mun
ity
Images: Creative Commons
21Copyright 2016 Comcast Corporation. All Rights Reserved
Challenges Monitoring and Managing a Multi-tenant Hadoop Environment - SLAs
Div
erse
SLA
s
22Copyright 2016 Comcast Corporation. All Rights Reserved
Challenges Monitoring and Managing a Multi-tenant Hadoop Environment - Governance
Images: Creative Commons
23Copyright 2016 Comcast Corporation. All Rights Reserved
Challenges Monitoring and Managing a Multi-tenant Hadoop Environment – Monitoring & Forecasting
Images: Creative Commons
24Copyright 2016 Comcast Corporation. All Rights Reserved
Environment
25Copyright 2016 Comcast Corporation. All Rights Reserved
Our Environment - Tools for Monitoring
Standard Hadoop Monitoring
26Copyright 2016 Comcast Corporation. All Rights Reserved
Environment - Tools for Monitoring
Command Center
Pepperdata
27Copyright 2016 Comcast Corporation. All Rights Reserved
SLA Management
Application Timing
Images: Creative Commons
28Copyright 2016 Comcast Corporation. All Rights Reserved
SLA Management
Application Timing
Resource Management
Images: Creative Commons
29Copyright 2016 Comcast Corporation. All Rights Reserved
SLA Management
Application Timing
Resource Management
Capacity Management
Images: Creative Commons
30Copyright 2016 Comcast Corporation. All Rights Reserved
Support & Staffing
Images: Creative Commons
31Copyright 2016 Comcast Corporation. All Rights Reserved
Takeaways for DevOps Model in Hadoop
Train Your Teams (!!!)
32Copyright 2016 Comcast Corporation. All Rights Reserved
Takeaways for DevOps Model in Hadoop
Train Your Teams (!!!)
Measure, Forecast and Model
33Copyright 2016 Comcast Corporation. All Rights Reserved
Takeaways for DevOps Model in Hadoop
Train Your Teams (!!!)
Measure, Forecast and Model
Automation and Frameworks
34Copyright 2016 Comcast Corporation. All Rights Reserved
Comcast Command Center
35Copyright 2016 Comcast Corporation. All Rights Reserved
The Command Center: Our Focus
Visualizations & Design
36Copyright 2016 Comcast Corporation. All Rights Reserved
Ease Of Use
Visualizations & Design
The Command Center: Our Focus
37Copyright 2016 Comcast Corporation. All Rights Reserved
Visualizations & Design
Ease Of Use
Extensibility
The Command Center: Our Focus
38Copyright 2016 Comcast Corporation. All Rights Reserved
Visualizations & Design
Ease Of Use
Extensibility
Alerting
The Command Center: Our Focus
39Copyright 2016 Comcast Corporation. All Rights Reserved
The Command Center for Monitoring and Alerting
• Missed SLAs• Guardrails broken
• Definitions• Links
• Containers• Queue capacity
• Status• Measures
• HDFS Usage• Queue Usage
Continuous Evolution
Continuous Engagement
40Copyright 2016 Comcast Corporation. All Rights Reserved
Monitoring and Alerting at Comcast
The Command Center!
41Copyright 2016 Comcast Corporation. All Rights Reserved
Thanks!
Ray HarrisonPrinciple DevOps Architect
Mike FaganPrinciple Big Data Architect
[email protected] [email protected]
We Are Hiring!