bsidessf 02/12/2017 - make alerts great again
Post on 11-Apr-2017
90 Views
Preview:
TRANSCRIPT
Daniel Popescudpopes@yelp.com / @danielpopes
Make Alerts Great Again
2015 - now():- Security Engineer - Yelp
2008 - 2015:- Software Engineer - MSFT
Daniel Popescu
Yelp’s MissionConnecting people with great
local businesses.
Yelp StatsAs of Q3 2016
97M 3274%115M
3000+ Servers4000+ Employees500+ Microservices
A TON of logs
Yelp Infrastructure
Collect Logs Stream Index Visualize and Alert
osquery
Elastalert Kibana
Yelp Security Infrastructure
Lack of VisibilityNot ActionableNo StandardsFunctionally Correct?False Positives
Common Alerting Pitfalls
Historical StatisticsActionableIncident Response StepsFunctionally Correct
Yelp Security Alerts
Alerts span multiple systems- Elasticsearch- Splunk
Alert metrics unknown- Count- Frequency
No comprehensive dashboard
Lack of Visibility: Problem
Alert Reporter- Weekly Report- Multiple Alert Sources- Insights
- Frequencies- Self Service- Delivery Mechanism
Lack of Visibility: Solution
Email- Only a reporting channel- No ownership
Ticketing- Better than email- No enforcement
Not Actionable: Problem
{ "eventName": "CreateRole", "requestParameters": { "roleName": "rds-monitoring-role" }, "userIdentity": { "userName": "ioannis+admin" }}
{ "eventName": "AddUserToGroup", "requestParameters": { "groupName": "admins", "userName": "jsendor" }, "userIdentity": { "userName": "mattc" }}
{ "eventName": "RemoveUserFromGroup", "requestParameters": { "groupName": "RequireMFA", "userName": "martin" }, "userIdentity": { "userName": "martin" }}
{ "eventName": "AuthorizeSecurityGroupIngress", "requestParameters": { "cidrIp": "0.0.0.0/0", "fromPort": 1, "toPort": 65535 }, "userIdentity": { "userName": "lmatthew" }}
No more emailsJIRA Service Desk- SLAs- Queues
Not Actionable: Solution
Actionable Alerting Service (AAS)- Finds assignees for alerts- Escalate when SLA breached- Looks at JIRA ticket metadata
Not Actionable: Solution
Alerts automatically assigned to actors- Common Administrative Tasks- Infrastructure Changes- Honor System (kinda)- Mistakes- Malware?
Self Service Alerts
Self Service - Human
duo_data { "action": "integration_create", "description": { "iname": "Auth API" "type": "rest" }, "eventtype": "administrator", "object": "Auth API", "username": "alect"}
Self Service - Non Human
{ "actor": "svc-dasher", "event": { "name": "CREATE_ORG_UNIT", "parameters": { "ORG_UNIT_NAME": "TestJMA" }, "type": "ORG_SETTINGS" }}
Schedule name in alert metadataAssign alert to current on-point
Pagerduty Schedule
Pagerduty Schedule
Alert Owner
What to do when SLA is breached- Ping user in JIRA- Ping user in IRC / Slack channel- Ping user’s manager in JIRA
Escalation Channels
SLA Past Due - JIRA Ping
SLA Past Due - CC Manager
No RFC for authoring alertsFeature Set Awareness
No Standards: Problem
Standards
New Alerts Runbook- Priorities- Mandatory Fields- Delivery Mechanism- Feature Set- Testing
No Standards: Solution
Alert Definition Bugs- Typos- Bad assumptions
Data Sources- Flatlines- Drop in volume
Functionally Correct?: Problem
End-to-End Testing- For 100% of new alerts- Subset of existing alerts
Flatline alerts- Test them too
Functionally Correct?: Solution
There will be false positives- Windows malware on mac- New production services- Sketchy? DNS requests
False Positives: Problem
Automation- Incident Response- Tools and Scripts
Constant alert improvement
False Positives: Solution
Measuring SuccessActive Tickets- Manageable- SLA Met > 50%
Positive Reception- Corp Eng- Operations
Security Team- Happy
RecapProblem Solution
Visibility Alert Reporter + JIRA Service Desk
Actionability JIRA + Self Service Alerts + Pagerduty +Actionable Alerting Service
Standardization Runbook For New Alerts
Functional Correctness End-To-End Tests
False Positives Automation
Make your alerts actionable!
Make sure you have visibility into your alerting metrics!
Make sure your alerts actually work!
TLDR;
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp
Q&A
Daniel Popescudpopes@yelp.com / @danielpopes
top related