bsidessf 02/12/2017 - make alerts great again

Post on 11-Apr-2017

90 Views

Category:

Engineering

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Daniel Popescudpopes@yelp.com / @danielpopes

Make Alerts Great Again

2015 - now():- Security Engineer - Yelp

2008 - 2015:- Software Engineer - MSFT

Daniel Popescu

Yelp’s MissionConnecting people with great

local businesses.

Yelp StatsAs of Q3 2016

97M 3274%115M

3000+ Servers4000+ Employees500+ Microservices

A TON of logs

Yelp Infrastructure

Collect Logs Stream Index Visualize and Alert

osquery

Elastalert Kibana

Yelp Security Infrastructure

Lack of VisibilityNot ActionableNo StandardsFunctionally Correct?False Positives

Common Alerting Pitfalls

Historical StatisticsActionableIncident Response StepsFunctionally Correct

Yelp Security Alerts

Alerts span multiple systems- Elasticsearch- Splunk

Alert metrics unknown- Count- Frequency

No comprehensive dashboard

Lack of Visibility: Problem

Alert Reporter- Weekly Report- Multiple Alert Sources- Insights

- Frequencies- Self Service- Delivery Mechanism

Lack of Visibility: Solution

Email- Only a reporting channel- No ownership

Ticketing- Better than email- No enforcement

Not Actionable: Problem

{ "eventName": "CreateRole", "requestParameters": { "roleName": "rds-monitoring-role" }, "userIdentity": { "userName": "ioannis+admin" }}

{ "eventName": "AddUserToGroup", "requestParameters": { "groupName": "admins", "userName": "jsendor" }, "userIdentity": { "userName": "mattc" }}

{ "eventName": "RemoveUserFromGroup", "requestParameters": { "groupName": "RequireMFA", "userName": "martin" }, "userIdentity": { "userName": "martin" }}

{ "eventName": "AuthorizeSecurityGroupIngress", "requestParameters": { "cidrIp": "0.0.0.0/0", "fromPort": 1, "toPort": 65535 }, "userIdentity": { "userName": "lmatthew" }}

No more emailsJIRA Service Desk- SLAs- Queues

Not Actionable: Solution

Actionable Alerting Service (AAS)- Finds assignees for alerts- Escalate when SLA breached- Looks at JIRA ticket metadata

Not Actionable: Solution

Alerts automatically assigned to actors- Common Administrative Tasks- Infrastructure Changes- Honor System (kinda)- Mistakes- Malware?

Self Service Alerts

Self Service - Human

duo_data { "action": "integration_create", "description": { "iname": "Auth API" "type": "rest" }, "eventtype": "administrator", "object": "Auth API", "username": "alect"}

Self Service - Non Human

{ "actor": "svc-dasher", "event": { "name": "CREATE_ORG_UNIT", "parameters": { "ORG_UNIT_NAME": "TestJMA" }, "type": "ORG_SETTINGS" }}

Schedule name in alert metadataAssign alert to current on-point

Pagerduty Schedule

Pagerduty Schedule

Alert Owner

What to do when SLA is breached- Ping user in JIRA- Ping user in IRC / Slack channel- Ping user’s manager in JIRA

Escalation Channels

SLA Past Due - JIRA Ping

SLA Past Due - CC Manager

No RFC for authoring alertsFeature Set Awareness

No Standards: Problem

Standards

New Alerts Runbook- Priorities- Mandatory Fields- Delivery Mechanism- Feature Set- Testing

No Standards: Solution

Alert Definition Bugs- Typos- Bad assumptions

Data Sources- Flatlines- Drop in volume

Functionally Correct?: Problem

End-to-End Testing- For 100% of new alerts- Subset of existing alerts

Flatline alerts- Test them too

Functionally Correct?: Solution

There will be false positives- Windows malware on mac- New production services- Sketchy? DNS requests

False Positives: Problem

Automation- Incident Response- Tools and Scripts

Constant alert improvement

False Positives: Solution

Measuring SuccessActive Tickets- Manageable- SLA Met > 50%

Positive Reception- Corp Eng- Operations

Security Team- Happy

RecapProblem Solution

Visibility Alert Reporter + JIRA Service Desk

Actionability JIRA + Self Service Alerts + Pagerduty +Actionable Alerting Service

Standardization Runbook For New Alerts

Functional Correctness End-To-End Tests

False Positives Automation

Make your alerts actionable!

Make sure you have visibility into your alerting metrics!

Make sure your alerts actually work!

TLDR;

@YelpEngineering

fb.com/YelpEngineers

engineeringblog.yelp.com

github.com/yelp

Q&A

Daniel Popescudpopes@yelp.com / @danielpopes

top related