monitoring patterns for mitigating technical risk

Monitoring Patterns for Mitigating Technical Risk

@Forter

#1 riskSlow or bad (500) API responses

Auto-healingbecause humans are slowSLA, Failover, Degradation, Throttling

AlertingDetect, Filter, Alert, Diagnostics

SLAPerformance Data Loss Business Logic

TX Processing Low Latency Nope Best Effort

Stream Processing High Throughput Best Effort Best Effort

Batch Processing High Volume Nope Reconciliation

Automatic Failoverhttp fencing (Incapsula)http load balancing (ELB)instance restart (Scaling Group)process restart (upstart)

exceptions bubble up and crash

Graceful Degradation

nginx (lua)

expressjs (nodejs)

storm (java)

Stability

CodeChanges

Throttling (without back-pressure)request priority reduced when TX/sec > thresh

Different priority → Different queue →

Different worker

lower priority inside queue for test probes

Detect -> Filter -> Alert -> Manual Diagnostics

Alerting

Detection

filter & route

diagnostics

redundancyCloudWatch/CollectD - fast, no root causeApp events (exceptions) - too noisy, root causePingdom probes - low coverage, reliableInternal probes - better coverage, false alarms

cloudwatchpagerduty alert

(no riemann)

system testpagerduty alert

(riemann needed)

filter tests using a state machine

filter tests using a state machine(tagged "apisRegression"

(pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)]

(changed-state {:init "passed"}(where (state "passed") (:resolve pd)) (where (state "failed") (:trigger pd)))))

re-open manually resolved alert

re-open manually resolved alert(tagged "apisRegression" (pagerduty-test-dispatch "1234567892ed295d91"))

(defn pagerduty-test-dispatch[key](let [pd (pagerduty key)] (sdo (changed-state {:init "passed"}

(where (state "passed") (:resolve pd)))

(where (state "failed")(by [:host :service] (throttle 1 60 (:trigger pd)))))))

Diagnostics - Storm topology timing

Diagnostics - Storm timelines

#2 riskSlowing down merchant's website

AlertingMonitor each and every browserAggregations (per browser type)Notify on thresholds

Monitoring our javascript snippet

TimeoutsExceptions by browserException aggregationMonitoring new versions

Riemann's Index (server monitoring)key (host+service) event TTL

10.0.0.1-redisfree { "metric":"5"} 60

10.0.0.1-probe1 {"state":"failed"} 300

10.0.0.2-probe1 { "state":"passed"} 300

Riemann's Indexkey (host+service) event TTL

199.25.1.1-1234 {"state":"loaded"} 300

199.25.2.1-4567 {"state":"downloaded"} 300

199.25.3.1-8901 {"state":"loaded"} 300

For our use case:host=browser-ip, service=cookie

Riemann's state machine(index)

stores last event and creates expired events (TTL)

(by [:host :service] stream) creates a new stream for each host/service

(by-host-service stream) - forter's fork onlyalso closes stream when TTL expires

(defn calc-load-time [& children]

(by-host-service (changed :state {:pairs? true} (smap (fn [[previous current]] (cond

(and (= (:state previous) "downloaded") (= (:state current) "loaded")) (assoc previous :metric (- (:time current) (:time previous)))

(and (= (:state previous) "downloaded") (= (:state current) "expired")) (assoc previous :metric (* JS_TIMEOUT 1000))))

children))))

(defn aggregate-by-browser [& children]

(by [:browser] (fixed-time-window 60 (sdo

(smap folds/median (tag "median-load-time" children))

(smap folds/count (tag "load-count" children)))))))

#3 riskWrong decision (approve/decline)

AlertingAnomaly detection

MotivationControl false alarms mathematicallyThreshold per customerThreshold seasonality

Alert me ifthe probability that we declinemore than k out of n transactions given probability pis 1 in a million (t=0.0001%)

n number of tx (30 minutes)k number of declined txs (30 minutes)p per customer declined/total (24 hours)t alert threshold

Binomial Distribution AssumptionExternal events are uncorrelated

What happens when a customer retries the same Tx because the first one was declined?

Questions?email itai@forter.com

http://tech.forter.comhttp://www.softwarearchitectureaddict.com

monitoring patterns for mitigating technical risk

Software

qualitative perception monitoring mitigating the risks of

mitigating malware

monitoring technologies for mitigating insider...

monitoring the formation of structures and patterns during...

managing, mitigating, monitoring risk assessment in study...

monitoring and mitigating space weather effects for gnss

m4shalegas measuring, monitoring, mitigating, … 20160223...

monitoring vital weather patterns in the caribbean case...

governance reform: bridging monitoring & action framework...

mountain goat movement patterns and population monitoring...

understanding regional ecosystem patterns to design...

mitigating cases

georgia code of judicial conduct20code%20of... ·...

passive acoustic monitoring of cetacean activity patterns...

governance reform: bridging monitoring & action framework ...

chapter 2 do no digital harm: mitigating technology risks...

monitoring the formation of structures and patterns during...

modeling and mitigating pattern andmodeling and mitigating...

monitoring technologies for mitigating insider threats

monitoring & mitigating supplier risk in procurement