production testing through monitoring
TRANSCRIPT
![Page 1: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/1.jpg)
@papa_fire
Troubleshooting with monitoringTesting in production
DevOps monitoring[something] testing [something]
monitoring [something] in production
Leon Fayer
![Page 2: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/2.jpg)
❖ @papa_fire ❖ [email protected] ❖ fayerplay.com ❖ slideshare.net/LeonFayer1
THAT’S ME
WHO AM I?๏ engineer for 20+ years
๏ professional cynic
๏ @ OmniTI
๏ build and operate big systems
๏ we are hiring! ๏ omniti.com/is/hiring
![Page 3: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/3.jpg)
@papa_fire
I HATE TESTING
![Page 4: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/4.jpg)
@papa_fire
testing is required
![Page 5: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/5.jpg)
@papa_fire
testing is not enough
![Page 6: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/6.jpg)
@papa_fire
> unit testing > functional testing > resilience testing > performance testing > …
![Page 7: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/7.jpg)
@papa_fire
testing can give a false sense of security
![Page 8: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/8.jpg)
@papa_fire
testing is deterministic
![Page 9: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/9.jpg)
@papa_fire
data problem
![Page 10: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/10.jpg)
@papa_fire
> quantity of data > frequency of data > quality of data
![Page 11: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/11.jpg)
@papa_fire
example
Wolfe+585
![Page 12: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/12.jpg)
@papa_fire
example
Hubert Blaine Wolfeschlegelsteinhausenbergerdorffwelchevoralternwaren-gewissenhaftschaferswessenschafewarenwohlgepflegeundsorgfaltigkeitbe
schutzenvorangreifendurchihrraubgierigfeindewelchevoralternzwolfhunderttausendjahresvorandieerscheinenvonderersteerdemenschderraumschiff
genachtmittungsteinundsiebeniridiumelektrischmotorsgebrauchlichtalsseinursprungvonkraftgestartseinlangefahrthinzwischensternartigraumaufdersuchennachbarschaftdersternwelchegehabtbewohnbarplanetenkreisedrehensichundwo
hinderneuerassevonverstandigmenschlichkeitkonntefortpflanzenundsicherfreuenanlebenslanglichfreudeundruhemitnichteinfurchtvorangreifenvor
andererintelligentgeschopfsvonhinzwischensternartigraum, Sr.
![Page 13: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/13.jpg)
@papa_fire
user problem
![Page 14: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/14.jpg)
@papa_fire
“Users (n) - distributed fault injection test suite for production
![Page 15: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/15.jpg)
@papa_fire
example
Corrupted Blood bug
![Page 16: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/16.jpg)
@papa_fire
example
![Page 17: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/17.jpg)
@papa_fire
other factors
![Page 18: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/18.jpg)
@papa_fire
> lack of foresight (Y2K bug) > too many use-cases (female Tauren bug) > change to assumptions
![Page 19: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/19.jpg)
@papa_fire
testing is great for “known knowns”
![Page 20: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/20.jpg)
@papa_fire
testing is ok for “known unknowns”
![Page 21: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/21.jpg)
@papa_fire
testing is bad for “unknown unknowns”
![Page 22: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/22.jpg)
@papa_fire
enter monitoring
![Page 23: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/23.jpg)
@papa_fire
why monitor?
![Page 24: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/24.jpg)
@papa_fire
because testing isn’t enough
![Page 25: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/25.jpg)
@papa_fire
> software is never perfect > systems are complex > external dependency worry > proactive is better than reactive > …
![Page 26: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/26.jpg)
@papa_fire
because things change
![Page 27: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/27.jpg)
@papa_fire
because things changein production
![Page 28: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/28.jpg)
@papa_fire
what to monitor?
![Page 29: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/29.jpg)
@papa_fire
in God we trust all others we monitor“
![Page 30: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/30.jpg)
@papa_fire
> systems > databases > applications > integration points > performance > user behavior > …
![Page 31: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/31.jpg)
@papa_fire
is it enough?
![Page 32: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/32.jpg)
@papa_fire
is it too much?
![Page 33: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/33.jpg)
@papa_fire
what is important?
![Page 34: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/34.jpg)
@papa_fire
what is important?(i.e. what to alert on)
![Page 35: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/35.jpg)
@papa_fire
example
> servers up and running > HTTP checks return 200 > tweets are lost
![Page 36: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/36.jpg)
@papa_fire
s/system checks/unit tests/
![Page 37: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/37.jpg)
@papa_fire
I don’t give a **** if the datacenter is on fire as
long as I am still making money“
— CEO
![Page 38: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/38.jpg)
@papa_fire
we monitor because things change
![Page 39: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/39.jpg)
@papa_fire
changes effect business
![Page 40: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/40.jpg)
@papa_fire
top-down approach> understand business > define baseline > correlate data
![Page 41: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/41.jpg)
@papa_fire
example๏ online marketing company ๏ major e-commerce component ๏ ~100 million users ๏ 1 billion emails/month ๏ 300,000 lines of code ๏5600+ metrics collected
![Page 42: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/42.jpg)
@papa_fire
it all starts with a call …
![Page 43: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/43.jpg)
@papa_fire
revenue
![Page 44: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/44.jpg)
@papa_fire
revenue + traffic
![Page 45: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/45.jpg)
@papa_fire
revenue + traffic + load time
![Page 46: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/46.jpg)
@papa_fire
revenue + traffic + load time + db
![Page 47: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/47.jpg)
@papa_fire
revenue + traffic + load time + db + email
![Page 48: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/48.jpg)
@papa_fire
… email wasn’t monitored?what if …
![Page 49: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/49.jpg)
@papa_fire
… email wasn’t monitored?(it would be after this)
what if …
![Page 50: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/50.jpg)
@papa_fire
instrumentation is never done
![Page 51: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/51.jpg)
@papa_fire
example
> same symptoms > higher decline rates > all metrics are within norm
![Page 52: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/52.jpg)
@papa_fire
example
> same symptoms > higher decline rates > all metrics are within norm
AmEx blocked
![Page 53: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/53.jpg)
@papa_fire
tl;dr
![Page 54: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/54.jpg)
@papa_fire
testing and monitoring not
testing or monitoring
![Page 55: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/55.jpg)
@papa_fire
understand the business
![Page 56: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/56.jpg)
@papa_fire
continuous improvement
![Page 57: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/57.jpg)
@papa_fire
{also bad at conclusions}
![Page 58: Production testing through monitoring](https://reader033.vdocuments.site/reader033/viewer/2022051404/586f8fe61a28ab54768b77b9/html5/thumbnails/58.jpg)
@papa_fire
THANK YOUquestions?