engineers guide to data analysis
TRANSCRIPT
![Page 2: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/2.jpg)
Wix in numbers
~ 400 Engineers~ 1400 employees
~ 100M Sites
~ 250 micro services
![Page 3: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/3.jpg)
IaaS(Insult as a Service)▪Thin API, written in Flask (python)
▪CouchDB
▪Apache proxy
▪StatsD, Graphite, ELK
![Page 4: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/4.jpg)
Architecture
StatsD
![Page 5: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/5.jpg)
Graphite
▪Metrics collector, storage and
UI
▪Math functions
▪Common
▪De-facto standard
![Page 6: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/6.jpg)
Oops, I think something is broken
![Page 7: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/7.jpg)
What is this “metric” you speak of?
![Page 8: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/8.jpg)
A metric is
▪Numeric data
▪Often with timestamp (time
series)
▪A “measurement” of
something
▪Discrete
![Page 9: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/9.jpg)
Where do metrics come from?
▪Events with numeric data
▪Counting/aggregating
▪Sampling
![Page 10: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/10.jpg)
Sampling
![Page 11: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/11.jpg)
Sampling
![Page 12: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/12.jpg)
Events
▪Data about something that
happened
▪timestamp (time series data)
▪Has properties - numeric and
non-numeric
{“timestamp”: “2016-11-
15T18:43:39+00:00”,“host”: “test01.example.net”,“status”: “ok”,“latency”: 14.31
}
![Page 13: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/13.jpg)
10000 events/secx
0.5kb/event=
How much data?400GB a day
![Page 14: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/14.jpg)
Telemetry is a big data problem
![Page 15: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/15.jpg)
Aggregates are lossy compression
We must decide in advance how we’ll use the metric
![Page 16: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/16.jpg)
Aggregates
▪Max, Min, Sum, Average, etc
▪Last, random point
▪Percentiles (quantiles)
▪Historgrams, reverse quantiles
▪Each is suitable for a particular use case
![Page 17: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/17.jpg)
Averages are mean to me
![Page 18: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/18.jpg)
Percentiles
p99 - The sampled value that is larger than other 99% of
samples
▪O(n) memory complexity
▪ O(n*log n) computation complexity
▪Some shortcuts for p50 (median), p100 (max), p0 (min)
Use when clients experience individual values
![Page 19: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/19.jpg)
Percentiles
▪Precentiles are not additive
▪ You cannot average percentiles
Example:
s1 (100 points) = [0, 0, ....., 100, 100] => p99 =
100
s2 (100 points) = [0, 0, …., 50, 50] => p99 = 50
p99(s1 : s2) = 50, avg(p99(s1), p99(s2)) = 75Fail
![Page 20: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/20.jpg)
Histograms
Distribution visualization of sample
▪Count of events in each bin
▪Beans are usually evenly spaced
▪Use logarithmically spaced bins
for long tails
▪ Additive
![Page 21: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/21.jpg)
Histograms :-(
So why aren’t we all using this?
▪Storage
▪Have to decide on bins schema
▪ Not many tools support this
![Page 22: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/22.jpg)
Choosing the right aggregate
▪ Percentiles/histograms for latency
▪ Max/min for latency and sizes
▪ Histogram analysis for sizes and latency
▪ Sums/averages for capacity and money
▪ Aggregate per domain
▪ Look for deviations
![Page 23: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/23.jpg)
Resolution
▪ Humans need ~5 data points to see a trend
▪ Hides faster changes
▪ Rollups/downscaling is hard
▪ Multi tier FTW!
![Page 24: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/24.jpg)
It ain't what you don’t know that gets you into trouble.
It's what you know for sure that just ain’t so.
““
![Page 25: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/25.jpg)
Peak Erasure/Spike erosion
■ When lowering resolution, data points are
aggregated
■ Default aggregation is average
■ Peaks are erased
■ This can happen in storage or visualization
![Page 26: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/26.jpg)
Peak Erasure/Spike erosion
■ Storages down-sample to save space
■ Aggregation function may be configurable
■ Metric collectors aggregate too
○ carbon-cache uses last value
○ StatsD - gauges, timers, counters
![Page 27: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/27.jpg)
Counters vs Gauges
Behaviour in low res time window
■ Low res sampling erases fast changes
■ “Round numbers” syndrom
■ Counters smear changes, but don’t erase them
TLDR: use counters when possible
![Page 28: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/28.jpg)
Mixed modes
Aggregating multiple modes reduces usability of aggregates
■ Different transaction types differ in latencies/sizes
■ Errors, successes have very different latencies/sizes
■ Makes your graphs weird
TLDR: use separate metrics for different things
![Page 29: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/29.jpg)
Building useful graphs
![Page 30: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/30.jpg)
Visualization
■ Timeframe
■ No more than 3 series
■ Be weary of multiple Y scales, but scale if needed
■ Only related series on the same graph
■ Never mix X scales
■ Visual references: bounds, Y min/max values, legend
![Page 31: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/31.jpg)
Metric design
■ Choose your aggregates wisely
■ Decide on a proper resolution, sampling rate, aggregation
time windows
■ Explore the distribution
■ Separate known modes to independent metrics
![Page 32: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/32.jpg)
Separate signal from noise
■ Use low-pass filters to smooth
■ Trend changes
■ Timeshifts
■ Filter out outliers
![Page 33: Engineers guide to data analysis](https://reader036.vdocuments.site/reader036/viewer/2022081605/58ef83da1a28ab18418b45f9/html5/thumbnails/33.jpg)
Working with clusters
■ Most-deviant/outliers
■ Max/Min
■ Sum (capacity)
■ Pre-aggregate percentiles