statistics 101 for system administrators
DESCRIPTION
Learn and use elements of statistics (distributions, standard deviation, linear correlation) in python is very simple. The slides shows an example of managing some dataseries for network troubleshooting.TRANSCRIPT
Statistics 101 for SystemAdministrators
EuroPython 2014, 22th July - Berlin
Roberto Polli - [email protected]
Babel Srl P.zza S. Benedetto da Norcia, 3300040, Pomezia (RM) - www.babel.it
22 July 2014Roberto Polli - [email protected]
Who? What? Why?
• Using (and learning) elements of statistics with python.• Roberto Polli - Community Manager @ Babel.it. Loves writing in C, Java
and Python. Red Hat Certified Engineer and Virtualization Administrator.• Babel – Proud sponsor of this talk ;) Delivers large mail infrastructures
based on Open Source software for Italian ISP and PA. Contributes tovarious FLOSS.
Intro Roberto Polli - [email protected]
Agenda
• A latency issue: what happened?• Correlation in 30”• Combining data• Plotting time• modules: scipy, matplotlib
Intro Roberto Polli - [email protected]
A Latency Issue
• Episodic network latency issues• Logs traces: message size, #peers, retransimissions• Do we need to scale? Was a peak problem?
Find a rapid answer with python!
Intro Roberto Polli - [email protected]
Basic statistics
Python provides basic statistics, likefrom scipy.stats import mean # x̄from scipy.stats import std # σXT = { ’ts’: (1, 2, 3, .., ),
’late’: (0.12, 6.31, 0.43, .. ),’peers’: (2313, 2313, 2312, ..),...}
print([k, max(X), min(X), mean(X), std(X) ]for k, X in T.items() ])
Intro Roberto Polli - [email protected]
Distributions
Data distribution - aka δX - shows event frequency.
# The fastest way to get a# distribution isfrom matplotlib import pyplot as pltfreq, bins, _ = plt.hist(T[’late’])
# plt.hist returns adistribution = zip(bins, freq)
A ping rtt distribution
158.0 158.5 159.0 159.5 160.0 160.5 161.0 161.5 162.0rtt in ms
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0 Ping RTT distribution
r
Intro Roberto Polli - [email protected]
Correlation I
Are two data series X ,Y related?Given ∆xi = xi − x̄ Mr. Pearson answered with this formula
ρ(X ,Y ) =
∑i ∆xi∆yi√∑i ∆2xi∆2yi
∈ [−1,+1] (1)
ρ identifies if the values of X and Y ‘move’ together on the same line.
Intro Roberto Polli - [email protected]
You must (scatter) plot
ρ doesn’t find non-linear correlation!Intro Roberto Polli - [email protected]
Probability Indicator
Python scipy provides a correlation function, returning two values:• the ρ correlation coefficient ∈ [−1,+1]
• the probability that such datasets are produced by uncorrelated systems
from scipy.stats.stats import pearsonr # our beloved ρa, b = range(0, 100), range(0, 400, 4)c, d = [randint(0, 100) for x in a], [randint(0, 100) for x in a]correlation, probability = pearsonr(a,b) # ρ = 1.000, p = 0.000correlation, probability = pearsonr(c,d) # ρ = −0.041, p = 0.683
Intro Roberto Polli - [email protected]
Combinations
itertools is a gold pot of useful tools.
from itertools import combinations# returns all possible combination of# items grouped by N at a timeitems = "heart spades clubs diamonds".split()combinations(items, 2)
# And now all possible combinations between# dataset fields!combinations(T, 2)
Combinating 4 suites,2 at a time.
♥♠♥♣♥♦♠♣♠♦♣♦
Intro Roberto Polli - [email protected]
Netfishing correlation I
# Now we have all the ingredients for# net-fishing relations between our data!for (k1,v1), (k2,v2) in combinations(T.items(), 2):
# Look for correlations between every dataset!corr, prob = pearsonr(v1, v2)
if corr > .6:print("Series", k1, k2, "can be correlated", corr)
elif prob < 0.05:print("Series", k1, k2, "probability lower than 5%%", prob)
Intro Roberto Polli - [email protected]
Netfishing correlation IINow plot all combinations: there’s more to meet with eyes!# Plot everything, and insert data in plots!for (k1,v1), (k2,v2) in combinations(T.items(), 2):
corr, prob = pearsonr(v1, v2)plt.scatter(v1, v2)
# 3 digit precision on titleplt.title("R={:0.3f} P={:0.3f}".format(corr, prob))plt.xlabel(k1); plt.ylabel(k2)
# save and close the plotplt.savefig("{}_{}.png".format(k1, k2)); plt.close()
Intro Roberto Polli - [email protected]
Color is the 3rd dimension
from itertools import cyclecolors = cycle("rgb") # use more than 3 colors!labels = cycle("morning afternoon night".split())size = datalen / 3 # 3 colors, right?for (k1,v1), (k2,v2) in combinations(T.items(), 2):
[ plt.scatter( t1[i:i+size] , t2[i:i+size],color=next(colors),label=next(labels)) for i in range(0, datalen, size) ]
# set title, save plot & co
Intro Roberto Polli - [email protected]
Latency Solution
• Latency wasn’t related to packet size or system throughput• Errors were not related to packet size• Discovered system throughput
Intro Roberto Polli - [email protected]
Wrap Up
• Use statistics: it’s easy• Don’t use ρ to exclude relations• Plot, Plot, Plot• Continue collecting results
Intro Roberto Polli - [email protected]
That’s all folks!
Thank you for the attention!Roberto Polli - [email protected]
Intro Roberto Polli - [email protected]