statistics for programmers

Upload: grob

Post on 07-Jul-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/19/2019 Statistics for Programmers

    1/129

    Jake VanderPlas

    Jake VanderPlas @jakevdp

    Sept 17, 2015

  • 8/19/2019 Statistics for Programmers

    2/129

    Jake VanderPlas

    About Me  Jake VanderPlas   (Twitter/Github: jakevdp) 

    - Astronomer by training

    - Data Scientist at UW eScience Institute

    - Active in Python science & open source- Blog at Pythonic Perambulations 

    - Author of two books:

  • 8/19/2019 Statistics for Programmers

    3/129

  • 8/19/2019 Statistics for Programmers

    4/129

    Jake VanderPlas

    Statistics is Hard.

    Using programming skills,it can be easy.

  • 8/19/2019 Statistics for Programmers

    5/129

  • 8/19/2019 Statistics for Programmers

    6/129

    Jake VanderPlas

    My thesis today: 

    If you can write a for-loop, you can do statistics

  • 8/19/2019 Statistics for Programmers

    7/129Jake VanderPlas

    You toss a coin 30 

    times and see 22 heads. Is it a fair coin?

     Warm-up:

    Coin Toss

  • 8/19/2019 Statistics for Programmers

    8/129Jake VanderPlas

    A fair coin shouldshow 15 heads in 30

    tosses. This coin isbiased.

    Even a fair coincould show 22 heads

    in 30 tosses. It mightbe just chance.

  • 8/19/2019 Statistics for Programmers

    9/129Jake VanderPlas

    Classic Method:

    Assume the Skeptic is correct:test the Null Hypothesis.

    Assuming a fair coin, compute

    probability of seeing 22 headssimply by chance.

  • 8/19/2019 Statistics for Programmers

    10/129Jake VanderPlas

    Classic Method:

    Start computing probabilities . . .

  • 8/19/2019 Statistics for Programmers

    11/129

  • 8/19/2019 Statistics for Programmers

    12/129Jake VanderPlas

    Classic Method:

    Number of

    arrangements(binomialcoefficient) Probability of

     N  H  heads

    Probability of N 

    T  tails

  • 8/19/2019 Statistics for Programmers

    13/129Jake VanderPlas

    Classic Method:

  • 8/19/2019 Statistics for Programmers

    14/129Jake VanderPlas

    Classic Method:

  • 8/19/2019 Statistics for Programmers

    15/129Jake VanderPlas

    Classic Method:

    0.8 %

  • 8/19/2019 Statistics for Programmers

    16/129Jake VanderPlas

    Classic Method:

    0.8 %

    Probability of 0.8% (i.e. p = 0.008) of

    observations given a fair coin.  → reject fair coin hypothesis at p < 0.05

  • 8/19/2019 Statistics for Programmers

    17/129Jake VanderPlas

    Could there be

    an easier way?

  • 8/19/2019 Statistics for Programmers

    18/129Jake VanderPlas

    Easier Method:

    Just simulate it!

    M = 0

    for i in range(10000):

      trials = randint(2, size=30)

      if (trials.sum() >= 22):

      M += 1

    p = M / 10000 # 0.008149

    → reject fair coin at p = 0.008

  • 8/19/2019 Statistics for Programmers

    19/129

    Jake VanderPlas

    In general . . .

    Computing the SamplingDistribution is Hard.

  • 8/19/2019 Statistics for Programmers

    20/129

    Jake VanderPlas

    In general . . .

    Computing the SamplingDistribution is Hard.

    Simulating the Sampling

    Distribution is Easy.

  • 8/19/2019 Statistics for Programmers

    21/129

    Jake VanderPlas

    Four Recipes for

    Hacking Statistics:1. Direct Simulation

    2. Shuffling3. Bootstrapping

    4. Cross Validation

  • 8/19/2019 Statistics for Programmers

    22/129

    Jake VanderPlas

    Now, the Star-Belly Sneetches 

    had bellies with stars.The Plain-Belly Sneetches had none upon thars . . .

    Sneeches:Stars and

    Intelligence

    *adapted from John Rauser’sStatistics Without All The Agonizing Pain

  • 8/19/2019 Statistics for Programmers

    23/129

    Jake VanderPlas

    ★   ❌

    84 72 81 69

    57 46 74 61

    63 76 56 87

    99 91 69 65

    66 44

    62 69

    ★ mean: 73.5❌ mean: 66.9

    difference: 6.6

    Sneeches:Stars and

    Intelligence

    Test Scores

  • 8/19/2019 Statistics for Programmers

    24/129

    Jake VanderPlas

    ★ mean: 73.5❌ mean: 66.9  difference: 6.6

    Is this difference of 6.6statistically significant?

  • 8/19/2019 Statistics for Programmers

    25/129

  • 8/19/2019 Statistics for Programmers

    26/129

  • 8/19/2019 Statistics for Programmers

    27/129

    Jake VanderPlas

    Classic

    Method

    (Student’s t distribution)

  • 8/19/2019 Statistics for Programmers

    28/129

    Jake VanderPlas

    Classic

    Method

    (Student’s t distribution)

    Degree of Freedom:  “The number of independentways by which a dynamic system can move,without violating any constraint imposed on it.”

    -Wikipedia

  • 8/19/2019 Statistics for Programmers

    29/129

    Jake VanderPlas

    Degree of Freedom:  “The number of independentways by which a dynamic system can move,without violating any constraint imposed on it.”

    -Wikipedia

    Classic

    Method

    (Student’s t distribution)

    https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation

  • 8/19/2019 Statistics for Programmers

    30/129

    Jake VanderPlas

    Classic

    Method

    ( Welch–Satterthwaiteequation)

    https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation

  • 8/19/2019 Statistics for Programmers

    31/129

    Jake VanderPlas

    Classic

    Method

    ( Welch–Satterthwaiteequation)

    https://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equationhttps://en.wikipedia.org/wiki/Welch%E2%80%93Satterthwaite_equation

  • 8/19/2019 Statistics for Programmers

    32/129

  • 8/19/2019 Statistics for Programmers

    33/129

    Jake VanderPlas

    Classic

    Method

  • 8/19/2019 Statistics for Programmers

    34/129

    Jake VanderPlas

    Classic

    Method

    1.7959

  • 8/19/2019 Statistics for Programmers

    35/129

  • 8/19/2019 Statistics for Programmers

    36/129

  • 8/19/2019 Statistics for Programmers

    37/129

    Jake VanderPlas

    Classic

    Method

  • 8/19/2019 Statistics for Programmers

    38/129

    Jake VanderPlas

    “The difference of 6.6 is not

    significant at the p=0.05 level”

  • 8/19/2019 Statistics for Programmers

    39/129

    Jake VanderPlas

  • 8/19/2019 Statistics for Programmers

    40/129

  • 8/19/2019 Statistics for Programmers

    41/129

    Jake VanderPlas

    Let’s use a sampling

    method instead

  • 8/19/2019 Statistics for Programmers

    42/129

    Jake VanderPlas

    Problem:

    Unlike coin flipping, we don’t  have a probabilistic model . . .

  • 8/19/2019 Statistics for Programmers

    43/129

    Jake VanderPlas

    Problem:

    Unlike coin flipping, we don’t  have a probabilistic model . . .

    Solution:

    Shuffling

    ★ d

  • 8/19/2019 Statistics for Programmers

    44/129

    Jake VanderPlas

    ★   ❌

    84 72 81 69

    57 46 74 61

    63 76 56 87

    99 91 69 6566 44

    62 69

    Idea:Simulate the distributionby shuffling  the labelsrepeatedly and computingthe desired statistic.

    Motivation:if the labels really don’tmatter, then switchingthem shouldn’t changethe result!

    ★ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    45/129

    Jake VanderPlas

    ★   ❌

    84 72 81 69

    57 46 74 61

    63 76 56 87

    99 91 69 6566 44

    62 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    46/129

    Jake VanderPlas

    ★   ❌

    84 72 81 69

    57 46 74 61

    63 76 56 87

    99 91 69 6566 44

    62 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    47/129

    Jake VanderPlas

    ★   ❌

    84 81 72 69

    61 69 74 57

    65 76 56 87

    99 44 46 6366 91

    62 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    48/129

    Jake VanderPlas

    ★   ❌

    84 81 72 69

    61 69 74 57

    65 76 56 87

    99 44 46 6366 91

    62 69

    ★ mean: 72.4❌ mean: 67.6difference: 4.8

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    49/129

    Jake VanderPlas

    ★   ❌

    84 81 72 69

    61 69 74 57

    65 76 56 87

    99 44 46 6366 91

    62 69

    ★ mean: 72.4❌ mean: 67.6difference: 4.8

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    50/129

    Jake VanderPlas

    ★   ❌

    84 81 72 69

    61 69 74 57

    65 76 56 87

    99 44 46 6366 91

    62 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ 1 Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    51/129

    Jake VanderPlas

    ★   ❌

    84 56 72 69

    61 63 74 57

    65 66 81 87

    62 44 46 6976 91

    99 69

    ★ mean: 62.6❌ mean: 74.1difference: -11.6

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ 1 Sh ffl L b l

  • 8/19/2019 Statistics for Programmers

    52/129

    Jake VanderPlas

    ★   ❌

    84 56 72 69

    61 63 74 57

    65 66 81 87

    62 44 46 6976 91

    99 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ 1 Shuffle Labels

  • 8/19/2019 Statistics for Programmers

    53/129

    Jake VanderPlas

    ★   ❌

    74 56 72 69

    61 63 84 57

    87 76 81 65

    91 99 46 6966 62

    44 69

    ★ mean: 75.9❌ mean: 65.3difference: 10.6

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ 1 Shuffle Labels

  • 8/19/2019 Statistics for Programmers

    54/129

    Jake VanderPlas

    ★   ❌

    84 56 72 69

    61 63 74 57

    65 66 81 87

    62 44 46 6976 91

    99 69

    1. Shuffle Labels2. Rearrange3. Compute means

    ★ ❌ 1 Shuffle Labels

  • 8/19/2019 Statistics for Programmers

    55/129

    Jake VanderPlas

    ★   ❌

    84 81 69 69

    61 69 87 74

    65 76 56 57

    99 44 46 6366 91

    62 72

    1. Shuffle Labels2. Rearrange3. Compute means

  • 8/19/2019 Statistics for Programmers

    56/129

    1 Shuffle Labels★ ❌

  • 8/19/2019 Statistics for Programmers

    57/129

    Jake VanderPlas

    1. Shuffle Labels2. Rearrange3. Compute means

    ★   ❌

    84 81 72 69

    61 69 74 57

    65 76 56 87

    99 44 46 6366 91

    62 69

  • 8/19/2019 Statistics for Programmers

    58/129

    Jake VanderPlas

    score difference

       n   u   m   b

       e   r

  • 8/19/2019 Statistics for Programmers

    59/129

    Jake VanderPlas

    score difference

       n   u   m   b

       e   r

  • 8/19/2019 Statistics for Programmers

    60/129

    Jake VanderPlas

    16 %

    score difference

       n   u   m   b

       e   r

  • 8/19/2019 Statistics for Programmers

    61/129

    Jake VanderPlas

    “A difference of 6.6 is notsignificant at p = 0.05.”

    That day, all the Sneetches 

    forgot about stars 

    And whether they had one,

    or not, upon thars.

    N t Sh ffli

  • 8/19/2019 Statistics for Programmers

    62/129

    Jake VanderPlas

    Notes on Shuffling:

    - Works when the Null Hypothesis  assumes two groups are equivalent

    - Like all methods, it will only work if your

    samples are representative – always becareful about selection biases!

    - For more discussion & references, seeStatistics is Easy  by Shasha & Wilson

  • 8/19/2019 Statistics for Programmers

    63/129

    Jake VanderPlas

    Four Recipes for

    Hacking Statistics:1. Direct Simulation

    2. Shuffling3. Bootstrapping

    4. Cross Validation

  • 8/19/2019 Statistics for Programmers

    64/129

    How High can Yertle

  • 8/19/2019 Statistics for Programmers

    65/129

    Jake VanderPlas

    How High can Yertlestack his turtles?

    - What is the mean of the number of

    turtles in Yertle’s stack?- What is the uncertainty on this

    estimate?

    48 24 32 61 51 12 32 18 19 24

    21 41 29 21 25 23 42 18 23 13

    Observe 20 of Yertle’s turtle towers . . .

       #   o   f   t   u   r   t   l   e   s

    Classic Method:

  • 8/19/2019 Statistics for Programmers

    66/129

    Jake VanderPlas

    Classic Method:

    Sample Mean:

    Standard Error of the Mean:

  • 8/19/2019 Statistics for Programmers

    67/129

    Jake VanderPlas

     What assumptions go intothese formulae?

    Can we usesampling instead?

  • 8/19/2019 Statistics for Programmers

    68/129

    Jake VanderPlas

    Problem:

     We need a way to simulatesamples, but we don’t have a

    generating model . . .

  • 8/19/2019 Statistics for Programmers

    69/129

    Jake VanderPlas

    Problem:

     We need a way to simulatesamples, but we don’t have a

    generating model . . .

    Solution:Bootstrap Resampling

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    70/129

    Jake VanderPlas

    p p g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    71/129

    Jake VanderPlas

    p p g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    72/129

    Jake VanderPlas

    p p g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    73/129

    Jake VanderPlas

    p p g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    74/129

    Jake VanderPlas

    p p g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    75/129

    Jake VanderPlas

    g

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    76/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    77/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    78/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    79/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23

  • 8/19/2019 Statistics for Programmers

    80/129

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    81/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

  • 8/19/2019 Statistics for Programmers

    82/129

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    83/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    84/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    85/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    86/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 13

    32 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    87/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 1332 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42 19

  • 8/19/2019 Statistics for Programmers

    88/129

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    89/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 1332 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42 19 18 61

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    90/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 1332 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42 19 18 61 29

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    91/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 1332 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42 19 18 61 29 41

    Bootstrap Resampling:

  • 8/19/2019 Statistics for Programmers

    92/129

    Jake VanderPlas

    48 24 51 12

    21 41 25 23

    32 61 19 24

    29 21 23 1332 18 42 18

    Idea:Simulate the distributionby drawing samples withreplacment.

    Motivation:The data estimates itsown distribution – wedraw random samplesfrom this distribution.

    21 19 25 24 23 19 41 23 41 18

    61 12 42 42 42 19 18 61 29 41→  31.05

  • 8/19/2019 Statistics for Programmers

    93/129

    Jake VanderPlas

    Repeat this

    several thousand times . . .

    Recovers The Analytic Estimate!

  • 8/19/2019 Statistics for Programmers

    94/129

    Jake VanderPlas

    for i in range(10000):

      sample = N[randint(20, size=20)]

      xbar[i] = mean(sample)mean(xbar), std(xbar)

    # (28.9, 2.9)

    Height = 29 ± 3 turtles

  • 8/19/2019 Statistics for Programmers

    95/129

    Jake VanderPlas

    Bootstrap samplingcan be applied even to

    more involved statistics

  • 8/19/2019 Statistics for Programmers

    96/129

    Bootstrap on LinearR i

  • 8/19/2019 Statistics for Programmers

    97/129

    Jake VanderPlas

    Regression:

    for i in range(10000):

      i = randint(20, size=20)

      slope, intercept = fit(x[i], y[i])

      results[i] = (slope, intercept)

    Notes on Bootstrapping:

  • 8/19/2019 Statistics for Programmers

    98/129

    Jake VanderPlas

    pp g

    - Bootstrap resampling rests on sound

    theoretical grounds.

    - Bootstrapping doesn’t work well for rank-based statistics (e.g. maximum value)

    - Works poorly with very few samples(N > 20 is a good rule of thumb)

    - Always be careful about selection biases!

    Four Recipes for

  • 8/19/2019 Statistics for Programmers

    99/129

    Jake VanderPlas

    Four Recipes for

    Hacking Statistics:1. Direct Simulation

    2. Shuffling3. Bootstrapping

    4. Cross Validation

    Onceler Industries:

  • 8/19/2019 Statistics for Programmers

    100/129

    Jake VanderPlas

    Sales of Thneeds

    I'm being quite useful! This thing is a Thneed.

    A Thneed's a Fine-Something-

    That-All-People-Need! 

  • 8/19/2019 Statistics for Programmers

    101/129

    Jake VanderPlas

    Thneed sales seem to show atrend with temperature . . .

  • 8/19/2019 Statistics for Programmers

    102/129

    Jake VanderPlas

     y = a + b x

     y = a + b x + c x2

    But which model is a better fit?

  • 8/19/2019 Statistics for Programmers

    103/129

    Jake VanderPlas

     y = a + b x

     y = a + b x + c x2

    Can we judge by root-mean-square error?

    RMS error = 63.0

    RMS error = 51.5

    In general, more flexible models will

  • 8/19/2019 Statistics for Programmers

    104/129

    Jake VanderPlas

    always  have a lower RMS error.

     y = a + b x

     y = a + b x + c x2

     y = a + b x + c x2 + d x3

     y = a + b x + c x2 + d x3 + e x4

     y = a + ⋯

    RMS error does not

  • 8/19/2019 Statistics for Programmers

    105/129

    Jake VanderPlas

     y = a + b x + c x2 + d x3 + e x4 + f  x5 + ⋯ + n x14

    tell the whole story.

  • 8/19/2019 Statistics for Programmers

    106/129

    Jake VanderPlas

    Not to worry:

    Statistics has figured this out.

    Classic Method

  • 8/19/2019 Statistics for Programmers

    107/129

    Jake VanderPlas

    Difference in Mean

    Squared Error follows

    chi-square distribution:

    Classic Method

  • 8/19/2019 Statistics for Programmers

    108/129

    Jake VanderPlas

    Difference in Mean

    Squared Error follows

    chi-square distribution:

    Can estimate degrees of

    freedom easily because the

    models are nested  . . .

    Classic Method

  • 8/19/2019 Statistics for Programmers

    109/129

    Jake VanderPlas

    Difference in Mean-

    Squared-Error follows

    chi-square distribution:

    Can estimate degrees of

    freedom easily because the

    models are nested  . . .

    Now plug in all our numbers and . . .

    Classic Method

  • 8/19/2019 Statistics for Programmers

    110/129

    Jake VanderPlas

    Difference in Mean

    Squared Error follows

    chi-square distribution:

    Can estimate degrees of

    freedom easily because the

    models are nested  . . .

    Now plug in all our numbers and . . .

  • 8/19/2019 Statistics for Programmers

    111/129

    Jake VanderPlas

    Easier Way:

    Cross Validation

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    112/129

    Jake VanderPlas

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    113/129

    Jake VanderPlas

    1. Randomly Split data

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    114/129

    Jake VanderPlas

    1. Randomly Split data

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    115/129

    Jake VanderPlas

    2. Find the best model for each subset

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    116/129

    Jake VanderPlas

    3. Compare models across subsets

  • 8/19/2019 Statistics for Programmers

    117/129

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    118/129

    Jake VanderPlas

    3. Compare models across subsets

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    119/129

    Jake VanderPlas

    3. Compare models across subsets

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    120/129

    Jake VanderPlas

    4. Compute RMS error for each

    RMS = 48.9RMS = 55.1

    RMS estimate = 52.1

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    121/129

    Jake VanderPlas

    5. Compare cross-validated RMS for models:

    Cross-Validation

  • 8/19/2019 Statistics for Programmers

    122/129

    Jake VanderPlas

    Best model minimizes thecross-validated error.

    5. Compare cross-validated RMS for models:

  • 8/19/2019 Statistics for Programmers

    123/129

    Jake VanderPlas

      . . . I biggered the loads of the thneeds I shipped out!   I was shipping them forth,to the South, to the East   to the West, to the North! 

    Notes on Cross-Validation:

  • 8/19/2019 Statistics for Programmers

    124/129

    Jake VanderPlas

    - This was “2-fold” cross-validation; other

    CV schemes exist & may perform betterfor your data (see e.g. scikit-learn docs)

    - Cross-validation is the go-to method formodel evaluation in machine learning,as statistics of the models are often notknown in the classical sense.

    Four Recipes for

  • 8/19/2019 Statistics for Programmers

    125/129

    Jake VanderPlas

    p

    Hacking Statistics:1. Direct Simulation

    2. Shuffling3. Bootstrapping

    4. Cross Validation

    Sampling Methods

  • 8/19/2019 Statistics for Programmers

    126/129

    Jake VanderPlas

    Sampling Methods

    allow you to replacenon-intuitive statistical rules withintuitive computational approaches!

    If you can write a for-loop

    you can do statistical analysis.

    Things I didn’t have time for:

  • 8/19/2019 Statistics for Programmers

    127/129

    Jake VanderPlas

    - Bayesian Methods: very intuitive & powerful

    approaches to more sophisticated modeling.(see Bayesian Methods for Hackers  by Cam Davidson-Pillon)

    - Selection Bias: if you get data selection

    wrong, you’ll have a bad time.(See Chris Fonnesbeck’s Scipy 2015 talk, Statistical Thinking for Data Science )

    - Detailed considerations on use of sampling,shuffling, and bootstrapping.(I recommend Statistics Is Easy , a book by Shasha & Wilson)

  • 8/19/2019 Statistics for Programmers

    128/129

    ~ Thank You! ~

  • 8/19/2019 Statistics for Programmers

    129/129

    Email:  [email protected]

    Twitter: @jakevdp

    Github:  jakevdp

    Web: http://vanderplas.com/

    Blog: http://jakevdp.github.io/

    Slides available ath k d k j k d i i f h k

    https://speakerdeck.com/jakevdp/statistics-for-hackers/https://speakerdeck.com/jakevdp/statistics-for-hackers/