the winter institute on statistical literacy for librarians demystifying statistics for the...

59
The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte, David Sulz, and Amanda Wakaruk February 18-20, 2015

Upload: gyles-caldwell

Post on 15-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

The Winter Institute on Statistical Literacy for

Librarians

Demystifying statistics for the practitioner

Anna Bombak, Chuck Humphrey, Larry Laliberte, David Sulz, and Amanda WakarukFebruary 18-20, 2015

Page 2: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Outline

Introductions A framework for understanding statistics How geography shapes statistics Official statistics: national Official statistics: international Non-official statistics Applying what you have learned

2

Page 3: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Introductions: your backgrounds

Please introduce yourself Your name Your institutional

affiliation What your current

occupation activity is

3

Page 4: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Introductions: your backgrounds

A little over three-fourths are from an academic library setting. The split in earlier Institutes was closer to 50/50.

The largest group, with 11, is from universities other than the U of A.

The second largest group, with 6 each, is from SLIS & Government.

Series1

0 2 4 6 8 10 12

AcademicLibrarySetting

Non-AcademicLibraries

Other Universities

UAL

GovernmentPublic / Special

(11)

(04)

(06)

(00)

(06)SLIS

Page 5: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Introductions: your backgrounds

Geographically, 11 of you are from outside Alberta.

Nine are from four other provinces: six BC, one each from NS, ON, and SK.

Sixteen are from the Edmonton region.

Two are from outside Canada: one from the United States and one from South Africa, welcome!

0 5 10

Alberta

OutsideCanada

(16)

OutsideAlberta

(09)

(02)

Page 6: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Uses of quantitative evidence

To provide a description This typically entails answering the question

about the scale or scope of something observable and its characteristics.

To make a comparison This usually involves establishing the degree

of similarity or dissimilarity among observables.

To identify a relationship This method looks at the correlation among

characteristics of observables, that is, how are things related?

6

Page 7: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Examples of quantitative evidence

7

Description Inside the 2014 Forbes billionaires list: facts and fi

gures Forbes, March 3, 2014

Comparison New alarm bells over household debt The Globe and Mail, Feb 5, 2015

Relationship American gun ownership and suicide rates The Economist, Feb 2, 2015

Page 8: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics are ubiquitous

“Statistics are generated today about nearly every activity on the planet. Never before have we had so much statistical information about the world in which we live. Why is this type of information so abundant? For one thing, statistics have become a form of currency in today’s information society. Through computing technology, society has become very proficient in calculating statistics from the vast quantities of data that are collected. As a result, our lives involve daily transactions revolving around some use of statistical information.”

Data Basics, page 1.1

8

Page 9: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

More statistics in electronic formats

9

In the past 25 years we have had a decline in the publication of official and non-official statistics in print, while the publication of these statistics in electronic formats has grown exponentially.

This has just heightened the problem of finding statistics.

Page 10: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics: what are we talking about?

Statistics and data are related but different

10

Page 11: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

How statistics and data differ

11

Statistics

Data

Page 12: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

How statistics and data differ

12

Page 13: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

13

Page 14: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Microdata

14

Page 15: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Microdata record layout

15

Page 16: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

How statistics and data differ

16

Statistics• numeric facts and

figures that provide summaries

• derived from data, i.e, already processed

• requires definitions and classifications

• presentation-ready• published

Data• numeric files

created and organized for processing

• requires processing to be useful

• requires detailed documentation

• not display-ready• disseminated, not

published

Page 17: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics are presentation ready

Tables, charts, and graphs are typically used to display statistics. You will find statistics sprinkled in text as part of a narrative describing some phenomenon; but tables and charts are the primary methods of organizing and presenting statistics.

17

Page 18: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

A statistic isn’t real without data A ‘real’ statistic requires a data source. If the

publisher of a statistic can’t tell you the data source behind a statistic, you should question that the statistic is ‘real.’ After all, people do make up statistics.

Notorious example: In an interview with Meredith Whitney on the December 19, 2010 episode of CBS’ 60 Minutes, she claimed that 50 to 100 “sizable” cities and counties in the U.S. would default on billions of dollars of municipal bonds. Her estimate sparked a mini-panic on the bond market. She refused to release the report behind these predictions on the grounds that her research is proprietary. Bloomberg revealed on February 1, 2011 that she “doesn’t have any numbers to back up her assertions -- she pulled the numbers out of thin air.”

18

Page 19: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Fabricated statistics

19

Fox News guest Steven Emerson says Birmingham is 'totally Muslim'

He has since admitted that is inaccurate, and for good reason. The stats say Birmingham is about 20 percent Muslim. Kathie Sanders, Jan 14, 2015

Page 21: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Fabricated data

21

Diederik Stapel, a Dutch social psychologist, perpetrated an audacious academic fraud by making up studies that told the world what it wanted to hear about human nature. Yudhijit Bhattacharjee, “The Mind of a Con Man,” New York Times Magazine, April 26, 2013

Page 23: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Misinterpretation of statistics

Some make wrong generalizations from statistics.

Notorious example: Approximately two year ago during the Republican Party presidential primaries, Rick Santorum claimed on television and on the campaign trail that “62 percent of kids who enter college with some kind of faith commitment leave without it.” Stephen Colbert suggested that this statistic had to be taken “on faith.” Jonathan Hill reported that “Studies using comparable data from recent cohorts of young people (for example, the National Longitudinal Survey of Youth 1997, the National Longitudinal Study of Adolescent Health, and the National Study of Youth and Religion) have found virtually no overall differences on most measures of identity, practice, and belief between those who [go] to college and those who do not.”

23

Page 24: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

A statistical concept may be derived from different data sources and show different results.

Notorious example: A long-standing debate erupted over a Lancet article published in 2004 that estimated the number of civilian deaths in Iraq, following the 18 months after the invasion, to be around 98,000. The Iraq Body Count project compiled a database of reported civilian deaths showing between 11,000 and 13,000 deaths in this same period. The UK government embraced statistics from the Iraq Ministry of Health, which reported 3,853 civilian deaths and 15,517 injuries over six months in 2004.

Same statistical concept but different data sources

24

Page 25: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

A statistic may have been derived from poor quality data and, consequently, may be of limited value. But nevertheless, it remains a ‘real’ statistic.

The desire is to have quality statistics that are derived from quality data.

Statistics Canada uses criteria to define quality statistics or statistics “fit for use” Relevance, accuracy, timeliness,

accessibility, interpretability, and coherence

Quality data needed

25

Page 26: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Methods producing data Observational

MethodsExperimental

MethodsComputational

MethodsFocus is on developing observational instruments to collect data

Focus is on manipulating causal agents to measure change in a response agent

Focus is on modeling phenomena through mathematical equations

Correlation Causation Prediction

Replicate the analysis (same data or similar)

Replicate the experiment

Replicate the simulation

Statistics summarize observations

Statistics summarize experiment results

Statistics summarize simulation results

Page 27: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Lifecycle production of data

The production of data across these three methods happens through a lifecycle process. Understanding the basics of the lifecycle process in which statistics are derived from data can help in the search for statistics.

27

Page 28: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Lifecycle of survey statistics

1 Program objective

2 Survey unit organized

3 Questionnaire & sample

4 Data collection

5 Data production & release

6 Analysis

7 Findings released

8 Popularizing findings

9 Needs & gaps evaluation

12

3

4

56

7

8

9

28

Page 29: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Lifecycle applied to health statistics

1 Program objectives

increased emphasis on health promotion and disease prevention;

decentralization of accountability and decision-making;

shift from hospital to community-based services;

integration of agencies, programs and services; and

increased efficiency and effectiveness in service delivery.

12

3

4

56

7

8

9

Health InformationRoadmap Initiative

29

Page 30: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

12

3

4

56

7

8

9

Health InformationRoadmap Initiative

2 Survey unit organized

3 Questionnaire & sample

4 Data collection

5 Data production & release

6 Analysis

7 Official findings released

Lifecycle applied to health statistics

30

Page 31: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Reconstructing statistics

One way to see the relationship between statistics and the data upon which they were derived is to reconstruct statistics that someone else has produced from data that are publicly accessible.

Page 32: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

12

3

4

56

7

8

9

Health InformationRoadmap Initiative

1 Program objective

2 Survey unit organized

3 Questionnaire & sample

4 Data collection

5 Data production & release

6 Analysis

7 Official findings released

8 Popularizing findings

9 Needs & gaps evaluation

Reconstructing statistics

32

Page 33: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

The statistics that we will reconstruct are reported in “Health Facts from the 1994 National Population Health Survey,” Canadian Social Trends, Spring 1996, pp. 24-27.

The steps we will follow are: identify the characteristics of the

respondents in the article; identify the data source; locate these characteristics in the data

documentation; find the original questions used to collect the

data; retrieve the data; and run an analysis to reproduce the statistics.

Reconstructing statistics

33

Page 34: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

The findings to be reproduced

Page 26

Page 35: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Summary of variables identified

Findings apply to Canadian adults Likely need age of respondents

Men and women Look for the sex of respondents

Type of drinkers Look for frequency of drinking or a variable

categorizing types of drinkers Age

Look for actual age or age in categories Smokers

Look for smoking status

35

Page 36: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Identify the data source

Survey title is identified: National Population Health Survey, 1994-95

Public-use microdata file is announced

Page 25 of the article

Page 37: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Locate the variables Examine the data documentation for the

National Population Health Survey, 1994-95 PDF version is on-line

Use TOC and link to “Data Dictionary for Health”

Identify the variables from their content NOTE: check how missing data were handled

Trace the variables back the questionnaire Did sampling method require weighting

cases?NOTE: in addition to the other variables, is a

weight variable needed to adjust for the sampling method?

37

Page 38: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Retrieve and analyze the data

The microdata for the NPHS is available through Statistics Canada without cost. However, a licence regarding the terms of use must be signed. Universities subscribed to the Data Liberation Initiative (DLI) can download the data directly.

Make use of local data services to retrieve data from the NPHS.

38

Page 39: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Lessons from the NPHS example

This example demonstrates the distinction between producing statistics and interpreting statistics that have been published by others.

This is an important distinction because: Choices are made in creating statistics. Interpreting statistics requires an

ability to understand the choices that were made.

Searching for statistics that others have published can be facilitated by understanding these points.

39

Page 40: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Attributes of statistics

40

Content or subject What was observed?

Geographic setting Where was it observed?

Time coverage When was it observed?

Metric of measurement How was it measured?

Page 41: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Six dimensions or variables in this tableThe cells in the table are the number ofestimated smokers.

GeographyRegion

TimePeriods

Social Content

Smokers

Education

Age

Sex

Attributes of statistics

41

Page 42: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics are about definitions

42

Page 43: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics are about definitions

Statistics are dependent on definitions. You may think of statistics as numbers, but the numbers represent measurements or observations based on specific definitions.

As just shown, tables are structured around geography, time, and content based on the attributes of the unit of observation. These properties are all depend on definitions.

43

Page 44: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics are about definitions

Consider the following example from the Canadian Census on the data behind statistics about visible minorities. This table displays the size of the visible minority population in Canada from the 2006 Census.

Visible Minority Groups (15), Generation Status (4), Age Groups (9) and Sex (3) for the Population 15 Years and Over of Canada, Provinces, Territories, Census Metropolitan Areas and Census

Agglomerations, 2006 Census - 20% Sample Data

44

Page 45: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

How is visible minority status identified in the Census? Are aboriginals among the visible minority in Canada? What is the definition of visible minority?

Statistics are about definitions

45

Page 46: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

46

Page 47: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

47

Page 48: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

ClassificationsSex

TotalMaleFemale

Periods1994-19951996-1997

Statistics involve classifications

48

Page 49: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Some classifications are based on standards while others are based on convention or practice.

For example, Standard Geography classifications

Statistics involve classifications

49

Page 50: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Statistics involve classifications

The definitions that shape statistics specify the metric of the data they summarize (for example, Canadian dollars) or the categories used to classify things if a statistic represents counts or frequencies. In this latter case, classification systems are used to identify categories of membership in a concept’s definition.

Examples of standard classifications include the North American Industrial Classification System (NAICS), the National Occupational Classification (NOC-S) and the International Classification of Diseases (ICD). Look at these examples and describe the coding systems used.

50

Page 51: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

A quick review To this point, we have established that:

Statistics are ‘real’ only if they are derived from data;

Statistics are dependent of definitions of the concepts they summarize;

Statistics that represent counts of things in the data employ classification systems, which are based either on standards or convention;

Statistics are numeric summaries over geography, time, and content; and

Statistics are typically organized for display using tables or charts.

51

Page 52: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Metadata for statistics

52

“Tips for Reading a Statistical Table” provides a list of information that a table should provide.

Metadata for statistical tables1. A title for the table containing references to the

content, geography, time, and unit of measurement;2. A reference in the title or a note about the identity

of the individual or organization who produced the table;

3. A date when the table was published; 4. Definitions of concepts or sources to classification

systems or standards used;5. The type of classification system use for the

headings of columns and rows;

Page 53: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Metadata for statistics

53

6. Notes helpful for interpreting the information in the cells of the table, such as a description of any special steps taken in preparing the statistics;

7. Information indicating whether the statistics were derived from a raw or a weighted sample (or possibly both);

8. The sample size if the data come from a sample;

9. The unit of observation for the data used; and

10. The agency that produced the data file.

Page 55: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

55

7 89

Page 56: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Looking for statistics

Who would publish this statistic? What organization would publish this statistic? Is this is a statistic that would be published by a

public or a private organization? Is this a statistic that would be published as an

operational requirement? Can you identify a data source for the statistic?

What data source would be used to produce this statistic?

Who would produce this data source? Would there be a distributor for the data source?

56

Page 57: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Looking for statistics

What view of the data would be shown in this statistic? What would be the level of geography? What time period would be shown? What social characteristics would be shown? Why would someone show this view of the data?

What metadata would describe this statistic? What definitions would describe the geography,

time, or social characteristics? What standard classification system would be used

for the categories of the statistic?

57

Page 58: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Search strategies for statistics

Over the next two days, we will talk about two general search strategies for finding statistics.

The “publisher” strategy is to identify an organization that would produce and publish such a statistic. This approach relies on knowledge of statistical producers. Understanding governmental structure and the content for which its agencies are responsible is an example of public sector sources.

The data strategy is to identify a data source from which the statistics were derived. This approach replies on knowledge of data sources produced by agencies or organizations.

58

Page 59: The Winter Institute on Statistical Literacy for Librarians Demystifying statistics for the practitioner Anna Bombak, Chuck Humphrey, Larry Laliberte,

Framework

A G E N C YP U B L IC A T IO N S

D A T AS O U R C E S

O F F IC IA LS T A T IS T IC S

O R G A N IZ A T IO NP U B L IC A T IO N S

D A T AS O U R C E S

N O N -O F F IC A LS T A T IS T IC S

STATISTIC AL IN FO R M ATIO N