the present and future of statistics challenges and ...davidian/jsm2014.pdf · was blind to the...

The Present and Future ofStatistics

Challenges and Opportunities

Marie Davidian

Department of StatisticsNorth Carolina State University

http://www4.stat.ncsu.edu/~davidian

1/47

http://www4.stat.ncsu.edu/~davidian

New York Times, August 6, 2009

“I keep saying that the sexy job in the next 10 years will bestatisticians” – Hal Varian, Chief Economist, Google

2/47

McKinsey Report, 2011

2011 McKinsey Global Institute report:

Big data: The next frontier for innovation,competition, and productivity

“A significant constraint. . . will be a shortage of . . . people withdeep expertise in statistics and data mining. . . a talent gap of

140K - 190K positions in 2018 (in the US)”

http://www.mckinsey.com/insights/mgi/research/technology and innovation/big data the next frontier for innovation

3/47

The Wall Street Journal, March 1, 2013

4/47

Advanced placement statistics

2000 2005 2010

050

100

150

200

Year

Num

ber

of S

tude

nts

(tho

usan

ds)

AP Statistics Exam Participation, 1997−2013

5/47

Rock star

Nate Silver attracted ∼ 4,000 statisticians to his president’sinvited address at the 2013 Joint Statistical Meetings

6/47

Bestsellers

7/47

The Wall Street Journal, November 15, 2013

“Statistics is cool” – Ron Wasserstein, ASA Executive Director

8/47

Rock stars

2013 MacArthur Fellow Susan Murphy

9/47

Rock stars

2013 Prime Minister’s Prize for Science (Australia) recipientTerry Speed

10/47

Rock stars

Sir David Spiegelhalter (aka “Professor Risk”)

11/47

Essential

“Statistical rigor is necessary to justify the inferential leap fromdata to knowledge”

12/47

Big opportunities

The opportunities for statistics to have major impact are endless

13/47

Big challenge

Big Data and data science

14/47

Big hype

15/47

Big hype, receding

“We are now past the ‘peak of inflated expectations’ of the hypecycle” (http://en.wikipedia.org/wiki/Hype_cycle)

16/47

http://en.wikipedia.org/wiki/Hype_cycle

Investment

17/47

Investment

18/47

Investment

19/47

What is data science?

20/47

Big challenge remains

• Perception that other disciplines are more relevant todata-driven science and business

• Computer science, machine learning, mathematics,engineering, physics, analytics, . . .

• Perception that statistics is old-fashioned and rigid• Institutes, centers, degree and certificate programs; in

many cases statistics is MIA

21/47

For example

“Statistics departments and journals still strongly emphasize a verynarrow range of topics and methods and techniques, all driven by a

tiny handful of results, many dating from the 1930s. . . the olderzombie methods persist in the statistics literature and teaching.”

22/47

For example

Science, February 11, 2011

“My impression is that scien-tists view statistics not so muchas a science but as a ‘bag oftools.’ You have a visibility prob-lem in Science and AAAS.” –

Alan Leshner, Chief ExecutiveOfficer of the American Asso-ciation for the Advancement ofScience (AAAS), to representa-tives of the ASA in 2011

23/47

Weird juxtaposition

• We continue to have daily impact in our criticalcollaborative roles

• We have gotten lots of great press• Enrollments in undergraduate programs and applications

to graduate programs are skyrocketing• But statistics is still often absent from the discourse on Big

Data and data science• And the importance of statistical rigor is sometimes lost on

fellow scientists• Statistics and statistical principles continue to be

misunderstood

24/47

Weird juxtaposition



to graduate programs are skyrocketing

• But statistics is still often absent from the discourse on BigData and data science

• And the importance of statistical rigor is sometimes lost onfellow scientists

• Statistics and statistical principles continue to bemisunderstood

24/47

Weird juxtaposition



to graduate programs are skyrocketing• But statistics is still often absent from the discourse on Big

Data and data science• And the importance of statistical rigor is sometimes lost on

fellow scientists• Statistics and statistical principles continue to be

misunderstood

24/47

ASA impact

The American Statistical Association has undertaken numerousinitiatives to

• Promote the role of statistics in all the sciences• Highlight the unique perspectives statistics brings to

business and policy• Increase awareness of opportunities for statisticians

among studentsand thereby enhance our impact on science and society

25/47

Key ASA Staff

Ron Wasserstein Steve Pierson Jeff Myers

26/47

AAAS and Science

AAAS – A focal point for science• World’s largest general scientific society, ∼120K members• 261 affiliated scientific societies (including ASA)• Publisher of Science• 24 AAAS sections, including Section U on Statistics

2013 presidential initiative• Raise the profile of statistics within AAAS

27/47

AAAS and Science

• Button campaigns encouraging statisticians to join AAAS• Section U membership increased 14% from 2012 to 2013• Section U invited session proposals for AAAS Annual

Meetings• Nominations for AAAS Fellow

28/47

Impacting AAAS and Science

September 26, 2013• A group of ASA representatives met with Alan Leshner• Very positive reception!• We also met with new Science editor-in-chief Marcia

McNutt and several senior editors• Great interest in enhancing the role of statistics and

statisticians

29/47

Impacting Science

Science Editor-in-Chief Marcia McNutt

30/47

Science, January 17, 2014

Reproducibility SCIENCE ADVANCES ON A FOUNDATION OF TRUSTED DISCOVERIES. REPRODUCING AN EXPERIMENT

is one important approach that scientists use to gain confidence in their conclusions.

Recently, the scientifi c community was shaken by reports that a troubling proportion of

peer-reviewed preclinical studies are not reproducible. Because confi dence in results is of

paramount importance to the broad scientifi c community, we are announcing new initiatives

to increase confi dence in the studies published in Science. For preclinical studies (one of the

targets of recent concern), we will be adopting recommendations of the U.S. National Insti-

tute of Neurological Disorders and Stroke (NINDS) for increasing transparency.* Authors

will indicate whether there was a pre-experimental plan for data handling (such as how to

deal with outliers), whether they conducted a sample size estimation to ensure a suffi cient

signal-to-noise ratio, whether samples were treated randomly, and whether the experimenter

was blind to the conduct of the experiment. These criteria will be

included in our author guidelines.

There are a number of reasons why peer-reviewed preclinical

studies may not be reproducible. The system under investigation may

be more complex than previously thought, so that the experimenter

is not actually controlling all independent variables. Authors may not

have divulged all of the details of a complicated experiment, making

it irreproducible by another lab. It is also expected that through ran-

dom chance, a certain number of studies will produce false positives.

If researchers are not alert to this possibility and have not set appro-

priately stringent signifi cance tests for their results, the outcome is a

study with irreproducible results. Although there is always the possi-

bility that an occasional study is fraudulent, the number of preclinical

studies that cannot be reproduced is inconsistent with the idea that all

irreproducibility results from misconduct in such research.

It is unlikely that the issues with irreproducibility are confi ned to preclinical studies

(social science has been equally noted, for example). Unfortunately, there are no equivalents

to the NINDS recommendations for other disciplines that provide a basis for requiring trans-

parency across all fi elds. For the next 6 months, we will be asking reviewers and editors to

identify papers submitted to Science that demonstrate excellence in transparency and instill

confi dence in the results. This will inform the next steps in implementing reproducibility

guidelines. Science Translational Medicine, a sister journal of Science, already enforces the

NINDS guidelines for preclinical studies. Both journals also are open to improving on the

NINDS recommendations for preclinical studies.

There is also a wide range of sophistication in the application of statistics displayed in

research analysis, ranging from practically no statistics, to the routine use of generic soft-

ware packages, to the application of advanced methods that extract subtle signals from noise.

Because reviewers who are chosen for their expertise in the subject matter of a study may not

be authorities in statistics as well, statistical errors in manuscripts may slip through unde-

tected. For that reason, with the advice of the American Statistical Association and others,

we are adding new members to our Board of Reviewing Editors from the statistics commu-

nity to ensure that manuscripts receive appropriate scrutiny in their methods of data analysis.

Science’s standards have always been high, and these measures add to steps we have

already taken to increase transparency, such as requiring data accessibility. Nevertheless,

journals can only do so much to assure readers of the validity of the studies they publish.

The ultimate responsibility lies with authors to be completely open with their methods, all of

their fi ndings, and the possible pitfalls that could invalidate their conclusions.

10.1126/science.1250475

– Marcia McNutt

229

EDITORIAL

CR

ED

ITS: (T

OP

) STA

CE

Y P

EN

TLA

ND

PH

OT

OG

RA

PH

Y; (R

IGH

T) O

NE

O2/I

ST

OC

KP

HO

TO

.CO

M

Marcia McNutt is Editor-

in-Chief of Science.

www.sciencemag.org SCIENCE VOL 343 17 JANUARY 2014

*S. C. Landis et al., Nature 490, 187 (2012).

Published by AAAS

on

July

16,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

31/47

Science, July 4, 2014

SCIENCE sciencemag.org 4 JULY 2014 • VOL 345 Issue 6192 9

Numbers. Lots and lots of numbers. It is hard to

find a paper published in Science or any other

journal that is not full of numbers. Interpreta-

tion of those numbers provides the basis for the

conclusions, as well as an assessment of the con-

fidence in those conclusions. But unfortunately,

there have been far too many cases where the

quantitative analysis of those numbers has been flawed,

causing doubt about the authors’ interpretation and

uncertainty about the result. Furthermore, it is not re-

alistic to expect that a technical reviewer, chosen for

her or his expertise in the topical subject matter or ex-

perimental protocol, will

also be an expert in data

analysis. For that reason,

with much help from the

American Statistical As-

sociation, Science has es-

tablished, effective 1 July

2014, a Statistical Board

of Reviewing Editors

(SBoRE), consisting of ex-

perts in various aspects of

statistics and data analy-

sis, to provide better over-

sight of the interpretation

of observational data.

For those familiar with

the role of Science’s Board

of Reviewing Editors

(BoRE), the function of

the SBoRE will be slightly

different. Members of the

BoRE perform a rapid

quality check of manu-

scripts and recommend

which should receive in-

depth review by techni-

cal specialists. Members

of the SBoRE will receive manuscripts that have been

identified by editors, BoRE members, or possibly review-

ers as needing additional scrutiny of the data analysis or

statistical treatment. The SBoRE member assesses what

the issue is that requires screening and suggests experts

from the statistics community to provide it.

So why is Science taking this additional step? Read-

ers must have confidence in the conclusions published

in our journal. We want to continue to take reasonable

measures to verify the accuracy of those results. We be-

lieve that establishing the SBoRE will help avoid honest

mistakes and raise the standards for data analysis, par-

ticularly when sophisticated approaches are needed. But

even when taking added precautions, no review system

is infallible, and no combination of reviewers can be ex-

pected to uncover all of the ways in which the interpre-

tation of results may have gone wrong. In particular, it

is difficult for reviewers to detect whether authors have

approached the study with a lack of bias in their data

collection and presentation.

I recall a study that I conducted years ago involving

a global analysis of some oceanographic features that

I was modeling to understand the rheology of oceanic

plates on million-year time scales. I had only a handful

of data points—perhaps a

dozen or so—and the fit to

my model failed a signifi-

cance test. Clearly, throw-

ing out a few of the data

points by declaring them

“outliers” would have im-

proved the fit dramati-

cally, and in fact I even

recall a reviewer of the

paper commenting: “Can’t

you make the data fit the

model better?”

Really?

The editor published

the paper despite the

lousy fit of the model to

the data. It was not too

long before it was real-

ized that those “outliers”

were the key to a more

complete understanding

of the long-term rheologi-

cal behavior of the oce-

anic plates. Although the

model in the earlier paper

needed an overhaul, the

fundamental observations, because they were presented

without bias, inspired much further progress in the field.

In the years since, I have been amazed at how many

scientists have never considered that their data might be

presented with bias. There are fundamental truths that

may be missed when bias is unintentionally overlooked,

or worse yet, when data are “massaged.” Especially as we

enter an era of “big data,” we should raise the bar ever

higher in scrutinizing the analyses that take us from ob-

servations to understanding.

Raising the bar

Marcia McNutt is

Editor-in-Chief of

Science.

EDITORIAL

– Marcia McNutt

10.1126/science.1257891

“Readers must have confidence in the conclusions published

in our journal.”

IMA

GE

S:

(RIG

HT

) S

TA

CE

Y P

EN

TL

AN

D P

HO

TO

GR

AP

HY

; (I

NS

ET

) S

OR

BE

TT

O/

IST

OC

KP

HO

TO

.CO

M

Published by AAAS

on

July

16,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

32/47

Science Statistics Board of Reviewing Editors

“. . . with much help from the American Statistical Association,Science has established, effective 1 July 2014, a StatisticalBoard of Reviewing Editors (SBoRE), consisting of experts invarious aspects of statistics and data analysis, to provide betteroversight of the interpretation of observational data.”

“I have been amazed at how many scientists have neverconsidered that their data might be presented with bias. . . Especially as we enter an era of ‘big data,’ we should raisethe bar ever higher in scrutinizing the analyses that take usfrom observations to understanding.”

33/47

Science Statistics Board of Reviewing Editors

Ron Brookmeyer, UCLAAlison Motsinger-Reif, NC State UniversityGiovanni Parmigiani, Dana-Faber Cancer InstituteRichard Smith, University of North Carolina at Chapel HillJane-Ling Wang, University of California, DavisChris Wikle, University of MissouriIan A. Wilson, The Scripps Research Institute

34/47

Science Statistics Perspectives

35/47

Impacting US federal research priorities

Whitepapers• Make case that statisticians are essential to tackling

national research priorities• Drive research funding• National Science Foundation (NSF)• White House Office of Science and Technology Policy

(OSTP)• Modeled after the success of Computing Community

Consortium (CCC) whitepapers• See Steve’s article in the July 2014 Amstat News

36/47

OSTP Big Data initiative

1

Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society

July 2, 2014

A Working Group of the American Statistical Association1

Summary:

The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise. With a major Big Data objective of turning data into knowledge, statistics is an essential scientific discipline because of its sophisticated methods for statistical inference, prediction, quantification of uncertainty, and experimental design. Such methods have helped and will continue to enable researchers to make discoveries in science, government, and industry. The paper discusses the statistical components of scientific challenges facing many broad areas being transformed by Big Data—including healthcare, social sciences, civic infrastructure, and the physical sciences—and describes how statistical advances made in collaboration with other scientists can address these challenges. We recommend more ambitious efforts to incentivize researchers of various disciplines to work together on national research priorities in order to achieve better science more quickly. Finally, we emphasize the need to attract, train, and retain the next generation of statisticians necessary to address the research challenges outlined here.

1 Authors: Cynthia Rudin, MIT (Chair); David Dunson, Duke University; Rafael Irizarry, Harvard University; Hongkai Ji, Johns Hopkins University; Eric Laber, North Carolina State University; Jeffrey Leek, Johns Hopkins University; Tyler McCormick, University of Washington; Sherri Rose, Harvard University; Chad Schafer, Carnegie Mellon University; Mark van der Laan, University of California, Berkeley; Larry Wasserman, Carnegie Mellon University; Lingzhou Xue, Pennsylvania State University. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.

Cynthia Rudin, ChairDavid Dunson, Rafael Irizarry,Hongkai Ji, Eric Laber,Jeff Leek, Tyler McCormick,Sherri Rose, Chad Schafer,Mark van der Laan,Larry Wasserman,Lingzhou Xue

37/47

OSTP BRAIN initiative

STATISTICAL RESEARCH AND TRAININGUNDER THE BRAIN INITIATIVE

A Working Group of the American Statistical Association∗

April 2014

1 Introduction and Summary

The BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative aims to produce a so-phisticated understanding of the link between brain and behavior and to uncover new ways to treat, prevent andcure brain disorders.1 Success in meeting these multifaceted challenges will require scientific and technologicalparadigms that incorporate novel statistical methods for data acquisition and analysis. Our purpose here is tosubstantiate this proposition, and to identify implications for training.

Brain research relies on a wide variety of existing methods for collecting human and animal neural data, in-cluding neuroimaging (radiography, fMRI, MEG, PET), electrophysiology from multiple electrodes (EEG, ECoG,LFP, spike trains), calcium imaging, optical imaging, optogenetics, and anatomical methods (diffusion imaging,electron microscopy, fluorescent microscopy). Each of these modalities produces data with its own set of sta-tistical and analytical challenges. As neuroscientists improve these techniques and develop new ones, dataare being acquired at very large scales. For example, advances in multiple-electrode recording and two-photoncalcium imaging have led to an exponential growth in the size of neural populations that can be observed si-multaneously, at single-cell resolution (2; 3; 52). Similarly, new anatomical methods have led to a rapid rise inthe size and the scale of data, and the resulting level of detail with which brain structure can be investigated(17; 10; 32; 39). Furthermore, both new and existing technologies are often used together, and are increasinglyaccompanied by rich characterizations of individuals and their behavior, ranging from genetic information tosensor-based monitors of activity.

∗The working group included Robert E. Kass, Carnegie Mellon University (Chair); Genevera Allen, Rice University; Brian Caffo,Johns Hopkins University; John Cunningham, Columbia University; Uri Eden, Boston University; Timothy D. Johnson, University ofMichigan; Martin A. Lindquist, Johns Hopkins University; Thomas A Nichols, University of Warwick; Hernando Ombao, University ofCalifornia, Irvine; Liam Paninski, Columbia University; Russell T. Shinohara, University of Pennsylvania; Bin Yu, University of California,Berkeley. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.

1http://www.whitehouse.gov/share/brain-initiative

1

Rob Kass, ChairGenevera Allen, Brian Caffo,John Cunningham,Uri Eden, Timothy Johnson,Martin Lindquist,Thomas Nichols,Hernando Ombao,Liam Paninski,Russell T. Shinohara, Bin Yu

38/47

OSTP climate change initiative

Statistical Science: Contributions to the Administration’s

Research Priority on Climate Change

April 2014

A White Paper of the American Statistical Association’s Advisory Committee for

Climate Change Policy1

EXECUTIVE SUMMARY

Data are fundamental to all of science. Data enhance scientific theories and their statistical

analysis suggests new avenues of research and data collection. Climate science is no exception.

Earth’s climate system is complex, involving the interaction of many different kinds of physical

processes and many different time scales. Thus this area of science has a critical dependence on

the examination of all relevant data and the application of statistics for its interpretation. Climate

datasets are increasing in number, size, and complexity and challenge traditional methods of data

analysis. Satellite remote sensing campaigns, automated weather monitoring networks, and

climate-model experiments have contributed to a data explosion that provides a wealth of new

information but can overwhelm standard approaches. Developing new statistical approaches is an

essential part of understanding climate and its impact on society in the presence of uncertainty.

Experience has shown that rapid progress can be made when “big data” is used with statistics to

derive new technologies. Crucial to this success are new statistical methods that recognize

uncertainties in the measurements and the scientific processes but are also tailored to the unique

scientific questions being studied.

This white paper makes the case for the National Science Foundation (NSF) to establish an

interdisciplinary research program around climate, where statisticians have the opportunity to

collaborate with researchers from other disciplines to advance the understanding of the climate

system (e.g., quantification of uncertainties, the development of powerful tests of scientific

hypotheses). Although NSF supports basic and applied statistical research, these efforts often do

not involve scientists and statisticians in partnerships or in teams to address problems in climate

science. This program would also address the critical need for training a new generation of

interdisciplinary researchers who can tackle challenging scientific problems that require complex

data analysis by developing and using the necessary sophisticated statistical methods.

1 Authors: Bruno Sanso, University of California, Santa Cruz (Chair); L. Mark Berliner, Ohio State

University; Daniel S. Cooley, Colorado State University; Peter Craigmile, Ohio State University; Noel A.

Cressie, University of Wollongong; Murali Haran, Pennsylvania State University; Robert B. Lund, Clemson

University; Douglas W. Nychka, National Center for Atmospheric Research; Chris Paciorek, University of

California, Berkeley; Stephan R. Sain, National Center for Atmospheric Research; Richard L Smith,

Statistical and Applied Mathematical Sciences Institute; Michael L. Stein, University of Chicago.

Affiliations are for identification purposes only and do not imply an institution’s endorsement of this

document.

Bruno Sanso, ChairMark Berliner, Daniel Cooley,Peter Cragmile, Noel Cressie,Murali Haran, Robert Lund,Doug Nychka, Chris Paciorek,Stephan Sain, Richard Smith,Michael Stein

39/47

Impacting our role in Big Data/data science

Big Data/data science initiative by the ASA presidents (Bob,Marie, Nat) (See my June 2013 Amstat News column)

• Workgroup on curriculum development• Meetings between ASA representatives and stakeholders

from business and technology companies at the forefront ofdata science (Alexandria, Cincinnati, New York, Palo Alto)

• Training in text analytics at 2014 CSP and JSM

40/47

Data science meetings

Major messages• Great shortage of statistical talent, concerns over pipeline• Concerns over ability of fresh PhDs to work independently,

identify problems• Concerns over computational and data manipulation skills

– Python favored over R• Communication, collaboration, and leadership skills• Must be able to “make it to the middle”

41/47

Preparing statisticians to make an impact

• Report of the curriculum development workgroup, 2014JSM panel (Monday)

• Professional skills development• Training in statistical leadership (see Nat’s May 2014

Amstat News column)

42/47

Public relations campaign

First national PR campaign for statistics• Promote the profession, increase visibility• Students and those who advise/influence them• Careers in statistics, importance of statistical literacy in

everyday life• Consulting with Stanton Communications• Not just an “ASA thing”• See Nat’s June 2014 Amstat News column

43/47

This is Statistics

“It’s not what you think it is.”

44/47

This is Statistics website

http://www.thisisstatistics.org

(going live on August 18, 2014)45/47

http://www.thisisstatistics.org

This is Statistics approaches

• Profiles of young statisticians in “cool” positions• Exploit social media• Pitch statistics career stories to the media

46/47

Making an impact

I have touched on just a few of the many things the ASA isdoing to enhance the impact of statistics

Please contact me or Ron ([email protected]) if you areinterested in participating

47/47

the present and future of statistics challenges and ...davidian/jsm2014.pdf · was blind to the...

Documents