information is what we want, but data are what weÕve got. c …€¦ ·  · 2015-08-12information...

5
"27","690",110023,32,16085.125,4922.1875,"GORDON HOSPITAL","1035 RED BUD ROAD","CALHOUN","GA",30701,"GA - Rome" "28","064",390073,58,25636.10345,10050.2069,"ALTOONA REGIONAL HEALTH SYSTEM","620 HOWARD AVENUE","ALTOONA","PA" "29","189",140166,27,25354.74074,8035.925926,"ST MARYS HOSPITAL","1800 E LAKE SHORE DR","DECATUR","IL",62521,"I "30","065",220108,18,16092.88889,7422.444444,"MILTON HOSPITAL INC","199 REEDSDALE ROAD","MILTON","MA",2186,"MA "31","638",240063,17,17730.29412,5531,"ST JOSEPH’S HOSPITAL","45 WEST 10TH STREET","SAINT PAUL","MN",55102,"MN "32","066",70010,21,25608.52381,8243.190476,"BRIDGEPORT HOSPITAL","267 GRANT STREET","BRIDGEPORT","CT",6610,"CT "33","389",220105,22,9957.636364,5855,"WINCHESTER HOSPITAL","41 HIGHLAND AVENUE","WINCHESTER","MA",1890,"MA - B "34","190",50517,20,34379.3,10507.75,"VICTOR VALLEY COMMUNITY HOSPITAL","15248 11TH ST","VICTORVILLE","CA",9239 "35","101",450044,18,22382.72222,5868,"UT SOUTHWESTERN UNIVERSITY HOSPITAL","5909 HARRY HINES BLVD","DALLAS","T "36","178",330239,25,20413.36,8496.96,"WOMAN’S CHRISTIAN ASSOCIATION","207 FOOTE AVENUE","JAMESTOWN","NY",14701 "37","872",440150,23,25628.08696,6593.652174,"SUMMIT MEDICAL CENTER","5655 FRIST BLVD","HERMITAGE","TN",37076," "38","243",670034,11,30896.09091,13386.27273,"SCOTT & WHITE HOSPITAL-ROUND ROCK","300 UNIVERSITY BLVD","ROUND R "39","243",290039,25,108888.08,17310.92,"MOUNTAINVIEW HOSPITAL","3100 N TENAYA WAY","LAS VEGAS","NV",89128,"NV "40","189",260119,45,49716.57778,8235.844444,"POPLAR BLUFF REGIONAL MEDICAL CENTER","2620 N WESTWOOD BLVD","POP "41","291",50262,65,87656.32308,23882.6,"RONALD REAGAN UCLA MEDICAL CENTER","757 WESTWOOD PLAZA","LOS ANGELES", "42","418",300012,14,24523.42857,10788.5,"ELLIOT HOSPITAL","1 ELLIOT WAY","MANCHESTER","NH",3103,"NH - Manchest "43","699",230092,11,18600,5511.363636,"ALLEGIANCE HEALTH","205 N EAST AVE","JACKSON","MI",49201,"MI - Ann Arbo "44","176",260091,15,29213.53333,8293.4,"SSM ST MARYS HEALTH CENTER","6420 CLAYTON RD","RICHMOND HEIGHT","MO",6 "45","390",240132,16,17871.75,5602.375,"UNITY HOSPITAL","550 OSBORNE ROAD","FRIDLEY","MN",55432,"MN - Minneapol "46","603",100173,57,20832.66667,4959.508772,"FLORIDA HOSPITAL TAMPA","3100 E FLETCHER AVE","TAMPA","FL",33613, "47","191",510001,91,19745.17582,8046.527473,"WEST VIRGINIA UNIVERSITY HOSPITALS","MEDICAL CENTER DRIVE","MORGA "48","690",380082,15,9025.333333,6077.4,"PROVIDENCE MILWAUKIE HOSPITAL","10150 SE 32ND AVENUE","MILWAUKIE","OR" "49","065",310083,19,50481.57895,11328.21053,"EAST ORANGE GENERAL HOSPITAL","300 CENTRAL AVE","EAST ORANGE","NJ "50","536",450686,25,20344.16,6008.6,"UNIVERSITY MEDICAL CENTER","602 INDIANA AVENUE","LUBBOCK","TX",79415,"TX "51","330",170020,21,42909.28571,16447.2381,"HUTCHINSON REGIONAL MEDICAL CENTER INC","1701 E 23RD AVENUE","HUTC "52","252",150017,47,101009.1915,22528.06383,"LUTHERAN HOSPITAL OF INDIANA","7950 W JEFFERSON BLVD","FORT WAYNE "53","392",260219,37,12271.64865,3993.972973,"PROGRESS WEST HEALTHCARE CENTER","2 PROGRESS POINT PKWY","O FALLO "54","419",140294,15,42847.33333,7572.866667,"CROSSROADS COMMUNITY HOSPITAL","8 DOCTORS PARK ROAD","MOUNT VERNO "55","039",100038,14,66214.78571,7246.357143,"MEMORIAL REGIONAL HOSPITAL","3501 JOHNSON ST","HOLLYWOOD","FL",33 "56","552",330061,75,16056.21333,5513.8,"LAWRENCE HOSPITAL CENTER","55 PALMER AVENUE","BRONXVILLE","NY",10708," "57","853",310029,26,223865.6154,40468.26923,"OUR LADY OF LOURDES MEDICAL CENTER","1600 HADDON AVENUE","CAMDEN" "58","312",360077,112,18022.44643,4422.642857,"FAIRVIEW HOSPITAL","18101 LORAIN AVENUE","CLEVELAND","OH",44111, "59","378",490130,32,15228.625,5327.21875,"RIVERSIDE WALTER REED HOSPITAL","7519 HOSPITAL ROAD","GLOUCESTER","V "60","313",180011,54,13743.96296,3580.555556,"SAINT JOSEPH HOSPITAL LONDON","1001 SAINT JOSEPH LANE","LONDON"," "61","208",70016,42,32458.07143,18761.47619,"SAINT MARYS HOSPITAL","56 FRANKLIN ST","WATERBURY","CT",6706,"CT - "62","292",10120,28,12364,5204.571429,"MONROE COUNTY HOSPITAL","2016 SOUTH ALABAMA AVENUE","MONROEVILLE","AL",3 "63","392",140250,136,17888.22059,5458.036765,"ADVOCATE SOUTH SUBURBAN HOSPITAL","17800 S KEDZIE AVE","HAZEL CR "64","552",450639,12,20740.58333,4367.083333,"TEXAS HEALTH HARRIS METHODIST HURST-EULESS-BEDFORD","1600 HOSPITA "65","918",220024,11,9872.818182,4161.545455,"HOLYOKE MEDICAL CENTER","575 BEECH STREET","HOLYOKE","MA",1040,"M "66","812",220016,18,11285.94444,4971.833333,"BAYSTATE FRANKLIN MEDICAL CENTER","164 HIGH STREET","GREENFIELD", "67","853",330214,44,195905.5455,57527,"NYU HOSPITALS CENTER","550 FIRST AVENUE","NEW YORK","NY",10016,"NY - Ma "68","641",50390,81,15930.23457,5040.580247,"HEMET VALLEY MEDICAL CENTER","1117 EAST DEVONSHIRE","HEMET","CA",9 "69","641",40026,92,15188.5,3900.695652,"ST JOSEPHS MERCY HEALTH CENTER INC","300 WERNER STREET","HOT SPRINGS", "70","482",220002,15,16174.53333,11676.53333,"MOUNT AUBURN HOSPITAL","330 MOUNT AUBURN STREET","CAMBRIDGE","MA" "71","314",100180,15,132940.6,11593,"ST PETERSBURG GENERAL HOSPITAL","6500 38TH AVE N","SAINT PETERSBUR","FL",3 "72","191",390219,35,14810.94286,6026.828571,"EXCELA HEALTH LATROBE HOSPITAL","ONE MELLON WAY","LATROBE","PA",1 "73","192",440173,14,10147.28571,3910.857143,"PARKWEST MEDICAL CENTER","9352 PARK WEST BLVD","KNOXVILLE","TN",3 "74","536",230216,12,9865.5,4183.583333,"PORT HURON HOSPITAL","1221 PINE GROVE AVE","PORT HURON","MI",48060,"MI "75","195",100228,23,32893.13043,4118.26087,"WESTSIDE REGIONAL MEDICAL CENTER","8201 W BROWARD BLVD","PLANTATIO "76","602",520030,18,24881.11111,8846.777778,"ASPIRUS WAUSAU HOSPITAL","333 PINE RIDGE BLVD","WAUSAU","WI",5440 "77","689",450042,27,23352.03704,6469.296296,"PROVIDENCE HEALTH CENTER","6901 MEDICAL PARKWAY","WACO","TX",7671 D ATA C OMPUTING A N I NTRODUCTION TO W RANGLING AND V ISUALIZATION WITH R D ANIEL K APLAN For Review Only

Upload: doantruc

Post on 29-Apr-2018

218 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Information is what we want, but data are what weÕve got. C …€¦ ·  · 2015-08-12Information is what we want, but data are what weÕve got. D ATA C OMPUTING introduces wrangling

Information is what we want, but data are what we’ve got.

DATA COMPUTING introduces wrangling and visualization, the techniques forturning data into information.

Ideal for self-study or as a classroom text,Data Computing shows how to condenseand combine data from multiple sourcesto present them in a way that informsdiscovery and decision making. Anaccessible introduction to technicalcomputing for those whose primaryinterest is with data, Data Computingbuilds on breakthrough softwaredevelopments that let even beginnersexploit the power of professional-leveltools.

Based on a no-prerequisite short coursethe author developed at MacalesterCollege, Data Computing uses R, a leadingsoftware application for statistics anddata analysis which is freely available forall widely-used platforms. The book’sshort chapters and the clarity of notationsupported by Hadley Wickham’s hugelypopular dplyr and ggplot2 packageshelp the reader to develop skills at ameasured pace.

Visit the web site: Data-Computing.org

Daniel Kaplan is a Harvard-trained, award-winningteacher. He has two decades of experience teachingstatistics and modeling, computing, and appliedmathematics. His graduate training in biomedicalengineering and economics, as well as extensiveconsulting, give him an applied perspective: using datato serve a purpose.

No matter what you do, you can use data to do it better. Gaining that superpower requires youto learn some programming tools and some mental tools. This book will teach you both: you’llget the mental building blocks to think about data analysis, and the computational tools toturn those thoughts into code. If you’re just learning to swim in the data ocean, Danny’s lucidwriting and thoughtful approach makes this book a great place to start!

— Hadley Wickham, Chief Scientist, RStudio

The book covers in a systematic way not only the stuff I’ve picked up in a fragmentary,bit-at-a-time way, but more important for me, stuff I wanted to know, and stuff I needed toknow without realizing it. Better yet, Kaplan’s book makes it all easily accessible. Just whatour profession needs!” — George Cobb, Professor Emeritus of Mathematics andStatistics, Mt. Holyoke College

"27","690",110023,32,16085.125,4922.1875,"GORDON HOSPITAL","1035 RED BUD ROAD","CALHOUN","GA",30701,"GA - Rome"

"28","064",390073,58,25636.10345,10050.2069,"ALTOONA REGIONAL HEALTH SYSTEM","620 HOWARD AVENUE","ALTOONA","PA"

"29","189",140166,27,25354.74074,8035.925926,"ST MARYS HOSPITAL","1800 E LAKE SHORE DR","DECATUR","IL",62521,"I

"30","065",220108,18,16092.88889,7422.444444,"MILTON HOSPITAL INC","199 REEDSDALE ROAD","MILTON","MA",2186,"MA

"31","638",240063,17,17730.29412,5531,"ST JOSEPH’S HOSPITAL","45 WEST 10TH STREET","SAINT PAUL","MN",55102,"MN

"32","066",70010,21,25608.52381,8243.190476,"BRIDGEPORT HOSPITAL","267 GRANT STREET","BRIDGEPORT","CT",6610,"CT

"33","389",220105,22,9957.636364,5855,"WINCHESTER HOSPITAL","41 HIGHLAND AVENUE","WINCHESTER","MA",1890,"MA - B

"34","190",50517,20,34379.3,10507.75,"VICTOR VALLEY COMMUNITY HOSPITAL","15248 11TH ST","VICTORVILLE","CA",9239

"35","101",450044,18,22382.72222,5868,"UT SOUTHWESTERN UNIVERSITY HOSPITAL","5909 HARRY HINES BLVD","DALLAS","T

"36","178",330239,25,20413.36,8496.96,"WOMAN’S CHRISTIAN ASSOCIATION","207 FOOTE AVENUE","JAMESTOWN","NY",14701

"37","872",440150,23,25628.08696,6593.652174,"SUMMIT MEDICAL CENTER","5655 FRIST BLVD","HERMITAGE","TN",37076,"

"38","243",670034,11,30896.09091,13386.27273,"SCOTT & WHITE HOSPITAL-ROUND ROCK","300 UNIVERSITY BLVD","ROUND R

"39","243",290039,25,108888.08,17310.92,"MOUNTAINVIEW HOSPITAL","3100 N TENAYA WAY","LAS VEGAS","NV",89128,"NV

"40","189",260119,45,49716.57778,8235.844444,"POPLAR BLUFF REGIONAL MEDICAL CENTER","2620 N WESTWOOD BLVD","POP

"41","291",50262,65,87656.32308,23882.6,"RONALD REAGAN UCLA MEDICAL CENTER","757 WESTWOOD PLAZA","LOS ANGELES",

"42","418",300012,14,24523.42857,10788.5,"ELLIOT HOSPITAL","1 ELLIOT WAY","MANCHESTER","NH",3103,"NH - Manchest

"43","699",230092,11,18600,5511.363636,"ALLEGIANCE HEALTH","205 N EAST AVE","JACKSON","MI",49201,"MI - Ann Arbo

"44","176",260091,15,29213.53333,8293.4,"SSM ST MARYS HEALTH CENTER","6420 CLAYTON RD","RICHMOND HEIGHT","MO",6

"45","390",240132,16,17871.75,5602.375,"UNITY HOSPITAL","550 OSBORNE ROAD","FRIDLEY","MN",55432,"MN - Minneapol

"46","603",100173,57,20832.66667,4959.508772,"FLORIDA HOSPITAL TAMPA","3100 E FLETCHER AVE","TAMPA","FL",33613,

"47","191",510001,91,19745.17582,8046.527473,"WEST VIRGINIA UNIVERSITY HOSPITALS","MEDICAL CENTER DRIVE","MORGA

"48","690",380082,15,9025.333333,6077.4,"PROVIDENCE MILWAUKIE HOSPITAL","10150 SE 32ND AVENUE","MILWAUKIE","OR"

"49","065",310083,19,50481.57895,11328.21053,"EAST ORANGE GENERAL HOSPITAL","300 CENTRAL AVE","EAST ORANGE","NJ

"50","536",450686,25,20344.16,6008.6,"UNIVERSITY MEDICAL CENTER","602 INDIANA AVENUE","LUBBOCK","TX",79415,"TX

"51","330",170020,21,42909.28571,16447.2381,"HUTCHINSON REGIONAL MEDICAL CENTER INC","1701 E 23RD AVENUE","HUTC

"52","252",150017,47,101009.1915,22528.06383,"LUTHERAN HOSPITAL OF INDIANA","7950 W JEFFERSON BLVD","FORT WAYNE

"53","392",260219,37,12271.64865,3993.972973,"PROGRESS WEST HEALTHCARE CENTER","2 PROGRESS POINT PKWY","O FALLO

"54","419",140294,15,42847.33333,7572.866667,"CROSSROADS COMMUNITY HOSPITAL","8 DOCTORS PARK ROAD","MOUNT VERNO

"55","039",100038,14,66214.78571,7246.357143,"MEMORIAL REGIONAL HOSPITAL","3501 JOHNSON ST","HOLLYWOOD","FL",33

"56","552",330061,75,16056.21333,5513.8,"LAWRENCE HOSPITAL CENTER","55 PALMER AVENUE","BRONXVILLE","NY",10708,"

"57","853",310029,26,223865.6154,40468.26923,"OUR LADY OF LOURDES MEDICAL CENTER","1600 HADDON AVENUE","CAMDEN"

"58","312",360077,112,18022.44643,4422.642857,"FAIRVIEW HOSPITAL","18101 LORAIN AVENUE","CLEVELAND","OH",44111,

"59","378",490130,32,15228.625,5327.21875,"RIVERSIDE WALTER REED HOSPITAL","7519 HOSPITAL ROAD","GLOUCESTER","V

"60","313",180011,54,13743.96296,3580.555556,"SAINT JOSEPH HOSPITAL LONDON","1001 SAINT JOSEPH LANE","LONDON","

"61","208",70016,42,32458.07143,18761.47619,"SAINT MARYS HOSPITAL","56 FRANKLIN ST","WATERBURY","CT",6706,"CT -

"62","292",10120,28,12364,5204.571429,"MONROE COUNTY HOSPITAL","2016 SOUTH ALABAMA AVENUE","MONROEVILLE","AL",3

"63","392",140250,136,17888.22059,5458.036765,"ADVOCATE SOUTH SUBURBAN HOSPITAL","17800 S KEDZIE AVE","HAZEL CR

"64","552",450639,12,20740.58333,4367.083333,"TEXAS HEALTH HARRIS METHODIST HURST-EULESS-BEDFORD","1600 HOSPITA

"65","918",220024,11,9872.818182,4161.545455,"HOLYOKE MEDICAL CENTER","575 BEECH STREET","HOLYOKE","MA",1040,"M

"66","812",220016,18,11285.94444,4971.833333,"BAYSTATE FRANKLIN MEDICAL CENTER","164 HIGH STREET","GREENFIELD",

"67","853",330214,44,195905.5455,57527,"NYU HOSPITALS CENTER","550 FIRST AVENUE","NEW YORK","NY",10016,"NY - Ma

"68","641",50390,81,15930.23457,5040.580247,"HEMET VALLEY MEDICAL CENTER","1117 EAST DEVONSHIRE","HEMET","CA",9

"69","641",40026,92,15188.5,3900.695652,"ST JOSEPHS MERCY HEALTH CENTER INC","300 WERNER STREET","HOT SPRINGS",

"70","482",220002,15,16174.53333,11676.53333,"MOUNT AUBURN HOSPITAL","330 MOUNT AUBURN STREET","CAMBRIDGE","MA"

"71","314",100180,15,132940.6,11593,"ST PETERSBURG GENERAL HOSPITAL","6500 38TH AVE N","SAINT PETERSBUR","FL",3

"72","191",390219,35,14810.94286,6026.828571,"EXCELA HEALTH LATROBE HOSPITAL","ONE MELLON WAY","LATROBE","PA",1

"73","192",440173,14,10147.28571,3910.857143,"PARKWEST MEDICAL CENTER","9352 PARK WEST BLVD","KNOXVILLE","TN",3

"74","536",230216,12,9865.5,4183.583333,"PORT HURON HOSPITAL","1221 PINE GROVE AVE","PORT HURON","MI",48060,"MI

"75","195",100228,23,32893.13043,4118.26087,"WESTSIDE REGIONAL MEDICAL CENTER","8201 W BROWARD BLVD","PLANTATIO

"76","602",520030,18,24881.11111,8846.777778,"ASPIRUS WAUSAU HOSPITAL","333 PINE RIDGE BLVD","WAUSAU","WI",5440

"77","689",450042,27,23352.03704,6469.296296,"PROVIDENCE HEALTH CENTER","6901 MEDICAL PARKWAY","WACO","TX",7671

DATA COMPUTING

AN INTRODUCTION TO

WRANGLING AND VISUALIZATION WITH R

DANIEL KAPLAN

DA

TA

CO

MP

UT

ING

—W

RA

NG

LIN

GA

ND

VIS

UA

LIZ

AT

ION

WIT

HR

KA

PL

AN

1

For R

evie

wO

nly

Page 2: Information is what we want, but data are what weÕve got. C …€¦ ·  · 2015-08-12Information is what we want, but data are what weÕve got. D ATA C OMPUTING introduces wrangling

4Files and Documents

In your work with data, you will be using and creating computer filesof various sorts. Of particular importance are three basic roles forfiles:

1. storing data tables2. listing R instructions3. writing reports with narrative, graphics and conclusions

Different file types are used for each of these roles. These typescan be referred to by the filename extension of the files.

Table 4.1: Common file types and theiruses

File Role Common file types Software

Data storage .csv, .xlsx, etc. SpreadsheetsR instructions .Rmd Text editor, compilerEnd reports .html, .pdf, .docx, etc. web browser, word processor

Filename Extensions

The filename extension is one or more letters following the last period filename extension: The lettersfollowing the last period in a file name.This suffix indicates the file type, thatis, what software to use to interpret thefile.

in the name: .xlsx, .docx, .html, .mpeg. (Other widely used ex-tensions are .pdf, .zip, .png.) These letters indicate the format ofthe file and often set which program will be used to open the file: aspreadsheet, a word processor, a browser, or a music player in theexamples. Or, in plainer language, the filename extension tells youthe kind of file.

If you are looking at files through RStudio, the filename extensionwill always be displayed. If you are using your your own computer’sfile browser, your system may have been set up to hide the extension.

When referring to files within R statements or an Rmd file, youmust always include the filename extension.

Page 3: Information is what we want, but data are what weÕve got. C …€¦ ·  · 2015-08-12Information is what we want, but data are what weÕve got. D ATA C OMPUTING introduces wrangling

38 daniel t. kaplan

Paths

A filename is analogous to a person’s first name. Just as first namesare unique within a nuclear family, so filenames must be uniquewithin a folder or directory.

To run on with the family metaphor ...You are identified within a householdby your first name. The path wouldtell you which specific nuclear familyyou belong to, perhaps in the formof your address, like this: USA/SaintPaul/55105/703 Lincoln Avenue.

You are probably used to seeing folders and the files they containorganized on your computer as in Figure 4.1. The file path is the set of

file path: Information that specifies thelocation of a file in your file system.

successive folders that bring you to the file.

Figure 4.1: Folders contained withinfolders, as shown by a file browser onApple OS-X.

There is a standard format for file paths. An example:

/Users/kaplan/Downloads/0021_001.pdf

Here the filename is 0021_001, the filename extension is .pdf, andthe file itself is in the Downloads folder contained in the kaplanfolder, which is in turn contained in the Users folder. The starting/ means “on this computer”.

The R file.choose() — which should be used only in the con-sole, not in an Rmd file — brings up an interactive file browser.You can select a file with the browser. The returned value will be aquoted character string with the path name.

file.choose() # then select a file

[1] "/Users/kaplan/Downloads/0021_001.pdf"

In RStudio, the Files tab will display the path near the top. In Fig-ure 4.2, the ten files lised are all in the same folder, whose path endswith\ DCF-2014/ReOrganizedJune10/CourseNotes/DataOrganization.

Figure 4.2: Filenames and their file pathshown in the Files tab in RStudio. Onlypart of the path is shown: the foldersclosest to the files.

Quotes or not for a file path? When you are referring to afile path in R, it will always be in quotation marks; it’s a character

Page 4: Information is what we want, but data are what weÕve got. C …€¦ ·  · 2015-08-12Information is what we want, but data are what weÕve got. D ATA C OMPUTING introduces wrangling

data computing 39

string. Other software such as web browsers don’t use quotes whenspecifying a path.

URLs

You have probably noticed URLs in the locator window near the topof your browser. In Figure 4.3, the URL is:

http://tiny.cc/dcf/index.html.

Figure 4.3: A browser directed to theURL http://tiny.cc/dcf/index.html

.

A URL includes in its path name the location of the server onwhich the file is stored (e.g. tiny.cc) followed by the path to the fileon that surver. Here, the path is /dcf and the file is index.html.

You will sometimes need to copy URLs into your work in R, toaccess a dataset, to make a link to some reference, etc. Remember tocopy the entire URL, including the http:// part if it is there.

Some common filename extensions for the sort of web resourcesyou will be using:

• .png for pictures• .jpg or .jpeg for photographs• .csv or .Rdata for data files• .Rmd for the human editable text of a document• .html for web pages themselves

Data Files

Data tables are often stored individually as files. There are many for-mats for data files. Among the most common is the .csv spreadsheetformat, popular because reading it is a standard feature of many dataanalysis packages (including R).

If you use R extensively, you will also encounter data in .Rda (or.Rdata), an efficient format for storing data and other informationspecifically for R. When you get data from an R package, like this:

data(CPS85, package="mosaicData")

you are in fact reading in an .Rda file associated with the package.There are many other formats for files containing data tables or the

information needed to put the contents in data-table format. Theseare discussed in Chapter 15. Increasingly, data are accessed throughdatabase systems. In such a case, rather than reading the database as a database: A computer system used

to store, update, and access data.Relational data bases consist of datatables.

whole, you make “queries” to access specific data you need.Files containing data tables are often distributed via the web.

These files are not any different than files on your own computer.You access such files through a Uniform Resource Locator, betterknown as a URL. For instance, on the Data Computing web site, there URL: A name for a specific resource,

such as a file, on the Internet.

Page 5: Information is what we want, but data are what weÕve got. C …€¦ ·  · 2015-08-12Information is what we want, but data are what weÕve got. D ATA C OMPUTING introduces wrangling

40 daniel t. kaplan

are a number of .csv files. You can access them from R with com-mands like this:

Engines <- read.file("http://tiny.cc/mosaic/engines.csv")Engines

Engine mass ncylinder strokes displacement bore stroke BHP RPM

Webra Speedy 0.14 1 2 1.80 13.50 12.50 0.45 22000Motori Cipolla 0.15 1 2 2.50 15.00 14.00 1.00 26000Webra Speed 20 0.25 1 2 3.40 16.50 16.00 0.78 22000

... and so on for 39 rows

So far as accessing data is concerned, there’s nothing fundamentallydifferent in reading a file from a URL than reading a file on your owncomputer.

Documenting your work with Rmd files

The purpose of data wrangling and visualization is communication:condensing and presenting data in a form that conveys information.An important part of communication is documentation and report-ing.

Writing is not a linear process. Ideas are presented, revised orabandoned, corrected, re-focussed, and re-arranged. Data-orientedtechnical reports tie together narrative, graphics and summary tablesand are based on potentially complex computer commands. Thereis an interplay between the computer commands and the narrative.Results from the computer may drive reconsideration of the narra-tive. Gaps in the narrative may point to shortcomings or omissionsin the computer commands. And, always, there is the possibility oferrors in writing commands and the need to document commands sothat they can be checked and corrected. As well, data are commonlyupdated, corrected or extended.

The familiar practice of cutting and pasting from the computerconsole into a word processor does not address these features of tech-nical reports. Cutting and pasting makes it hard to revise or updatea report; you’ve got to cut out the old and paste in the new, figur-ing out for yourself which is which. This introduces the likelihoodof error. And, there’s nothing to document the linkages between thecomputer commands and the word-processed document.

An important concept in data-driven reporting is “reproducibility.”The idea is to be able to reproduce your entire document withoutany manual intervention, and, more important, to be easily able togenerate a new report in response to changes in data or revisions incomputer commands. In other words, reproducible reports contain