foundations of data science - data science with r

19
Data Science with R Foundations of Data Science 1. Outline the chapter as a. definitions o. an observation has a unit i. individuals are one type of unit ii. groups are a high- level unit i. a natural law applies to individual (low-level) observations ii. a statistical law applies to (high level) groups of observations iii. a unit of analysis is a property of the question you are trying to answer b. how data reveals patterns i. patterns can be there even if you do not see them ii. you can use patterns to test hypotheses and make predictions c. Simple, right? But why do things go wrong so often? c. but data science is not magic, it relies on similarity i. your observations must be similar to the points you want to predict ii. and your group of points must be similar to the population at large iii. because patterns are like a statistics, they describe group behaviour d. data scientists are not the first to try to learn about the world. i. Epistemiologists have tread this ground before and they proved that you cannot know with certainty ii. the problem of induction iii. but data science is the most pragmatic thing that you can do O’Reilly publishes nine books on data science and one of them is named “What is Data Science?” When you open any of these books you should ask yourself what you are getting into. As a term, data science has come to mean several things. At one level, data science is a body of knowledge, a collection of useful information related to a specific task. For example, library science and managerial science are bodies of knowledge. Library science collects the best ways to run a library, and managerial science collects the best ways to run a business. Data science collects the best ways to store, retrieve, and manage data. As a result, a data scientist might know how to set up a hadoop cluster

Upload: mario

Post on 21-Nov-2015

265 views

Category:

Documents


3 download

DESCRIPTION

Data Science with RStudio

TRANSCRIPT

  • Data Science with R

    Foundations of Data Science1. Outline the chapter as a. definitions o. an observation has a

    unit i. individuals are one type of unit ii. groups are a high-level unit i. a natural law applies to individual (low-level)observations ii. a statistical law applies to (high level)groups of observations iii. a unit of analysis is a property ofthe question you are trying to answer b. how data revealspatterns i. patterns can be there even if you do not see themii. you can use patterns to test hypotheses and makepredictions c. Simple, right? But why do things go wrong sooften? c. but data science is not magic, it relies on similarityi. your observations must be similar to the points you wantto predict ii. and your group of points must be similar to thepopulation at large iii. because patterns are like a statistics,they describe group behaviour d. data scientists are not thefirst to try to learn about the world. i. Epistemiologists havetread this ground before and they proved that you cannotknow with certainty ii. the problem of induction iii. but datascience is the most pragmatic thing that you can do

    OReilly publishes nine books on data science and one of them isnamed What is Data Science? When you open any of these booksyou should ask yourself what you are getting into. As a term, datascience has come to mean several things.

    At one level, data science is a body of knowledge, a collection ofuseful information related to a specific task. For example, libraryscience and managerial science are bodies of knowledge. Libraryscience collects the best ways to run a library, and managerialscience collects the best ways to run a business. Data sciencecollects the best ways to store, retrieve, and manage data. As aresult, a data scientist might know how to set up a hadoop cluster

  • or run the latest type of non-relational database. This is probablywhat most people think of when they think of data science, butthis is not the type of data science that I will teach you.

    At another level, data science is a way of doing science. Datascientists use data, models, and visualizations to make scientificdiscoveries, just as other scientists use experiments. In fact, youcan think of data science as a method of science that complementsexperimental science. Experimental scientists use theexperimental method to solve scientific problems, and datascientists use the data science method. Many scientists use both.

    This book will teach you the method of data science. You will learnhow to use data to make discoveries, and to justify thosediscoveries once they are made. Along the way, you will learn howto visualize data, build models, and make predictions.

    In this chapter, you will learn the strategy behind data science:data scientists search for evidence of natural laws in the structureof data. They then judge the strength of the evidence that theyfind, and are able to develop insights based on the laws theydiscover. This strategy guides the techniques that you will learn inlater chapterstechniques like data wrangling, exploratory dataanalysis (EDA), bootstrap sampling, and cross-validation.

    The scientic worldviewAs a method of science, data science is based on two simple ideas.First, that the best way to learn about the word is to observe it.And second, that the universe operates according to natural laws.These ideas summarize the worldview shared by many scientists,and they provide a the vocabulary that will help us talk about datascience.

    A natural law is a rule that describes a part of the natural world,like or . Natural laws help scientistsunderstand, control, and make predictions about naturalprocesses.

    You can write down a natural law as a relationship between

    & .D

    ' ."

    & .

  • variables. For example, is a natural law that states thatthe energy content of a system ( ) is always equal to the mass ofthe system ( ) multiplied by the speed of light squared ( ).

    is a natural law that explains that a force ( ) exertedupon an object will cause the object to accelerate ( ) at a rateproportional to the mass of the object ( ), an insight that hasmany applications in the field of physics.

    Natural laws deal with variables, values, and observations. We usethese terms in everyday speech, but they have a technical meaningwhen associated with science.

    A variable is a quantity, quality, or property that can bemeasured.

    A value is the apparent state of a variable when you measureit. The value of a variable may change from measurement tomeasurement.

    An observation is a set of values that are measured onmultiple variables under similar (ideally identical)conditions.

    You can think of an observation as a snapshot of the world.An observation shows what a group of variables looked liketogether for a brief moment before they changed.

    Natural laws deal with variables, but they operate on values thatappear in the same observation. For example, the law states that when you measure the force, mass, and accelerationassociated with the same particle at the same time, you will observea trio of values such that

    or

    & .D

    &

    . D

    ' ." '

    "

    .

    ' ."

    0G

    N

    B

    0G

    N

    B

    0G

    N

    B

  • and so on.

    In the notation above, the lowercase letters denote specific valuesof the variables , , and . Throughout the book, I will refer tovariable names with a capital letter and individual values with alower case letter.

    The subscripts denote which observation each of the valuesbelongs to. If a set of values belongs to the same observation, itimplies that the values were measured under similar conditions.

    To see how this works, consider what the three observations abovemay represent. The observations may have been taken at threedifferent times. For example, , , and may have beenmeasured at time one, , , and measured at time two, andso on. Alternatively, the observations may describe three differentparticles. For example, , , and may have been measured onone particle, , , and may may have been measured on asecond particle, and , , and may have been measured on athird particle.

    Observations play a very important role in science. A natural lawimplies that a relationship will exist between values of variablesthat appear in the same observation. However, a natural law doesnot imply that a relationship will exist between values in differentobservations. You wouldnt think that the force you exert on oneobject would equal the mass times the acceleration that youmeasure on a different object. Or, in the notation above, youwouldnt think that should equal times . You wouldexpect to equal times .

    Natural laws provide a goal for science. Scientists attempt todiscover natural laws and thereby explain natural phenomena.You can think of science as a collection of methods that useobservations to discover natural laws.

    Data science is one of those methods. It uses a specific tool toreveal natural laws, and that tool is data.

    Data, or more precisely a data set, is a collection of values that

    ' . "

    G

    N

    B

    G

    N

    B

    G

    N

    B

    G

    N

    B

    G

    N

    B

    G

    N

    B

    G

    N

    B

  • have been organized in a specific way: each value in a data set isassociated with a variable and an observation.

    For example, you can use the values , , , , , , , ,and to compose a data set, like the one below.

    obs

    1

    2

    3

    Now that you know the vocabulary of data science, lets look at themethod.

    PatternsRecall that a data set organizes values so that each value isassociated with a variable as well as an observation. Thisorganization makes data sets particularly useful for discoveringnatural laws. If a natural law exists between the variables in a dataset, the law will appear as a pattern that reoccurs in eachobservation. Or to put it more simply, a natural law will appear asa pattern in data.

    In our example data set, the relationship described by the law will be present in each observation. As a result, the data

    set will reveal what the natural law implies:

    obs

    1

    2

    3

    This is easy to verify if you measure the real forces, masses, andaccelerations associated with several dozen particles, like in thedata set below. Each row of values displays the relationship

    G

    G

    G

    N

    N

    N

    B

    B

    B

    ' . "

    G

    N

    B

    G

    N

    B

    G

    N

    B

    ' ."

    ' . "

    G

    N

    B

    G

    N

    B

    G

    N

    B

  • .obs

    1

    2

    3

    4

    5

    6

    7

    8

    A data set transforms a law into a pattern, which makes data avery useful tool for science. The tool isnt perfect, patterns arevery difficult to spot in raw data, but you can optimize how yousearch for patterns.

    First, you can transform your variables or compute summarystatistics in a way that would make patterns easier to spot. Datascientists often transform their data, a process known as datawrangling, to prepare for the steps that follow.

    Second, you can visualize raw data to make patterns easier to spot.Notice how the pattern between , , and becomes obviouswhen you visualize the data with a three dimensional, or even atwo dimensional, graph. The relationship appears as athree dimensional plane that each of the data points falls on. Thisplane resembles a line when it is projected into a two-dimensionalgraph.

    ' ."

    ' . "

    ' . "

    ' ."

  • Third, you can use a computer algorithm to search for patternswithin data, which is exactly what data scientists do when they usestatistical modeling or machine learning techniques.

    Moreover, you can count on laws to appear as patterns in dataunder a wide variety of conditions. Consider what would happen ifyour data failed in some way, for example, if your measurementswere inaccurate, or if your data did not contain all of the variablesin a law.

    If your values are contaminated by measurement errors, the errorswill add noise to your data. As long as the errors are relativelysmall, laws will still emerge in your data as discernable, but noisypatterns.

    You can see measurement errors at work in the graphs below. Thegraph on the left displays two variables that are related by the law

    . However, the measurements were made in a sloppyfashion that resulted in inaccurate values. The graph on the leftdisplays the same data after the measurement errors have beencorrected. Notice that you can still perceive the underlying patterneven when it has been contaminated by measurement errors.

    : 9

  • You will face a similar situation if your data contains some, butnot all, of the variables in a law. When this happens, a pattern willstill appear between the variables in your data set that areconnected by a law. The influence of the missing variables willappear as noise in the pattern.

    You can see this happen in the two dimensional graph between and . The graph ignores the variable, as if it were not part ofthe data set. As a result, the variable adds noise to the linearpattern between and , but the pattern is still discernable.

    Noise in your data is not a cause for defeat. As long as you capturethe most influential variables in a law, and do not letmeasurement errors get so big that they swamp your data, you arelikely to find a pattern that will point to the law.

    CorrelationsSo far we have considered how data will display , anatural law that describes a causal relationship between physical

    '

    " .

    .

    ' "

    ' ."

  • variables. But now we can see that data will also display patternsthat do not involve causal relationships (or physical variables).

    Consider two everyday variables that are strongly correlated. Forexample, consider the price of Chevron stock and the price of BPstock. These two companies compete against each other, but theprices of their stocks tend to rise and fall at the same times. This isbecause the companies both sell oil. As the price of oil rises, sodoes the price of each stock.

    This correlation forms a relationship between the prices, but therelationship is not causal. The price of BP stock does not cause theprice of Chevron stock.

    Will data display a non-causal relationship between variables? Yesit will, and it is easy to see why. The price of each stock is causedby the price of oil plus some company specific variables thatdetermine how profitable each company is, i.e.

    and

    Simple algebraic substitution shows that this arrangement impliesa relationship between the price of Chevron and BP stock.

    This relationship will appear as a pattern whenever you collectdata on Chevron and BP stock. Since we do not collect data on theBP specific and Chevron specific variables, these variables willshow up as noise in the pattern. In short, our graph may looksomething like the noisy graph between and above.

    To summarize, data will display any sort of relationship betweenvariables as a pattern, whether or not that relationship involves acausal association.

    To account for this, I will use the term natural law loosely in this

    QSJD QSJD $IFWSPOTQFDJGJDWBSJBCMFTF

    $IFWSPO

    F

    0JM

    QSJD QSJD #1TQFDJGJDWBSJBCMFTF

    #1

    F

    0JM

    QSJD QSJD #1TQFDJGJDWBSJBCMFT $IFWSPOTQFDJGJDWBSJBCMFTF

    $IFWSPO

    F

    #1

    9 :

  • book. A natural law is a causal relationship or reliable correlationbetween variables. You might argue that correlations are notnatural laws but are more like natural shorthand rules. You areright, but correlations do provide valuable information. As you willsee later in the chapter, data scientists tend to make as much useof correlations as other scientists do of causal laws (and acorrelation can suggest the operation of some unidentified causallaw).

    Sample effectsYou have seen that data provides a simple way to spot naturallaws, and that this method works in a variety of situations. Whythen, does data science have such a fearsome reputation?

    Unfortunately, natural laws are not the only thing that can causepatterns to appear in data. Sometimes data sets display patternsthat do not exist in real life. These patterns are illusions and leadto false results. How can you tell whether the patterns that you dofind are real and not an illusion? Before we answer that question,lets examine why a data set might contain patterns that do notexist in real life.

    Most data sets are much smaller than they could be. For example,if you wanted to research a question like, How is an adults heightrelated to their age?, you could collect a very big data set: themeasurements of every single adult on the planet. But thatwouldnt be necessary. A pattern between height and age wouldbecome clear well before you finish measuring every adult on theplanet (and if it doesnt, a pattern between your data collectionefforts and your quality of life certainly would).

    Data scientists refer to the universe of possible observations thatyou could collect as a population, and the set of observations thatyou actually collect as a sample. The process of collecting a sampleof data is known as sampling, and it has important consequencesfor data science. Sampling opens the door for illusions to creepinto a data set.

  • Consider the two data sets visualized below.

    The graph on the left shows the relationship between the and of 1000 adults. In adults, these two variables are not closely

    related. As a result, the points appear as an unstructured cloud,with no patterns.

    The graph on the right displays the relationship between and for the same adults. An adults height is related totheir weight, and the data points display a pattern as a result. Thepattern is noisy because other variables (such as diet and exercise)also play a role in a persons weight, and their effect appears hereas noise in the pattern.

    Lets do a simple thought experiment. Imagine that these 1000adults are the only adults on the planet. In other words, imaginethat these data sets display the entire population of adults. Nowsuppose that you only observed 50 of these adults. What wouldyour data look like?

    We can randomly select 50 of the data points above to see. Morethan likely, the 50 points would display a less dense, but stillunstructured cloud on the left and a less dense, but still noticeablepattern on the right. For example, here are 50 points randomlyselected from the original data sets.

    BHF

    IFJHIU

    IFJHIU

    XFJHIU

  • However, by coincidence you might collect 50 observations thatdisplay an illusion. For example, any of the collections belowwould suggest that a natural law exists between and .

    And any of the collections below would suggest that a natural lawdoes not exist between and or worse, the lastpattern suggests that an inverse relationship exists between

    and .

    These patterns are illusions. They are not caused by natural laws,they are caused by omission and coincidence. We did not collectall of the possible observations (which wouldve revealed the truelaw). The observations that we did collect happened to make anunusual set.

    IFJHIU BHF

    IFJHIU XFJHIU

    IFJHIU XFJHIU

  • Notice how diabolical this situation can be. The individualmeasurements in each of these samples are correct, and yet thepatterns displayed by the measurements do not exist in real life.

    Due to sampling effects, data sets often display patterns that donot exist in real life, which creates a challenge for data scientists.As a data scientist, your main source of evidence for natural lawswill be patterns (or descriptions of patterns) that you find in data.Will you be able to tell when your patterns are caused by naturallaws and when they are caused by sampling effects?

    In theory, there is no way to use a data set to determine whetherthe patterns contained in the data exist in real life. Or, moreprecisely, there is no way to determine with absolute certaintywhether the patterns exist in real life.

    In practice, there is a way forward. You can calculate theprobability that a pattern is the result of random chance.

    ProbabilityProbability is the branch of mathematics that describes randombehavior. We will take a look at probability later in the book, butfor now lets consider how you can use probability to spot realpatterns.

    Recall that sampling is the source of illusions when illusionsappear in your data. In other words, which observations youdecide to collect will determine which patterns you see (if any).

    If you use a random method to select observations, then randomchance will be the only mechanism that could cause samplingeffects to appear in the data. You could then calculate theprobability that a pattern in the data is a result of random chance,and not a natural law.

    This system reduces patterns in data from proofs of natural laws toevidence of natural laws. Each pattern that you find is evidence of anatural law. If the pattern is likely to be caused by random chance,then the evidence is weak. If the pattern is not likely to be caused

  • by random chance, then the evidence is strong.

    A probability calculation will tell you exactly how weak or howstrong your evidence is. As a data scientist, you will need to decidefor yourself how strong the evidence must be before you arewilling to believe it.

    It is important to realize that probability does not eliminate theuncertainty associated with patterns. There will always be a smallpossibility that even the most striking patterns are caused byrandom chance. Probability calculations do not eliminate thispossibility; they quantify it, which makes it easier for you toreason about your evidence.

    Data scientists use probability calculations to augment the simplesystem of discovery presented by data. This arrangement createsthe method of data science, which can be described with a basicoutline.

    The Method of Data ScienceData scientists search for evidence of natural laws in the structureof data. They then judge the strength of the evidence that theyfind. To do this, they:

    1. Collect data in a way that minimizes the chance thatpatterns will appear by coincidence. Often this involvessome type of random selection.

    2. Search for patterns that provide evidence of natural laws.During this search a data scientist will often:

    Wrangle data - make patterns more apparent byreshaping, subsetting, or transforming the data.Visualize data - display data in a graph, whichexposes patterns to the human visual system.Model data - search for patterns with computeralgorithms that can be automated, calibrated, andoptimized.

    3. Judge patterns - calculate the probability that a pattern is

  • due to random chance, and not a natural law. You can viewthis step as measuring the strength of the evidence providedby an analysis.

    This method involves a level of uncertainty. In many ways, as adata scientist, you will be a specialist in uncertainty. You will notwork with proofs, like a mathematician, but with evidence thatcomes with a certain probability that it might be wrong.

    Given this ambiguity, you may wonder why anyone would practicedata science. There are some very good reasons.

    Why do Data Science?Data science complements other methods of scientific inquiry. Tosee the strengths of data science, lets compare it to experimentalscience, a well known way to do science. To summarize loosely,experimental scientists use the following method to learn aboutnatural laws:

    1. Formulate a hypothesis about a natural law.

    2. Make a testable prediction deduced from the hypothesis.

    3. Conduct an experiment to test the prediction.

    4. Reject the hypothesis if the prediction is incorrect.

    Discovery and conrmationYou may notice that the experimental method begins with ahypothesis and then uses observations to test the hypothesis. Thisapproach makes the experimental method very good for confirminghypotheses. Experimental scientists can quickly winnow falsehypotheses from very plausible hypotheses.

    However, the experimental method does not answer a veryimportant question: how should scientists think up usefulhypotheses to test? Data science provides the answer. A scientistcan begin with observations and then search them for patterns

  • that suggest hypotheses about natural laws. In short, data scienceprovides a system of discovery for scientists to use.

    Causation and predictionExperiments are designed to show causation, a specific type ofrelationship between variables. An experimenter manipulates anexplanatory variable to observe the effect that the manipulationcauses in a response variable. This design makes experiments lesseffective at discovering non-causal relationships.

    Why would you want to discover a non-causal relationship?Whenever a relationship exists between variables, you can use therelationship to make better predictions. You can use the value ofone variable to predict the value of another variable that it isrelated to. This works even if the relationship is a non-causalcorrelation.

    Consider, for example, how Netflix knows which movies you willlike. By studying data, Netflix has learned that people who like TheMatrix also tend to like The Terminator and vice versa. Thisrelationship is very useful, but it is not causal: your opinion of TheMatrix does not cause your opinion of the The Terminator.

    In contrast to experimental science, data science makes it easy tospot any type of relationship between variables. Data science willexpose both non-causal and causal relationships as patterns in thedata. Data science will not tell you which relationships are causaland which are not, but if you are only interested in makingaccurate predictions, you may not mind.

    Flexibility and controlConsider for a moment why experiments prove causation. Anexperimenter does more than manipulate an explanatory variableto see the effect on a response variable. An experimenter alsoholds constant any other variable that could influence theresponse during the experiment. For example, an experimenterwill ensure that the temperature, humidity, local magnetic fields,

  • etc. do not fluctuate during an experiment.

    As a result, the experimenter can rule out the posibility thatsomething other than the explanatory variable caused the effect inthe response variable. This method is almost foolproof, but itrequires a tremendous amount of control over the process beingstudied.

    In many research settings, this amount of control is impossible orunethical. For example, you could not control each of the variablesthat influences something like the stock market, or a nationseconomy. Nor should you control variables like how much alcohola pregnant woman ingests or how much pollution an asthmaticperson inhales if doing so would cause unnecessary harm.

    Data science requires much less control than experimentalscience, which makes data science adaptable to a broader range ofresearch questions. As a data scientist, you do not need tointervene in a process to study it. You can collect data passively byobserving nature as it is, which can free you from the ethical andlogistical burdens that an experimental scientists would face.

    Take CautionWe are starting to learn that most published data science findingsmay be wrong. In 2012, Amgen determined that only 6 of 53landmark medical studies had results that could be reproduced.From a scientific point of view, this means that these studiesshould be considered unreliable, if not wrong. In 2011, the Bayercompany found it could only reproduce 25% of published findingsin cardiovascular disease, cancer, and womens health studies.Bayer shelved development of two thirds of its new drug projectsas a result.

    Data science goes wrong in other fields too. The 2008 FinancialCrisis was enabled by a misapplication of the Gaussian copula, adata analysis technique. In another case, NASA analyzed globalozone data for seven years without noticing the hole in the ozone

  • layer. The most famous data analysis failure probably happened in1983. Engineers at Morton Thiokol, the builder of space shuttlebooster rockets, predicted that the Challenger would explode onlaunch. They had a chance to prevent the launch, but changedtheir minds after misreading data that proved them right.

    Even famous statisticians can get data wrong. Sir Ronald Fisherinvented much of modern statistics, but he spent the end of hiscareer using data to show that cigarettes do not cause cancer.

    This doesnt mean you should avoid data. Looking at data willalways create better understanding than ignoring it, but rememberthat data is not a cure-all. Good science also requires goodreasoning and skepticism.

    Summary and Parting AdviceThe method of data science is very simple and very effective. Datascientists search for evidence of natural laws in the structure ofdata. If a natural law exists between the variables in a data set, itwill appear as a pattern in the data.

    This method is very useful for discovering laws and for collectinginformation that can lead to better predictions. Moreover, you canapply data science to any situation in which you can collect data.

    But data can be very deceptive. Patterns can be hidden in noiseand may not appear at first glance. Moreover, coincidencesorbiasesthat occur when you collect your data can introducepatterns into your data that do not occur in real life. These thingscause formidable challenges; let the sidebar serve as a cautionarytale.

    How can you do better than the people mentioned in the sidebar?You already have one advantage. Many people who practice (andfail) at data science, do not study data science and might notappreciate how deceptive data can be. You can further protectyourself by adopting two traits that will safeguard your work. You

  • can be curious enough to explore a data set thoroughly, exposingany patterns that are there. Then you can be skeptical enough toquestion every pattern that you find and to search for alternativeexplanations.

    John Tukey, one of the first data scientists, often compared datascience to detective work. I like this metaphor because detectivesare both curious and skeptical. Also, detective work is riskybusiness, and so is data science. But I would extend the metaphorfurther. If you think of yourself as a detective, you should think ofdata as the mysterious blonde who walks into your office: sexy onthe surface, murky and treacherous beneath.

    Garrett Grolemund. Pre-order Data Science with R at shop.oreilly.com.