dsc 201: data analysis & visualizationdkoop/dsc201-2017fa/lectures/lecture16.pdf · read_csv...

DSC 201: Data Analysis & Visualization

Data Cleaning Dr. David Koop

D. Koop, DSC 201, Fall 2017

File Handling in Python• Use open and f.close():

- f = open('huck-finn.txt', 'r') for line in f: if 'Huckleberry' in line: print(line.strip())

• Open flags indicate whether the file is being read ('r') or written ('w') • Using a with statement, close can be done automatically:

- with open('huck-finn.txt', 'w') as f: for line in lines: if 'Huckleberry' in line: f.write(line)

• Closing the file is important when writing files!

2D. Koop, DSC 201, Fall 2017

Comma-separated values (CSV) Format• Comma is a field separator, newlines denote records

- a,b,c,d,message 1,2,3,4,hello 5,6,7,8,world 9,10,11,12,foo

• May have a header (a,b,c,d,message), but not required • No type information: we do not know what the columns are

(numbers, strings, floating point, etc.) - Default: just keep everything as a string - Type inference: Figure out what type to make each column based

on what they look like • What about commas in a value? → double quotes • Can use other delimiters (|, <space>, <tab>)


Reading Data in Pandas


[W. McKinney, Python for Data Analysis]

CHAPTER 6

Data Loading, Storage, and File Formats

Accessing data is a necessary first step for using most of the tools in this book. I’mgoing to be focused on data input and output using pandas, though there are numer‐ous tools in other libraries to help with reading and writing data in various formats.

Input and output typically falls into a few main categories: reading text files and othermore efficient on-disk formats, loading data from databases, and interacting with net‐work sources like web APIs.

6.1 Reading and Writing Data in Text Formatpandas features a number of functions for reading tabular data as a DataFrameobject. Table 6-1 summarizes some of them, though read_csv and read_table arelikely the ones you’ll use the most.

Table 6-1. Parsing functions in pandasFunction Descriptionread_csv Load delimited data from a file, URL, or file-like object; use comma as default delimiterread_table Load delimited data from a file, URL, or file-like object; use tab ('\t') as default delimiterread_fwf Read data in fixed-width column format (i.e., no delimiters)read_clipboard Version of read_table that reads data from the clipboard; useful for converting tables from web

pagesread_excel Read tabular data from an Excel XLS or XLSX fileread_hdf Read HDF5 files written by pandasread_html Read all tables found in the given HTML documentread_json Read data from a JSON (JavaScript Object Notation) string representationread_msgpack Read pandas data encoded using the MessagePack binary formatread_pickle Read an arbitrary object stored in Python pickle format

167

Function Descriptionread_sas Read a SAS dataset stored in one of the SAS system’s custom storage formatsread_sql Read the results of a SQL query (using SQLAlchemy) as a pandas DataFrameread_stata Read a dataset from Stata file formatread_feather Read the Feather binary file format

I’ll give an overview of the mechanics of these functions, which are meant to converttext data into a DataFrame. The optional arguments for these functions may fall intoa few categories:

IndexingCan treat one or more columns as the returned DataFrame, and whether to getcolumn names from the file, the user, or not at all.

Type inference and data conversionThis includes the user-defined value conversions and custom list of missing valuemarkers.

Datetime parsingIncludes combining capability, including combining date and time informationspread over multiple columns into a single column in the result.

IteratingSupport for iterating over chunks of very large files.

Unclean data issuesSkipping rows or a footer, comments, or other minor things like numeric datawith thousands separated by commas.

Because of how messy data in the real world can be, some of the data loading func‐tions (especially read_csv) have grown very complex in their options over time. It’snormal to feel overwhelmed by the number of different parameters (read_csv hasover 50 as of this writing). The online pandas documentation has many examplesabout how each of them works, so if you’re struggling to read a particular file, theremight be a similar enough example to help you find the right parameters.

Some of these functions, like pandas.read_csv, perform type inference, because thecolumn data types are not part of the data format. That means you don’t necessarilyhave to specify which columns are numeric, integer, boolean, or string. Other dataformats, like HDF5, Feather, and msgpack, have the data types stored in the format.

Handling dates and other custom types can require extra effort. Let’s start with asmall comma-separated (CSV) text file:

In [8]: !cat examples/ex1.csva,b,c,d,message1,2,3,4,hello

168 | Chapter 6: Data Loading, Storage, and File Formats

Types of arguments for readers• Indexing: choose a column to index the data, get column names

from file or user • Type inference and data conversion: automatic or user-defined • Datetime parsing: can combine information from multiple columns • Iterating: deal with very large files • Unclean Data: skip rows (e.g. comments) or deal with formatted

numbers (e.g. 1,000,345)


Reading and Writing CSV data with pandas• Reading:

- Basic: df = pd.read_csv(fname) - Use a different delimiter:

• df = pd.read_csv(fname, sep='\t')

- Skip the first few rows: • df = pd.read_csv(fname, skiprows=3)

• Writing: - Basic: df.to_csv(<fname>) - Change delimiter with sep kwarg:

• df.to_csv('example.dsv', sep='|')

- Change missing value representation • df.to_csv('example.dsv', na_rep='NULL')


JavaScript Object Notation (JSON)• A format for web data • Looks very similar to python dictionaries and lists • Example:

- {"name": "Wes", "places_lived": ["United States", "Spain", "Germany"], "pet": null,"siblings": [{"name": "Scott", "age": 25, "pet": "Zuko"}, {"name": "Katie", "age": 33, "pet": "Cisco"}] }

• Only contains literals (no variables) but allows null • Values: strings, arrays, dictionaries, numbers, booleans, or null

- Dictionary keys must be strings - Quotation marks help differentiate string or numeric values


eXtensible Markup Language (XML)• Older, self-describing format with nesting • Each field has tags • Example:

- <INDICATOR> <INDICATOR_SEQ>373889</INDICATOR_SEQ> <PARENT_SEQ></PARENT_SEQ> <AGENCY_NAME>Metro-North Railroad</AGENCY_NAME> <INDICATOR_NAME>Escalator Avail.</INDICATOR_NAME> <PERIOD_YEAR>2011</PERIOD_YEAR> <PERIOD_MONTH>12</PERIOD_MONTH> <CATEGORY>Service Indicators</CATEGORY> <FREQUENCY>M</FREQUENCY> <YTD_TARGET>97.00</YTD_TARGET> </INDICATOR>

• Top element is the root


Reading and Writing JSON with pandas• pd.read_json(<filename>, orient=<orientation>)

• df.to_json(<filename>, orient=<orientation>)

• Possible JSON orientations - split: dict like {index -> [index], columns -> [columns], data -> [values]}

- records: list like[{column -> value}, ... , {column -> value}]

- index: dict like {index -> {column -> value}} - columns: dict like {column -> {index -> value}} - values: just the values array


Assignment 4 and Test 2• Assignment 4:

- Out soon - Similar analysis as A3 but using pandas

• Test 2: - Currently scheduled for Nov. 14 - Interest in Nov. 16? - Focus on topics from Test 1 through this week - Still requires understanding of topics from rest of the course


What if your JSON/XML doesn't match a specific orientation/format?


Write your own code to create the DataFrame• Use json library to read in the data and organize the pieces as

needed • Create the DataFrame from a list of dictionaries, etc.


XML• No built-in method • Use lxml library (also can use ElementTree) • from lxml import objectify path = 'datasets/mta_perf/Performance_MNR.xml'parsed = objectify.parse(open(path)) root = parsed.getroot() data = [] skip_fields = ['PARENT_SEQ', 'INDICATOR_SEQ', 'DESIRED_CHANGE','DECIMAL_PLACES']for elt in root.INDICATOR: el_data = {} for child in elt.getchildren(): if child.tag in skip_fields: continue el_data[child.tag] = child.pyval data.append(el_data) perf = pd.DataFrame(data)



Binary Formats• CSV, JSON, and XML are all text formats • What is a binary format? • Pickle: Python's built-in serialization • HDF5: Library for storing large scientific data

- Hierarchical Data Format - Interfaces in C, Java, MATLAB, etc. - Supports compression - Use pd.HDFStore to access - Shortcuts: read_hdf/to_hdf, need to specify object

• Excel: need to specify sheet when a spreadsheet has multiple sheets - pd.ExcelFile or pd.read_excel


Databases


[Wikipedia]

https://en.wikipedia.org/wiki/File:Star-schema-example.png

Databases• Relational databases are similar to multiple data frames but have

many more features - links between tables via foreign keys - SQL to create, store, and query data

• sqlite3 is a simple database with built-in support in python • Python has a database API which lets you access most database

systems through a common API.


Python DBAPI Exampleimport sqlite3 query = """CREATE TABLE test(a VARCHAR(20), b VARCHAR(20), c REAL, d INTEGER);""" con = sqlite3.connect('mydata.sqlite') con.execute(query) con.commit() # Insert some data data = [('Atlanta', 'Georgia', 1.25, 6), ('Tallahassee', 'Florida', 2.6, 3), ('Sacramento', 'California', 1.7, 5)]stmt = "INSERT INTO test VALUES(?, ?, ?, ?)" con.executemany(stmt, data) con.commit()



Databases• Similar syntax from other database systems (MySQL, Microsoft

SQL Server, Oracle, etc.) • SQLAlchemy: Python package that abstracts away differences

between different database systems • SQLAlchemy gives support for reading queries to data frame:

- import sqlalchemy as sqla db = sqla.create_engine('sqlite:///mydata.sqlite')pd.read_sql('select * from test', db)


sqlite:///mydata.sqlite

Pandas Analysis• Lots of easy analysis:

- df.describe() - df["column"].sum()

• Can plot the data from pandas: - df.plot.scatter(x="price", y="numSold")

• Can pass data to machine learning tools like scikit-learn


… but what if data isn't correct/trustworthy/in the right format?


Dirty Data


[Flickr]

http://farm3.static.flickr.com/2558/3717487523_f197ac2fbf.jpg

LA Crime Map: Geocoding Error


[LA Crime Map (2009), via CNET]

https://www.cnet.com/news/geocoding-error-distorts-l-a-crime-statistics/

Numeric Outliers

Adapted from Joe Hellerstein’s 2012 CS 194 Guest Lecture

Numeric Outliers


[J. Hellerstein via J. Canny et al.]

https://bcourses.berkeley.edu/files/50707513/download?download_frd=1&verifier=njoObzWKAmeihDjqFN9EMrY0IRlDbUWy2mFegnXN

6F INDINGS

we got about the future of the data science,

the most salient takeaway was how excited our

respondents were about the evolution of the

field. They cited things in their own practice, how

they saw their jobs getting more interesting and

less repetitive, all while expressing a real and

broad enthusiasm about the value of the work in

their organization.

As data science becomes more commonplace and

simultaneously a bit demystified, we expect this

trend to continue as well. After all, last year’s

respondents were just as excited about their

work (about 79% were “satisfied” or better).

How a Data Scientist Spends Their Day

Here’s where the popular view of data scientists diverges pretty significantly from reality. Generally,

we think of data scientists building algorithms, exploring data, and doing predictive analysis. That’s

actually not what they spend most of their time doing, however.

As you can see from the chart above, 3 out of every 5 data scientists we surveyed actually spend the

most time cleaning and organizing data. You may have heard this referred to as “data wrangling” or

compared to digital janitor work. Everything from list verification to removing commas to debugging

databases–that time adds up and it adds up immensely. Messy data is by far the more time- consuming

aspect of the typical data scientist’s work flow. And nearly 60% said they simply spent too much

time doing it.

Data scientist job satisfaction

60%

19%

9%

4%5%3%

Building training sets: 3%

Cleaning and organizing data: 60%

Collecting data sets; 19%

Mining data for patterns: 9%

Refining algorithms: 4%

Other: 5%

What data scientists spend the most time doing

4.05

4

3

2

1

35%

47%

12%

6%

1%

This takes a lot of time!


[CrowdFlower Data Science Report, 2016]

http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

7F INDINGS

Why That’s a Problem

Simply put, data wrangling isn’t fun. It takes forever. In fact, a few years back, the New York Times

estimated that up to 80% of a data scientist’s time is spent doing this sort of work.

Here, it’s necessary to point out that data cleaning is incredibly important. You can’t do the sort of

work data scientists truly enjoy doing with messy data. It needs to be cleaned, labeled, and enriched

before you can trust the output.

The problem here is two fold. One: data scientists simply don’t like doing this kind of work, and,

as mentioned, this kind of work takes up most of their time. We asked our respondents what

was the least enjoyable part of their job.

They had this to say:

Note how those last two charts mirror each other. The things data scientists do most are the

things they enjoy least. Last year, we found that respondents far prefer doing the more creative,

interesting parts of their job, things like predictive analysis and mining data for patterns. That’s

where the real value comes. But again, you simply can’t do that work unless the data is properly

labeled. And nobody likes labeling data.

Do Data Scientists Have What They Need?

With a shortage of data scientists out there in the world, we wanted to find out if they thought

they were properly supported in their job. After all, when you need more data scientists, you’ll

often find a single person doing the work of several.

Building training sets: 10%

Cleaning and organizing data: 57%

Collecting data sets: 21%

Mining data for patterns: 3%

Refining algorithms: 4%

Other: 5%

57%

21%

10%

5%4%3% What’s the least enjoyable part of data science?

…and it isn't the most fun thing to do


[CrowdFlower Data Science Report, 2016]

http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf

Dirty Data: Statistician's View• Some process produces the data • Want a model but have non-ideal samples:

- Distortion: some samples corrupted by a process - Selection bias: likelihood of a sample depends on its value - Left and right censorship: users come and go from scrutiny - Dependence: samples are not independent (e.g. social networks)

• You can add/augment models for different problems, but cannot model everything

• Trade-off between accuracy and simplicity


[J. Canny et al.]


Dirty Data: Database Expert's View• Got a dataset • Some values are missing, corrupted, wrong, duplicated • Results are absolute (relational model) • Better answers come from improving the quality of values in the

dataset


[J. Canny et al.]


Dirty Data: Domain Expert's View• Data doesn't look right • Answer doesn't look right • What happened? • Domain experts carry an implicit model of the data they test against • You don't always need to be a domain expert to do this

- Can a person run 50 miles an hour? - Can a mountain on Earth be 50,000 feet above sea level? - Use common sense


[J. Canny et al.]


Dirty Data: Data Scientist's View• Combination of the previous three views • All of the views present problems with the data • The goal may dictate the solutions:

- Median value: don't worry too much about crazy outliers - Generally, aggregation is less susceptible by numeric errors - Be careful, the data may be correct…


[J. Canny et al.]


Be careful how you detect dirty data• The appearance of a hole in the earth’s ozone layer over Antarctica,

first detected in 1976, was so unexpected that scientists didn’t pay attention to what their instruments were telling them; they thought their instruments were malfunctioning.

– National Center for Atmospheric Research


[Wikimedia]

https://commons.wikimedia.org/wiki/File:Agujero_en_la_capa_de_ozono_2008.jpg

Where does dirty data originate?• Source data is bad, e.g. person entered it incorrectly • Transformations corrupt the data, e.g. certain values processed

incorrectly due to a software bug • Integration of different datasets causes problems • Error propagation: one error is magnified


[J. Canny et al.]


Types of Dirty Data Problems• Separator Issues: e.g. CSV without respecting double quotes • Naming Conventions: NYC vs. New York • Missing required fields, e.g. key • Different representations: 2 vs. two • Truncated data: "Janice Keihanaikukauakahihuliheekahaunaele"

becomes "Janice Keihanaikukauakahihuliheek" on Hawaii license • Redundant records: may be exactly the same or have some overlap • Formatting issues: 2017-11-07 vs. 07/11/2017 vs. 11/07/2017


[J. Canny et al.]


Data Wrangling• Data wrangling: transform raw data to a more meaningful format

that can be better analyzed • Data cleaning: getting rid of inaccurate data • Data transformations: changing the data from one representation to

another • Data reshaping: reorganizing the data • Data merging: combining two datasets


Data Cleaning


Handling Missing Data• Filtering out missing data:

- Can choose rows or columns • Filling in missing data:

- with a default value - with an interpolated value

• In pandas:



In [10]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])

In [11]: string_dataOut[11]: 0 aardvark1 artichoke2 NaN3 avocadodtype: object

In [12]: string_data.isnull()Out[12]: 0 False1 False2 True3 Falsedtype: bool

In pandas, we’ve adopted a convention used in the R programming language by refer‐ring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed(through problems with data collection, for example). When cleaning up data foranalysis, it is often important to do analysis on the missing data itself to identify datacollection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA in object arrays:In [13]: string_data[0] = None

In [14]: string_data.isnull()Out[14]: 0 True1 False2 True3 Falsedtype: bool

There is work ongoing in the pandas project to improve the internal details of howmissing data is handled, but the user API functions, like pandas.isnull, abstract away many of the annoying details. See Table 7-1 for a list of some functions relatedto missing data handling.

Table 7-1. NA handling methodsArgument Descriptiondropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how

much missing data to tolerate.fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.isnull Return boolean values indicating which values are missing/NA.notnull Negation of isnull.

192 | Chapter 7: Data Cleaning and Preparation

Filling in missing data• fillna arguments:



Table 7-2. fillna function argumentsArgument Descriptionvalue Scalar value or dict-like object to use to fill missing valuesmethod Interpolation; by default 'ffill' if function called with no other argumentsaxis Axis to fill on; default axis=0inplace Modify the calling object without producing a copylimit For forward and backward filling, maximum number of consecutive periods to fill

7.2 Data TransformationSo far in this chapter we’ve been concerned with rearranging data. Filtering, cleaning,and other transformations are another class of important operations.

Removing DuplicatesDuplicate rows may be found in a DataFrame for any number of reasons. Here is anexample:

In [45]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})

In [46]: dataOut[46]: k1 k20 one 11 two 12 one 23 two 34 one 35 two 46 two 4

The DataFrame method duplicated returns a boolean Series indicating whether eachrow is a duplicate (has been observed in a previous row) or not:

In [47]: data.duplicated()Out[47]: 0 False1 False2 False3 False4 False5 False6 Truedtype: bool

Relatedly, drop_duplicates returns a DataFrame where the duplicated array isFalse:

7.2 Data Transformation | 197

dsc 201: data analysis & visualizationdkoop/dsc201-2017fa/lectures/lecture16.pdf · read_csv...

Documents