cis 602-01: scalable data analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists...

62
CIS 602-01: Scalable Data Analysis Python Dr. David Koop D. Koop, CIS 602-01, Fall 2017

Upload: others

Post on 29-Jun-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

CIS 602-01: Scalable Data Analysis

Python Dr. David Koop

D. Koop, CIS 602-01, Fall 2017

Page 2: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Obtaining Python (and Jupyter Notebook)• https://www.anaconda.com/download/ • Anaconda comes with Jupyter Notebook • Anaconda Navigator

- GUI application for managing Python environment - Should be installed with new Anaconda installations - Has a shortcut to start Jupyter Notebook - Allows you to add install packages (libraries)

2D. Koop, CIS 602-01, Fall 2017

Page 3: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

NYC Taxi Data

3D. Koop, CIS 602-01, Fall 2017

[Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance, T. W. Schneider]

Page 4: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Fig. 2. Taxis as sensors of city life. The plot on the top shows how the number of trips varies over 2011 and 2012. While some patterns are regularand appear on both years, some anomalies are clear, e.g., the drops in August 2011 (Hurricane Irene) and in October 2012 (Hurricane Sandy). Inthe bottom, we show pickups (blue) and dropoffs (orange) in Manhattan on May 1st from 7am to 11am. Notice that from 8-10am, there are virtuallyno trips along 6th Avenue, indicating the traffic was blocked.

over 2011 and 2012. There is a lot of regularity: the plot lines are verysimilar for the two years. For example, on Thanksgiving, Christmasand New Year’s eve, there is a substantial drop in the number of trips.But the plot also shows some anomalies. There are big drops in Au-gust 2011 and October 2012, which correspond to hurricanes Irene andSandy, respectively. Looking at the data at a finer grain, other inter-esting patterns emerge. The maps in Fig. 2 show the density of taxisacross Manhattan from 7am to 11am, on May 1st, 2011. From 8amto 10am, taxis disappear along 6th avenue, from Midtown to Down-town; and then, at 10am they reappear. As it turns out, during thisperiod, the streets were closed to traffic for the Five Boro Bike Tour.1As we discuss later, other useful information can be discovered byanalyzing these data, from popular night spots and economically dis-advantaged neighborhoods that are underserved by taxis, to mobilitypatterns across regions at different times and days (see Fig. 1).

Like many urban data sets, taxi trips contain geographical and tem-poral components. In addition, they encode information about move-ment: a trip is associated with pickup and dropoff locations and times.A trip also contains other attributes including the taxi id, distance trav-eled, fare and tip amount, which enable, for example, the study of theeconomics of fare structure and optimal fleet size. Not surprisingly,exploring these data is challenging. We have carried out interviewswith social scientists and engineers that have used this data set in theirresearch to better understand their needs. Their analyses can be com-plex and have been greatly limited by the size of the data. Tools thatare commonly used, including R, MatLab, Stata, ArcGIS and Excel,cannot handle large data sets. This prevents scientists from analyz-ing the whole data. Instead, they first select small slices and then loadthem into these tools for analysis. This process is both tedious and timeconsuming. Furthermore, while these tools support complex analysis,users must be familiar with the underlying languages. For example, inArcGIS, users have to construct SQL-like queries, a task that is out ofreach for scientists without database training.

To address these limitations, in this paper, we propose a new vi-sual query model that supports complex spatio-temporal queries overorigin-destination (OD) data. Users need not be experts in any textual

1http://www.nycbikemaps.com/spokes/five-boro-bike-tour-sunday-may-1st-2011

query language: they can directly query the data using visual opera-tions. We show that this model supports a wide range of queries, andin particular, the three classes of queries in Peuquet’s typology [29]:identify a set of objects at a given location and time; given a time anda set of objects, describe the locations occupied by the objects; anddescribe the times a set of objects occupied a given set of locations.This model also supports origin-destination queries that are needed toexplore taxi movement. While visual languages have been proposedfor spatio-temporal data and moving objects, they targeted a differentkind of data—continuous data (i.e., complete trajectories), and madeuse of a pictorial language [18, 12]. Instead of requiring users to sketchqueries, our model allows them to directly (and visually) manipulatethe data. The ability to specify queries using graphical widgets andvisualize their results was in part inspired by the seminal work byAhlberg et al. [2] on dynamic queries. Our focus, however, is dif-ferent: we aim to support the exploration of large, spatio-temporalOD data, and provide visualization services that are both usable andefficient. Another important feature of our model is that each queryis associated with a set of trips. As a result, not only can queries becomposed and refined, but also query results can be aggregated anddifferent visual representations can be applied while still maintainingthe spatial and temporal contexts. Query composition also enables theuse of cross-filtered views [37], which is key in our model to supportthe query classes in Peuquet’s typology.

We have built TaxiVis, an analysis environment that implementsthis model. It combines a number of interaction capabilities that enableusers to pose queries over all the dimensions of the data and flexiblyexplore the attributes associated with the taxi trips. Another importantfeature of the system is the ability to compare spatio-temporal slicesthrough multiple coordinated views. Users can interactively composeand refine queries as well as generalize them by performing parametersweeps. To deal with scalability, the system implements a number ofstrategies to support interactive response times and the rendering of alarge number of graphical primitives on a map. As discussed in Sec. 5,these include efficient data storage, and the use of adaptive level-of-detail rendering to provide clutter-free visualization of the results.

We demonstrate the usefulness of our system through a series ofcase studies motivated by traffic engineers and economists whoseneeds have driven our design. These case studies show how our model

NYC Taxi Data: Day analysis

4D. Koop, CIS 602-01, Fall 2017

[Ferreira et al., 2013]

Page 5: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

++

--

Leaflet | Map data © OpenStreetMap contributors

Marine Traffic Data

5D. Koop, CIS 602-01, Fall 2017

Page 6: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Fig. 9: Ball and fielder trajectories vary for hits that arise from differ-ent pitch types.

on hitter characteristics, pitch types, and weather conditions. WithBaseball4D, we show that full-field tracking presents opportunities tocontinue both aggregation and filtering for other parts of the game,including hit trajectories, fielding, and baserunning. In addition, thisfiltering can be extended to utilize pitch information as well as theother game characteristics.

One feature that Baseball4D provides is the opportunity to selec-tively view and analyze multiple plays at once. Coaches and playersoften watch video to improve their own play or scout their opponents.For example, an analyst might view a set of ground balls hit to theshortstop and notice that when he moves to his left, his throws to firstbase tend to more accurate than when he moves to his right. We couldalso time-shift plays so that we could see balls as they are caught toevaluate the different trajectories and speeds. Figure 1 (right) showshow Baseball4D can present both multiple hits by the same batter andmultiple plays by the same fielder.

As with pitch data, we can aggregate all plays to generate heatmapsof hit ball trajectories. The heat map shown in Figure 7, for example,shows all hits and suggests that the 3rd baseman and the shortstop fieldmost of the hits, perhaps suggesting a larger number of right-handed atbats. However, further investigations might combine this aggregationwith filtering to confirm differences between left- and right-handedbatters or differences between pitchers. Such selection allows morefine-grained analysis and presents the opportunity to find new correla-tions between aspects of the game.

One interesting direction that uses filtering is to combine pitch char-acteristics with hit ball trajectories and fielder trajectories. If a man-ager wants to induce a double-play, for example, he may signal in aparticular type of pitch. Analysis of the hit trajectories of that type of

Fig. 10: A data-based partition of which regions a particular fielder isresponsible for. Each field point is assigned to the player that made themost catches in that location during the 2013 season. The availabilityof actual fielding data may help to determine the regions each playeris responsible for.

pitch might enhance the understanding of how effective such strate-gies are. Note that a successful double-play depends on the quality ofthe pitch, the hitter’s decision, the ball trajectory, the fielders’ initialpositions, the baserunners’ movements, and the execution of all of thefielders’ throws. Thus, understanding hit ball trajectories and fieldermovements for different pitch types (see Figure 9) presents a step to-ward improving our understanding of such complicated plays. In thefigures, heat maps show the outcomes of different pitch types, andspray charts show the trajectories of the balls and fielders. Both visu-alizations may be also filtered by other gameplay attributes includingthe pitcher name, the batter name, the game, inning, number of outs,pitch result, and pre-pitch balls and strikes.

6.3 True Defensive Range: Expanding Defensive MetricsDespite the large number of statistics that allow in-depth analysis ofpitching and hitting, it has been much more difficult to evaluate perfor-mance once the ball is in play. Often, the number of errors is the onlymeasure available to compare fielding performance. Perhaps someevaluations can be made based on number of outs versus times theball was hit to a particular region the player was responsible for, butthese statistics often lack information about where the ball landed orif the player was positioned according to a particular strategy (e.g., totake away potential extra-base hits). With the reconstruction of entiregameplays, we can compute defensive metrics that evaluate a player’sfielding performance. Not only can we see where the ball is hit to butalso the speed with which a fielder reacts.

The measurement of defensive skills has been significantly im-proved in recent years. Defensive metrics can be derived from bothdefensive events (putout, assists, errors, total chances) and fielding in-formation. The latter has proven to be more a reliable indicator of thefielding ability, but suffers from the lack of accurate data on battedballs (this data is only available since 1989) and players’ positions.The early tracking approaches (such as the zone rating of STATS Inc.and Baseball Info Solutions) were based on a discretization of the fieldinto zones. The zones helped the reporters (zone rating operators) tovisually determine where the ball landed and what player was sup-posed to field it (zones were assigned to specific players). The assign-ment of zones to specific fielders was put to discussion later, however,and the new data shows that players’ zones may overlap significantlyon the field (see, e.g., Figure 10).

With new tracking abilities, we can better track fielding perfor-mance, and this has already led to more discussion and new metricsabout defensive skills. Greg Rybarczyk has proposed a measure of

Baseball Data

6D. Koop, CIS 602-01, Fall 2017

[Deitrich et al., 2014]

Page 7: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Data Science Venn Diagram

7D. Koop, CIS 602-01, Fall 2017

[D. Conway, The Data Science Venn Diagram, 2013]

Page 8: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Scalability• “Big Data”

- What is “big”? For whom is it “big”? - variety, velocity, volume, …

• Lots of data that was big is not an issue now • Understanding the scalability of techniques is important • There will always be larger datasets, want to understand

- how methods scale - performance bounds - storage constraints

8D. Koop, CIS 602-01, Fall 2017

Page 9: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

About this course• Course web page is authoritative:

- http://www.cis.umassd.edu/~dkoop/cis602-2017fa - Schedule, Readings, Assignments will be posted online - Check the web site before emailing me

• Topics course - A current research area the professor works in - A chance to be on the “cutting edge” of research

• Requires your participation - Reading responses - Project presentations

9D. Koop, CIS 602-01, Fall 2017

Page 10: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

About this course• Balance of techniques and research ideas • Some background (Python) followed by topic areas and readings • Assignments at the beginning of course, project at end • Two tests • Topic areas:

- Exploratory Data Analysis and Visualization - Data Acquisition - Data Storage and Access - Cloud Computing and Scalable Computation - Applications and specific data considerations

• Schedule: Move cloud computing earlier…

10D. Koop, CIS 602-01, Fall 2017

Page 11: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

About this course• Course Registration:

- Make sure you have registered in COIN for the course - Add/drop deadline is today

• Review of course policies: - Plagiarism and academic honesty - If you have any concerns or questions, please email me as soon

as possible • If you are not sure if this course is a good fit, please email me or talk

to me

11D. Koop, CIS 602-01, Fall 2017

Page 12: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Dataset Types

12D. Koop, CIS 602-01, Fall 2017

[Munzner (ill. Maguire), 2014]

Tables

Attributes (columns)

Items (rows)

Cell containing value

Networks

Link

Node (item)

Trees

Fields (Continuous)

Attributes (columns)

Value in cell

Cell

Multidimensional Table

Value in cell

Grid of positions

Geometry (Spatial)

Position

Dataset Types

Page 13: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Attribute Types

13D. Koop, CIS 602-01, Fall 2017

[Munzner (ill. Maguire), 2014]

Attributes

Attribute Types

Ordering Direction

Categorical Ordered

Ordinal Quantitative

Sequential Diverging Cyclic

Page 14: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

241 = Quantitative2 = Nominal3 = Ordinal

quantitative ordinal categorical

Categorial, Ordinal, and Quantitative

14D. Koop, CIS 602-01, Fall 2017

Page 15: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Background Reading• Listed on course schedule:

- Challenges and Opportunities with Big Data, D. Agrawal et al. - Toward Scalable Systems for Big Data Analysis: A Technology Tutorial,

H. Hu et al. - Big Data computing and clouds: Trends and future directions,

M. D. Assuncao et al.

15D. Koop, CIS 602-01, Fall 2017

Page 16: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python• Started in December 1989 by Guido van Rossum • “Python has surpassed Java as the top language used to introduce

U.S. students to programming…” (ComputerWorld, 2014) • Python and R are the two top languages for data science • High-level, interpreted language • Supports multiple paradigms (OOP, procedural, imperative) • Help programmers write readable code • Use less code to do more • Lots of libraries for python

- Designed to be extensible - Easy to wrap code from other languages like C/C++

• Open-source with a large, passionate community

16D. Koop, CIS 602-01, Fall 2017

Page 17: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python Compared to Java• Dynamic Typing

- A variable does not have a fixed type - Example: a = 1; a = “abc”

• Indentation - Braces define blocks in Java, good style is to indent but not

required - Indentation is critical in Python - Example:

z = 20 if x > 0: if y > 0: z = 100 else: z = 10

17D. Koop, CIS 602-01, Fall 2017

Page 18: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Learning Python Resources• https://software-carpentry.org/lessons.html • https://wiki.python.org/moin/BeginnersGuide • https://learnxinyminutes.com/docs/python3/ • http://www.pythontutor.com • http://www.python-course.eu • http://thepythonguru.com • https://wiki.python.org/moin/IntroductoryBooks • https://en.wikibooks.org/wiki/A_Beginner%27s_Python_Tutorial • https://learnpythonthehardway.org • learnpython.org

18D. Koop, CIS 602-01, Fall 2017

Page 19: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Obtaining Python (and Jupyter Notebook)• https://www.anaconda.com/download/ • Anaconda comes with Jupyter Notebook • Anaconda Navigator

- GUI application for managing Python environment - Should be installed with new Anaconda installations - Has a shortcut to start Jupyter Notebook - Allows you to add install packages (libraries)

19D. Koop, CIS 602-01, Fall 2017

Page 20: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Jupyter Notebook• Display rich representations and

text • Uses Web technology • Cell-based • Built-in editor • GitHub displays notebooks

20D. Koop, CIS 602-01, Fall 2017

[Jupyter]

Page 21: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python 2 and 3• https://docs.python.org/3/whatsnew/3.0.html • Key Differences:

- print as a function: print "Hello" vs. print("Hello") - Views and iterators instead of lists - Integer divison: 5/2 = 2.5, 5//2 = 2 - Unicode as standard - String formatting:

• Py2: "Hello %s. You are %d years old" % (name, age) • Py3: "Hello {}. You are {} years old".format(name, age) • Py3.6: f"Hello {name}. You are {age} years old"

21D. Koop, CIS 602-01, Fall 2017

Page 22: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Exercise• Given variables x and y, print the long division answer of x divided

by y with the remainder. • Examples:

- x =11, y = 4 should print "2R3" - x = 15, y =2 should print "7R1"

22D. Koop, CIS 602-01, Fall 2017

Page 23: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Chicago Food Inspections Exploration• Data: https://data.cityofchicago.org/Health-Human-Services/Food-

Inspections/4ijn-s7e5/data - Google: "chicago food inspections data"

• Figure out how many locations failed inspection • Figure out which chain had the most failed inspections (percentage) • Figure out which location had the most inspections in 2017 • Count the occurrences of different types of violations in the data

23D. Koop, CIS 602-01, Fall 2017

Page 24: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Print and Comments• General print: print("Hello World") • Can also print variables:

name = "Jane" print("Hello,", name)

• In the notebook, putting a variable as the last line of a cell will display it

• Any text after # is a comment in python - Until the end of the line (just like // in Java) - Remember, can also have entire cells that are text in a notebook

24D. Koop, CIS 602-01, Fall 2017

Page 25: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python Variables and Types• No type declaration necessary • Variables are names, not memory locations

a = 0 a = "abc" a = 3.14159

• Don't worry about types, but think about types • Strings are a type • Integers are as big as you want them • Floats can hold large numbers, too (double-precision)

25D. Koop, CIS 602-01, Fall 2017

Page 26: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python Math and String "Math"• Standard Operators: +, -, *, /, % • Division "does what you want" (new in v3) -5 / 2 = 2.5

- 5 // 2 = 2 # use // for integer division • Shortcuts: +=, -=, *= • No ++, -- • Exponentiation (Power): ** • Order of operations and parentheses:

- 4 - 3 - 1 - 4 - (3 - 1)

• "abc" + "def"

• "abc" * 3

26D. Koop, CIS 602-01, Fall 2017

Page 27: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Exercise• Given variables x and y, print the long division answer of x divided

by y with the remainder. • Examples:

- x =11, y = 4 should print "2R3" - x = 15, y =2 should print "7R1"

27D. Koop, CIS 602-01, Fall 2017

Page 28: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Loops• while <condition>: <indented block> # end of while block (indentation done)

• Remember the colon! • a = 5 while a > 0: print(a) a -= 2

• a > 0 is the condition • Python has standard boolean operators (<, >, <=, >=, ==, !=)

- What does a boolean operation return? - Linking boolean comparisons (and, or)

28D. Koop, CIS 602-01, Fall 2017

Page 29: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Conditionals• if, else

- Again, indentation is required • elif

- Shorthand for else: if: • Same type of boolean expressions

29D. Koop, CIS 602-01, Fall 2017

Page 30: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

True and False• True and False (captialized) are defined values in Python • v == 0 will evaluate to either True or False

30D. Koop, CIS 602-01, Fall 2017

Page 31: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Functions• Calling functions is as expected:

mul(2,3) # computes 2*3 (mul from operator package) - Values passed to the function are parameters - May be variables! a = 5 b = 7 mul(a,b)

• print is a function print("This line doesn't end", end="") print("See it continues")

- end is also a parameter, but this has a different syntax - Keyword argument!

31D. Koop, CIS 602-01, Fall 2017

Page 32: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Defining Functions• def keyword • Arguments have names but no types

def hello(name): print(f"Hello {name}")

• Can have defaults: def hello(name="Jane Doe"): print(f"Hello {name}")

• With defaults, we can skip the parameter hello() or hello("John")

• Also can pick and choose arguments: def hello(name1="Joe", name2="Jane"): print(f"Hello {name1} and {name2}") hello(name2="Mary")

32D. Koop, CIS 602-01, Fall 2017

Page 33: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Return statement• Return statement gives back a value:

def mul(a,b): return a * b

• Variables changed in the function won't be updated: def increment(a): a += 1 return a b = 12 c = increment(b) print(b,c)

33D. Koop, CIS 602-01, Fall 2017

Page 34: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python Containers• Container: store more than one value • Mutable versus immutable: Can we update the container?

- Yes → mutable - No → immutable - Lists are mutable, tuples are immutable

• Lists and tuples may contain values of different types: • List: [1,"abc",12.34] • Tuple: (1, "abc", 12.34) • You can also put functions in containers! • len function: number of items: len(l)

34D. Koop, CIS 602-01, Fall 2017

Page 35: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Indexing and Slicing• Just like with strings • Indexing:

- Where do we start counting? - Use brackets [] to retrieve one value - Can use negative values (count from the end)

• Slicing: - Use brackets plus a colon to retrieve multiple values:

[<start>:<end>] - Returns a new list (b = a[:]) - Don't need to specify the beginning or end - Can add a second colon to specify the increment

[<start>:<end>:<step>]

35D. Koop, CIS 602-01, Fall 2017

Page 36: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Examples• Suppose a = ['a', 'b', 'c', 'd'] • What are?

- a[0] - a[1:2] - a[3:] - a[:-2] - a[::-1]

36D. Koop, CIS 602-01, Fall 2017

Page 37: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Tuples• months = ('January','February','March','April','May','June','July','August','September','October','November','December')

• Useful when you know you're not going to change the contents or add or delete values

• Can index and slice • Also, can create new tuples from existing ones:

- t = (1,2,3) u = (4,5,6) v = t + u

37D. Koop, CIS 602-01, Fall 2017

Page 38: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Modifying Lists• Add to a list l:

- l.append(v): add one value (v) to the end of the list - l.extend(vlist): add multiple values (vlist) to the end of l - l.insert(i, v): add one value (v) at index i

• Remove from a list l: - del l[i]: deletes the value at index i - l.pop(i): removes the value at index i (and returns it) - l.remove(v): removes the first occurrence of value v (careful!)

• Changing an entry: - l[i] = v: changes the value at index i to v - Watch out for IndexError

38D. Koop, CIS 602-01, Fall 2017

Page 39: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Modifying Lists• Also can create new lists by "addition" as with tuples:

- a = [1,2,3]; b = [4,5,6] c = a + b

- Does not work: c = a + 4

39D. Koop, CIS 602-01, Fall 2017

Page 40: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

in: Checking for a value • The in operator:

- 'a' in l - 'a' not in l

• Not very fast for lists

40D. Koop, CIS 602-01, Fall 2017

Page 41: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

For loops• Used much more frequently than while loops • Is actually a "for-each" type of loop • In Java, this is:

- for (String item : someList) { System.out.println(item); }

• In Python, this is: - for item in someList: print(item)

• Grabs each element of someList in order and puts it into item • Be very careful about modifying the container when using it in a for

loop! (e.g. someList.append(new_item))

41D. Koop, CIS 602-01, Fall 2017

Page 42: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

What about counting?• In C++: • for(int i = 0; i < 100; i++) { cout << i << endl; }

• In Python: • for i in range(0,100): # or range(100) print(i)

• range(100) vs. list(range(100)) • What about only even integers?

42D. Koop, CIS 602-01, Fall 2017

Page 43: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

break and continue• break stops the execution of the loop • continue skips the rest of the loop and goes to the next iteration

43D. Koop, CIS 602-01, Fall 2017

Page 44: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Dictionaries• One of the most useful features of Python • Also known as associative arrays • Exist in other languages but a core feature in Python • Associate a key with a value • When I want to find a value, I give the dictionary a key, and it returns

the value • Example: InspectionID (key) → InspectionRecord (value) • Keys must be immutable (technically, hashable):

- Normal types like numbers, strings are fine - Tuples work, not lists (TypeError: unhashable type: 'list')

• There is only one value per key!

44D. Koop, CIS 602-01, Fall 2017

Page 45: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Dictionaries• Defining a dictionary: curly braces • states = {'MA': 'Massachusetts, 'RI': 'Road Island', 'CT': 'Connecticut'}

• Accessing a value: use brackets! • states['MA']

• Adding a value: • states['NH'] = 'New Hampshire'

• Checking for a key: • 'ME' in states → returns True or False • Removing a value: del states['CT'] • Changing a value: states['RI'] = 'Rhode Island'

45D. Koop, CIS 602-01, Fall 2017

Page 46: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Example: Counting Letters• Write code that takes a string s and creates a dictionary with that

counts how often each letter appears in s • count_letters("Mississippi") → {'s': 4, 'i': 4, 'p': 2', …}

46D. Koop, CIS 602-01, Fall 2017

Page 47: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Dictionaries• Combine dictionaries: d1.update(d2)

- update overwrites any key-value pairs in d1 when the same key appears in d2

• len(d) is the number of entries in d

47D. Koop, CIS 602-01, Fall 2017

Page 48: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Extracting Parts of a Dictionary• d.keys(): the keys only • d.values(): the values only • d.items(): key-value pairs as a collection of tuples: [(k1, v1), (k2, v2), …]

• Unpacking a tuple or list - t = (1,2) a, b = t

• Iterating through a dictionary: for (k,v) in d.items(): if k % 2 == 0: print(v)

• Important: keys, values, and items are not in any specific order!

48D. Koop, CIS 602-01, Fall 2017

Page 49: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Example: Counting Letters• Write code that takes a string s and creates a dictionary with that

counts how often each letter appears in s • count_letters("Mississippi") → {'s': 4, 'i': 4, 'p': 2', …}

49D. Koop, CIS 602-01, Fall 2017

Page 50: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Example: Counting Letters• def count_letters(s): letter_counts = {} for ch in s: if ch not in letter_counts: letter_counts[ch] = 1 else: letter_counts[ch] += 1

• Cases? s.upper, s.lower • get method

50D. Koop, CIS 602-01, Fall 2017

Page 51: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Sets• Sets are like dictionaries but without any values: • s = {'MA', 'RI', 'CT', 'NH'}

• {} is an empty dictionary, set() is an empty set • Adding values: s.add('ME') • Removing values: s.discard('CT')

51D. Koop, CIS 602-01, Fall 2017

Page 52: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Nesting Containers• Can have lists inside of lists, tuples inside of tuples, dictionaries

inside of dictionaries • Can also have dictionaries inside of lists, tuples inside of

dictionaries, … • d = {"Brady": [(2015, 4770, 36), (2014, 4109, 33)], "Luck": [(2015, 1881, 15), (2014, 4761, 40)], … }

• JavaScript Object Notation (JSON) looks very similar for literal values; Python allows variables in these types of structures

52D. Koop, CIS 602-01, Fall 2017

Page 53: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Nesting Code• Can have loops inside of loops, if statements inside of if statements • Careful with variable names: • l = {1: 3, 4: 5, 9: 12} for i in range(100): square = i ** 2 max_val = l[square] for i in range(max_val): print(i)

• Strange behavior, likely unintended, but Python won't complain!

53D. Koop, CIS 602-01, Fall 2017

Page 54: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

None• Like null in other languages • Used as a placeholder when no value exists • The value returned from a function that doesn't return a value

def f(name): print("Hello,", name) v = f("Patricia") # v will have the value None

• Also used when you need to create a new list or dictionary: def add_letters(s, d=None): if d is None: d = {} d.update(count_letters(s))

• Looks like d={} would make more sense, but that causes issues • None serves as a sentinel value in add_letters

54D. Koop, CIS 602-01, Fall 2017

Page 55: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

is and ==• == does a normal equality comparison • is checks to see if the object is the exact same object • Common style to write statements like if d is None: … • Weird behavior:

- a = 4 - 3 a is 1 # True

- a = 10 ** 3 a is 1000 # False

- a = 10 ** 3 a == 1000 # True

• Python caches common integer objects • Generally, avoid is unless writing is None

55D. Koop, CIS 602-01, Fall 2017

Page 56: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Objects• d = dict() # construct an empty dictionary object

• l = list() # construct an empty list object

• s = set() # construct an empty set object

• s = set([1,2,3,4]) # construct a set with 4 numbers

• Calling methods: - l.append('abc') - d.update({'a': 'b'}) - s.add(3)

• The method is tied to the object preceding the dot (e.g. append modifies l to add 'abc')

56D. Koop, CIS 602-01, Fall 2017

Page 57: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Python Modules• Python module: a file containing definitions and statements • Import statement: like Java, get a module that isn't a Python builtin

import collections d = collections.defaultdict(list) d[3].append(1)

• from…import…: don't need to refer to the module from collections import defaultdict d = defaultdict(list) d[3].append(1)

57D. Koop, CIS 602-01, Fall 2017

Page 58: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Other Collections• collections.defaultdict: specify a default value for any item in

the dictionary (instead of KeyError) • collections.OrderedDict: keep entries ordered according to

when the key was inserted - dict objects will be ordered in future versions of Python so this

will be obsolete then • collections.Counter: counts hashable objects, has a most_common method

58D. Koop, CIS 602-01, Fall 2017

Page 59: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Iterators• Remember range, values, keys, items? • They return iterators: objects that traverse containers, only need to

know how to get the next element • Given iterator it, next(it) gives the next element • StopIteration exception if there isn't another element • Generally, we don't worry about this as the for loop handles

everything automatically • …but you cannot index or slice an iterator • d.values()[0] and range(100)[-1] will not work! • If you need to index or slice, construct a list from an iterator • list(d.values())[0] or list(range(100))[-1] • In general, this is slower code so we try to avoid creating lists

59D. Koop, CIS 602-01, Fall 2017

Page 60: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

List Comprehensions• Shorthand for transformative or filtering for loops • squares = [] for i in range(10): squares.append(i**2)

• squares = [i**2 for i in range(10)]

• Equivalent code, just moved the loop inside of list definition • Advantages: concise, readable • Filtering: • squares = [] for i in range(10): if i % 3 != 1: squares.append(i ** 2)

• squares = [i**2 for i in range(10) if i % 3 != 1]

• if clause follows the for clause

60D. Koop, CIS 602-01, Fall 2017

Page 61: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Dictionary Comprehensions• Similar idea, but allow dictionary construction • dict([(k, v) for k,v in … if …])

• {k: v for k, v in … if …}

• Could do this with a for loop as well

61D. Koop, CIS 602-01, Fall 2017

Page 62: CIS 602-01: Scalable Data Analysisdkoop/cis602-2017fa/lectures/lecture02.pdfwith social scientists and engineers that have used this data set in their research to better understand

Exceptions• errors but potentially something that can be addressed • try-except: except allows specifications of exactly the error(s) you

wish to address

• finally: always runs (even if the program is about to crash)

• can also raise exceptions using the raise keyword • …and define your own

62D. Koop, CIS 602-01, Fall 2017