Download - DSC 201: Data Analysis & Visualizationdkoop/dsc201-2017fa/lectures/lecture04.pdf · in discovering trends, relationships • From a review of the book: “Tukey favors analysis of

DSC 201: Data Analysis & Visualization

Visualization Dr. David Koop

D. Koop, DSC 201, Fall 2017

Quiz

2D. Koop, DSC 201, Fall 2017

Sheet 3

0 1,000,000

Refugees (incl. refugee-..

Map based on Longitude (generated) and Latitude (generated). Color shows sum of Refugees (incl. refugee-like situations).Details are shown for Origin. The data is filtered on Year, which ranges from 2015 to 2015.

Assignment 1• http://www.cis.umassd.edu/~dkoop/

dsc201-2017fa/assignment1.html • Due next Thursday (Sept. 28) • Goals:

- Using Tableau - Exploratory Data Analysis - Visualization

• Data: UN Persons of Concern • Find outliers, trends, etc.


http://www.cis.umassd.edu/~dkoop/dsc201-2017fa/assignment1.html

http://www.cis.umassd.edu/~dkoop/dsc201-2017fa/assignment1.html

Exploratory Data Analysis• John W. Tukey

- Born in New Bedford - 1977: Highly influential book

• Emphasis on value of visualization in discovering trends, relationships

• From a review of the book: “Tukey favors analysis of data with little more than pencil and paper. Specifically, there is no need for a calculator, a computer, or a lettering guide to do the analyses he proposes” [R.M. Church, 1979]


Types of EDA• Univariate (one attribute) vs. multivariate (2+ attributes) • Non-graphical vs. graphical

- Non-graphical ~ statistics - Graphical ~ visualizations

• All are important!


Univariate Non-graphical EDA• Categorical Data:

- Frequency counts, proportions - Groupings

• Quantitative Data: - Distribution - Summary statistics: mean, median, mode, variance, standard

deviation, quantiles


Univariate Graphical EDA• Categorical Data: grouping, bar charts

• Quantitative Data: strip charts, steam-and-leaf, histograms, boxplots


A B C D E F G H I J K L M N O P Q R S T U V W X Y Z0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

11%

12%

Histograms and Distributions


[Cloudera]

https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/

Boxplots• Show distribution • Multiple summary statistics can be

read from the chart • Also provides a general shape of

the data • Best for unimodal data


[N. Yao]

https://flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot/

Multivariate non-graphical EDA• Crosstabs and Pivot Tables

- What is in the data? Count • Correlation and covariance

- Correlation: how related are different attributes? • Positive correlation (related) • Negative correlation (related) • Zero (unrelated)

- Covariance: how do two attributes change together?


Crosstabs & Pivot Tables• Count groups and subgroups • At least two different attributes • Can subdivide vertically and

horizontally for more subgroups • Sometimes totals are useful


Correlation• Pearson Correlation Coefficient:

• r > 0 (+ correlation), r < 0 (- correlation), |r| ~ 1 (strong correlation) • Examples:

- Correlation(Number of people who drowned by falling into a pool, Films Nicolas Cage appeared in) = 0.666004

- Correlation(Divorce rate in Maine, Per capita consumption of margarine) = 0.992558

- Correlation(People who drowned after falling out of a fishing boat, Marriage rate in Kentucky) = 0.952407


stddev(y)stddev(x)

covariance(x, y)

Multivariate Graphical EDA• Categorical Data:

- Grouped bar charts - Parallel sets

• Quantitative Data: - Scatterplots: look for correlation

• Usually put outcome on y-axis • Can encode other variables

- Side-by-side boxplots - Parallel coordinates


6 extracat: New Approaches in Visualization of Categorical Data in R

Figure 4: Multiple barchart of Cont (x), Type (y), Infl (x) and Sat (x) generated with theMondrian software.

frequencies is called doubledeckerplots (Hofmann 2001), which has all explanatory variableson the x-axis. Its interpretability decreases with the number of displayed combinations. Forrelatively small examples such as the Copenhagen housing data, the graphic is among thebest visual representations as Figure 3

R> doubledecker(xtabs(Freq ~ Cont + Type + Infl + Sat, data = housing))

illustrates: It is easy to compare combinations with di↵erent levels of influence or di↵erenttypes of residence but less easy to compare the two di↵erent levels of contact.

Figure 4 shows the multiple barchart visualisation of the example discussed in Figures 1and 2 created with the interactive software Mondrian (Theus and Urbanek 2008). The maindi↵erence between the multiple barchart and the rmb plot is that the multiple barchart dis-plays absolute frequencies whereas the rmb plot shows their factorization into conditionalrelative frequencies and weights. The advantage of this becomes apparent in the last tworows ("Atrium" and "Terrace") of the left side of the plot (low contact) where the bars arevery small and hardly comparable. Within each combination of Type, Infl and Cont the ratioof any two absolute frequencies is obviously the same as that of the corresponding conditionalrelative frequencies. Unfortunately this does not hold for two di↵erent combinations of thesethree variables and thus only the ratios of the bars can be compared.

Figure 2 and 4 show that it is at least possible to judge strong di↵erences in the shape of thedistributions of a target variable in both classical mosaicplots and multiple barcharts. E.g.,the strong positive relationship of Infl and Sat is apparent in both graphics. Neverthelessin many examples the rmb plot provides a better overview and allows for more precise com-

Bar Chart Matrix


[Pilhöfer and Unwin]

Journal of Statistical Software 3

Variable Description LevelsCont Contact to other residents "Low", "High"Infl Influence on housing conditions "Low", "Medium", "High"Type Type of residence "Tower", "Atrium", "Apartment", "Terrace"Sat Satisfaction "Low", "Medium", "High"

Table 1: The Copenhagen housing dataset.

In principle, classical mosaicplots (see Friendly 1994; Hartigan and Kleiner 1981) also showboth pi|jk and n

+jk but while the space is e�ciently used, it becomes harder to establishthe relation between the rectangles and the corresponding variable combinations with everyadditional variable. Comparing the proportions of a target category in di↵erent combinationsof explanatory variables is only possible in a qualitative manner, because the correspondingrectangles neither share a common axis nor have a common scale.

By contrast multiple barcharts and fluctuation diagrams display only the total number ofobservations nijk but allocate the information in equal-sized rectangles in a hierarchical gridlayout (see Hofmann 2000). The allocation along the grid makes it easier to read the plotand also allows better comparisons especially within the rows or columns because all combi-nations now share the same x- and y-axis scales. In multiple barcharts the y-axis is set to[0,max(nijk)] and the x-axis is cut into equal segments for the target categories (or vice versa).Unfortunately comparisons of the conditional distributions of a target variable are quite hard:Comparing absolute frequencies ni|s and ni|t of target category i in two explanatory combi-nations s and t is obviously not equivalent to the comparison of the relative frequencies pi|sand pi|t and hence it is necessary to use ratios of the form

ni|snj|s

=pi|spj|s

andni|tnj|t

=pi|tpj|t

instead.

The basic version of rmb plots is constructed as follows: Consider a set of m categoricalvariables including one target variable. The basis of the plot is a multiple barchart of them � 1 explanatory variables displaying the observed frequencies n

+jk of their combinations.The plot uses horizontal bars which means that all bars have an equal height and their widthsare proportional to the ratios

n+jk

max(n+jk).

The conditional distributions of the target categories defined by the probabilities pi|jk aredisplayed inside these bars. The basic type of visualization is again a barchart with verticalbars. An alternative which is discussed in Section 3 is the generalized spineplot versionwhich splits each bar from the basis plot vertically into segments according to their relativefrequencies, just as in classical mosaicplots or spineplots. In both versions the x- and y-axisscales are the same, namely [0,max(n

+jk)] and [0, 1] respectively.

A first introductory example using the well-known Copenhagen housing dataset (c.f. Venablesand Ripley 2002) is shown in Figure 1. In R the dataset is available from the MASS packageand the variables are listed in Table 1.

Figure 1 shows the variables Cont and Infl on the x-axis, Type on the y-axis and Sat as thetarget variable which is by convention on the x-axis. The graphic reveals the weak influence ofthe Cont variable and the strong positive correlation between Infl and Sat: The di↵erencesbetween the distributions on the left side (low contact) and the corresponding counterpartson the right side (high contact) are quite small and hence the influence of the Cont variableon the satisfaction of the respondents is weak. In contrast the variable Infl shows a strongpositive correlation with the target variable: The people who judged their influence to be low

Data: Robert J. MacG. Dawson. Curves?

Survived alpha » size »Survived Perished

Sex alpha » size »Female Male

Age alpha » size »Child Adult

Class alpha » size »Second Class First Class Third Class Crew

Explanation

Parallel Sets


[Titanic Data, J. Davies]

https://www.jasondavies.com/parallel-sets/

Scatterplot


Scatterplots and Correlation


0

0

0

0

0

0

00

5

5

5

5

5

5

55

10

10

10

10

10

10

1010

15

15

15

15

15

15

1515

20

20

20

20

20

20

2020

25

25

25

25

25

25

2525

30

30

30

30

30

30

3030

35

35

35

35

35

35

3535

40

40

40

40

40

40

4040

45

45

45

45

45

45

4545

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)

economy (mpg)economy (mpg)

3.0

3.0

3.0

3.0

3.0

3.0

3.03.0

3.5

3.5

3.5

3.5

3.5

3.5

3.53.5

4.0

4.0

4.0

4.0

4.0

4.0

4.04.0

4.5

4.5

4.5

4.5

4.5

4.5

4.54.5

5.0

5.0

5.0

5.0

5.0

5.0

5.05.0

5.5

5.5

5.5

5.5

5.5

5.5

5.55.5

6.0

6.0

6.0

6.0

6.0

6.0

6.06.0

6.5

6.5

6.5

6.5

6.5

6.5

6.56.5

7.0

7.0

7.0

7.0

7.0

7.0

7.07.0

7.5

7.5

7.5

7.5

7.5

7.5

7.57.5

8.0

8.0

8.0

8.0

8.0

8.0

8.08.0cylinders

cylinders

cylinders

cylinders

cylinders

cylinders

cylinderscylinders

100

100

100

100

100

100

100100

150

150

150

150

150

150

150150

200

200

200

200

200

200

200200

250

250

250

250

250

250

250250

300

300

300

300

300

300

300300

350

350

350

350

350

350

350350

400

400

400

400

400

400

400400

450

450

450

450

450

450

450450

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)

displacement (cc)displacement (cc)

0

0

0

0

0

0

00

20

20

20

20

20

20

2020

40

40

40

40

40

40

4040

60

60

60

60

60

60

6060

80

80

80

80

80

80

8080

100

100

100

100

100

100

100100

120

120

120

120

120

120

120120

140

140

140

140

140

140

140140

160

160

160

160

160

160

160160

180

180

180

180

180

180

180180

200

200

200

200

200

200

200200

220

220

220

220

220

220

220220

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)

power (hp)power (hp)

2,000

2,000

2,000

2,000

2,000

2,000

2,0002,000

2,500

2,500

2,500

2,500

2,500

2,500

2,5002,500

3,000

3,000

3,000

3,000

3,000

3,000

3,0003,000

3,500

3,500

3,500

3,500

3,500

3,500

3,5003,500

4,000

4,000

4,000

4,000

4,000

4,000

4,0004,000

4,500

4,500

4,500

4,500

4,500

4,500

4,5004,500

5,000

5,000

5,000

5,000

5,000

5,000

5,0005,000

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)

weight (lb)weight (lb)

8

8

8

8

8

8

88

10

10

10

10

10

10

1010

12

12

12

12

12

12

1212

14

14

14

14

14

14

1414

16

16

16

16

16

16

1616

18

18

18

18

18

18

1818

20

20

20

20

20

20

2020

22

22

22

22

22

22

2222

24

24

24

24

24

24

2424

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)

0-60 mph (s)0-60 mph (s)

70

70

70

70

70

70

7070

71

71

71

71

71

71

7171

72

72

72

72

72

72

7272

73

73

73

73

73

73

7373

74

74

74

74

74

74

7474

75

75

75

75

75

75

7575

76

76

76

76

76

76

7676

77

77

77

77

77

77

7777

78

78

78

78

78

78

7878

79

79

79

79

79

79

7979

80

80

80

80

80

80

8080

81

81

81

81

81

81

8181

82

82

82

82

82

82

8282year

year

year

year

year

year

yearyear

Parallel Coordinates


[M. Bostock]

https://bl.ocks.org/mbostock/1341021

Multiple Boxplots


Visualization


MTA Fare Data Visualization


“Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively.”

— T. Munzner



— T. Munzner



— T. Munzner


NYC Subway Fare Data


— T. Munzner


Find Interesting NYC Subway Ridership Patterns

NYC Subway Fare Data

Why People?• Certain tasks can be totally automated

- Statistical computations - Machine learning algorithms - We don’t need visualization for these tasks (although perhaps for

debugging them…) • Analysis problems are often ill-specified

- What is the correct question? - Exploit human visual system, pattern detection capabilities - Goal may be an automated solution or a visual analysis system

• Presentation - It is often easier to show someone something than to tell them a

bunch of facts about the data (and let them explore it)


Why Computers?


[Cerebral, Barsky et al., 2007]

http://www.cs.ubc.ca/labs/imager/tr/2007/barskya_cerebral_appnote/

Resource Limitations• Memory and space constraints • How many pixels do I have? • Information Density


[McGuffin & Robert, 2010]

Fig. 2. A tree of regions and major islands of the Philippines, drawnusing squarified treemaps (top) and using icicle diagrams (bottom). Thetwo diagrams on the left weight leaf nodes by geographic area, whereasthe two diagrams on the right give equal area to leaf nodes. Labels arerotated when necessary to maximize their size.

arbitrarily deep. Our analysis allows us to rank tree representations bytheir efficiency, which is useful for helping designers choose the mostefficient representation allowable within other given constraints.

Our work also quantifies an interesting difference between repre-sentations in how they distribute area across nodes. For example, theicicle diagrams in Figures 1C, 2C, and 2D allocate equal area to eachlevel of the tree: the root node has the same area as all the leaf nodestogether. Treemaps, in contrast, typically allocate more area to deepernodes. There is a tradeoff here, since we would like users to be ableto see as many deep nodes as possible (which tend to also be the mostnumerous nodes), while at the same time providing some informationabout shallow nodes (for example, to give an overview of subtrees,and/or to guide the user in zooming operations). This article developsa new metric, the mean area exponent, that describes the distributionof area across levels of a tree representation, to quantify this tradeoff.

Finally, we also present a set of design guidelines for using treerepresentations, as well as a few novel tree representations, includinga variation on squarified treemaps that allows for larger labels withinthe nodes.

2 RELATED WORK

Different tree representations, including classical node-link, icicle,nested enclosure, and indented outline, were identified decades agoin [4, 16], and an interactive version of the indented outline represen-tation (now popular in file browsers such as Microsoft Explorer) waspresented in [11]. Subsequent years have seen variations on these rep-resentations proposed. Treemaps are a relatively recent innovation,and are a kind of nested enclosure representation. Treemaps are of-ten described as space-filling, a highly desirable property for space-efficiency.

The term “space-filling” can sometimes be problematic, however.For example, a view sometimes expressed [21] holds that tree repre-sentations can be divided into two classes: (1) node-link diagrams, thatillustrate parent-child relationships with line segments or curves, and

(2) space-filling representations, which include treemaps and concen-tric circles such as Sunburst (Sunburst was described as space-filling in[27], and [21] similarly describe [2] as space-filling.) However, thesetwo classes seem to not be disjoint, because some node-link diagramsalso “fill space” [19, 20]. The 2nd class also ignores an interesting dif-ference between treemaps and concentric circles, namely that parentnodes in treemaps enclose their children, whereas parents in concen-tric circle diagrams are adjacent to their children. Finally, the term“space-filling” suggests increased space-efficiency, however it is easyto design a treemap layout algorithm that occupies all available spacewithout making good use of it, for example, by using excessively thickmargins, or by concentrating child nodes in only one corner of theirparent, leaving the rest of the parent empty and unused. Would sucha treemap cease to be considered space-filling, even though its rootnode covers all the available space? Without a precise definition of“space-filling”, we recommend being cautious about using this termto refer to a category of tree representations, since the name seemsto imply that members of the category are more space-efficient thannon-members. As an alternative, categories of representations couldinstead be based on how the nodes are drawn (e.g. representationswhere the nodes are mapped to points, and those where the nodes aremapped to areas) or on how parent-child relationships are shown (e.g.through line segments, enclosure, adjacency, or relative positioning).The space-efficiency of a given representation can be treated as a sep-arate matter, and evaluated by several metrics, as demonstrated in thisarticle.

Within the graph drawing community, a common approach for eval-uating space-efficiency is to compare the total area required by differ-ent drawings (i.e., representations) of the same graph or tree. Sinceany drawing can be scaled arbitrarily in x and y, to ensure a meaning-ful comparison, the “resolution” of the representations is fixed, oftenby requiring that nodes be positioned on a grid (i.e. with integer co-ordinates) [9]. There are problems with this general approach, how-ever, especially when comparing representations of trees rather thangraphs. For example, allowing only grid positions may be mislead-ing, because non-integer coordinates can significantly reduce total areawithout compromising the clarity of the representation or the spaceavailable for labels (Figure 3). As a potential remedy, instead of posi-tioning nodes on a grid, we might instead impose a minimum distancebetween nodes, or a minimum size for non-overlapping labels centeredover the nodes. Unfortunately, matters are complicated by the fact thatsome tree representations (such as Figures 1C, 1E, 1F, 1G) involvenodes that have an area and shape, and there may be nodes and labelsof different sizes within a single representation (e.g. deeper nodes maybe smaller and have smaller labels). This makes it less clear how toimpose a fixed resolution in a way that is fair across tree representa-tions. Note that this issue does not arise in traditional graph drawing,where nodes are typically mapped to points.

Fig. 3. A and B are adapted from a comparison in Figure 5 of [1], andshow two different graphical representations of the same tree wherenodes are constrained to positions with integer coordinates. B is clearlymore compact than A. In C, however, we have redrawn the represen-tation from A with the integer coordinate constraint relaxed, and the re-sulting graphical representation has a convex hull whose area is onlyabout 5% greater than that in B. Notice also that the minimum horizon-tal spacing between nodes in B and C is the same, allowing nodes to beoverlaid with horizontally oriented labels of the same size in both cases.

In our work, rather than comparing total area with a fixed resolution,we fix the total area available, and fix its aspect ratio. Representations

“Computer-based visualization systems provide visual representations of datasets designed to help people carry out tasks more effectively”


Why Visual?


[F. J. Anscombe]

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Why Visual?


[F. J. Anscombe]

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

Mean of x 9Variance of x 11Mean of y 7.50Variance of y 4.122Correlation 0.816

●

●●

●●

●

●

●

●

●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x1

y 1

●●

●●●

●

●

●

●

●

●

4 6 8 10 12 14 16 18

4

6

8

10

12

x2

y 2●

●

●

●●

●

●●

●

●●

4 6 8 10 12 14 16 18

4

6

8

10

12

x3

y 3

●●

●

●●

●

●

●

●

●

●

4 6 8 10 12 14 16 18

4

6

8

10

12

x4

y 4

Why Visual?


[F. J. Anscombe]

Visual Pop-out


[C. G. Healey, http://www.csc.ncsu.edu/faculty/healey/PP/]

http://www.csc.ncsu.edu/faculty/healey/PP/

Visual Pop-out




Visual Perception Limitations




Other Human Limitations• Visual working memory is small • Change blindness: Large changes go unnoticed when we are

working on something else in our view


Design Iteration


[19 Sketches of Quarterback Timelines, K. Quelay]

http://kpq.github.io/chartsnthings/2013/09/19-sketches-of-quarterback-timelines.html

Design Iteration





Another Design Example


[M. Stefaner, 2013]

http://well-formed-data.net/archives/972/where-the-wild-bees-are

Why Effectiveness?• “It’s not just about pretty pictures” • Any depiction of data requires the designer to make choices about

how that data is visually represented - Analogy to photography - Lots of possibilities (see quarterback study)

• Effectiveness measures how well the visualization helps a person with their tasks - How? insight, engagement, efficiency? - Benchmarks and user studies


Effectiveness


[S. Hayward, 2015]

http://www.powerlineblog.com/archives/2015/10/the-only-global-warming-chart-you-need-from-now-on.php

Effectiveness


[@bizweekgraphics]

https://twitter.com/bizweekgraphics/status/676533647567114240

Effectiveness


[S. Hayward, 2015]

http://www.powerlineblog.com/archives/2015/10/the-only-global-warming-chart-you-need-from-now-on.php

Tableau Example


Download - DSC 201: Data Analysis & Visualizationdkoop/dsc201-2017fa/lectures/lecture04.pdf · in discovering trends, relationships • From a review of the book: “Tukey favors analysis of

Top Related