lecture 2: the pipeline - ucla statisticscocteau/stat202a/lectures/lecture2.pdf · lecture 2: the...

97
Lecture 2: The pipeline

Upload: dangnguyet

Post on 30-Jul-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Lecture 2: The pipeline

Page 2: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Last time

• We presented some (admittedly rough) context for the material to be covered in this course: From a comparison between Stat202a from 2004 and 2010 we saw somewhat incredible shifts in mechanisms by which data are distributed

• We also considered the term “data science” and compared current attempts at a definition with older (some really old) approaches and the classic cycle of exploratory data analysis

• We ended by examining some early work on bringing data analysis to high school students, curricular approaches that should have something to say about how we train statisticians

Page 3: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Today

• We will look at the Unix operating system, the philosophy underlying its design and some basic tools

• Given the abstract nature of our first lecture, we will emphasize some hands-on tasks -- We will use as our case study the last week of traffic across the department’s website www.stat.ucla.edu

• You will have a chance to kick the tires on these tools and address some simple web site usage statistics

Page 4: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The shell: Why we teach it

• Many of you know the computer as a Desktop and a handful of applications (a browser, Word, Excel, maybe R); to introduce the shell means having a discussion about the structure of the computer, about operating systems, about filesystems, about history

• The shell offers programmatic access to a computer’s constituent parts, allows students to “do” data analysis on directories (folders), on processes, on their network

• Given that there are so many flavors of shell, it is also the first time we can talk about choosing tools (that the choice is yours to make!), about evaluating which shell is best for you, and about the shifting terrain of software development

Page 5: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The shell: Why we teach it

• As a practical matter, shell tools are an indispensable part of my own practice, for data “cleaning,” for preliminary data processing, for exploratory analysis -- By design, they are simple, lightweight and easily combined to perform much more complex computations

• The shell also becomes important as you start to make use of shared course resources (data, software, hardware) -- This year, most of our work will be done on computers outside the Statistics Department, and the shell will be your window onto these other systems

• On the computers in front of you, you can access the shell through a Terminal Window (Applications -> Utilities -> Terminal) -- For those of you with a Window’s machine, you will have to download an emulator like Cygwin...

Page 6: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 7: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Operating systems

• Most devices that contain a computer of some kind will have an OS; they tend to emerge when the appliance will have to deal with new applications, complex user-input and possibly changing requirements of its function

• Your Tivo, smartphone and even your automobile all have operating systems

Page 8: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Operating systems

• An operating system is a piece of software (code) that organizes and controls hardware and other software so your computer behaves in a flexible but predictable way

• For home computers, Windows, MacOS and Linux are among the most commonly used operating systems

Page 9: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Your computer

• The Central Processing Unit (CPU) or microprocessor is a complete computational engine, capable of carrying out a number of basic commands (basic arithmetic, store and retrieve information from its memory, and so on)

• The CPU itself has the capacity to store some amount of information; when it needs more space, it moves data to another kind of chip known as Random Access Memory (RAM) -- “random access” as opposed to, say, sequential access to memory locations

• Your computer also has one or more storage devices that can be used to organize and store data; hard disks or drives store data magnetically, while solid state drives again use semiconductor technology

Page 10: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Your computer

• Your operating system, then, manages all of your computer’s resources, providing layers of abstraction for both you, the user, as well as developers who are writing new programs for your computer

• With the emergence of so-called cloud computing, we imagine a variety of computing resources “out there” on the web that we can execute; in this model, computations are performed elsewhere, and your own computer might function more as a “browser” receiving results -- Google’s Chrome operating system is “minimalist” in this sense

Page 11: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 12: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 13: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

http://www.youtube.com/watch?v=0QRO3gKj3qw

Page 14: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

http://www.youtube.com/watch?v=0QRO3gKj3qw

Page 15: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

http://www.youtube.com/watch?v=0QRO3gKj3qw

Page 16: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

http://www.youtube.com/watch?v=0QRO3gKj3qw

Page 17: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

But let’s step back a bit...

• The computers you are sitting at are running the Mac OS which is built on a Unix platform -- let’s spend some time talking about Unix (we will have more to say about three fundamental abstractions for computer components -- the memory, the interpreter and the communication link -- next time)

• We’ll start with a little history that might help explain why these tools look the way they do, their underlying design philosophy and how they can help you in the early (or late, even) stages of an analysis

Page 18: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 19: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Some history

• In 1964, Bell Labs partnered with MIT and GE to create Multics (for Multiplexed Information and Computing Service) -- here is the vision they had for computing

“Such systems must run continuously and reliably 7 days a week, 24 hours a day in a way similar to telephone or power systems, and must be capable of meeting wide service demands: from multiple man-machine interaction to the sequential processing of absentee-user jobs; from the use of the system with dedicated languages and subsystems to the programming of the system itself”

Page 20: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Some history

• Bell Labs pulled out of the Multics project in 1969, a group of researchers at Bell Labs started work on Unics (Uniplexed information and computing system) because initially it could only support one user; as the system matured, it was renamed Unix, which isn’t an acronym for anything

• Richie simply says that Unix is a “somewhat treacherous pun on Multics”

Page 21: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

From “The Art of Unix Programming” by Raymond

Page 22: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The Unix filesystem

• In Multics, we find the first notion of a hierarchical file system -- software for organizing and storing computer files and the data they contain in Unix, files are arranged in a tree structure that allows separate users to have control of their own areas

• In general, a file systems provide a level of abstraction for the user, keeping track of how data are stored on an associated physical storage device (say, a hard drive or a CD)

• Unix began (more or less) as a file system and then an interactive shell emerged to let you examine its contents and perform basic operations

Page 23: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The Unix kernel and its shell

• The Unix kernel is the part of the operating system that provides other programs access to the system’s resources (the computer’s CPU or central processing unit, its memory and various I/O or input/output devices)

• The Unix shell is a command-line interface to the kernel -- keep in mind that Unix was designed by computer scientists for computer scientists and the interface is not optimized for novices

Page 24: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Unix shells

• The Unix shell is a type of program called an interpreter; in this case, think of it as a text-based interface to the kernel

• The shell operates in a simple loop: It accepts a command, interprets it, executes the command and waits for another -- The shell displays a prompt to tell you that it is ready to accept a command

Page 25: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Shells in general

• The term “shell” is often applied to any program that you type commands into (we’ll see similar kinds of interfaces when we talk about Python, R, and even sqlite3 and mongo) -- the idea is that the shell is the outermost interface to the inner workings of the system it surrounds

• It is sometimes useful to take a broader view of the shell concept, for example, considering a browser (like Firefox) a shell for a system that renders web pages -- in such cases, we might usefully distinguish between text-based shells (command line interfaces) and graphics-based shells (graphical user interfaces)

• Now, open up the terminal window on your Mac and let’s explore a little...

Page 26: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

* A boring slide, but full of potential!

Page 27: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Unix shells

• The shell is itself a program that the Unix operating system runs for you (a program is referred to as a process when it is running)

• The kernel manages many processes at once, many of which are the result of user commands (others provide services that keep the computer running)

• Some commands are built into the shell, others have been added by users -- Either way, the shell waits until the command is executed

Page 28: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The result of typing in the command top; a printout of all the processes running on your computer

Process ID

Name of the commandHow hard the computer is thinking about it

How much memory isbeing used

Page 29: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% hostinfo # neolab.stat.ucla.edu

Mach kernel version:! Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386Kernel configured for up to 8 processors.8 processors are physically available.8 processors are logically available.Processor type: i486 (Intel 80486)Processors active: 0 1 2 3 4 5 6 7Primary memory available: 16.00 gigabytesDefault processor set: 159 tasks, 364 threads, 8 processorsLoad average: 10.14, Mach factor: 1.43

Page 30: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% hostinfo # my laptop

Mach kernel version:! Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386Kernel configured for up to 2 processors.2 processors are physically available.2 processors are logically available.Processor type: i486 (Intel 80486)Processors active: 0 1Primary memory available: 4.00 gigabytesDefault processor set: 110 tasks, 442 threads, 2 processorsLoad average: 0.25, Mach factor: 1.74

Page 31: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Running top on neolab.stat.ucla.edu

! How many processes are running?

How much RAM is available?

Page 32: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Running top on taia.stat.ucla.edu (last year)

! How many processes are running?

How much RAM is available?

What do you reckon this computer does?

Page 33: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Operating systems

• Processor management

• Schedules jobs (formally referred to as processes) to be executed by the computer

• Memory and storage management

• Allocate space required for each running process in main memory (RAM) or in some other temporary location if space is tight; and supervise the storage of data onto disk

Page 34: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Operating systems

• Device management

• A program called a driver translates data (files from the filesystem) into signals that devices like printers can understand; an operating system manages the communication between devices and the CPU, for example

• Application interface

• An API (application programming interface) let programmers use functions of the computer and the operating system without having to know how something is done

• User interface

• Finally, the operating system turns and looks at you; the UI is a program that defines how users interact with the computer -- some are graphical (Windows is a GUI) and some are text-based (your Unix shell)

Page 35: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Unix shell(s)

• There are, in fact, many different kinds of Unix shells -- The table on the right lists a few of the most popular

• Why so many?

Bourne shell /bin/sh The oldest and most standardized shell. Widely used for system startup files (scripts run during system startup). Installed in Mac OS X.

Bash (Bourne Again SHell) /bin/bashBash is an improved version of sh. Combines features from csh, sh, and ksh. Very widely used, especially on Linux systems. Installed in Mac OS X.http://www.gnu.org/manual/bash/

C shell /bin/csh Provides scripting features that have a syntax similar to that of the C programming language (originally written by Bill Joy). Installed in Mac OS X.

Korn shell /bin/ksh Developed at AT&T by David Korn in the early 1980s. Ksh is widely used for programming. It is now open-source software, although you must agree to AT&T's license to install it. http://www.kornshell.com

TC Shell /bin/tcsh An improved version of csh. The t in tcsh comes from the TENEX and TOPS-20 operating systems, which provided a command-completion feature that the creator (Ken Greer) of tcsh included in his new shell. Wilfredo Sanchez, formerly lead engineer on Mac OS X for Apple, worked on tcsh in the early 1990s at the Massachusetts Institute of Technology.

Z shell /bin/zsh Created in 1990, zsh combines features from tcsh, bash, and ksh, and adds many of its own. Installed in Mac OS X. http://zsh.sourceforge.net

Page 36: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Why the choices?

• A shell program was originally meant to take commands, interpret them and then execute some operation

• Inevitably, one wants to collect a number of these operations into programs that execute compound tasks; at the same time you want to make interaction on the command line as easy as possible (a history mechanism, editing capabilities and so on)

• The original Bourne shell is ideal for programming; the C-shell and its variants are good for interactive use; the Korn shell is a combination of both

Steve Bourne, creator of sh, in 2005

Page 37: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The

orig

inal

UN

IX s

yste

m s

hell

was

a s

impl

e pr

ogra

m w

ritte

n by

Ken

Tho

mps

on a

t Bel

l Lab

orat

orie

s, a

s th

e in

terfa

ce to

th

e ne

w U

NIX

ope

ratin

g sy

stem

. It a

llow

ed th

e us

er to

invo

ke

sing

le c

omm

ands

, or t

o co

nnec

t com

man

ds to

geth

er b

y ha

ving

the

outp

ut o

f one

com

man

d pa

ss th

roug

h a

spec

ial

file

calle

d a

pipe

and

bec

ome

inpu

t for

the

next

com

man

d.

The

Thom

pson

she

ll w

as d

esig

ned

as a

com

man

d in

terp

rete

r, no

t a p

rogr

amm

ing

lang

uage

. Whi

le o

ne c

ould

pu

t a s

eque

nce

of c

omm

ands

in a

file

and

run

them

, i.e

., cr

eate

a s

hell

scrip

t, th

ere

was

no

supp

ort f

or tr

aditi

onal

la

ngua

ge fa

cilit

ies

such

as

flow

con

trol,

varia

bles

, and

fu

nctio

ns. W

hen

the

need

for s

ome

flow

con

trol s

urfa

ced,

the

com

man

ds /b

in/if

and

/bin

/got

o w

ere

crea

ted

as s

epar

ate

com

man

ds. T

he /b

in/if

com

man

d ev

alua

ted

its fi

rst a

rgum

ent

and,

if tr

ue, e

xecu

ted

the

rem

aind

er o

f the

line

. The

/bin

/got

o co

mm

and

read

the

scrip

t fro

m it

s st

anda

rd in

put,

look

ed fo

r th

e gi

ven

labe

l, an

d se

t the

see

k po

sitio

n at

that

loca

tion.

W

hen

the

shel

l ret

urne

d fro

m in

voki

ng /b

in/g

oto,

it re

ad th

e ne

xt li

ne fr

om s

tand

ard

inpu

t fro

m th

e lo

catio

n se

t by

/bin

/go

to.

Unl

ike

mos

t ear

lier s

yste

ms,

the

Thom

pson

she

ll co

mm

and

lang

uage

was

a u

ser-

leve

l pro

gram

that

did

not

hav

e an

y sp

ecia

l priv

ilege

s. T

his

mea

nt th

at n

ew s

hells

cou

ld b

e cr

eate

d by

any

use

r, w

hich

led

to a

suc

cess

ion

of im

prov

ed

shel

ls. I

n th

e m

id-1

970s

, Joh

n M

ashe

y at

Bel

l Lab

orat

orie

s ex

tend

ed th

e Th

omps

on s

hell

by a

ddin

g co

mm

ands

so

that

it

coul

d be

use

d as

a p

rimiti

ve p

rogr

amm

ing

lang

uage

. He

mad

e co

mm

ands

suc

h as

if a

nd g

oto

built

-ins

for i

mpr

oved

pe

rform

ance

, and

als

o ad

ded

shel

l var

iabl

es.

At t

he s

ame

time,

Ste

ve B

ourn

e at

Bel

l Lab

orat

orie

s w

rote

a

vers

ion

of th

e sh

ell w

hich

incl

uded

pro

gram

min

g la

ngua

ge

tech

niqu

es. A

rich

set

of s

truct

ured

flow

con

trol p

rimiti

ves

was

par

t of t

he la

ngua

ge; t

he s

hell

proc

esse

d co

mm

ands

by

build

ing

a pa

rse

tree

and

then

eva

luat

ing

the

tree.

Bec

ause

of

the

rich

flow

con

trol p

rimiti

ves,

ther

e w

as n

o ne

ed fo

r a

goto

com

man

d. B

ourn

e in

trodu

ced

the

"her

e-do

cum

ent"

whe

reby

the

cont

ents

of a

file

are

inse

rted

dire

ctly

into

the

scrip

t. O

ne o

f the

ofte

n ov

erlo

oked

con

tribu

tions

of t

he

Bou

rne

shel

l is

that

it h

elpe

d to

elim

inat

e th

e di

stin

ctio

n be

twee

n pr

ogra

ms

and

shel

l scr

ipts

. Ear

lier v

ersi

ons

of th

e sh

ell r

ead

inpu

t fro

m s

tand

ard

inpu

t, m

akin

g it

impo

ssib

le to

us

e sh

ell s

crip

ts a

s pa

rt of

a p

ipel

ine.

By

the

late

197

0s, e

ach

of th

ese

shel

ls h

ad s

izab

le

follo

win

gs w

ithin

Bel

l Lab

orat

orie

s. T

he tw

o sh

ells

wer

e no

t co

mpa

tible

, lea

ding

to a

div

isio

n as

to w

hich

sho

uld

beco

me

the

stan

dard

she

ll. S

teve

Bou

rne

and

John

Mas

hey

argu

ed

thei

r res

pect

ive

case

s at

thre

e su

cces

sive

UN

IX u

ser g

roup

m

eetin

gs. B

etw

een

mee

tings

, eac

h en

hanc

ed th

eir s

hell

to

have

the

func

tiona

lity

avai

labl

e in

the

othe

r. A

com

mitt

ee w

as

set u

p to

cho

ose

a st

anda

rd s

hell.

It c

hose

the

Bou

rne

shel

l as

the

stan

dard

.

At t

he ti

me

of th

ese

so-c

alle

d ``

shel

l war

s'',

I wor

ked

on a

pr

ojec

t at B

ell L

abor

ator

ies

that

nee

ded

a fo

rm e

ntry

sys

tem

. W

e de

cide

d to

bui

ld a

form

inte

rpre

ter,

rath

er th

an w

ritin

g a

sepa

rate

pro

gram

for e

ach

form

. Ins

tead

of i

nven

ting

a ne

w

scrip

t lan

guag

e, w

e bu

ilt a

form

ent

ry s

yste

m b

y m

odify

ing

the

Bou

rne

shel

l, ad

ding

bui

lt-in

com

man

ds a

s ne

cess

ary.

Th

e ap

plic

atio

n w

as c

oded

as

shel

l scr

ipts

. We

adde

d a

built

-in

to re

ad fo

rm te

mpl

ate

desc

riptio

n fil

es a

nd c

reat

e sh

ell

varia

bles

, and

a b

uilt-

in to

out

put s

hell

varia

bles

thro

ugh

a fo

rm m

ask.

We

also

add

ed a

bui

lt-in

nam

ed le

t to

do

arith

met

ic u

sing

a s

mal

l sub

set o

f the

C la

ngua

ge e

xpre

ssio

n sy

ntax

. An

arra

y fa

cilit

y w

as a

dded

to h

andl

e co

lum

ns o

f dat

a on

the

scre

en. S

hell

func

tions

wer

e ad

ded

to m

ake

it ea

sier

to

writ

e m

odul

ar c

ode,

sin

ce o

ur s

hell

scrip

ts te

nded

to b

e la

rger

than

mos

t she

ll sc

ripts

at t

hat t

ime.

Sin

ce th

e B

ourn

e sh

ell w

as w

ritte

n in

an

Alg

ol-li

ke v

aria

nt o

f C, w

e co

nver

ted

our v

ersi

on o

f it t

o a

mor

e st

anda

rd K

&R

ver

sion

of C

. We

rem

oved

the

rest

rictio

n th

at d

isal

low

ed I/

O re

dire

ctio

n of

bu

ilt-in

com

man

ds, a

nd a

dded

ech

o, p

wd,

and

test

bui

lt-in

co

mm

ands

for i

mpr

oved

per

form

ance

. Fin

ally,

we

adde

d a

capa

bilit

y to

run

a co

mm

and

as a

cop

roce

ss s

o th

at th

e co

mm

and

that

pro

cess

ed th

e us

er-e

nter

ed d

ata

and

acce

ssed

the

data

base

cou

ld b

e w

ritte

n as

a s

epar

ate

proc

ess.

At t

he s

ame

time,

at t

he U

nive

rsity

of C

alifo

rnia

at B

erke

ley,

B

ill J

oy p

ut to

geth

er a

new

she

ll ca

lled

the

C s

hell.

Lik

e th

e M

ashe

y sh

ell,

it w

as im

plem

ente

d as

a c

omm

and

inte

rpre

ter,

not a

pro

gram

min

g la

ngua

ge. W

hile

the

C s

hell

cont

aine

d flo

w c

ontro

l con

stru

cts,

she

ll va

riabl

es, a

nd a

n ar

ithm

etic

fa

cilit

y, it

s pr

imar

y co

ntrib

utio

n w

as a

bet

ter c

omm

and

inte

rface

. It i

ntro

duce

d th

e id

ea o

f a h

isto

ry li

st a

nd a

n ed

iting

fa

cilit

y, s

o th

at u

sers

did

n't h

ave

to re

type

com

man

ds th

at

they

had

ent

ered

inco

rrec

tly.

I cre

ated

the

first

ver

sion

of k

sh s

oon

afte

r I m

oved

to a

re

sear

ch p

ositi

on a

t Bel

l Lab

orat

orie

s. S

tarti

ng w

ith th

e fo

rm

scrip

ting

lang

uage

, I re

mov

ed s

ome

of th

e fo

rm-s

peci

fic

code

, and

add

ed u

sefu

l fea

ture

s fro

m th

e C

she

ll su

ch a

s hi

stor

y, a

liase

s, a

nd jo

b co

ntro

l.

In 1

982,

the

UN

IX S

yste

m V

she

ll w

as c

onve

rted

to K

&R

C,

echo

and

pw

d w

ere

mad

e bu

ilt-in

com

man

ds, a

nd th

e ab

ility

to

def

ine

and

use

shel

l fun

ctio

ns w

as a

dded

. Unf

ortu

nate

ly,

the

Sys

tem

V s

ynta

x fo

r fun

ctio

n de

finiti

ons

was

diff

eren

t fro

m th

at o

f ksh

. In

orde

r to

mai

ntai

n co

mpa

tibili

ty w

ith th

e S

yste

m V

she

ll an

d pr

eser

ve b

ackw

ard

com

patib

ility

, I

mod

ified

ksh

to a

ccep

t eith

er s

ynta

x.

The

popu

lar i

nlin

e ed

iting

feat

ures

(vi a

nd e

mac

s m

ode)

of

ksh

wer

e cr

eate

d by

sof

twar

e de

velo

pers

at B

ell

Labo

rato

ries;

the

vi li

ne e

ditin

g m

ode

by P

at S

ulliv

an, a

nd

the

emac

s lin

e ed

iting

mod

e by

Mik

e Ve

ach.

Eac

h ha

d in

depe

nden

tly m

odifi

ed th

e B

ourn

e sh

ell t

o ad

d th

ese

feat

ures

, and

bot

h w

ere

in o

rgan

izat

ions

that

wan

ted

to u

se

ksh

only

if k

sh h

ad th

eir r

espe

ctiv

e in

line

edito

r. O

rigin

ally

the

idea

of a

ddin

g co

mm

and

line

editi

ng to

ksh

was

reje

cted

in

the

hope

that

line

edi

ting

wou

ld m

ove

into

the

term

inal

driv

er.

How

ever

, whe

n it

beca

me

clea

r tha

t thi

s w

as n

ot li

kely

to

happ

en s

oon,

bot

h lin

e ed

iting

mod

es w

ere

inte

grat

ed in

to

ksh

and

mad

e op

tiona

l so

that

they

cou

ld b

e di

sabl

ed o

n sy

stem

s th

at p

rovi

ded

editi

ng a

s pa

rt of

the

term

inal

in

terfa

ce.

As

use

of k

sh g

rew

, the

nee

d fo

r mor

e fu

nctio

nalit

y be

cam

e ap

pare

nt. L

ike

the

orig

inal

she

ll, k

sh w

as fi

rst u

sed

prim

arily

fo

r set

ting

up p

roce

sses

and

han

dlin

g I/O

redi

rect

ion.

New

er

uses

requ

ired

mor

e st

ring

hand

ling

capa

bilit

ies

to re

duce

the

num

ber o

f pro

cess

cre

atio

ns. T

he 1

988

vers

ion

of k

sh, t

he

one

mos

t wid

ely

dist

ribut

ed a

t the

tim

e th

is is

writ

ten,

ex

tend

ed th

e pa

ttern

mat

chin

g ca

pabi

lity

of k

sh to

be

com

para

ble

to th

at o

f the

regu

lar e

xpre

ssio

n m

atch

ing

foun

d in

sed

and

gre

p.

In s

pite

of i

ts w

ide

avai

labi

lity,

ksh

sou

rce

is n

ot in

the

publ

ic

dom

ain.

Thi

s ha

s le

d to

the

crea

tion

of b

ash,

the

``B

ourn

e ag

ain

shel

l'', b

y th

e Fr

ee S

oftw

are

Foun

datio

n; a

nd p

dksh

, a

publ

ic d

omai

n ve

rsio

n of

ksh

. Unf

ortu

nate

ly, n

eith

er is

co

mpa

tible

with

ksh

.

In 1

992,

the

IEE

E P

OS

IX 1

003.

2 an

d IS

O/IE

C 9

945-

2 sh

ell

and

utili

ties

stan

dard

s w

ere

ratif

ied.

The

se s

tand

ards

de

scrib

e a

shel

l lan

guag

e th

at w

as b

ased

on

the

UN

IX

Sys

tem

V s

hell

and

the

1988

ver

sion

of k

sh. T

he 1

993

vers

ion

of k

sh is

a v

ersi

on o

f ksh

whi

ch is

a s

uper

set o

f the

P

OS

IX a

nd IS

O/IE

C s

hell

stan

dard

s. W

ith fe

w e

xcep

tions

, it

is b

ackw

ard

com

patib

le w

ith th

e 19

88 v

ersi

on o

f ksh

.

The

awk

com

man

d w

as d

evel

oped

in th

e la

te 1

970s

by

Al

Aho

, Bria

n K

erni

ghan

, and

Pet

er W

einb

erge

r of B

ell

Labo

rato

ries

as a

repo

rt ge

nera

tion

lang

uage

. A s

econ

d-ge

nera

tion

awk

deve

lope

d in

the

early

198

0s w

as a

mor

e ge

nera

l-pur

pose

scr

iptin

g la

ngua

ge, b

ut la

cked

som

e sh

ell

feat

ures

. It b

ecam

e ve

ry c

omm

on to

com

bine

the

shel

l and

aw

k to

writ

e sc

ript a

pplic

atio

ns. F

or m

any

appl

icat

ions

, thi

s ha

d th

e di

sadv

anta

ge o

f bei

ng s

low

bec

ause

of t

he ti

me

cons

umed

in e

ach

invo

catio

n of

aw

k. T

he p

erl l

angu

age,

de

velo

ped

by L

arry

Wal

l in

the

mid

-198

0s, i

s an

atte

mpt

to

com

bine

the

capa

bilit

ies

of th

e sh

ell a

nd a

wk

into

a s

ingl

e la

ngua

ge. B

ecau

se p

erl i

s fre

ely

avai

labl

e an

d pe

rform

s be

tter t

han

com

bine

d sh

ell a

nd a

wk,

per

l has

a la

rge

user

co

mm

unity

, prim

arily

at u

nive

rsiti

es.

From "ksh - An Extensible High Level Language" by David G. Korn

Page 38: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

And while we are at it...

• Unix itself comes in different flavors; the 1980s saw an incredible proliferation of Unix versions, somewhere around 100 (System V, AIX, Berkeley BSD, SunOS, Linux, ...)

• Vendors provided (diverging) version of Unix, optimized for their own computer architectures and supporting different features

• Despite the diversity, it was still easier to “port” applications between versions of Unix than it was between different proprietary OS

• In the 90s, some consolidation took place; today Linux dominates the low-end market, while Solaris, AIX and HP-UX are leaders in the mid-to-high end

Page 39: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

A few common commands

First, commands to explore your file system; walk through directories and list files

• pwd, ls, cd

• mkdir, rmdir

• cp, mv, rm

• (These commands seem somewhat terse -- Why not “remove” or “copy” -- and some have speculated that it was because of the effort required to enter data on the early ASR-33 teletype keyboards)

Page 40: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 41: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 42: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Another view of the filesystem; here, your Mac will display directories as folders and you navigate by clicking rather than typing commands

Page 43: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

An example

• [obstacle 12 lectures] ls -al• total 8• drwxr-xr-x 8 cocteau staff 272 Sep 30 08:11 .• drwxr-xr-x 70 cocteau staff 2380 Sep 30 08:08 ..• drwxr-xr-x 4 cocteau staff 136 Sep 30 08:05 lecture1• drwxr-xr-x@ 282 cocteau staff 9588 Sep 30 08:13 lecture1.key• -rw-r--r-- 1 cocteau staff 139 Sep 21 15:34 lecture1.txt• drwxr-xr-x 11 cocteau staff 374 Sep 30 10:24 lecture2• drwxr-xr-x@ 80 cocteau staff 2720 Sep 30 10:22 lecture2.key

Your user name

The group you belong to that owns the file

The file’s size in bytes

When the file was last modified

The file’s name

Shorthand for your present working directory (where you’re at)

Shorthand for the directory one level above

Number of 512-byte blocks used by the files in the directory

Permissions; whatyou (and others) can do to the file

Type of file: - normal file d directory s socket file l link file

http://www.thegeekstuff.com/2009/07/linux-ls-command-examples/

Page 44: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Options

• Notice that Unix commands can also be modified by adding one or more options; in the case of ls, we added the option -l for the “long” form of the output and “-a” for all directories that begin with a “.” -- Another useful option is “-h” for humanly readable output

• The command man will provide you with help on Unix commands; you simply supply the name of the command you are interested in as an operand -- try typing “man ls”

Page 45: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 46: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Permission bits

• Unix can support many users (remember, that was part of its initial design goals) on a single system and each user can belong to one or more groups

• Every file in a Unix filesystem is owned by some user and one of that user’s groups; each file also has a set of permissions specifying which users can

• r: read • w: write (modify) or• x: execute

• the file; these are specified with three “bits” and we need three sets of bits to define what the user can do, what their group (that owns the file) can do and what others can do

Page 47: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Permission bits

• [obstacle 12 lectures] ls -al• total 8• drwxr-xr-x 8 cocteau staff 272 Sep 30 08:11 .• drwxr-xr-x 70 cocteau staff 2380 Sep 30 08:08 ..• drwxr-xr-x 4 cocteau staff 136 Sep 30 08:05 lecture1• drwxr-xr-x@ 282 cocteau staff 9588 Sep 30 08:13 lecture1.key• -rw-r--r-- 1 cocteau staff 139 Sep 21 15:34 lecture1.txt• drwxr-xr-x 11 cocteau staff 374 Sep 30 10:24 lecture2• drwxr-xr-x@ 80 cocteau staff 2720 Sep 30 10:22 lecture2.key

Page 48: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

What you can do to the file (u)

What the owning group can do (g)

What others can do (o)

The type of filePermission bits

• drwxr-xr-x 8 cocteau staff • drwxr-xr-x 70 cocteau staff • drwxr-xr-x 4 cocteau staff • drwxr-xr-x@ 282 cocteau staff • -rw-r--r-- 1 cocteau staff • drwxr-xr-x 11 cocteau staff • drwxr-xr-x@ 80 cocteau staff

Page 49: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

In general...

• The command chmod changes the permissions on a file; here are some examples

• % chmod g+x lecture1.txt

• % chmod ug-x lecture1.txt

• % chmod a+w lecture1.txt

• % chmod go-w lecture1.txt

• You can also use binary to express the permissions; so if we think of bits ordered as rwx, then

• and we can specify permissions with these values

• % chmod 754 dates.sh

r-x 1 ! 22 + 0 ! 21 + 1 ! 20 = 5

rwx 1 ! 22 + 1 ! 21 + 1 ! 20 = 7

r-- 1 ! 22 + 0 ! 21 + 0 ! 20 = 4

Page 50: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Kinds of files

• What you’ll notice right away is that there are different types of files having different permissions -- we will talk about different kinds of data storage and representation next time

• Unix filesystem conventions places (shared, commonly used) executable files in places like /usr/bin or /usr/local/bin (navigate to one of these directories and have a look!)

• Different files are opened by different kinds of programs; in OSX, there is a beautiful command called open that decides which program to use

Page 51: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Kinds of files

• Filenames which contain special characters like * and ~ are candidates for filename substitution

• ~ refers to your home directory and * is a wildcard for any number of characters (it’s referred to globbing -- you can try the command glob to see what happens, something we’ll do shortly)

• Other special characters like {, [ and ? can also be expanded, but we’ll get to them when we learn a bit more about regular expressions

Page 52: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Unix shells

• There are many flavors of Unix Shells; that is, there are many kinds of programs that operate as shells

sh, csh, tcsh, bash, ksh

• They are all programs and can be found on the file system

which sh

Page 53: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

An example: HTTP access logs

• www.stat.ucla.edu

• The department runs an Apache Web server running on taia.stat.ucla.edu

• Each request, each click by a user out on the Internet browsing our site, is logged -- there are standards for these files, but in general, they can be a bit hairy to “parse”

Page 54: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Log files

• A Unix system is constantly maintaining logs about its performance; various applications log their activities as well -- some are visible by everyone, some by users with “root” access

• Scraping through logs for anomalies, for system profiling, for system management, is a fairly common exercise; the logs we’ll look at have taken on extra importance given all the transactions that take place across the web

Page 55: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Into the cloud...

• This morning, I spun up an EC2 instance from Amazon Web Services -- EC2 stands for Elastic Compute Cloud, assigned it to a domain (stat202a.org) and copied our relatively small data set over to the new machine

• The machine is running Ubuntu and you’ll have access to all the Unix tools we’ve been discussion so far (and more!)

Page 56: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 57: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 58: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

From machine to machine

• In courses you’ve had previously, you probably ran R or some other program on the computer on your desk -- The Statistics Department has a number of computers available to you that are more powerful (memory, speed) than those on your desk and with EC2, we can spin up any number of machines to do our bidding

• You can navigate between machines by invoking the commands ssh (secure shell) and move files around using scp (secure copy, example shortly)...

% ssh [email protected]

Page 59: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 60: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Moving data

• You can also move data to your own machine (assuming it’s not too big -- at least one of the data sets we’ll consider this term will be too big to budge!)

• If the data, for example, are stored on homework.stat202a.org, you can bring them to your desktop with the command

• % scp [email protected]:/data/weblogs/access_log.1267056000 .

The secure copy command;friends with ssh

Copy the file access_log.1267056000 in the directory /data/weblogs from the computer homework.stat202a.org

Copy it to a file of the same name on your local machine in your current working directory

Page 61: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

HTTP access logs

• A bit of digging...

Commands

• pwd, ls, cd

• more/less, tail, wc

• cut, sort, uniq

Page 62: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% cd /data/weblogs% ls

% head access_log.1267056000

seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:00:31 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"ces.stat.ucla.edu 128.97.55.242 - - [24/Feb/2010:23:02:12 -0800] "GET /rss/ HTTP/1.1" 404 1515 "-" "Apple-PubSub/65.11"seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:48 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:53 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:37 -0800] "GET /2007-04-10/3:00pm/1260-franz HTTP/1.0" 200 3459 "-" "UCLA%20Google%20Seacsrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 99 "-" "UCLA%20Google%20Search%20Appliance%20%2301csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /projects/ HTTP/1.0" 200 1179 "-" "UCLA%20Google%20Search%20Appliance%20%230seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /archive?page=24 HTTP/1.0" 200 4329 "-" "UCLA%20Google%20Search%20Applianseminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /2007-02-23/3:00pm/5200-boelter HTTP/1.0" 200 3560 "-" "UCLA%20Google%20Sseminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /1999-03-01/3:00pm/6627-ms HTTP/1.0" 200 3005 "-" "UCLA%20Google%20Search

% wc access_log.1267056000

1215118 24435535 306343201 access_log.1267056000

% tail access_log.1267056000

forums.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:54 -0800] "GET /feed.php?16,102,type=rss HTTP/1.0" 200 1570 "-" "UCLA%20Google%20Search%20Appliance%20%2302%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-GAAQFZSZNSNAS; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /web/checks/check_results_zic.html HTTP/1.0" 200 4965 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"www.socr.ucla.edu 207.46.199.191 - - [03/Mar/2010:23:59:56 -0800] "GET /robots.txt HTTP/1.1" 200 58 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"www.stat.ucla.edu 67.195.110.174 - - [03/Mar/2010:23:59:57 -0800] "GET /~jkokkin/gray/pb_lrn_bound_1_sharp_2_dc_2_1_nm/253055_counts.mat HTTP/1.0" 200 481 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"directory.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:57 -0800] "GET /info.php?directory=443 HTTP/1.0" 200 2905 "-" "UCLA%20Google%20Search%20Appliance%20%2302%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-GAAQFZSZNSNAS; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /web/checks/check_results_xlsReadWrite.html HTTP/1.0" 200 5185 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"www.stat.ucla.edu 67.180.31.126 - - [03/Mar/2010:23:59:58 -0800] "GET /projects/datasets/risk_perception.html HTTP/1.1" 200 5818 "http://www.stat.ucla.edu/projects/datasets/social_sciences.php" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"www.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:59 -0800] "GET /cgi-bin/man-cgi?ypmatch HTTP/1.0" 404 - "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"

Page 63: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Combined log format

virtual host

IP address

Identity

Userid

date

Request

Status

Bytes

Referrer

Agent

Page 64: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Slicing out a field...

For speed, we’ll work with just one day’s worth of logs; look at the file access_log.1267056000 -- what time period does this cover?

Many of the built-in Unix tools are designed to digest log files of this kind; in a way, this is a very old dance, one that has been enacted for decades...

Examples

• cut -d" " -f1,11 access_log.1267056000

• cut -d" " -f11 access_log.1267056000

Page 65: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

In general...

cut -d" " -f1,5,9 select the first, fifth and ninthcut -d" " -f1-5 select the first through the fifthcut -d" " -f1 select just the first

Page 66: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Unix pipes

• Rather than have the output scroll out on the screen, it would be better (useful, even) to “catch” the output with a more command

• Many programs take some kind of input and generate some kind of output; Unix tools often take input from the user and print the output to the screen

• “Redirection” of data to and from files or programs is controlled by a construction known as a pipe

Page 67: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% cut -d" " -f1 access_log.1267056000 | more

seminars.stat.ucla.educes.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.educsrcb.stat.ucla.educsrcb.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.educsrcb.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.edu...

Page 68: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Sending output to a file with “>”

With this form of redirection, we take a stream of processed data and store it in a file -- For example

• % cut -d" " -f2 access_log.1267056000 > ~/ips.txt

• % ls ~

Page 69: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Taking input from a file with “<”

With this form of redirection, we create an input stream from a file -- For example

wc < ~/ips.txt

Page 70: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Tallying

• As its name suggests, the command sort will order the rows in our file; by default it uses alphabetical order -- the option -n let you sort numerically instead

• The command uniq will remove repeated adjacent lines; so if your file is sorted, it will return just the unique rows

• sort ~/ips.txt > ~/ips_sorted.txt

• head ~/ips_sorted.txt

• uniq ~/ips_sorted.txt | more

Page 71: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The pipeline

We seem to be proliferating files -- one version of the IPs that is raw, one that is sorted, maybe one that has just the unique entries -- we can use the pipe construction to combine many of these steps into one

As the name might imply, you can connect pipes and have data stream from process to process

Example

• cut -d" " -f2 access_log.1267056000 | sort | uniq | wc

• cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn

• cut -d" " -f10 access_log.1267056000 | sort | uniq -c

Page 72: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% cut -d" " -f2 access_log.1267056000 | sort | uniq | wc

28564 28564 404062 # what does this mean?

% cut -d" " -f10 access_log.1267056000 | sort | uniq -c | sort -rn | head -10

933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416

Page 73: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

What are these numbers?

• A fast Google search gives us a list of possible errors

• Note that Error 200 actually means a success

• Error 206 means that only part of the file was delivered; the user cancelled the request before it could be delivered

• Error 304 is “not modified”; sometimes clients perform conditional GET requests

HTTP Error 101Switching Protocols. Again, not really an "error", this HTTP Status Code means everything is working fine.

HTTP Error 200Success. This HTTP Status Code means everything is working fine. However, if you receive this message on screen, obviously something is not right... Please contact the server's administrator if this problem persists. Typically, this status code (as well as most other 200 Range codes) will only be written to your server logs.

HTTP Error 201Created. A new resource has been created successfully on the server.

HTTP Error 202Accepted. Request accepted but not completed yet, it will continue asynchronously.

HTTP Error 203Non-Authoritative Information. Request probably completed successfully but can't tell from original server.

HTTP Error 204No Content. The requested completed successfully but the resource requested is empty (has zero length).

HTTP Error 205Reset Content. The requested completed successfully but the client should clear down any cached information as it may now be invalid.

HTTP Error 206Partial Content. The request was canceled before it could be fulfilled. Typically the user gave up waiting for data and went to another page. Some download accelerator programs produce this error as they submit multiple requests to download a file at the same time.

HTTP Error 300Multiple Choices. The request is ambiguous and needs clarification as to which resource was requested.

HTTP Error 301Moved Permanently. The resource has permanently moved elsewhere, the response indicates where it has gone to.

HTTP Error 302Moved Temporarily. The resource has temporarily moved elsewhere, the response indicates where it is at present.

HTTP Error 303See Other/Redirect. A preferred alternative source should be used at present.

Page 74: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn | more

130258 128.97.86.247 129354 164.67.132.220 123756 164.67.132.219 83956 164.67.132.221 83205 164.67.132.222 23554 67.195.110.174 23058 128.97.55.186 21205 76.89.146.27 15941 128.97.55.188 12967 128.97.55.185 12365 67.195.114.250 8029 137.110.114.8 7748 208.115.111.248 5256 216.9.17.231 4894 128.97.55.3 4554 66.249.67.236 4057 69.235.86.81 3857 66.249.67.183 3691 67.195.110.173 3622 67.195.113.236 3181 77.88.29.246 3144 74.6.149.154 2825 98.222.49.118...

Page 75: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The pipeline

• In 1972, pipes appear in Unix, and with them a philosophy, albeit after some struggle for the syntax; should it be

• more(sort(cut)))

• [Remember this; S/R has this kind of functional syntax]

• The development of pipes led to the concept of tools -- software programs that would be in a tool box, available when you need them

“And that’s, I think, when we started to think consciously about tools, because then you could compose things together... compose them at the keyboard and get them right every time.”

from an interview with Doug McIlroy

Page 76: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Remember, read the man pages!

As we did with ls, if the command uniq is unfamiliar, you can look up its usage

Example

man uniqman host

Page 77: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 78: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% host 128.97.86.247247.86.97.128.in-addr.arpa domain name pointer limen.stat.ucla.edu.

% host 164.67.132.220220.132.67.164.in-addr.arpa domain name pointer gsa2.search.ucla.edu.

% host 67.195.110.174174.110.195.67.in-addr.arpa domain name pointer b3091077.crawl.yahoo.net.

Page 79: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 80: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

But what are we really after?

• Rather than splitting up the data in access_log.1267056000 by day, we might consider dividing it by IP -- that is extract all the hits from a given IP

• Once we have such a thing, we can use the command wc to tell us about the number of accesses from each user

• We can also start to “fit” user-level models that can be used to predict navigation

Page 81: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Rudimentary pattern matching

grep can be used to skim lines from a file that have (or don’t have) a particular pattern -- patterns are specified via regular expressions, something we will learn more about later

The name of the command comes from an editing operation on Unix: g/re/p

Example

host 98.222.49.118

grep 98.222.49.118 access_log.1267056000

grep /~dinov access_log.1267056000

Page 82: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

%grep 98.222.49.118 access_log.1267056000| head -25

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:37 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:38 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:39 -0800] "GET /~zyyao/common.css HTTP/1.1" 200 3762 "http://www.stat.ucla.edu/~zyyao/" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 200 36901 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:42 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 206 4352 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /~zyyao/projects/ABSS/all_exp.html HTTP/1.1" 200 525 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:37 -0800] "GET /%7Ezyyao/projects/ABSS/walk.html HTTP/1.1" 200 22230 "http://www.stat.ucla.edu/~zyyao/projects/ABSS/all_exp.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/000template101.png HTTP/1.1" 200 2369 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1001.png HTTP/1.1" 200 11315 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003.png HTTP/1.1" 200 11362 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003sketch.png HTTP/1.1" 200 2224 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004.png HTTP/1.1" 200 11191 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004sketch.png HTTP/1.1" 200 2293 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005.png HTTP/1.1" 200 11214 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005sketch.png HTTP/1.1" 200 2251 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006.png HTTP/1.1" 200 11507 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006sketch.png HTTP/1.1" 200 2179 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007.png HTTP/1.1" 200 11226 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007sketch.png HTTP/1.1" 200 2268 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008.png HTTP/1.1" 200 11321 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008sketch.png HTTP/1.1" 200 2214 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009.png HTTP/1.1" 200 10986 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009sketch.png HTTP/1.1" 200 2246 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"

Page 83: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

% cut -d" " -f12 access_log.1267056000|grep google | head -25

"http://www.google.com/search?hl=en&newwindow=1&q=filled+%22rounded+box%22+image+beamer&aq=f&aqi=&aql=&oq="

"http://www.google.com/imgres?imgurl=http://scc.stat.ucla.edu/page_attachments/0000/0064/map.jpg&imgrefurl=http://scc.stat.ucla.edu/contact-us/map&h=640&w=779&sz=307&tbnid=uHJ18tjP58PzjM:&tbnh=117&tbnw=142&prev=/images%3Fq%3Dboelter%2Bhall%2Bucla&hl=en&usg=__uoy38imK_LrO3Pe3j4z42srd9qY=&ei=BS6GS6nAMIXWsgPA0PzWDQ&sa=X&oi=image_result&resnum=5&ct=image&ved=0CBEQ9QEwBA"

"http://www.google.com.ph/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=tl&source=hp&q=HISTORY+OF+STATISTICS&meta=&btnG=Hanapin+sa+Google"

"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"

"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"

"http://translate.google.gr/translate?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/"

"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=translate.google.gr&usg=ALkJrhjZRaez9AkwycLOBDTWUPwBXEuthw"

"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=translate.google.gr&usg=ALkJrhjZRaez9AkwycLOBDTWUPwBXEuthw"

"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_head.php&rurl=translate.google.gr&usg=ALkJrhiQwFYwCUEiCMaFMRNh7aY_UFOEYA"

"http://www.google.com/search?q=socr&rls=com.microsoft:ko:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7HPAB_ko"

"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_body.php&rurl=translate.google.gr&usg=ALkJrhgWBIYc3prqMNQUlZTad_UKiabFWg"

"http://www.google.co.in/search?q=Lienhart+ICME+2003+face&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"

"http://www.google.com.my/search?hl=en&client=firefox-a&channel=s&rls=org.mozilla:en-US:official&q=markov+chain+monte+carlo+mcmc&revid=1701745188&ei=BS-GS5OFI4qOkQXb7JFC&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=1&ved=0CEQQ1QIoADgU"

"http://www.google.com/search?q=stata+goodness+of+fit+test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"

"http://translate.googleusercontent.com/translate_c?hl=zh-CN&sl=en&u=http://www.stat.ucla.edu/~yuille/courses/Stat202C/ga_tutorial.pdf&prev=/search%3Fq%3DA%2Bgenetic%2Balgorithm%2Btutorial%26hl%3Dzh-CN%26newwindow%3D1%26client%3Daff-cs-360se-channel%26hs%3DHyp%26sa%3DN%26channel%3Dbookmark%26source%3Dhp&rurl=translate.google.cn&usg=ALkJrhhyeLuWRdcFd6ege_jTb1a8hMbEWw"

"http://www.google.lt/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~bk%2F&rct=j&q=brian+kriegler&ei=zC-GS4i7O4OmnQOj5_TWCw&usg=AFQjCNHTadqBAC3mNZuf6W0Oc6ELJOPeAw"

"http://www.google.nl/search?hl=nl&source=hp&q=f+distribution+table&meta=&aq=1&oq=F+distrib"

"http://www.google.com.my/search?q=case+in+suicide&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"

"http://www.google.com/url?sa=t&source=web&ct=res&cd=9&ved=0CCgQFjAI&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~sczhu%2FProject_pages%2FFace_Aging%2FFaceAging.htm&rct=j&q=aging+process+face&ei=4y-GS9WFI4OIsgOwqLzSDQ&usg=AFQjCNGniQVp4_079POXSewtZ7-Thaul4g"

"http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=en&source=hp&q=t-statistic+chart&btnG=Google+Search"

"http://www.google.co.in/search?hl=en&q=Genetic+algorithm+tutorial&meta=&aq=o&oq="

"http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&q=an+introduction+to+stochastic+modeling+taylor+karlin&start=170&sa=N"

"http://www.google.com/search?hl=en&client=firefox-a&hs=YlE&rls=org.mozilla%3Aen-US%3Aofficial&q=dinov+ucla+stats&aq=f&aqi=g1&aql=&oq="

"http://www.google.com/search?hl=en&lr=&client=safari&rls=en&q=the+reading+on+a+voltage+meter+connected+to+a+test+circuit+is+uniformly+distributed+over+the+interval&aq=f&aqi=&aql=&oq="

"http://www.google.co.jp/search?hl=ja&q=Ryacas%E3%80%80r-commander+install&btnG=%E6%A4%9C%E7%B4%A2&lr=&aq=f&oq="

Page 84: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

A puzzle

• In working through these notes, you might have noticed that I’ve been protecting you a little from some nastiness -- Suppose, for example, we decided to use the cut command to extract the 10th field in each line of the logfile and look at just the first 10 lines of ouptut

• The code 200 is the most frequent by far the most frequent with second place going to code 304, but by looking at the first 10 lines I intentionally hid information from you...

count code 933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416

Page 85: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

A puzzle

• Looking at the full output from the command and not just the first 10 lines, you see something a bit different; at the right we present that table

• What else do you notice?

count code 933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416 29 400 8 HTTP/1.0" 3 prep.R 3 HTTP/1.1" 2 1103 1 Post-rej/project_II.htm 1 500 1 2008

Page 86: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

A puzzle

• About 18 of the hits in this file are logged in such a way that the 10th field (as defined by a space) is not the code we were after

• We’ve made an assumption about the data that isn’t correct, and the result shows up when we have a large enough sample -- to chase down the problem, we extracted just the log line referring to a request for project_II.htm

• So, what happened?

www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, [email protected])"

Page 87: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Backstory

• We specify web pages using a Uniform Resource Locator (URL); for example, the page http://www.stat.ucla.edu/research will be interpreted as a request for /research from the host www.stat.ucla.edu

• The request and response are handled via a communication protocol called HTTP (or the Hypertext Transfer Protocol); you will see two versions of HTTP in our log files (HTTP/1.0 which opens a new “connection” per request, and HTTP/1.1 which can reuse a connection)

• HTTP is the protocol by which clients (your web browser, or in the parlance of your logs, a user agent) makes requests from one or more servers (computers hosting web sites; also called an HTTP server) and receives data in return (web pages, media items, and so on)

Page 88: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Backstory

• In addition to its particular format, a URL can be made up of only certain classes of symbols taken from the US ASCII* character set-- alphanumeric characters, some the special characters $-_.+!*() and some reserved characters that have their own special purpose

• I don’t want to go into the specifics of naming here, as we will likely take it up in earnest later, except to say that certain characters cannot appear in a URL or should only appear with a special meaning (like / to specify the path)

• When you need to include a special character, they need to be URL encoded; that is, they are replaced with a numerical equivalent...

• *The American Standard Code for Information Interchange

Page 89: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 90: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material
Page 91: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

URL Encoding

• URL encoding replaces characters with their ASCII hexadecimal equivalent, prefaced a %; recall that the hexadecimal number system represents values in “base” 16 -- that is, it uses the sixteen symbols 0,1,...,9,A,B,C,D,E,F

• Two hexadecimal digits are equivalent to 8 binary digits (8 bits or 1 byte) and represent an integer between 0 and 255 -- we will have more to say about data representation next time

• For now, it’s enough to know that the %20 is the URL encoding of the space “ “; does this bring us any closer to understanding the three lines a few slides back?

Page 92: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Puzzle solved

• Notice that the first line of the five has “ “‘s in the URL that was requested -- a “ “ that was not URL encoded -- and this means that counting spaces is not really the same as identifying items from the log

• Technically, your browser takes care of this encoding (try typing in a URL with a space in it); it’s worth noting that the request came from a crawler; a robot that (presumably) doesn’t have this check in place

% grep Post-rej/project_II.htm access_log.* | head -5

access_log.1267056000:www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; DotBot/1.1;

access_log.1267660800:www.stat.ucla.edu 67.195.110.174 - - [06/Mar/2010:04:29:33 -0800] "GET /%7Ewxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm HTTP/1.0" 200 12078 "-" "Mozilla/5.0 (compatible; Yah

access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:20 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm HTTP/1.1" 200 12078 "

access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:21 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/lightblue.jpg HTTP/1.1" 200 672 "http://www.stat.ucla.edu/~wxl/researchproteomic/_Project_Rej%20vs%20Post-rej/project_II.htm" "Mozilla/4.0 (

access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:28 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/other%20plot/CART.pdf HTTP/1.1" 200 214177 "http://www.stat.ucla.edu/~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm"

Page 93: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

The lesson

• This explains just one of the handful of errors we encountered in the one file, but the basic problem is the same -- we are relying on spaces to define a pattern for us, but this is not a completely reliable way of identifying the content we’re after

• Instead, we need a more elaborate language for specifying patterns; and today we’ll see an extremely powerful construction known as regular expressions

• (Note: This material is perhaps slightly out of time order -- I decided to present it now to get us out of this puzzle -- and next time we will take a small step back and talk about data formats like CSV, XML, JSON and the relational model)

Page 94: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Quien es mas macho?

• In online marketing, hits rule the roost; the more raw traffic you attract, the greater your opportunities for making a sale

• Alright, so it’s not as simple as that, but let’s see what we can learn about centers of activity on our website

Page 95: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Quien es mas macho?

• Compute the number of hits to the portions of the site owned by Song-Chun Zhu, Vivian Lew, Jan deLeeuw and Ivo Dinov

• Who received the most hits in August? In March?

• What can you say about the kinds of files that were downloaded?

• What was the most popular portion of each site?

Page 96: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Then...

• Pull back a little and tell me about the site in general and the habits of its visitors; specifically, think about

• When is the site active? When is it quiet?

• Do the visitors stay for very long? Do they download any of our papers or software? What applications do they run?

• On the balance, is our traffic “real” or mostly the result of robots or automated processes?

Page 97: Lecture 2: The pipeline - UCLA Statisticscocteau/stat202a/lectures/lecture2.pdf · Lecture 2: The pipeline. Last time • We presented some (admittedly rough) context for the material

Finally

• Unix tools can be incredibly powerful; they will be much faster than most other options for what they’re good at

• That said, there are plenty of kinds of analysis that are not workable with collections of simple Unix facilities; as you do your homework, keep track of what seems hard and why (in particular

• Our programming resources will keep getting more and more elaborate as the quarter progresses; this is just a first step