lecture 2: the pipeline - ucla statisticscocteau/stat202a/lectures/lecture2.pdf · lecture 2: the...
TRANSCRIPT
Lecture 2: The pipeline
Last time
• We presented some (admittedly rough) context for the material to be covered in this course: From a comparison between Stat202a from 2004 and 2010 we saw somewhat incredible shifts in mechanisms by which data are distributed
• We also considered the term “data science” and compared current attempts at a definition with older (some really old) approaches and the classic cycle of exploratory data analysis
• We ended by examining some early work on bringing data analysis to high school students, curricular approaches that should have something to say about how we train statisticians
Today
• We will look at the Unix operating system, the philosophy underlying its design and some basic tools
• Given the abstract nature of our first lecture, we will emphasize some hands-on tasks -- We will use as our case study the last week of traffic across the department’s website www.stat.ucla.edu
• You will have a chance to kick the tires on these tools and address some simple web site usage statistics
The shell: Why we teach it
• Many of you know the computer as a Desktop and a handful of applications (a browser, Word, Excel, maybe R); to introduce the shell means having a discussion about the structure of the computer, about operating systems, about filesystems, about history
• The shell offers programmatic access to a computer’s constituent parts, allows students to “do” data analysis on directories (folders), on processes, on their network
• Given that there are so many flavors of shell, it is also the first time we can talk about choosing tools (that the choice is yours to make!), about evaluating which shell is best for you, and about the shifting terrain of software development
The shell: Why we teach it
• As a practical matter, shell tools are an indispensable part of my own practice, for data “cleaning,” for preliminary data processing, for exploratory analysis -- By design, they are simple, lightweight and easily combined to perform much more complex computations
• The shell also becomes important as you start to make use of shared course resources (data, software, hardware) -- This year, most of our work will be done on computers outside the Statistics Department, and the shell will be your window onto these other systems
• On the computers in front of you, you can access the shell through a Terminal Window (Applications -> Utilities -> Terminal) -- For those of you with a Window’s machine, you will have to download an emulator like Cygwin...
Operating systems
• Most devices that contain a computer of some kind will have an OS; they tend to emerge when the appliance will have to deal with new applications, complex user-input and possibly changing requirements of its function
• Your Tivo, smartphone and even your automobile all have operating systems
Operating systems
• An operating system is a piece of software (code) that organizes and controls hardware and other software so your computer behaves in a flexible but predictable way
• For home computers, Windows, MacOS and Linux are among the most commonly used operating systems
Your computer
• The Central Processing Unit (CPU) or microprocessor is a complete computational engine, capable of carrying out a number of basic commands (basic arithmetic, store and retrieve information from its memory, and so on)
• The CPU itself has the capacity to store some amount of information; when it needs more space, it moves data to another kind of chip known as Random Access Memory (RAM) -- “random access” as opposed to, say, sequential access to memory locations
• Your computer also has one or more storage devices that can be used to organize and store data; hard disks or drives store data magnetically, while solid state drives again use semiconductor technology
Your computer
• Your operating system, then, manages all of your computer’s resources, providing layers of abstraction for both you, the user, as well as developers who are writing new programs for your computer
• With the emergence of so-called cloud computing, we imagine a variety of computing resources “out there” on the web that we can execute; in this model, computations are performed elsewhere, and your own computer might function more as a “browser” receiving results -- Google’s Chrome operating system is “minimalist” in this sense
http://www.youtube.com/watch?v=0QRO3gKj3qw
http://www.youtube.com/watch?v=0QRO3gKj3qw
http://www.youtube.com/watch?v=0QRO3gKj3qw
http://www.youtube.com/watch?v=0QRO3gKj3qw
But let’s step back a bit...
• The computers you are sitting at are running the Mac OS which is built on a Unix platform -- let’s spend some time talking about Unix (we will have more to say about three fundamental abstractions for computer components -- the memory, the interpreter and the communication link -- next time)
• We’ll start with a little history that might help explain why these tools look the way they do, their underlying design philosophy and how they can help you in the early (or late, even) stages of an analysis
Some history
• In 1964, Bell Labs partnered with MIT and GE to create Multics (for Multiplexed Information and Computing Service) -- here is the vision they had for computing
“Such systems must run continuously and reliably 7 days a week, 24 hours a day in a way similar to telephone or power systems, and must be capable of meeting wide service demands: from multiple man-machine interaction to the sequential processing of absentee-user jobs; from the use of the system with dedicated languages and subsystems to the programming of the system itself”
Some history
• Bell Labs pulled out of the Multics project in 1969, a group of researchers at Bell Labs started work on Unics (Uniplexed information and computing system) because initially it could only support one user; as the system matured, it was renamed Unix, which isn’t an acronym for anything
• Richie simply says that Unix is a “somewhat treacherous pun on Multics”
From “The Art of Unix Programming” by Raymond
The Unix filesystem
• In Multics, we find the first notion of a hierarchical file system -- software for organizing and storing computer files and the data they contain in Unix, files are arranged in a tree structure that allows separate users to have control of their own areas
• In general, a file systems provide a level of abstraction for the user, keeping track of how data are stored on an associated physical storage device (say, a hard drive or a CD)
• Unix began (more or less) as a file system and then an interactive shell emerged to let you examine its contents and perform basic operations
The Unix kernel and its shell
• The Unix kernel is the part of the operating system that provides other programs access to the system’s resources (the computer’s CPU or central processing unit, its memory and various I/O or input/output devices)
• The Unix shell is a command-line interface to the kernel -- keep in mind that Unix was designed by computer scientists for computer scientists and the interface is not optimized for novices
Unix shells
• The Unix shell is a type of program called an interpreter; in this case, think of it as a text-based interface to the kernel
• The shell operates in a simple loop: It accepts a command, interprets it, executes the command and waits for another -- The shell displays a prompt to tell you that it is ready to accept a command
Shells in general
• The term “shell” is often applied to any program that you type commands into (we’ll see similar kinds of interfaces when we talk about Python, R, and even sqlite3 and mongo) -- the idea is that the shell is the outermost interface to the inner workings of the system it surrounds
• It is sometimes useful to take a broader view of the shell concept, for example, considering a browser (like Firefox) a shell for a system that renders web pages -- in such cases, we might usefully distinguish between text-based shells (command line interfaces) and graphics-based shells (graphical user interfaces)
• Now, open up the terminal window on your Mac and let’s explore a little...
* A boring slide, but full of potential!
Unix shells
• The shell is itself a program that the Unix operating system runs for you (a program is referred to as a process when it is running)
• The kernel manages many processes at once, many of which are the result of user commands (others provide services that keep the computer running)
• Some commands are built into the shell, others have been added by users -- Either way, the shell waits until the command is executed
The result of typing in the command top; a printout of all the processes running on your computer
Process ID
Name of the commandHow hard the computer is thinking about it
How much memory isbeing used
% hostinfo # neolab.stat.ucla.edu
Mach kernel version:! Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01 PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386Kernel configured for up to 8 processors.8 processors are physically available.8 processors are logically available.Processor type: i486 (Intel 80486)Processors active: 0 1 2 3 4 5 6 7Primary memory available: 16.00 gigabytesDefault processor set: 159 tasks, 364 threads, 8 processorsLoad average: 10.14, Mach factor: 1.43
% hostinfo # my laptop
Mach kernel version:! Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386Kernel configured for up to 2 processors.2 processors are physically available.2 processors are logically available.Processor type: i486 (Intel 80486)Processors active: 0 1Primary memory available: 4.00 gigabytesDefault processor set: 110 tasks, 442 threads, 2 processorsLoad average: 0.25, Mach factor: 1.74
Running top on neolab.stat.ucla.edu
! How many processes are running?
How much RAM is available?
Running top on taia.stat.ucla.edu (last year)
! How many processes are running?
How much RAM is available?
What do you reckon this computer does?
Operating systems
• Processor management
• Schedules jobs (formally referred to as processes) to be executed by the computer
• Memory and storage management
• Allocate space required for each running process in main memory (RAM) or in some other temporary location if space is tight; and supervise the storage of data onto disk
Operating systems
• Device management
• A program called a driver translates data (files from the filesystem) into signals that devices like printers can understand; an operating system manages the communication between devices and the CPU, for example
• Application interface
• An API (application programming interface) let programmers use functions of the computer and the operating system without having to know how something is done
• User interface
• Finally, the operating system turns and looks at you; the UI is a program that defines how users interact with the computer -- some are graphical (Windows is a GUI) and some are text-based (your Unix shell)
Unix shell(s)
• There are, in fact, many different kinds of Unix shells -- The table on the right lists a few of the most popular
• Why so many?
Bourne shell /bin/sh The oldest and most standardized shell. Widely used for system startup files (scripts run during system startup). Installed in Mac OS X.
Bash (Bourne Again SHell) /bin/bashBash is an improved version of sh. Combines features from csh, sh, and ksh. Very widely used, especially on Linux systems. Installed in Mac OS X.http://www.gnu.org/manual/bash/
C shell /bin/csh Provides scripting features that have a syntax similar to that of the C programming language (originally written by Bill Joy). Installed in Mac OS X.
Korn shell /bin/ksh Developed at AT&T by David Korn in the early 1980s. Ksh is widely used for programming. It is now open-source software, although you must agree to AT&T's license to install it. http://www.kornshell.com
TC Shell /bin/tcsh An improved version of csh. The t in tcsh comes from the TENEX and TOPS-20 operating systems, which provided a command-completion feature that the creator (Ken Greer) of tcsh included in his new shell. Wilfredo Sanchez, formerly lead engineer on Mac OS X for Apple, worked on tcsh in the early 1990s at the Massachusetts Institute of Technology.
Z shell /bin/zsh Created in 1990, zsh combines features from tcsh, bash, and ksh, and adds many of its own. Installed in Mac OS X. http://zsh.sourceforge.net
Why the choices?
• A shell program was originally meant to take commands, interpret them and then execute some operation
• Inevitably, one wants to collect a number of these operations into programs that execute compound tasks; at the same time you want to make interaction on the command line as easy as possible (a history mechanism, editing capabilities and so on)
• The original Bourne shell is ideal for programming; the C-shell and its variants are good for interactive use; the Korn shell is a combination of both
Steve Bourne, creator of sh, in 2005
The
orig
inal
UN
IX s
yste
m s
hell
was
a s
impl
e pr
ogra
m w
ritte
n by
Ken
Tho
mps
on a
t Bel
l Lab
orat
orie
s, a
s th
e in
terfa
ce to
th
e ne
w U
NIX
ope
ratin
g sy
stem
. It a
llow
ed th
e us
er to
invo
ke
sing
le c
omm
ands
, or t
o co
nnec
t com
man
ds to
geth
er b
y ha
ving
the
outp
ut o
f one
com
man
d pa
ss th
roug
h a
spec
ial
file
calle
d a
pipe
and
bec
ome
inpu
t for
the
next
com
man
d.
The
Thom
pson
she
ll w
as d
esig
ned
as a
com
man
d in
terp
rete
r, no
t a p
rogr
amm
ing
lang
uage
. Whi
le o
ne c
ould
pu
t a s
eque
nce
of c
omm
ands
in a
file
and
run
them
, i.e
., cr
eate
a s
hell
scrip
t, th
ere
was
no
supp
ort f
or tr
aditi
onal
la
ngua
ge fa
cilit
ies
such
as
flow
con
trol,
varia
bles
, and
fu
nctio
ns. W
hen
the
need
for s
ome
flow
con
trol s
urfa
ced,
the
com
man
ds /b
in/if
and
/bin
/got
o w
ere
crea
ted
as s
epar
ate
com
man
ds. T
he /b
in/if
com
man
d ev
alua
ted
its fi
rst a
rgum
ent
and,
if tr
ue, e
xecu
ted
the
rem
aind
er o
f the
line
. The
/bin
/got
o co
mm
and
read
the
scrip
t fro
m it
s st
anda
rd in
put,
look
ed fo
r th
e gi
ven
labe
l, an
d se
t the
see
k po
sitio
n at
that
loca
tion.
W
hen
the
shel
l ret
urne
d fro
m in
voki
ng /b
in/g
oto,
it re
ad th
e ne
xt li
ne fr
om s
tand
ard
inpu
t fro
m th
e lo
catio
n se
t by
/bin
/go
to.
Unl
ike
mos
t ear
lier s
yste
ms,
the
Thom
pson
she
ll co
mm
and
lang
uage
was
a u
ser-
leve
l pro
gram
that
did
not
hav
e an
y sp
ecia
l priv
ilege
s. T
his
mea
nt th
at n
ew s
hells
cou
ld b
e cr
eate
d by
any
use
r, w
hich
led
to a
suc
cess
ion
of im
prov
ed
shel
ls. I
n th
e m
id-1
970s
, Joh
n M
ashe
y at
Bel
l Lab
orat
orie
s ex
tend
ed th
e Th
omps
on s
hell
by a
ddin
g co
mm
ands
so
that
it
coul
d be
use
d as
a p
rimiti
ve p
rogr
amm
ing
lang
uage
. He
mad
e co
mm
ands
suc
h as
if a
nd g
oto
built
-ins
for i
mpr
oved
pe
rform
ance
, and
als
o ad
ded
shel
l var
iabl
es.
At t
he s
ame
time,
Ste
ve B
ourn
e at
Bel
l Lab
orat
orie
s w
rote
a
vers
ion
of th
e sh
ell w
hich
incl
uded
pro
gram
min
g la
ngua
ge
tech
niqu
es. A
rich
set
of s
truct
ured
flow
con
trol p
rimiti
ves
was
par
t of t
he la
ngua
ge; t
he s
hell
proc
esse
d co
mm
ands
by
build
ing
a pa
rse
tree
and
then
eva
luat
ing
the
tree.
Bec
ause
of
the
rich
flow
con
trol p
rimiti
ves,
ther
e w
as n
o ne
ed fo
r a
goto
com
man
d. B
ourn
e in
trodu
ced
the
"her
e-do
cum
ent"
whe
reby
the
cont
ents
of a
file
are
inse
rted
dire
ctly
into
the
scrip
t. O
ne o
f the
ofte
n ov
erlo
oked
con
tribu
tions
of t
he
Bou
rne
shel
l is
that
it h
elpe
d to
elim
inat
e th
e di
stin
ctio
n be
twee
n pr
ogra
ms
and
shel
l scr
ipts
. Ear
lier v
ersi
ons
of th
e sh
ell r
ead
inpu
t fro
m s
tand
ard
inpu
t, m
akin
g it
impo
ssib
le to
us
e sh
ell s
crip
ts a
s pa
rt of
a p
ipel
ine.
By
the
late
197
0s, e
ach
of th
ese
shel
ls h
ad s
izab
le
follo
win
gs w
ithin
Bel
l Lab
orat
orie
s. T
he tw
o sh
ells
wer
e no
t co
mpa
tible
, lea
ding
to a
div
isio
n as
to w
hich
sho
uld
beco
me
the
stan
dard
she
ll. S
teve
Bou
rne
and
John
Mas
hey
argu
ed
thei
r res
pect
ive
case
s at
thre
e su
cces
sive
UN
IX u
ser g
roup
m
eetin
gs. B
etw
een
mee
tings
, eac
h en
hanc
ed th
eir s
hell
to
have
the
func
tiona
lity
avai
labl
e in
the
othe
r. A
com
mitt
ee w
as
set u
p to
cho
ose
a st
anda
rd s
hell.
It c
hose
the
Bou
rne
shel
l as
the
stan
dard
.
At t
he ti
me
of th
ese
so-c
alle
d ``
shel
l war
s'',
I wor
ked
on a
pr
ojec
t at B
ell L
abor
ator
ies
that
nee
ded
a fo
rm e
ntry
sys
tem
. W
e de
cide
d to
bui
ld a
form
inte
rpre
ter,
rath
er th
an w
ritin
g a
sepa
rate
pro
gram
for e
ach
form
. Ins
tead
of i
nven
ting
a ne
w
scrip
t lan
guag
e, w
e bu
ilt a
form
ent
ry s
yste
m b
y m
odify
ing
the
Bou
rne
shel
l, ad
ding
bui
lt-in
com
man
ds a
s ne
cess
ary.
Th
e ap
plic
atio
n w
as c
oded
as
shel
l scr
ipts
. We
adde
d a
built
-in
to re
ad fo
rm te
mpl
ate
desc
riptio
n fil
es a
nd c
reat
e sh
ell
varia
bles
, and
a b
uilt-
in to
out
put s
hell
varia
bles
thro
ugh
a fo
rm m
ask.
We
also
add
ed a
bui
lt-in
nam
ed le
t to
do
arith
met
ic u
sing
a s
mal
l sub
set o
f the
C la
ngua
ge e
xpre
ssio
n sy
ntax
. An
arra
y fa
cilit
y w
as a
dded
to h
andl
e co
lum
ns o
f dat
a on
the
scre
en. S
hell
func
tions
wer
e ad
ded
to m
ake
it ea
sier
to
writ
e m
odul
ar c
ode,
sin
ce o
ur s
hell
scrip
ts te
nded
to b
e la
rger
than
mos
t she
ll sc
ripts
at t
hat t
ime.
Sin
ce th
e B
ourn
e sh
ell w
as w
ritte
n in
an
Alg
ol-li
ke v
aria
nt o
f C, w
e co
nver
ted
our v
ersi
on o
f it t
o a
mor
e st
anda
rd K
&R
ver
sion
of C
. We
rem
oved
the
rest
rictio
n th
at d
isal
low
ed I/
O re
dire
ctio
n of
bu
ilt-in
com
man
ds, a
nd a
dded
ech
o, p
wd,
and
test
bui
lt-in
co
mm
ands
for i
mpr
oved
per
form
ance
. Fin
ally,
we
adde
d a
capa
bilit
y to
run
a co
mm
and
as a
cop
roce
ss s
o th
at th
e co
mm
and
that
pro
cess
ed th
e us
er-e
nter
ed d
ata
and
acce
ssed
the
data
base
cou
ld b
e w
ritte
n as
a s
epar
ate
proc
ess.
At t
he s
ame
time,
at t
he U
nive
rsity
of C
alifo
rnia
at B
erke
ley,
B
ill J
oy p
ut to
geth
er a
new
she
ll ca
lled
the
C s
hell.
Lik
e th
e M
ashe
y sh
ell,
it w
as im
plem
ente
d as
a c
omm
and
inte
rpre
ter,
not a
pro
gram
min
g la
ngua
ge. W
hile
the
C s
hell
cont
aine
d flo
w c
ontro
l con
stru
cts,
she
ll va
riabl
es, a
nd a
n ar
ithm
etic
fa
cilit
y, it
s pr
imar
y co
ntrib
utio
n w
as a
bet
ter c
omm
and
inte
rface
. It i
ntro
duce
d th
e id
ea o
f a h
isto
ry li
st a
nd a
n ed
iting
fa
cilit
y, s
o th
at u
sers
did
n't h
ave
to re
type
com
man
ds th
at
they
had
ent
ered
inco
rrec
tly.
I cre
ated
the
first
ver
sion
of k
sh s
oon
afte
r I m
oved
to a
re
sear
ch p
ositi
on a
t Bel
l Lab
orat
orie
s. S
tarti
ng w
ith th
e fo
rm
scrip
ting
lang
uage
, I re
mov
ed s
ome
of th
e fo
rm-s
peci
fic
code
, and
add
ed u
sefu
l fea
ture
s fro
m th
e C
she
ll su
ch a
s hi
stor
y, a
liase
s, a
nd jo
b co
ntro
l.
In 1
982,
the
UN
IX S
yste
m V
she
ll w
as c
onve
rted
to K
&R
C,
echo
and
pw
d w
ere
mad
e bu
ilt-in
com
man
ds, a
nd th
e ab
ility
to
def
ine
and
use
shel
l fun
ctio
ns w
as a
dded
. Unf
ortu
nate
ly,
the
Sys
tem
V s
ynta
x fo
r fun
ctio
n de
finiti
ons
was
diff
eren
t fro
m th
at o
f ksh
. In
orde
r to
mai
ntai
n co
mpa
tibili
ty w
ith th
e S
yste
m V
she
ll an
d pr
eser
ve b
ackw
ard
com
patib
ility
, I
mod
ified
ksh
to a
ccep
t eith
er s
ynta
x.
The
popu
lar i
nlin
e ed
iting
feat
ures
(vi a
nd e
mac
s m
ode)
of
ksh
wer
e cr
eate
d by
sof
twar
e de
velo
pers
at B
ell
Labo
rato
ries;
the
vi li
ne e
ditin
g m
ode
by P
at S
ulliv
an, a
nd
the
emac
s lin
e ed
iting
mod
e by
Mik
e Ve
ach.
Eac
h ha
d in
depe
nden
tly m
odifi
ed th
e B
ourn
e sh
ell t
o ad
d th
ese
feat
ures
, and
bot
h w
ere
in o
rgan
izat
ions
that
wan
ted
to u
se
ksh
only
if k
sh h
ad th
eir r
espe
ctiv
e in
line
edito
r. O
rigin
ally
the
idea
of a
ddin
g co
mm
and
line
editi
ng to
ksh
was
reje
cted
in
the
hope
that
line
edi
ting
wou
ld m
ove
into
the
term
inal
driv
er.
How
ever
, whe
n it
beca
me
clea
r tha
t thi
s w
as n
ot li
kely
to
happ
en s
oon,
bot
h lin
e ed
iting
mod
es w
ere
inte
grat
ed in
to
ksh
and
mad
e op
tiona
l so
that
they
cou
ld b
e di
sabl
ed o
n sy
stem
s th
at p
rovi
ded
editi
ng a
s pa
rt of
the
term
inal
in
terfa
ce.
As
use
of k
sh g
rew
, the
nee
d fo
r mor
e fu
nctio
nalit
y be
cam
e ap
pare
nt. L
ike
the
orig
inal
she
ll, k
sh w
as fi
rst u
sed
prim
arily
fo
r set
ting
up p
roce
sses
and
han
dlin
g I/O
redi
rect
ion.
New
er
uses
requ
ired
mor
e st
ring
hand
ling
capa
bilit
ies
to re
duce
the
num
ber o
f pro
cess
cre
atio
ns. T
he 1
988
vers
ion
of k
sh, t
he
one
mos
t wid
ely
dist
ribut
ed a
t the
tim
e th
is is
writ
ten,
ex
tend
ed th
e pa
ttern
mat
chin
g ca
pabi
lity
of k
sh to
be
com
para
ble
to th
at o
f the
regu
lar e
xpre
ssio
n m
atch
ing
foun
d in
sed
and
gre
p.
In s
pite
of i
ts w
ide
avai
labi
lity,
ksh
sou
rce
is n
ot in
the
publ
ic
dom
ain.
Thi
s ha
s le
d to
the
crea
tion
of b
ash,
the
``B
ourn
e ag
ain
shel
l'', b
y th
e Fr
ee S
oftw
are
Foun
datio
n; a
nd p
dksh
, a
publ
ic d
omai
n ve
rsio
n of
ksh
. Unf
ortu
nate
ly, n
eith
er is
co
mpa
tible
with
ksh
.
In 1
992,
the
IEE
E P
OS
IX 1
003.
2 an
d IS
O/IE
C 9
945-
2 sh
ell
and
utili
ties
stan
dard
s w
ere
ratif
ied.
The
se s
tand
ards
de
scrib
e a
shel
l lan
guag
e th
at w
as b
ased
on
the
UN
IX
Sys
tem
V s
hell
and
the
1988
ver
sion
of k
sh. T
he 1
993
vers
ion
of k
sh is
a v
ersi
on o
f ksh
whi
ch is
a s
uper
set o
f the
P
OS
IX a
nd IS
O/IE
C s
hell
stan
dard
s. W
ith fe
w e
xcep
tions
, it
is b
ackw
ard
com
patib
le w
ith th
e 19
88 v
ersi
on o
f ksh
.
The
awk
com
man
d w
as d
evel
oped
in th
e la
te 1
970s
by
Al
Aho
, Bria
n K
erni
ghan
, and
Pet
er W
einb
erge
r of B
ell
Labo
rato
ries
as a
repo
rt ge
nera
tion
lang
uage
. A s
econ
d-ge
nera
tion
awk
deve
lope
d in
the
early
198
0s w
as a
mor
e ge
nera
l-pur
pose
scr
iptin
g la
ngua
ge, b
ut la
cked
som
e sh
ell
feat
ures
. It b
ecam
e ve
ry c
omm
on to
com
bine
the
shel
l and
aw
k to
writ
e sc
ript a
pplic
atio
ns. F
or m
any
appl
icat
ions
, thi
s ha
d th
e di
sadv
anta
ge o
f bei
ng s
low
bec
ause
of t
he ti
me
cons
umed
in e
ach
invo
catio
n of
aw
k. T
he p
erl l
angu
age,
de
velo
ped
by L
arry
Wal
l in
the
mid
-198
0s, i
s an
atte
mpt
to
com
bine
the
capa
bilit
ies
of th
e sh
ell a
nd a
wk
into
a s
ingl
e la
ngua
ge. B
ecau
se p
erl i
s fre
ely
avai
labl
e an
d pe
rform
s be
tter t
han
com
bine
d sh
ell a
nd a
wk,
per
l has
a la
rge
user
co
mm
unity
, prim
arily
at u
nive
rsiti
es.
From "ksh - An Extensible High Level Language" by David G. Korn
And while we are at it...
• Unix itself comes in different flavors; the 1980s saw an incredible proliferation of Unix versions, somewhere around 100 (System V, AIX, Berkeley BSD, SunOS, Linux, ...)
• Vendors provided (diverging) version of Unix, optimized for their own computer architectures and supporting different features
• Despite the diversity, it was still easier to “port” applications between versions of Unix than it was between different proprietary OS
• In the 90s, some consolidation took place; today Linux dominates the low-end market, while Solaris, AIX and HP-UX are leaders in the mid-to-high end
A few common commands
First, commands to explore your file system; walk through directories and list files
• pwd, ls, cd
• mkdir, rmdir
• cp, mv, rm
• (These commands seem somewhat terse -- Why not “remove” or “copy” -- and some have speculated that it was because of the effort required to enter data on the early ASR-33 teletype keyboards)
Another view of the filesystem; here, your Mac will display directories as folders and you navigate by clicking rather than typing commands
An example
• [obstacle 12 lectures] ls -al• total 8• drwxr-xr-x 8 cocteau staff 272 Sep 30 08:11 .• drwxr-xr-x 70 cocteau staff 2380 Sep 30 08:08 ..• drwxr-xr-x 4 cocteau staff 136 Sep 30 08:05 lecture1• drwxr-xr-x@ 282 cocteau staff 9588 Sep 30 08:13 lecture1.key• -rw-r--r-- 1 cocteau staff 139 Sep 21 15:34 lecture1.txt• drwxr-xr-x 11 cocteau staff 374 Sep 30 10:24 lecture2• drwxr-xr-x@ 80 cocteau staff 2720 Sep 30 10:22 lecture2.key
Your user name
The group you belong to that owns the file
The file’s size in bytes
When the file was last modified
The file’s name
Shorthand for your present working directory (where you’re at)
Shorthand for the directory one level above
Number of 512-byte blocks used by the files in the directory
Permissions; whatyou (and others) can do to the file
Type of file: - normal file d directory s socket file l link file
http://www.thegeekstuff.com/2009/07/linux-ls-command-examples/
Options
• Notice that Unix commands can also be modified by adding one or more options; in the case of ls, we added the option -l for the “long” form of the output and “-a” for all directories that begin with a “.” -- Another useful option is “-h” for humanly readable output
• The command man will provide you with help on Unix commands; you simply supply the name of the command you are interested in as an operand -- try typing “man ls”
Permission bits
• Unix can support many users (remember, that was part of its initial design goals) on a single system and each user can belong to one or more groups
• Every file in a Unix filesystem is owned by some user and one of that user’s groups; each file also has a set of permissions specifying which users can
• r: read • w: write (modify) or• x: execute
• the file; these are specified with three “bits” and we need three sets of bits to define what the user can do, what their group (that owns the file) can do and what others can do
Permission bits
• [obstacle 12 lectures] ls -al• total 8• drwxr-xr-x 8 cocteau staff 272 Sep 30 08:11 .• drwxr-xr-x 70 cocteau staff 2380 Sep 30 08:08 ..• drwxr-xr-x 4 cocteau staff 136 Sep 30 08:05 lecture1• drwxr-xr-x@ 282 cocteau staff 9588 Sep 30 08:13 lecture1.key• -rw-r--r-- 1 cocteau staff 139 Sep 21 15:34 lecture1.txt• drwxr-xr-x 11 cocteau staff 374 Sep 30 10:24 lecture2• drwxr-xr-x@ 80 cocteau staff 2720 Sep 30 10:22 lecture2.key
What you can do to the file (u)
What the owning group can do (g)
What others can do (o)
The type of filePermission bits
• drwxr-xr-x 8 cocteau staff • drwxr-xr-x 70 cocteau staff • drwxr-xr-x 4 cocteau staff • drwxr-xr-x@ 282 cocteau staff • -rw-r--r-- 1 cocteau staff • drwxr-xr-x 11 cocteau staff • drwxr-xr-x@ 80 cocteau staff
In general...
• The command chmod changes the permissions on a file; here are some examples
• % chmod g+x lecture1.txt
• % chmod ug-x lecture1.txt
• % chmod a+w lecture1.txt
• % chmod go-w lecture1.txt
• You can also use binary to express the permissions; so if we think of bits ordered as rwx, then
• and we can specify permissions with these values
• % chmod 754 dates.sh
r-x 1 ! 22 + 0 ! 21 + 1 ! 20 = 5
rwx 1 ! 22 + 1 ! 21 + 1 ! 20 = 7
r-- 1 ! 22 + 0 ! 21 + 0 ! 20 = 4
Kinds of files
• What you’ll notice right away is that there are different types of files having different permissions -- we will talk about different kinds of data storage and representation next time
• Unix filesystem conventions places (shared, commonly used) executable files in places like /usr/bin or /usr/local/bin (navigate to one of these directories and have a look!)
• Different files are opened by different kinds of programs; in OSX, there is a beautiful command called open that decides which program to use
Kinds of files
• Filenames which contain special characters like * and ~ are candidates for filename substitution
• ~ refers to your home directory and * is a wildcard for any number of characters (it’s referred to globbing -- you can try the command glob to see what happens, something we’ll do shortly)
• Other special characters like {, [ and ? can also be expanded, but we’ll get to them when we learn a bit more about regular expressions
Unix shells
• There are many flavors of Unix Shells; that is, there are many kinds of programs that operate as shells
sh, csh, tcsh, bash, ksh
• They are all programs and can be found on the file system
which sh
An example: HTTP access logs
• www.stat.ucla.edu
• The department runs an Apache Web server running on taia.stat.ucla.edu
• Each request, each click by a user out on the Internet browsing our site, is logged -- there are standards for these files, but in general, they can be a bit hairy to “parse”
Log files
• A Unix system is constantly maintaining logs about its performance; various applications log their activities as well -- some are visible by everyone, some by users with “root” access
• Scraping through logs for anomalies, for system profiling, for system management, is a fairly common exercise; the logs we’ll look at have taken on extra importance given all the transactions that take place across the web
Into the cloud...
• This morning, I spun up an EC2 instance from Amazon Web Services -- EC2 stands for Elastic Compute Cloud, assigned it to a domain (stat202a.org) and copied our relatively small data set over to the new machine
• The machine is running Ubuntu and you’ll have access to all the Unix tools we’ve been discussion so far (and more!)
From machine to machine
• In courses you’ve had previously, you probably ran R or some other program on the computer on your desk -- The Statistics Department has a number of computers available to you that are more powerful (memory, speed) than those on your desk and with EC2, we can spin up any number of machines to do our bidding
• You can navigate between machines by invoking the commands ssh (secure shell) and move files around using scp (secure copy, example shortly)...
% ssh [email protected]
Moving data
• You can also move data to your own machine (assuming it’s not too big -- at least one of the data sets we’ll consider this term will be too big to budge!)
• If the data, for example, are stored on homework.stat202a.org, you can bring them to your desktop with the command
• % scp [email protected]:/data/weblogs/access_log.1267056000 .
The secure copy command;friends with ssh
Copy the file access_log.1267056000 in the directory /data/weblogs from the computer homework.stat202a.org
Copy it to a file of the same name on your local machine in your current working directory
HTTP access logs
• A bit of digging...
Commands
• pwd, ls, cd
• more/less, tail, wc
• cut, sort, uniq
% cd /data/weblogs% ls
% head access_log.1267056000
seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:00:31 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"ces.stat.ucla.edu 128.97.55.242 - - [24/Feb/2010:23:02:12 -0800] "GET /rss/ HTTP/1.1" 404 1515 "-" "Apple-PubSub/65.11"seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:48 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"seminars.stat.ucla.edu 128.97.86.247 - - [24/Feb/2010:23:03:53 -0800] "GET /upcoming_through/2010-03-02.xml HTTP/1.0" 200 3426 "-" "-"seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:37 -0800] "GET /2007-04-10/3:00pm/1260-franz HTTP/1.0" 200 3459 "-" "UCLA%20Google%20Seacsrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /robots.txt HTTP/1.0" 200 99 "-" "UCLA%20Google%20Search%20Appliance%20%2301csrcb.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:38 -0800] "GET /projects/ HTTP/1.0" 200 1179 "-" "UCLA%20Google%20Search%20Appliance%20%230seminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /archive?page=24 HTTP/1.0" 200 4329 "-" "UCLA%20Google%20Search%20Applianseminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /2007-02-23/3:00pm/5200-boelter HTTP/1.0" 200 3560 "-" "UCLA%20Google%20Sseminars.stat.ucla.edu 164.67.132.221 - - [24/Feb/2010:23:04:39 -0800] "GET /1999-03-01/3:00pm/6627-ms HTTP/1.0" 200 3005 "-" "UCLA%20Google%20Search
% wc access_log.1267056000
1215118 24435535 306343201 access_log.1267056000
% tail access_log.1267056000
forums.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:54 -0800] "GET /feed.php?16,102,type=rss HTTP/1.0" 200 1570 "-" "UCLA%20Google%20Search%20Appliance%20%2302%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-GAAQFZSZNSNAS; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:55 -0800] "GET /web/checks/check_results_zic.html HTTP/1.0" 200 4965 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"www.socr.ucla.edu 207.46.199.191 - - [03/Mar/2010:23:59:56 -0800] "GET /robots.txt HTTP/1.1" 200 58 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"www.stat.ucla.edu 67.195.110.174 - - [03/Mar/2010:23:59:57 -0800] "GET /~jkokkin/gray/pb_lrn_bound_1_sharp_2_dc_2_1_nm/253055_counts.mat HTTP/1.0" 200 481 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)"directory.stat.ucla.edu 164.67.132.220 - - [03/Mar/2010:23:59:57 -0800] "GET /info.php?directory=443 HTTP/1.0" 200 2905 "-" "UCLA%20Google%20Search%20Appliance%20%2302%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-GAAQFZSZNSNAS; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /robots.txt HTTP/1.0" 200 98261 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"cran.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:58 -0800] "GET /web/checks/check_results_xlsReadWrite.html HTTP/1.0" 200 5185 "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"www.stat.ucla.edu 67.180.31.126 - - [03/Mar/2010:23:59:58 -0800] "GET /projects/datasets/risk_perception.html HTTP/1.1" 200 5818 "http://www.stat.ucla.edu/projects/datasets/social_sciences.php" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6"www.stat.ucla.edu 164.67.132.219 - - [03/Mar/2010:23:59:59 -0800] "GET /cgi-bin/man-cgi?ypmatch HTTP/1.0" 404 - "-" "UCLA%20Google%20Search%20Appliance%20%2301%20%28contact%3A%20jhuang%40ais.ucla.edu%29 (Enterprise; S5-J4JEBZS9PUJJA; [email protected])"
Combined log format
virtual host
IP address
Identity
Userid
date
Request
Status
Bytes
Referrer
Agent
Slicing out a field...
For speed, we’ll work with just one day’s worth of logs; look at the file access_log.1267056000 -- what time period does this cover?
Many of the built-in Unix tools are designed to digest log files of this kind; in a way, this is a very old dance, one that has been enacted for decades...
Examples
• cut -d" " -f1,11 access_log.1267056000
• cut -d" " -f11 access_log.1267056000
In general...
cut -d" " -f1,5,9 select the first, fifth and ninthcut -d" " -f1-5 select the first through the fifthcut -d" " -f1 select just the first
Unix pipes
• Rather than have the output scroll out on the screen, it would be better (useful, even) to “catch” the output with a more command
• Many programs take some kind of input and generate some kind of output; Unix tools often take input from the user and print the output to the screen
• “Redirection” of data to and from files or programs is controlled by a construction known as a pipe
% cut -d" " -f1 access_log.1267056000 | more
seminars.stat.ucla.educes.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.educsrcb.stat.ucla.educsrcb.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.educsrcb.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.eduseminars.stat.ucla.edu...
Sending output to a file with “>”
With this form of redirection, we take a stream of processed data and store it in a file -- For example
• % cut -d" " -f2 access_log.1267056000 > ~/ips.txt
• % ls ~
Taking input from a file with “<”
With this form of redirection, we create an input stream from a file -- For example
wc < ~/ips.txt
Tallying
• As its name suggests, the command sort will order the rows in our file; by default it uses alphabetical order -- the option -n let you sort numerically instead
• The command uniq will remove repeated adjacent lines; so if your file is sorted, it will return just the unique rows
• sort ~/ips.txt > ~/ips_sorted.txt
• head ~/ips_sorted.txt
• uniq ~/ips_sorted.txt | more
The pipeline
We seem to be proliferating files -- one version of the IPs that is raw, one that is sorted, maybe one that has just the unique entries -- we can use the pipe construction to combine many of these steps into one
As the name might imply, you can connect pipes and have data stream from process to process
Example
• cut -d" " -f2 access_log.1267056000 | sort | uniq | wc
• cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn
• cut -d" " -f10 access_log.1267056000 | sort | uniq -c
% cut -d" " -f2 access_log.1267056000 | sort | uniq | wc
28564 28564 404062 # what does this mean?
% cut -d" " -f10 access_log.1267056000 | sort | uniq -c | sort -rn | head -10
933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416
What are these numbers?
• A fast Google search gives us a list of possible errors
• Note that Error 200 actually means a success
• Error 206 means that only part of the file was delivered; the user cancelled the request before it could be delivered
• Error 304 is “not modified”; sometimes clients perform conditional GET requests
HTTP Error 101Switching Protocols. Again, not really an "error", this HTTP Status Code means everything is working fine.
HTTP Error 200Success. This HTTP Status Code means everything is working fine. However, if you receive this message on screen, obviously something is not right... Please contact the server's administrator if this problem persists. Typically, this status code (as well as most other 200 Range codes) will only be written to your server logs.
HTTP Error 201Created. A new resource has been created successfully on the server.
HTTP Error 202Accepted. Request accepted but not completed yet, it will continue asynchronously.
HTTP Error 203Non-Authoritative Information. Request probably completed successfully but can't tell from original server.
HTTP Error 204No Content. The requested completed successfully but the resource requested is empty (has zero length).
HTTP Error 205Reset Content. The requested completed successfully but the client should clear down any cached information as it may now be invalid.
HTTP Error 206Partial Content. The request was canceled before it could be fulfilled. Typically the user gave up waiting for data and went to another page. Some download accelerator programs produce this error as they submit multiple requests to download a file at the same time.
HTTP Error 300Multiple Choices. The request is ambiguous and needs clarification as to which resource was requested.
HTTP Error 301Moved Permanently. The resource has permanently moved elsewhere, the response indicates where it has gone to.
HTTP Error 302Moved Temporarily. The resource has temporarily moved elsewhere, the response indicates where it is at present.
HTTP Error 303See Other/Redirect. A preferred alternative source should be used at present.
% cut -d" " -f2 access_log.1267056000 | sort | uniq -c | sort -rn | more
130258 128.97.86.247 129354 164.67.132.220 123756 164.67.132.219 83956 164.67.132.221 83205 164.67.132.222 23554 67.195.110.174 23058 128.97.55.186 21205 76.89.146.27 15941 128.97.55.188 12967 128.97.55.185 12365 67.195.114.250 8029 137.110.114.8 7748 208.115.111.248 5256 216.9.17.231 4894 128.97.55.3 4554 66.249.67.236 4057 69.235.86.81 3857 66.249.67.183 3691 67.195.110.173 3622 67.195.113.236 3181 77.88.29.246 3144 74.6.149.154 2825 98.222.49.118...
The pipeline
• In 1972, pipes appear in Unix, and with them a philosophy, albeit after some struggle for the syntax; should it be
• more(sort(cut)))
• [Remember this; S/R has this kind of functional syntax]
• The development of pipes led to the concept of tools -- software programs that would be in a tool box, available when you need them
“And that’s, I think, when we started to think consciously about tools, because then you could compose things together... compose them at the keyboard and get them right every time.”
from an interview with Doug McIlroy
Remember, read the man pages!
As we did with ls, if the command uniq is unfamiliar, you can look up its usage
Example
man uniqman host
% host 128.97.86.247247.86.97.128.in-addr.arpa domain name pointer limen.stat.ucla.edu.
% host 164.67.132.220220.132.67.164.in-addr.arpa domain name pointer gsa2.search.ucla.edu.
% host 67.195.110.174174.110.195.67.in-addr.arpa domain name pointer b3091077.crawl.yahoo.net.
But what are we really after?
• Rather than splitting up the data in access_log.1267056000 by day, we might consider dividing it by IP -- that is extract all the hits from a given IP
• Once we have such a thing, we can use the command wc to tell us about the number of accesses from each user
• We can also start to “fit” user-level models that can be used to predict navigation
Rudimentary pattern matching
grep can be used to skim lines from a file that have (or don’t have) a particular pattern -- patterns are specified via regular expressions, something we will learn more about later
The name of the command comes from an editing operation on Unix: g/re/p
Example
host 98.222.49.118
grep 98.222.49.118 access_log.1267056000
grep /~dinov access_log.1267056000
%grep 98.222.49.118 access_log.1267056000| head -25
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:37 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:38 -0800] "GET /~zyyao/ HTTP/1.1" 200 17692 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:39 -0800] "GET /~zyyao/common.css HTTP/1.1" 200 3762 "http://www.stat.ucla.edu/~zyyao/" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:40 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 200 36901 "http://www.google.com/search?q=zhenyu+yao&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [26/Feb/2010:23:20:42 -0800] "GET /~zyyao/CV_ZhenyuYao.pdf HTTP/1.1" 206 4352 "-" "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.8) Gecko/20100202 Firefox/3.5.8"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /~zyyao/projects/ABSS/all_exp.html HTTP/1.1" 200 525 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:35 -0800] "GET /favicon.ico HTTP/1.1" 200 318 "-" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:37 -0800] "GET /%7Ezyyao/projects/ABSS/walk.html HTTP/1.1" 200 22230 "http://www.stat.ucla.edu/~zyyao/projects/ABSS/all_exp.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/000template101.png HTTP/1.1" 200 2369 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1001.png HTTP/1.1" 200 11315 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003.png HTTP/1.1" 200 11362 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1003sketch.png HTTP/1.1" 200 2224 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004.png HTTP/1.1" 200 11191 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1004sketch.png HTTP/1.1" 200 2293 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005.png HTTP/1.1" 200 11214 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1005sketch.png HTTP/1.1" 200 2251 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006.png HTTP/1.1" 200 11507 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1006sketch.png HTTP/1.1" 200 2179 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:38 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007.png HTTP/1.1" 200 11226 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1007sketch.png HTTP/1.1" 200 2268 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008.png HTTP/1.1" 200 11321 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1008sketch.png HTTP/1.1" 200 2214 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009.png HTTP/1.1" 200 10986 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
www.stat.ucla.edu 98.222.49.118 - - [01/Mar/2010:19:02:39 -0800] "GET /%7Ezyyao/projects/ABSS/walkpng/101I1009sketch.png HTTP/1.1" 200 2246 "http://www.stat.ucla.edu/%7Ezyyao/projects/ABSS/walk.html" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100106 Ubuntu/9.10 (karmic) Firefox/3.5.7"
% cut -d" " -f12 access_log.1267056000|grep google | head -25
"http://www.google.com/search?hl=en&newwindow=1&q=filled+%22rounded+box%22+image+beamer&aq=f&aqi=&aql=&oq="
"http://www.google.com/imgres?imgurl=http://scc.stat.ucla.edu/page_attachments/0000/0064/map.jpg&imgrefurl=http://scc.stat.ucla.edu/contact-us/map&h=640&w=779&sz=307&tbnid=uHJ18tjP58PzjM:&tbnh=117&tbnw=142&prev=/images%3Fq%3Dboelter%2Bhall%2Bucla&hl=en&usg=__uoy38imK_LrO3Pe3j4z42srd9qY=&ei=BS6GS6nAMIXWsgPA0PzWDQ&sa=X&oi=image_result&resnum=5&ct=image&ved=0CBEQ9QEwBA"
"http://www.google.com.ph/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=tl&source=hp&q=HISTORY+OF+STATISTICS&meta=&btnG=Hanapin+sa+Google"
"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"
"http://www.google.com/search?sourceid=navclient&ie=UTF-8&rlz=1T4SNNT_enUS355US355&q=SOCR+distributions"
"http://translate.google.gr/translate?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=translate.google.gr&usg=ALkJrhjZRaez9AkwycLOBDTWUPwBXEuthw"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/&rurl=translate.google.gr&usg=ALkJrhjZRaez9AkwycLOBDTWUPwBXEuthw"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_head.php&rurl=translate.google.gr&usg=ALkJrhiQwFYwCUEiCMaFMRNh7aY_UFOEYA"
"http://www.google.com/search?q=socr&rls=com.microsoft:ko:IE-SearchBox&ie=UTF-8&oe=UTF-8&sourceid=ie7&rlz=1I7HPAB_ko"
"http://translate.googleusercontent.com/translate_c?hl=el&langpair=en%7Cel&u=http://www.stat.ucla.edu/history/index_body.php&rurl=translate.google.gr&usg=ALkJrhgWBIYc3prqMNQUlZTad_UKiabFWg"
"http://www.google.co.in/search?q=Lienhart+ICME+2003+face&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"
"http://www.google.com.my/search?hl=en&client=firefox-a&channel=s&rls=org.mozilla:en-US:official&q=markov+chain+monte+carlo+mcmc&revid=1701745188&ei=BS-GS5OFI4qOkQXb7JFC&sa=X&oi=revisions_inline&resnum=0&ct=broad-revision&cd=1&ved=0CEQQ1QIoADgU"
"http://www.google.com/search?q=stata+goodness+of+fit+test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"
"http://translate.googleusercontent.com/translate_c?hl=zh-CN&sl=en&u=http://www.stat.ucla.edu/~yuille/courses/Stat202C/ga_tutorial.pdf&prev=/search%3Fq%3DA%2Bgenetic%2Balgorithm%2Btutorial%26hl%3Dzh-CN%26newwindow%3D1%26client%3Daff-cs-360se-channel%26hs%3DHyp%26sa%3DN%26channel%3Dbookmark%26source%3Dhp&rurl=translate.google.cn&usg=ALkJrhhyeLuWRdcFd6ege_jTb1a8hMbEWw"
"http://www.google.lt/url?sa=t&source=web&ct=res&cd=1&ved=0CAYQFjAA&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~bk%2F&rct=j&q=brian+kriegler&ei=zC-GS4i7O4OmnQOj5_TWCw&usg=AFQjCNHTadqBAC3mNZuf6W0Oc6ELJOPeAw"
"http://www.google.nl/search?hl=nl&source=hp&q=f+distribution+table&meta=&aq=1&oq=F+distrib"
"http://www.google.com.my/search?q=case+in+suicide&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"
"http://www.google.com/url?sa=t&source=web&ct=res&cd=9&ved=0CCgQFjAI&url=http%3A%2F%2Fwww.stat.ucla.edu%2F~sczhu%2FProject_pages%2FFace_Aging%2FFaceAging.htm&rct=j&q=aging+process+face&ei=4y-GS9WFI4OIsgOwqLzSDQ&usg=AFQjCNGniQVp4_079POXSewtZ7-Thaul4g"
"http://www.google.com/search?client=firefox-a&rls=org.mozilla%3Aen-US%3Aofficial&channel=s&hl=en&source=hp&q=t-statistic+chart&btnG=Google+Search"
"http://www.google.co.in/search?hl=en&q=Genetic+algorithm+tutorial&meta=&aq=o&oq="
"http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla:en-US:official&q=an+introduction+to+stochastic+modeling+taylor+karlin&start=170&sa=N"
"http://www.google.com/search?hl=en&client=firefox-a&hs=YlE&rls=org.mozilla%3Aen-US%3Aofficial&q=dinov+ucla+stats&aq=f&aqi=g1&aql=&oq="
"http://www.google.com/search?hl=en&lr=&client=safari&rls=en&q=the+reading+on+a+voltage+meter+connected+to+a+test+circuit+is+uniformly+distributed+over+the+interval&aq=f&aqi=&aql=&oq="
"http://www.google.co.jp/search?hl=ja&q=Ryacas%E3%80%80r-commander+install&btnG=%E6%A4%9C%E7%B4%A2&lr=&aq=f&oq="
A puzzle
• In working through these notes, you might have noticed that I’ve been protecting you a little from some nastiness -- Suppose, for example, we decided to use the cut command to extract the 10th field in each line of the logfile and look at just the first 10 lines of ouptut
• The code 200 is the most frequent by far the most frequent with second place going to code 304, but by looking at the first 10 lines I intentionally hid information from you...
count code 933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416
A puzzle
• Looking at the full output from the command and not just the first 10 lines, you see something a bit different; at the right we present that table
• What else do you notice?
count code 933724 200 127624 304 72112 404 46878 206 22145 302 9682 301 2415 403 239 405 208 401 43 416 29 400 8 HTTP/1.0" 3 prep.R 3 HTTP/1.1" 2 1103 1 Post-rej/project_II.htm 1 500 1 2008
A puzzle
• About 18 of the hits in this file are logged in such a way that the 10th field (as defined by a space) is not the code we were after
• We’ve made an assumption about the data that isn’t correct, and the result shows up when we have a large enough sample -- to chase down the problem, we extracted just the log line referring to a request for project_II.htm
• So, what happened?
www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; DotBot/1.1; http://www.dotnetdotcom.org/, [email protected])"
Backstory
• We specify web pages using a Uniform Resource Locator (URL); for example, the page http://www.stat.ucla.edu/research will be interpreted as a request for /research from the host www.stat.ucla.edu
• The request and response are handled via a communication protocol called HTTP (or the Hypertext Transfer Protocol); you will see two versions of HTTP in our log files (HTTP/1.0 which opens a new “connection” per request, and HTTP/1.1 which can reuse a connection)
• HTTP is the protocol by which clients (your web browser, or in the parlance of your logs, a user agent) makes requests from one or more servers (computers hosting web sites; also called an HTTP server) and receives data in return (web pages, media items, and so on)
Backstory
• In addition to its particular format, a URL can be made up of only certain classes of symbols taken from the US ASCII* character set-- alphanumeric characters, some the special characters $-_.+!*() and some reserved characters that have their own special purpose
• I don’t want to go into the specifics of naming here, as we will likely take it up in earnest later, except to say that certain characters cannot appear in a URL or should only appear with a special meaning (like / to specify the path)
• When you need to include a special character, they need to be URL encoded; that is, they are replaced with a numerical equivalent...
• *The American Standard Code for Information Interchange
URL Encoding
• URL encoding replaces characters with their ASCII hexadecimal equivalent, prefaced a %; recall that the hexadecimal number system represents values in “base” 16 -- that is, it uses the sixteen symbols 0,1,...,9,A,B,C,D,E,F
• Two hexadecimal digits are equivalent to 8 binary digits (8 bits or 1 byte) and represent an integer between 0 and 255 -- we will have more to say about data representation next time
• For now, it’s enough to know that the %20 is the URL encoding of the space “ “; does this bring us any closer to understanding the three lines a few slides back?
Puzzle solved
• Notice that the first line of the five has “ “‘s in the URL that was requested -- a “ “ that was not URL encoded -- and this means that counting spaces is not really the same as identifying items from the log
• Technically, your browser takes care of this encoding (try typing in a URL with a space in it); it’s worth noting that the request came from a crawler; a robot that (presumably) doesn’t have this check in place
% grep Post-rej/project_II.htm access_log.* | head -5
access_log.1267056000:www.stat.ucla.edu 208.115.111.248 - - [25/Feb/2010:07:23:49 -0800] "GET /~wxl/research/proteomic/_Project_Rej vs Post-rej/project_II.htm HTTP/1.1" 404 - "-" "Mozilla/5.0 (compatible; DotBot/1.1;
access_log.1267660800:www.stat.ucla.edu 67.195.110.174 - - [06/Mar/2010:04:29:33 -0800] "GET /%7Ewxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm HTTP/1.0" 200 12078 "-" "Mozilla/5.0 (compatible; Yah
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:20 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm HTTP/1.1" 200 12078 "
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:21 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/lightblue.jpg HTTP/1.1" 200 672 "http://www.stat.ucla.edu/~wxl/researchproteomic/_Project_Rej%20vs%20Post-rej/project_II.htm" "Mozilla/4.0 (
access_log.1267660800:www.stat.ucla.edu 206.53.226.235 - - [10/Mar/2010:08:33:28 -0800] "GET /~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/other%20plot/CART.pdf HTTP/1.1" 200 214177 "http://www.stat.ucla.edu/~wxl/research/proteomic/_Project_Rej%20vs%20Post-rej/project_II.htm"
The lesson
• This explains just one of the handful of errors we encountered in the one file, but the basic problem is the same -- we are relying on spaces to define a pattern for us, but this is not a completely reliable way of identifying the content we’re after
• Instead, we need a more elaborate language for specifying patterns; and today we’ll see an extremely powerful construction known as regular expressions
• (Note: This material is perhaps slightly out of time order -- I decided to present it now to get us out of this puzzle -- and next time we will take a small step back and talk about data formats like CSV, XML, JSON and the relational model)
Quien es mas macho?
• In online marketing, hits rule the roost; the more raw traffic you attract, the greater your opportunities for making a sale
• Alright, so it’s not as simple as that, but let’s see what we can learn about centers of activity on our website
Quien es mas macho?
• Compute the number of hits to the portions of the site owned by Song-Chun Zhu, Vivian Lew, Jan deLeeuw and Ivo Dinov
• Who received the most hits in August? In March?
• What can you say about the kinds of files that were downloaded?
• What was the most popular portion of each site?
Then...
• Pull back a little and tell me about the site in general and the habits of its visitors; specifically, think about
• When is the site active? When is it quiet?
• Do the visitors stay for very long? Do they download any of our papers or software? What applications do they run?
• On the balance, is our traffic “real” or mostly the result of robots or automated processes?
Finally
• Unix tools can be incredibly powerful; they will be much faster than most other options for what they’re good at
• That said, there are plenty of kinds of analysis that are not workable with collections of simple Unix facilities; as you do your homework, keep track of what seems hard and why (in particular
• Our programming resources will keep getting more and more elaborate as the quarter progresses; this is just a first step