researchdirect.westernsydney.edu.auresearchdirect.westernsydney.edu.au/islandora/object/uws:8960... ·...
Post on 18-Oct-2020
4 Views
Preview:
TRANSCRIPT
D
CO
SCHOOL
DETECTI
M
Thesi
OLLEGE OF H‐‐‐
L OF COMPU
ION OF
Eliezer
Master of S
is supervis
Ju
HEALTH AND‐‐‐‐‐‐‐‐‐‐‐‐‐‐UTING AND
BYPASS
IDJALAHO
Science (H
sors: Dr Ho
Dr Ew
une 2011
D SCIENCE
MATHEMAT
SING TR
OUE
onours)
on CHEUN
wa HUEBN
TICS
RAFFIC
NG
NER
TABLE O
1 Introd
1.1 Pr
1.2 Sc
1.3 Re
1.4 Th
2 Backgr
2.1 Ba
2.1
2.1
2.1
2.2 W
2.2
2.2
2.3 By
2.3
2.3
2.3
2.4 Ri
OF CONTEN
uction........
roblem defin
cope and lim
esearch met
hesis overvie
round.........
asic concept
1.1 Networ
1.2 Firewal
1.3 Types o
2.1.3.1 Tr
2.1.3.2 D
2.1.3.3 Ci
2.1.3.4 A
2.1.3.5 D
Web proxy se
2.1 Web pr
2.2 Proxy f
ypassing a w
3.1 Definiti
3.2 Bypass
3.3 Bypass
2.3.3.1 En
2.3.3.2 A
2.3.3.3 CG
sks of bypas
TS
..................
nition.........
mitations.....
thod...........
ew..............
..................
ts................
rk security o
lls and pack
of firewalls a
raditional Pa
ynamic pac
ircuit level g
pplication le
istributed fi
erver and filt
roxy............
iltering mec
web proxy..
ion of bypas
ing mechan
ing techniqu
ncrypted tu
nonymizer o
GI proxy ser
ssing a firew
..................
..................
...................
..................
...................
..................
..................
overview....
ket filtering t
and their we
acket filters
ket filters...
gateways....
evel gatewa
irewalls......
tering mech
..................
chanisms....
..................
ssing...........
nism............
ues.............
nnels..........
or bypassing
rver............
wall.............
.................
...................
..................
..................
..................
.................
..................
...................
techniques.
eaknesses t
s..................
..................
..................
ays..............
...................
hanisms......
..................
..................
...................
...................
...................
..................
...................
g software..
...................
..................
De
.................
...................
..................
..................
..................
.................
..................
...................
...................
to stop bypa
...................
..................
..................
..................
...................
..................
..................
..................
...................
...................
...................
..................
...................
...................
...................
..................
etection of
.................
...................
..................
..................
..................
.................
..................
...................
...................
assing traffic
...................
..................
..................
..................
...................
..................
..................
..................
...................
...................
...................
..................
...................
...................
...................
..................
bypassing t
Pag
.................
...................
..................
...................
..................
.................
...................
...................
..................
c …………….…
..................
..................
..................
...................
...................
..................
..................
...................
...................
..................
...................
..................
...................
...................
...................
...................
raffic
ge | ii
... 1
.....2
....4
....5
....6
... 7
....7
.... 7
....8
……11
....12
....14
....14
....15
.....16
....17
....17
....18
.....19
....19
.....19
....21
....22
.....23
.....25
....26
Detection of bypassing traffic
Page | iii
2.4.1 Financial impact..................................................................................................27
2.4.2 Productivity loss.................................................................................................27
2.4.3 Shortage in resources.........................................................................................28
2.4.4 Privacy concerns.................................................................................................28
2.5 Summary.......................................................................................................................29
3 Previous works..................................................................................................................30
3.1 Encrypted tunnels.......................................................................................................30
3.2 Anonymizers or bypassing software...........................................................................32
3.3 CGI proxy servers........................................................................................................33
3.4 Summary.....................................................................................................................34
4 Goals and experiments............................................................................................... 35
4.1 Goals...........................................................................................................................35
4.2 Experiments................................................................................................................36
4.3 Summary.....................................................................................................................37
5 Design and implementation..............................................................................................39
5.1 Network profiles.........................................................................................................40
5.1.1 Detection parameters……………………………..…………………………………………………...41
5.1.1.1 Size of embedded objects........................................................................41
5.1.1.2 Inter‐arrival time......................................................................................44
5.1.1.3 TCP flows ……………………………………………………………….............................. 45
5.2 Implementation of the testing network.....................................................................46
5.2.1 Topology of the testing network.......................................................................46
5.2.2 Hardware and configuration.............................................................................48
5.2.2.1 Physical machine......................................................................................48
5.2.2.2 Virtual machine 1: Proxy Firewall............................................................49
5.2.2.3 Virtual machine 2: Blocked server............................................................50
5.2.2.4 Virtual machine 3: Bypassing proxy.........................................................51
5.2.2.5 Virtual machine 4: Routing server............................................................52
Detection of bypassing traffic
Page | iv
5.2.2.6 Virtual machine 5: Client computer.........................................................53
5.2.3 Software............................................................................................................53
5.2.3.1 VMWare Workstation..............................................................................53
5.2.3.2 ISA server 2004........................................................................................55
5.2.3.3 XAMPP for windows.................................................................................55
5.2.3.4 Wireshark.................................................................................................56
5.2.3.5 Fiddler 2................................................................................................ 56
5.2.3.6 Glype proxy script....................................................................................57
5.2.3.7 Traffic generator......................................................................................57
5.3 Summary.....................................................................................................................60
6 Findings: Results and Analyses.....................................................................................61
6.1 Initial experiment: Profile building.............................................................................61
6.2 Single webpage results...............................................................................................62
6.3 Aggregated results.....................................................................................................64
6.4 Summary.....................................................................................................................67
7 Additional experiements……….....................................................................................69
7.1 Physical network for accuracy evaluation................................................................. 69
7.2 Accuracy evaluation script.........................................................................................70
7.3 Accuracy evaluation of the detection approach........................................................73
7.3.1 Building phase of network profiles.................................................................. 73
7.3.2 Frequency distribution of the size of embedded objects………........................ 73
7.3.3 Frequency distribution of the header size of embedded objects.....................76
7.3.4 Frequency distribution of the payload size of embedded objects...................79
7.3.5 Inter‐arrival time.............................................................................................. 80
7.3.6 Number of TCP flows........................................................................................82
7.4 Detection rules............................................................................................................83
7.5 Results of the accuracy of the detection approach....................................................84
7.5.1 Results of HTTP bypassing mode…...................................................................85
7
8 Concl
8.1 C
8.2 F
Referenc
Appendi
7.5.1.1 F
7.5.1.2 F
7.5.1.3 F
a
.5.2 Result
7.5.2.1 F
7.5.2.2 F
7.5.2.3 F
a
usion.........
Contribution
Future work
ces.............
ix...............
Frequency d
Frequency d
Frequency d
and the num
ts of HTTPS
Frequency d
Frequency d
Frequency d
and the num
..................
n.................
..................
..................
.................
distribution
distribution
distribution
mber of TCP
bypassing m
distribution
distribution
distribution
mber of TCP
..................
...................
..................
..................
..................
of the size o
of payload
of payload
flows.........
mode…........
of the size o
of payload
of payload
flows.........
...................
...................
...................
.................
..................
De
of payload…
combined w
combined w
...................
...................
of payload…
combined w
combined w
...................
...................
...................
...................
.................
..................
etection of
……………………
with inter‐a
with inter‐a
...................
...................
……………………
with inter‐a
with inter‐a
...................
...................
...................
...................
.................
..................
bypassing t
Pag
…................
rrival time…
rrival time
..................
...................
…................
rrival time…
rrival time
..................
...................
...................
...................
.................
..................
raffic
ge | v
... 85
……86
...
88
.....90
... 90
……91
...
92
....94
.... 94
....94
....95
....101
TABLE O
1.1 Plet
1.2 CGI
2.1 Illus
2.2 Stru
2.3 Part
2.4 Byp
2.5 Ano
2.6 Imp
4.1 Mod
5.1 Ret
5.2 Inte
5.3 Illus
5.4 Det
5.5 Flow
6.1 Sing
6.2 web
6.3 web
6.4 web
6.5 web
6.6 web
7.1 Top
7.2 Flo
7.3 Com
a w
7.4 Com
with
OF FIGURES
thora of arti
bypassing v
stration of a
ucture of ne
ties involved
passing thro
onymizer: si
plementatio
del for dete
rieval of a w
er‐arrival tim
stration of T
ail topology
w chart of th
gle webpage
bpage 1 resu
bpage 2 resu
bpage 3 resu
bpage 9 resu
bpage 10 re
pology of the
w chart of t
mparison of
web page in d
mparison of
hin a web pa
icles and tut
vs. SSH tunn
a firewall.....
etwork pack
d during a b
ugh an encr
ngle‐point v
n of a CGI p
ecting CGI pr
web page re
me illustratio
TCP flows an
y of the virtu
he traffic ge
e results.....
ults……………
ults……………
ults……………
ults……………
sults…………
e physical n
the evaluati
f the freque
direct acces
the frequen
age in direct
torials on th
nel bypassin
...................
ets across t
bypassing sc
rypted tunn
vs. networke
proxy to byp
roxies’ traff
quiring mul
on...............
nd inter‐arri
ual network
enerator……
...................
……………………
……………………
……………………
……………………
……………….…
etwork for t
on of the ef
ency distribu
ss, HTTP byp
ncy distribu
t access, HT
he Internet o
ng................
..................
he layers……
cenario........
el...............
ed design...
pass firewall
ic................
tiple GET to
...................
ival time of
k..................
…………………
...................
…………………
…………………
…………………
…………………
…………………
the accurac
fficiency of t
ution of the
passing acce
tion of the h
TTP bypassin
De
on bypassin
...................
..................
…………………
...................
..................
..................
l restrictions
...................
o fetch each
...................
each flow d
...................
……………………
...................
…………….………
…………….………
…………….………
…………….………
…………….……
cy evaluatio
the detectio
size of emb
ess and HTT
header size
ng access an
etection of
ng a proxy fi
...................
..................
…………….……
...................
..................
..................
s.................
...................
h object.......
...................
during a sess
...................
…………………
...................
…………………
…………………
…………………
…………………
…………………
n………………
on approach
bedded obje
TPS bypassin
of embedd
nd HTTPS by
bypassing t
Pag
irewall........
..................
..................
…………………
...................
..................
..................
.................
...................
...................
...................
sion............
...................
……………………
...................
…………..…..…
…………..…..…
…………..…..…
…………..…..…
…………..…..…
………..…..…
h…..…..…..…
ects within
ng access.….
ded objects
ypassing
raffic
e | vi
2
. 4
..9
… 10
..20
..23
. 24
26
.38
..43
.44
46
.48
…59
..63
….65
….65
….66
….66
….67
. 70
….72
.
75
acc
7.5 Com
obje
byp
7.6 Ret
7.7 Eva
dist
7.8 Eva
freq
7.9 Eva
freq
in H
7.10 Eva
dist
7.11 Eva
freq
7.12 Eva
freq
in H
ess……………
mparison of
ects within
passing mod
trieval of ww
aluation of t
tribution in
luation of th
quency distr
luation of th
quency distr
HTTP bypass
luation of th
tribution in
luation of th
quency distr
luation of th
quency distr
HTTPS bypas
……………..….…
the frequen
a web page
des………………
ww.uws.edu
the accuracy
HTTP bypas
he accuracy
ribution and
he accuracy
ribution, the
sing mode…
he accuracy
HTTPS bypa
he accuracy
ribution and
he accuracy
ribution, the
ssing mode…
…………………
ncy distribu
e in direct ac
…………………
u.au throug
y of the det
ssing mode.
y of the dete
d the inter‐a
y of the dete
e inter‐arriv
…………………
y of the dete
assing mode
y of the dete
d the inter‐a
y of the dete
e inter‐arriv
…………………
…………………
tion of the p
ccess, HTTP
……………………
h the CGI pr
ection appr
…………………
ection appro
arrival time
ection appro
val time and
……………………
ection appro
e.………………
ection appro
arrival time
ection appro
val time and
……………………
De
……………………
payload size
bypassing a
…………………
roxy www.g
roach accord
……………………
oach accord
in HTTP byp
oach accord
d the numbe
…………………
oach accord
…………………
oach accord
in HTTPS by
oach accord
d the numbe
…………………
etection of
…………….……
e of embedd
and HTTPS
…………………
glypeproxy.c
ding to the f
…………………
ding to the
passing mod
ding to the
er of TCP flo
……………………
ding to the f
……………………
ding to the
ypassing mo
ding to the
er of TCP flo
…………………
bypassing t
Page
………..…..……
ded
……………………
com ..…..….
frequency
………………….
de…………….
ows
…………………
frequency
………………..
ode…………..
ows
………………….
raffic
e | vii
…78
…
80
82
..
86
.
88
.
89
.
90
92
.
93
TABLE O
4.1 Tota
5.1 Des
5.2 Des
5.3 Des
5.4 Des
5.5 Des
5.6 Des
6.1 Traf
6.2 Det
7.1 Rep
em
acc
7.2 Rep
size
acc
7.3 Rep
size
mo
7.4 Rep
dire
7.5 Rep
com
7.6 Det
OF TABLES
al number o
scription of t
scription of t
scription of t
scription of t
scription of t
scription of t
ffic profile o
ection cond
partition of w
bedded obj
esses..........
partition of w
e of embedd
esses..........
partition of w
e of embedd
des.............
partition of w
ct access co
partition of w
mpared to H
ection rules
of accesses f
the physica
the proxy se
the blocked
the bypassin
the Routing
the client co
of initial acc
ditions of a C
web pages i
ects in direc
...................
web pages i
ded objects
...................
web pages i
ded objects
...................
web pages i
ompared to
web pages i
HTTP and HT
s of bypassin
for the expe
l machine...
erver (virtua
d or blacklist
ng proxy (vi
g Server (virt
omputer (vi
ess.............
CGI proxy tr
n relation to
ct access co
..................
n relation to
in direct ac
..................
n relation to
in direct ac
..................
n relation to
HTTP and H
n relation to
TTPS bypass
ng traffic……
eriments.....
..................
al machine)
ted server (v
rtual machi
tual machin
rtual machi
...................
raffic...........
o the perce
mpared to
..................
o the perce
cess compa
..................
o the perce
cess compa
..................
o the inter‐a
HTTPS bypas
o the numb
ing accesse
……………………
De
...................
..................
..................
virtual mach
ine).............
ne)...............
ne).............
...................
...................
ntage of ma
HTTP and H
..................
ntage of ma
ared to HTTP
..................
ntage of ma
ared to HTTP
..................
arrival time
ssing access
ber of TCP fl
s.................
…………………
etection of
...................
..................
..................
hine)...........
...................
...................
...................
...................
..................
atches of th
HTTPS bypas
..................
atches of th
P and HTTPS
..................
atches of th
P and HTTPS
..................
e of the pack
ses..............
ows in direc
...................
……………………
bypassing t
Page
..................
..................
...................
...................
...................
...................
..................
...................
..................
he size of
ssing
..................
he header
S bypassing
..................
he payload
S bypassing
..................
kets in
..................
ct access
...................
…….……………
raffic
| viii
..37
.. 49
.. 50
.. 51
.. 51
.. 52
.. 53
.. 62
.. 68
.
76
..
78
..
80
..
81
.
83
…. 84
LIST OF
CGI
DNS
DoS
FTP
HTML
HTTP
HTTPS
IP
ISA Serv
NIC
NTFS
OSI
P2P
PHP
RAM
SSH
SSL
SSL
SOCKS
TCP
TELNET
URL
VoIP
VPN
ACRONYMS
: Co
: Do
: De
: Fil
: Hy
: Hy
: Hy
: Int
ver : Int
: Ne
: Ne
: Op
: Pe
: Hy
: Ra
: Se
: Se
: Se
: Ab
Cli
fir
: Tra
T : Ne
: Un
: Vo
: Vir
S
ommon Gate
omain Name
enial of Serv
e Transfer P
ypertext Ma
yper Text Tr
yper Text Tr
ternet Proto
ternet Secu
etwork Inter
ew Technolo
pen System
eer‐to‐Peer
ypertext Pre
andom Acce
cure Shell
cure Socket
cure Socket
bbreviation
ient/Server
ewalls
ansmission
etwork Virtu
niform Reso
oice over Int
rtual Private
eway Interfa
e Service
vice
Protocol
arkup Langu
ansfer Proto
ansfer Proto
ocol
rity and Acc
rface Card
ogy File Syst
Interconne
eprocessor
ss Memory
t Layer
t Layer
from SOC
application
Control Pro
ual Termina
ource Locato
ternet Proto
e Network
ace
age
ocol
ocol Secure
celeration Se
tem
ction
CKetS. Inte
n to commu
otocol
l Protocol
or
ocol
De
erver
ernet Proto
unicate tran
etection of
ocol that e
nsparently t
bypassing t
Pag
enables
through
raffic
e | ix
ABSTRA
The inte
business
are some
increase
has also
maliciou
boundar
risk of co
security
restrictio
the sec
investiga
network
This Mas
The det
embedd
packets
system is
virtual n
blocked
web pag
sequenti
between
of the m
network
used for
ACT
ernet throug
ses as well a
e of the serv
d through t
o been obs
s programs
ry of private
omputers g
policies. H
ons of many
urity polici
ated in this
and thus av
ster’s thesis
ection mod
ed object o
and the n
s tested on
etwork rep
web server
ges stored o
ial accesses
n direct acce
model in a v
to evaluat
the experim
gh the year
as our daily
vices relying
the decades
served. Ma
s by installi
e networks.
etting infec
However, s
y proxy firew
ies of the
s thesis, ho
void the byp
s covers the
del is built
of a webpag
umber of T
a virtual ne
roduces the
r, a CGI pro
on the block
s are made
ess, HTTP a
irtual netwo
e the efficie
ments is arti
rs has beco
life. Emails
g on the Inte
s, a huge su
ny networ
ng antivirus
The two m
cted by mal
sophisticate
walls, grant
ir organisa
ow to detec
passing of se
e design and
from four
ge, inter‐arr
TCP flows e
etwork in or
e bypassing
oxy and the
ed server in
to each we
and HTTPS b
ork, bypass
ency of the
ificially gene
me, not a
s, social net
ernet to ope
rge of virus
k specialist
s programs
main functio
icious progr
ed tools ha
ting unlimite
ation. Here
ct traffic em
ecurity polic
d evaluation
r non‐paylo
ival time of
emulated b
der to evalu
scenario inv
client. An
n order to c
ebpage in H
bypassing a
ing experim
e model in a
erated due t
De
luxury asse
tworking, on
erate. As th
es and spyw
ts respond
and deplo
ons of proxy
rams as we
ave been
ed access to
in lays t
mulated by
cies.
n of a detec
oad propert
f inbound p
by a brows
uate the cor
volving fou
initial test
reate netwo
TTP and HT
ccesses. Aft
ments are th
a more rea
to the lack o
etection of
et, but a ne
nline shopp
he popularit
ware circula
to the th
oying proxy
y firewalls a
ll as definin
developed
o internal u
the fundam
y CGI proxi
ction mode
ties of IP p
packets, ave
ing session
rrectness of
r parties: a
is run by ac
ork traffic p
TTPS to find
ter proving
hen conduct
listic situati
of physical u
bypassing t
Pag
ecessary too
ping and ban
y of the Inte
ating on the
reats pose
firewalls a
are to lowe
ng and enfo
to bypass
users contra
mental pro
es on a pr
l of CGI pro
packets: siz
erage size of
. The dete
f the model
proxy firew
ccessing dir
profiles. Two
d the correl
the correct
ted in a phy
ion. The da
users.
raffic
ge | x
ol for
nking
ernet
e web
d by
t the
r the
orcing
s the
ary to
blem
rivate
oxies.
ze of
f TCP
ction
. This
wall, a
rectly
o sub
ation
tness
ysical
taset
The wor
as ackn
rk presented
owledged in
ful
ST
d in this the
n the text. I
ll or in part,
TATEMENT O
sis is, to the
hereby dec
for a degre
Eliezer
OF AUTHEN
e best of my
clare that I h
ee at this or
r IDJALAHO
De
NTICATION
y knowledge
have submit
any other i
UE
etection of
e and belief
tted this ma
nstitution.
bypassing t
Pag
f, original ex
aterial, eithe
raffic
e | xi
xcept
er in
ACKNOW
First of a
to him fo
Secondly
valuable
research
of this th
I would
support
I would
care. Thi
Thanks a
for proof
Also tha
thousand
breath to
Lastly, I
spending
WLEDGMEN
all, I am gra
or leading m
y, I owe my
feedback. M
h. His experi
hesis.
also like to
available du
like to show
is work coul
also to Reini
f reading m
anks to my
ds of kilom
o my life an
extend my
g hours with
TS
teful to GO
me through t
deepest gra
My thanks g
ience in this
o thank Dr
uring the fir
w my gratit
ld not have
ier VEERMA
y thesis.
family, for
metres sepa
d offer me
y sincere g
h me to imp
D for his inf
the years in
atitude to m
go to him fo
s discipline
Ewa HUEB
st half of m
tude to Car
been comp
AN for spons
r always pr
rate us, yo
new hope.
ratitude to
plement a w
finite love t
reaching th
my superviso
or motivatin
has contrib
BNER, my se
y research.
olle AKPAKA
leted witho
soring my st
raying for m
our emails a
Harris TCH
working platf
De
towards me
his level of m
or Dr Hon C
ng and enco
buted treme
econd supe
A, my wife
out her patie
tudies in Au
me and sup
and phone
HABOSSOU
form for my
etection of
. May all th
my studies.
CHEUNG for
uraging me
endously to
ervisor, who
for her con
ence and un
ustralia and
pporting m
calls alway
ANANI, CI
y experimen
bypassing t
Page
he glory be g
r his support
throughou
o the comple
o has made
ntinual love
nderstandin
Michael JO
me. Even th
ys bring a
SCO expert
nts.
raffic
e | xii
given
t and
t this
etion
e her
e and
g.
OSSEP
ough
fresh
t, for
Chapter 1 ‐‐ Introduction
Page | 1
CCHHAAPPTTEERR 11
IINNTTRROODDUUCCTTIIOONN
Nowadays, the Internet has become a key instrument in the expansion and performance of
many companies. Organisations such as hospitals, universities and government institutions
are proceeding with the migration of their activities to the web platform allowing the
acceleration of transactions and making information easy to access for their employees as
well as for their targeted population. E‐commerce, web banking, online conferencing and
social networking are just some of the services relying on the Internet to operate. However,
the plethora of inter‐connected computer networks around the globe has triggered a
massive hunt for sensitive information and copyrighted materials by cyber criminals. In the
same way, the popularity of the Internet, through the years, saw an exponential growth of
viruses, spyware and Trojan horses. Therefore, the privacy and the protection of private
networks have emerged as a big concern for computer specialists. According to [1], an
unprotected computer can be infected by malicious programs in less than five minutes after
connecting to the Internet.
In the late 80’s [2], a new technology known as a firewall was introduced to combat threats
posed by cyber criminals and malicious programs. Through the years, firewalls evolved from
inspecting packets to highly sophisticated proxies performing complex tasks such as Deep
Packet Inspection (DPI), Intrusion Detection System (IDS), Network Address Translation
(NAT) and caching [1, 3]. Firewalls are deployed at the boundary of private or corporate
networks to create a security perimeter between the Internet and the private network.
They enhance the security of the network by inspecting and analyzing inbound and
outbound traffic. In other words, a firewall can automatically detect and block some attacks
originating from the Internet. In addition, firewalls can be used as a tool to limit access to
some res
gained a
huge am
minimise
which m
mechani
to websi
exist to b
to the w
“bypass
returned
figure 1.
Figure
11..11 PPR
A previo
private
Anonym
metrics w
packets
sources on
lot of mom
mounts of m
e the risk o
malicious co
ism for web
ites conside
bypass the s
web. By the
a censorshi
d more tha
1).
1.1: Plethor
RROOBBLLEEMM DD
ous researc
network re
izers (lozdo
which were
retransmitt
the Interne
mentum in n
money ever
of attacks. T
odes are in
b traffic is m
ered safe an
security pol
e time this
p” or “bypa
n a million
ra of articles
EEFFIINNIITTIIOONN
h [5], com
esulting fro
odge) and C
e: throughpu
ted, the fo
et by enforc
network sec
y year to a
The World
ntroduced i
mostly imple
nd productiv
icies of corp
research w
ass a proxy”
articles, tu
s and tutori
pleted in 2
om the use
CGI proxies
ut, amount
ormat of th
ing an Acce
curity with a
acquire the
Wide Web
into private
emented in
ve for the c
porate firew
as conducte
” introduced
utorials and
als on the In
2008, invest
e of three
[5]. This w
of data rec
e URLs and
ess Control
a large num
latest firew
(WWW) is
e networks
a lot of fire
company. H
walls and the
ed, the key
d into the se
d forum dis
nternet on b
tigated the
bypassing
as achieved
ceived and s
d the aver
Chapter
Policy (ACP
ber of instit
wall techno
s the main
s [4]. There
walls to res
owever, sev
ereby gain u
ywords “byp
earch engin
scussions on
bypassing a
anomalies
technique
d by choosi
sent by a cl
age time r
1 ‐‐ Introdu
Pag
). Firewalls
tutions inve
logy in ord
service thr
efore a filt
strict access
veral techni
unlimited ac
pass a firew
e “google.c
n the topic
proxy firew
observed
s: SSH tun
ng five net
lient, numb
equired for
ction
ge | 2
have
esting
er to
rough
ering
s only
iques
ccess
wall”,
com”,
(see
wall.
on a
nnels,
work
ber of
r the
Chapter 1 ‐‐ Introduction
Page | 3
retrieval of a webpage. From the experiments carried out throughout this research, it was
discovered that the use of SSH tunnels to bypass a firewall generated a high amount of data
sent, a low throughput and a high average time to complete an HTTP session [5]. As for the
anonymizer (in this case Lozdodge), anomalies were only observed with the average time
and the throughput [5]. Furthermore, the experiments on CGI proxies outlined anomalies
related to the throughput and the amount of data sent [5]. Yet, the anomalies detected
with the different bypassing techniques were not enough to implement a detection system.
Viruses and spyware can generate similar anomalies which would cause the detection
system to trigger a lot of false alerts, also known as a “false positive”. A more specific
investigation, focused on the properties of the TCP packets exchanged during a bypassing
session instead of aggregated statistics of the session, was then necessary to increase the
robustness of the proposed detection system.
In this research, the investigation is narrowed down to CGI proxies especially those
implemented with both HTTP and HTTPS protocols. This decision is motivated by the fact
that CGI proxies are free, easy to use and the most popular technique. In fact, thousands of
CGI proxies are accessible on the Internet free of cost to bypass proxy firewalls. On one
hand, the SSH and anonymizer techniques require the installation of software and in some
cases an advanced knowledge in networking such as the mastering of port forwarding, the
configuration of an SSH server and the modification of a web browser settings to use a
SOCKS proxy. On the other hand, with the CGI proxy bypassing technique, a user can bypass
a proxy firewall by simply typing the URL of the CGI proxy into his web browser. In other
words, no software installation and no configuration are required for the CGI proxy
technique (see figure 1.2).
The main problem addressed in this research is how to find patterns related to the use of
CGI proxies on private networks. Hence, the goal of this thesis is to test the correctness of a
detection model for the fingerprinting of CGI proxies’ traffic that can be used to increase
Chapter 1 ‐‐ Introduction
Page | 4
the efficiency of proxy firewalls. This investigation will design and test a detection
mechanism of proxy firewall bypassing traffic in a virtual network. A real world platform
made of the different parties involved during a bypassing technique will be reproduced in a
virtual network. This thesis shall investigate possible patterns related to CGI proxies through
network profiles built from:
The size of the objects embedded within a webpage
The inter‐arrival time of inbound packets
The number of TCP flows
The average size of the packets
Figure 1.2: CGI bypassing vs. SSH tunnel bypassing
11..22 SSCCOOPPEE AANNDD LLIIMMIITTAATTIIOONNSS
Many techniques exist to bypass proxy firewalls but the most popular techniques are
encrypted tunnels, anonymizers and CGI proxies. Through the past decades, many
investigations, with good detection accuracy, have been conducted on identifying
encrypted tunnels circumventing unauthorized traffic and anonymizers. However, few
(a) CGI Proxy Bypassing
User No configuration Use CGI proxy URL
Blocked
Server
CGI Proxy
(b) SSH tunnel bypassing
User Configure SSH client Use Sock proxy Port forwarding Configure web browser
Blocked
Server
SSH Proxy SSH server installed Open Port on router Configure SSH server
Chapter 1 ‐‐ Introduction
Page | 5
investigations have been done on CGI proxies. This thesis will provide background
information on the three main bypassing techniques. However, the investigation will focus
more on CGI proxies rather than encrypted tunnels and anonymizers. The experiments will
be carried out in a virtual network. However, the accuracy of the proposed detection
approach will tested in a real network.
11..33 RREESSEEAARRCCHH MMEETTHHOODD
In [6], Gordana identified three main scientific methods to approach computer science
problems:
Theory: This scientific approach is based on logic and sound mathematics to build
theories as well as proving or deriving theorems. In this branch of computer science,
researchers seek to design and evaluate the performance of new algorithms,
understand computational problems and investigate solutions.
Experiment: In this discipline, scientific experiments are conducted on computation
phenomena with the aim to verify hypotheses or create new models.
Simulation: In this approach, scientists investigate a real world situation or
computational phenomena by conducting experiments in virtual laboratories instead
of building physical ones. In this quest, applied mathematics as well as
experimentation and applied theory investigation are intensively used by
researchers for simulation [6].
Prior to using one of the scientific approaches mentioned above, modelling is always
applied to the computational phenomena. During the modelling process, the phenomena is
analyzed, simplified and reduced to an understandable model that can be studied. In this
thesis, the main approach has been on the simulation of a detection model of bypassing
traffic in a virtual network. Nonetheless, theory is used to formulate the problem while
experimentation contributed to the evaluation of the performance of the detection model.
Chapter 1 ‐‐ Introduction
Page | 6
11..44 TTHHEESSIISS OOVVEERRVVIIEEWW
This chapter has described the network breach investigated in the present research and
outlined the scope and limitations of the study. Moreover, both the methodology adopted
for the investigation and the aims of the thesis are defined in this chapter. The rest of the
thesis is organized as follows. Chapter 2 will present some relevant background information
on firewalls and provide an overview about web proxy and content filtering concepts. That
chapter will also explain how the bypassing of proxy firewalls is achieved and will discuss
the potential threats posed by CGI proxies to corporate networks. Related work in the
discovery and blocking of the bypassing of proxy firewalls will be presented in Chapter 3. In
Chapter 4, the goals of this work are clearly outlined. Chapter 5 will provide more details
about the design of the detection model and the implementation of the virtual network
used for testing. The results of the correctness of the detection model will be presented and
discussed in Chapter 6. Chapter 7 has been added to the research to evaluate the accuracy
of the detection model in a more realistic environment and a large dataset. Finally, Chapter
8 will provide a conclusion to the investigation and identify further work.
Chapter 2 ‐‐ Background
Page |
7
CCHHAAPPTTEERR 22
BBAACCKKGGRROOUUNNDD
This chapter presents some relevant background information. Section 2.1 introduces basic
information on firewalls while Section 2.2 focuses on the concepts of web proxy and
content filtering. Section 2.3 explains the mechanism utilised to perform the bypassing of a
proxy firewall. Finally, Section 2.4 outlines the risks posed by bypassing traffic to private and
corporate networks.
22..11 BBAASSIICC CCOONNCCEEPPTTSS
To help in the understanding of the work presented in this thesis, an overview of firewalls
will be explained in this section. In addition, firewall classifications according to OSI
architecture and the main purpose of each type of firewall will be outlined. Also in this
section, the concepts of a web proxy, content filtering and blacklists are also clarified.
2.1.1 Network security overview
The Internet was created to provide connectivity between computers and offer an
infrastructure for resource and service sharing [2]. Rapidly, a lot of communities such as
corporations, universities, schools, government institutions, hospitals, banks and private
users joined the Internet to speed up or improve their activities. The openness of the
Internet to various communities also means the internet is open to more sinister
communities which consist of hackers and other cyber criminals. The exponential growth of
users, throughout the years, also resulted in a dramatic increase of the number of attacks
and infections [7, 8, 9]. This can be easily explained by the fact that security mechanisms
Chapter 2 ‐‐ Background
Page |
8
were not implemented in the technology [9] supporting the Internet such as protocols. The
Internet was initially developed to share information between trusted parties. Therefore,
security was not considered an issue as the Internet was not a public tool. TCP/IP for
instance, has many weaknesses that can lead to attacks such as DoS attacks, impersonation
of a trusted party through IP spoofing and the interception of messages [9]. Through the
years, many filtering techniques have been developed to overcome some of the
weaknesses of the TCP/IP protocol. Some of these techniques involve the scanning of
inbound traffic for viruses, Trojan horses and spyware, deep packet inspection, the blocking
of vulnerable services and intrusion detection systems.
2.1.2 Firewalls and packet filtering techniques
Firewalls were introduced in the late 1980’s [2] to prevent and minimise the attacks on
private networks as well as increasing a user’s privacy. Routers were the first tools used to
separate networks [2] before the introduction of firewalls but rarely provided any method
of security.
According to [7], a firewall can be defined as:
“A network security product that acts as a barrier between two or more network
segments. The firewall is a system that provides an access control mechanism
between your network and the network(s) on the other side of it.”
A firewall is a computer system made of software, hardware or a combination of both [10]
that enforces an access control policy between a trusted network and an untrusted network
such as the Internet as seen in Figure 2.1. Mostly, a firewall is located at the outer boundary
of a private or corporate network and controls the outgoing and incoming traffic. Many
corporations and organizations deploy a firewall as the first line of defence against attacks.
A firewa
some re
pass thro
is blocke
access in
Furtherm
informat
(URLs) v
time and
that a fi
investiga
Figure 2
this figur
The OSI
presenta
4 main l
one han
from a h
when th
correspo
seen in
mechani
all offers th
sources acc
ough a firew
ed. The filter
n a firewall s
more, a fire
tion about
isited by a
d IP addres
irewall can
ations after
.1: Illustrati
re the firew
architectur
ation and ap
ayers [11]:
d, a header
higher layer
e packets a
onding to ea
Figure 2.2,
isms enforc
e possibiliti
cording to a
wall using p
ring of port
security pol
wall is also
security bre
user, the a
ses of the c
record. Th
an attack h
ion of a fire
all sits at th
re is made
pplication. In
application
r is appende
to a lower l
re entering
ach layer as
a packet is
ed on a net
ies, to a ne
security po
ort 80 while
numbers is
icy.
a good too
eaches [7,
amount of d
computers
is informat
as been det
ewall placed
he edge bou
of 7 layers
n practice, t
n layer, tran
ed to existin
layer when
a network,
s the packet
made of tw
twork by eit
etwork adm
olicy. For in
e FTP traffic
one of the
ol for monit
3]. For exa
data receive
connected
ion is very
tected on a
d between a
ndary of the
: physical, d
the 7 layers
nsport layer
ng data at e
packets are
the existing
ts travel fro
wo main pa
ther scrutin
inistrator, t
stance, HTT
c which is c
keys to sett
toring a net
mple, the U
ed or sent
to a server
important
network.
a trusted ne
e private ne
data link, n
of the OSI
r, internet la
each layer a
e exiting a n
g data is stri
om a lower
arts: a head
nising the in
Chapter
to block or
TP traffic ca
commonly fo
ting the allo
twork and l
Uniform Re
during a se
r are some
when perf
etwork and
etwork.
network, tra
architecture
ayer and ph
as the packe
etwork. On
ipped from
layer to a h
der and a p
nformation c
r 2 ‐‐ Backgr
Page
allow acce
an be allowe
ound on po
owed/disallo
ogging sens
esource Loc
ession, the
of the stat
forming for
the Interne
ansport, ses
e are reduce
hysical laye
et is transm
the other h
the header
higher layer
ayload. Filt
contained in
ound
| 9
ess to
ed to
ort 21
owed
sitive
ators
date,
tistics
ensic
et. In
ssion,
ed to
r. On
mitted
hand,
data
r. As
ering
n the
Chapter 2 ‐‐ Background
Page |
10
header of packet or inspecting the payload of the packet for keywords or anomaly
signatures. However, this is only achievable if the packets are not encrypted.
Figure 2.2: Structure of network packets across the layers.
Overall, four filtering mechanisms can be identified:
Physical layer filtering: At this stage, the filtering of network packets is performed
by inspecting the Ethernet header of network packets. Therefore, the only
parameter of interest at the physical layer is the MAC address. An attack is possible
at this layer if the attacker has a direct access to a private network device such as a
user’s computer or a router. ARP spoofing and packet sniffing can then be used to
collect network traffic and scan for vulnerabilities such as user accounts and
passwords.
Internet layer filtering: The source IP and destination IP are two main parameters
for filtering network traffic at this layer [12, 11]. These parameters are contained in
the IP header of the packets. Most often, a blacklist and a whitelist are enforced on
the perimeter of a private network to restrict and allow access to some IP addresses,
respectively. An Intrusion Detection System (IDS) can fingerprint policy violations by
inspecting the source IP and destination IP of the packets [13] derived from the IP
header of the packets. In addition, spoofed IP and inconsistent IP headers are more
likely to be detected by an IDS at this layer.
Payload (Plaintext) TCP Header IP Header
Payload (Plaintext)
Payload (Plaintext) TCP Header
Payload (Plaintext) TCP Header IP Header Ethernet Header
Application layer
Transport layer
Internet layer
Physical layer
Chapter 2 ‐‐ Background
Page |
11
Transport layer filtering: At this layer, security policies are established based on the
source port and destination port contained in the TCP header. The port number is a
reliable parameter to identify the service running on a host machine. For example,
Port 80, 443 and 22 are default ports for HTTP, HTTPS and SSH traffics, respectively.
Therefore, at the transport layer, security policies rely on the source and destination
port numbers to block or allow services on a private network. Furthermore, some
network attacks are traceable by an Intrusion Detection System (IDS) using the
source port and destination port of TCP packets. These attacks range from DoS
attack, SYN attack to port scanning [13].
Application layer filtering: The application layer is the highest layer in the OSI
architecture. At this stage, the filtering is applied on the payload of the packets. As a
result, the payload is searched for keywords in the access control policies. In
addition, the payload can be scanned for viruses, spyware and Trojan horses.
Malware and suspicious code is detected either by using signatures or by detecting
anomalies related to the malware. Intrusion Detection Systems implemented at this
layer are able to detect anomalies and attacks related to protocols such as HTTP,
DNS and FTP [13].
2.1.3 Types of firewalls and their weakness to stop bypassing traffic
William et al. [8] classifies firewalls in five types according to OSI architecture. As seen in
Figure 2.2, each type of firewall is implemented at a specific layer of OSI architecture. These
five types of firewall are:
Packet filters;
Application level filtering;
Circuit level gateways;
Dynamic packet filters;
Distributed firewalls.
Chapter 2 ‐‐ Background
Page |
12
2.1.3.1 Traditional Packet filters
Packet filtering firewalls are implemented at the network layer of the OSI architecture [7,
3]. They allow or prevent a packet to enter or exit a network based on information
contained in the IP header of the packet [8, 14]. The source IP address, destination IP
address, source port, destination port and transport level protocol [2] are some of the
properties found in the IP header that are used for filtering a packet. The blocking
mechanism of packet filtering firewalls is faster but the filtering rules are difficult to
implement [2]. This mechanism is present in a lot of routers through a program
incorporated into the hardware [8]. Although, the use of packet filters on private networks
is widespread, this does not stop malicious traffic, such as bypassing traffic, from crossing
the security perimeter.
Packet filters are unable to stop bypassing traffic because of the immutability of some fields
of the IP header such as the IP address. The weaknesses of packet filters are:
Spoofing: This attack is achieved by changing the source IP address of a packet to a
random or spoofed IP address. Spoofing is tricking a user or a computer into thinking
that a packet is originating from a trusted source while it is not [15]. This is easily
achievable because the authentication of other parties is not implemented with TCP
[16]. Spoofing is mostly used to bypass the access control policies of packet filters
and is difficult to detect for many firewalls and intrusion detection systems [16]. In
the CGI bypassing scenario, the CGI proxy spoofs the packets received from a
blacklisted server by changing the source IP of the packets with the its IP address.
Source routing: Source routing is a technique used to specify the route a packet
must take through a network [17, 18]. The route of a packet is either specified by the
sender (source) or the network device receiving the packet. If the path of a packet is
Chapter 2 ‐‐ Background
Page |
13
not defined by the sender, the router or network device receiving the packet will
decide on which route to forward the packet. Source routing is an ideal tool utilized
by hackers to bypass firewalls [17] and access computers which are normally blocked
by the firewall. This technique is similar to the spoofing attack during the CGI
bypassing technique. By adding a CGI proxy between a packet filter firewall and a
blocked server, a user is able to go around the access control rules.
Fragmentation attacks: The transmission of large packets is enabled by the Internet
protocol (IP) through a mechanism called fragmentation. This mechanism consists
into splitting a large packet into small packets each containing an offset for
reassembling. These fragments are transmitted through a network and reassembled
at the other end to reconstruct the original packet [12]. Packet filtering firewalls
check the authenticity of the first fragmented packet of the original packet and
allow the remaining fragmented packets to pass through if the header data of the
first packet complies with the access control policies [15]. By doing this, a firewall
can permit unauthorized traffic to enter the network. Fragmentation attacks are
categorized in two main groups: tiny fragment attack and overlapping attack [53,
15]. In a tiny fragment attack, the TCP header information is sent to a packet filtering
firewall in three small fragments [52]. Packet filtering firewalls will fail to block the
first fragmented packet from entering the network due to the first packet not
containing all the TCP header information necessary to authenticate the packet. The
TCP header data being split into the three smaller fragments [51, 15], therefore, the
filtering mechanism is unable to check the legitimacy of the following incoming
packets. The overlapping fragment attack is achieved by sending a zero offset packet
containing incomplete data or a legitimate TCP header complying with the firewall
rules. Additional non‐zero offset packets are then transmitted to modify the TCP
header data during the reassembling process [15] resulting in a malicious packet.
Chapter 2 ‐‐ Background
Page |
14
2.1.3.2 Dynamic packet filters
Dynamic packet filters are another method of preventing attacks on private networks. They
are also referred to as stateful packet inspection firewalls. They are implemented at the
transport layer of OSI architecture [7, 17] and offer transparency to users while applying
security measures. Dynamic packet filters apply security mechanisms during the
establishment of a connection by recording session information such as source IP address,
destination IP address, source port and destination port. This allows them to maintain an
array of active and authorized connections in order to monitor the traffic. All incoming
packets are then analysed against the active connections table to determine whether the
packet is legitimate or unwanted. Dynamic packet filters offer a higher security level [8]
than packet filtering firewalls by keeping track of the state of open connections and
matching them with inbound traffic to detect unwanted traffic.
Dynamic packet filters have similar weaknesses to traditional packet filters. Spoofing the IP
address of incoming packets will allow the packets to enter the private network. The
connection to the blocked server being established through the CGI proxy, dynamic packet
filters are powerless to detect traffic originating from a blocked server.
2.1.3.3 Circuit level gateways
Circuit‐level gateways work at the transport layer of OSI architecture. During the
establishment of TCP connections, they create a virtual circuit between the source and the
destination by acting like a relay host or man in the middle [8]. A TCP connection initiated
by a client is terminated at the circuit level gateway which establishes another TCP
connection with the external server in order to handle the user’s requests [18]. Contrary to
the packet filters, this type of firewall does not allow packets to flow from end to end. The
IP address of the clients and other connection information are concealed by the circuit level
Chapter 2 ‐‐ Background
Page |
15
gateway. For example, an external server will only see the IP address of the circuit level
gateway instead of the client’s IP address. At the early age of the Internet, circuit level
gateways were used to bridge two networks [8]. This type of firewall hides the topology of
private networks and provides authentication, audit and logging mechanisms. By doing
that, circuit level gateways provide a higher security environment to private networks
compared to simple packet filters. In addition, statistics recorded by circuit level gateways
are very useful for forensics investigations to reconstruct the source of an attack.
As for the two previous types of firewalls, circuit level gateways are vulnerable to IP
spoofing. Therefore, this vulnerability can be exploited by the World Wide Web (WWW)
community to circumvent illegal traffic by routing the traffic through a CGI proxy.
2.1.3.4 Application level gateways
Application level gateways are the most advanced type of firewalls. They are implemented
at the application layer of the OSI architecture. Mostly, they are deployed on private
networks and act as an intermediate [7, 15] between internal users and the Internet during
TCP sessions. All requests made by internal users go through the application level gateway.
Authorized requests are then appended with the identification information of the
application level gateway and forward to the intended server. This transformation protects
internal users and hides the topology of the private network. Apart from acting as an
intermediate, an application level gateway can perform deep packet inspections on
inbound and outbound traffic. In other words, they can scrutinise the payload of incoming
and outgoing packets because they are working at the application layer of the OSI model.
For instance, an application level gateway acting as a web proxy can block all HTTP requests
containing the word “hackers” or “virus” or “download spyware”. In addition, the
monitoring and logging [7, 10, 18] of users’ activities are easily achievable by application
level firewalls. For example, they can record the URLs visited by users, the attempt of
Chapter 2 ‐‐ Background
Page |
16
connections made to a server and the date and time of the sessions. Network
administrators can use this information to identify the source of an attack or investigate the
entrance point of infections.
Contrary to the other types of firewalls, application level gateways perform a deep
inspection of the packets by analysing the content of the payload. In other words, the
payload is searched for keywords categorised as illegal in the access control policies.
However, the bypassing of application level gateways is still achievable by using an
encrypted channel to hide payload and therefore defeat the content filtering of the firewall.
2.1.3.5 Distributed firewalls
A distributed firewall is a new concept and implementation of a firewall system. This form
of firewall is a newer technology and is more secure and efficient than the more traditional
types of firewalls which have been operating for decades. With this type of firewall, the
client is responsible for enforcing the security policies which are provided by the main
firewall. The main firewall’s role is to provide the security rules and supervises the client’s
enforcement [8] of these rules. With this type of firewall, the enforcement of the security
policies is decentralised to the clients [8]. A distributed firewall operates according to the
server/client concept. On one hand, the server representing the central firewall maintains a
database of security rules. This central firewall is responsible for providing the security rules
to each client and ensuring that these rules are enforced. On the other hand, the client
drops a packet or allows a packet to be transmitted on the network in accordance with the
access control policies.
Distributed firewalls are more efficient compared to the other types of firewalls. More
specifically, network traffic is analysed on the server and the client side of the firewall to
detect policy violations. However, distributed firewalls are also vulnerable to IP spoofing.
Chapter 2 ‐‐ Background
Page |
17
22..22 WWEEBB PPRROOXXYY SSEERRVVEERR AANNDD FFIILLTTEERRIINNGG MMEECCHHAANNIISSMMSS
This section clarifies the misconception made between a firewall and a web proxy. The aim
of a web proxy is to filter web traffic while a firewall is limited to a specific function
depending on its type as described in Section 2.1. A censorship mechanism is mostly
installed on a firewall to filter web traffic by inspecting web protocols such as HTTP, HTTPS,
DNS and FTP. In general, a web proxy is a combination of hardware and software. First, the
hardware is used to separate two networks and to define the rules for allowing traffic in
and out of the private network. This hardware can be a router or a computer. Then, a piece
of software is then installed on this hardware to handle web traffic and restrict access to
some resources on the World Wide Web.
2.2.1 Web proxy
A proxy server or proxy firewall is a computer system or application, located between two
networks or two computers which acts like an intermediate. An example would be, the
employees of a company using a proxy server to connect to the Internet. This is done to
ensure the security of the private network and improve the network performance through
caching.
A web proxy is aimed at the filtering of web traffic. In other words, a web proxy applies
filtering mechanisms on web related protocols such as HTTP, FTP and HTTPS [19]. User
authentication enforcement is also part of the role performed by a web proxy [20]. A web
proxy receives requests from users in the form of a Uniform Resource Locator (URL). All
requests conforming to the access control policies are completed while unauthorized
requests are simply rejected by the web proxy. In most cases, a web proxy is the point of
entrance and exit of web traffic. So, it is the best place to control the browsing activities of
users by applying filtering mechanisms.
Chapter 2 ‐‐ Background
Page |
18
2.2.2 Proxy filtering mechanisms
Many filtering mechanisms are implemented in web proxies to restrict the access to
external web servers classified as unsafe for private networks. According to Michael E.
Whiteman [21], a basic proxy server has at least two filtering mechanisms:
URL filtering: This filtering technique can be achieved in two modes [19]. In the first
mode, a network administrator creates and regularly updates a list of forbidden
websites called a blacklist to perform URL blocking. Blacklists are also
commercialised by web filtering companies and can be acquired by network
administrators and upload onto the proxy firewall. Access to a website is denied by
using a domain name such as www.example.com. In other words, when a domain
name is blacklisted all the URLs referring to this domain are automatically banned by
the proxy server. URLs are mostly categorised according to the content of the
website such as news or games, and stored in plain‐text [21]. In this mode, a user
has access to all the websites except those listed in the blacklist. The first mode is
also referred as “block some and allow the rest” filtering. The concept of the second
mode is opposite to the first mode. Instead of blocking some URLs and allowing the
other URLs, in the second mode, network administrators use a list of authorized
URLs called a whitelist and block all the rest of the websites.
Content filtering: In this filtering technique, network traffic is deemed legal by
scrutinising the payload of the packets. That is to say that a deeper inspection is
performed on the packet in search of keywords explicitly classified by the network
administrator as harmful. For example, universities and schools block website
containing the keywords drugs to avoid students being exposed to illicit drugs.
In addition to the two filtering mechanisms described above, Ari Luotonen [21] outlined
that the filtering can also be applied to the headers of the packets. For instance, headers
Chapter 2 ‐‐ Background
Page |
19
containing users’ credentials such as username and password should not be forwarded to
the Internet without removing this information [21].
22..33 BBYYPPAASSSSIINNGG AA WWEEBB PPRROOXXYY
This section clarifies the concept of bypassing a firewall. It also explains the mechanism
used to bypass a firewall. The three techniques commonly used on the Internet are also
presented in this section.
2.3.1 Definition of bypassing
The bypassing of a firewall is a breach of security policies enforced on a private network. It
is the routing of unauthorized traffic through a bypassing proxy in order to get around the
access control rules of a firewall. This is achieved by circumventing illegal traffic through
encrypted tunnels or CGI web servers with the purpose of fooling the firewall into believing
that the traffic is originating from a trusted source.
2.3.2 Bypassing mechanism
Three main parties are involved in the bypassing of a firewall. These are as follows:
An anonymous user: A computer inside a private network is called an anonymous
user. This computer connects to the bypassing proxy and redirects all its requests to
this server instead of connecting directly to an external server on the Internet. An
anonymous user is also called the client or source of a request.
Unauthorized server: This is a computer located outside a private network, generally
on the Internet. An unauthorized server is a server blocked by a firewall. Access to
Chapter 2 ‐‐ Background
Page |
20
this server is disallowed to users in order to prevent infections by viruses and
spyware. An unauthorized server is also known as the destination of a request.
A bypassing server: This server is located on the Internet and acts as an
intermediary system between an anonymous user and an unauthorized server. The
requests received by a bypassing proxy from a client are signed with the
identification information of the bypassing proxy such as the IP address and then
forwarded to the intended server. After the retrieval of the data from the intended
server, the packets are updated once again with the identification information of the
bypassing proxy and relayed back to the client. Figure 2.3 outlines the different
phases and transformations performed on the packets during a bypassing process.
Figure 2.3: The three parties involved in the bypassing scenario of the firewall.
As can be seen in Figure 2.3, the packets originating from an anonymous user goes through
a modification process before being forwarded to the unauthorized server. The same
applies to the packets returned by the unauthorized server. During the bypassing process,
the anonymous user is not directly connected to the unauthorized server but to the
bypassing server. The unauthorized server will never see the anonymous server because the
requests arriving to it are originating from the bypassing server.
PRIVATE NETWORK
Unauthorized server
Anonymous user Bypassing server
Source IP: IP 1 Destination IP: IP 2
Source IP: IP 2 Destination IP: IP 1
Source IP: IP 2 Destination IP: IP 3
Source IP: IP 3 Destination IP: IP 2
IP address: IP 1 IP address: IP 2 IP address: IP 3
Chapter 2 ‐‐ Background
Page |
21
2.3.3 Bypassing techniques
Nowadays, three common techniques are used to bypass firewalls. These techniques are
easy to perform because several ports are opened by network administrators on firewalls in
order to communicate with other networks mainly with the Internet. A good firewall must
allow a corporation to conduct its activities on the Internet while guarding the private
network of the corporation from all sorts of attacks. To achieve this, many network
administrators allow HTTP (port 80), HTTPS (port 443), SSH (port 22) and DNS (port 53)
traffic. Ports 80 and 443 are mostly opened for users to browse and retrieve information
from the Internet. In addition, the encryption, authentication and integrity mechanisms
offered by the SSH protocol [22] make it one the favourite tools for network administrators
to access remote servers.
The SSH protocol is preferred to the TELNET protocol because TELNET lacks the three
mechanisms offered by SSH. That is to say that TELNET can easily be exploited to perform
attacks. DNS traffic is authorised for the resolution of addresses and is necessary for
accessing resources on the Internet. Some of the protocols listed above can be exploited to
encapsulate other traffic. For example, a user can exploit the forwarding and encapsulation
possibilities of SSH to convey HTTP traffic.
All three bypassing techniques require an external server located on the Internet. This
external server must not be blocked in the access control policies of the firewalls. The most
common techniques used to get around censorship are:
Encrypted tunnels;
Anonymizers or bypassing software;
CGI proxies.
Chapter 2 ‐‐ Background
Page |
22
2.3.3.1 Encrypted tunnels
To overcome the lack of encryption and authentication mechanisms in the TCP protocol,
tunnelling has been introduced to secure the traffic transmitted between two computers.
Tunnelling allows two computers to communicate or exchange data through an encrypted
channel. Protocols such as SSH and VPN offer the possibilities to users around the world to
access their organisation’s servers through a secure tunnel [22, 23, 24]. For instance, SSH
tunnels are used by many university students to access their files from home while
protecting their privacy through encryption and authentication. However, legal and illegal
traffic can both be transmitted through encrypted tunnels. The firewalls are unable to
perform their filtering functions because the payload of the packets transmitted over an
SSH or VPN tunnels are encrypted.
SSH and VPN are built according to the Client/Server architecture. The client connects to
the server and sends requests which are handled by the server. To bypass a firewall using
the tunnelling technique, a user needs the following tools:
A tunnelling client: A tunnelling client refers to a piece of software used to
communicate with a tunnelling server. For example, Putty [23] is an SSH client
developed to exchange data with an SSH server. The tunnelling client is mostly
installed and configured on the anonymous user’s computer and resides inside the
private network.
A tunnelling server: The tunnelling server is the server version of a tunnelling client.
In other words, it is a program on a specific port which listens for commands sent by
a client. To bypass a firewall, the tunnel server must be implemented outside the
private network. A home computer running an SSH server or VPN server is mostly
designated as the bypassing proxy.
Chapter 2 ‐‐ Background
Page |
23
An open port: the communication between a tunnelling client and a tunnelling
server is only possible if the firewall allows this traffic through an open port. This
port must be explicitly allowed in the access control policies.
A typical scenario showing access to an unauthorised server through an SSH tunnel is
illustrated in Figure 2.4. A user performs the bypassing of his organisation’s firewall by
firstly installing an SSH server on his home computer. This server must be connected to the
Internet and configured to accept connections from SSH clients on a dedicated port. The
user can also host the SSH server with an Internet Service Provider (ISP) or pay to use the
service of one of the dedicated SSH servers present on the Internet. An SSH client such as
Putty, can then be used by the client to connect to the SSH server from his organisation’s
network and forward the traffic on a local port dynamically. Finally, the user must configure
the web browsers to use the tunnel instead of accessing the Internet through the proxy
firewall. This is achieved through a SOCKS proxy, implemented with many web browsers.
Figure 2.4: Bypassing performed by the use of an encrypted tunnel. The filtering mechanism
of the firewall is defeated by the encryption traffic flowing between the client side and the
server side of an encrypted tunnel.
2.3.3.2 Anonymizer or bypassing software
An anonymizer is a piece of program designed for the bypassing of proxy firewalls. This
bypassing tool is installed on a computer and acts like a proxy server between the Internet
PRIVATE NETWORK
INTERNET
UNAUTHORIZED SERVER
INTERNAL USER BYPASSING SERVER
SSH, SSL, VPN Tunnel (Encrypted)
Chapter 2 ‐‐ Background
Page |
24
and end‐users. The identity of users is concealed by the proxy server making their online
activities untraceable. A large variety of bypassing software is freely accessible by users on
the Internet. Anonymizers are classified in two main groups [25]:
Single‐point design: As seen from Figure 2.5 (a), an anonymizer implemented with
the single‐point design routes all the traffic through a single machine [25]. The
requests of the user travel through a single bypassing server and then hit the
intended server. The response from the server is also relayed to the user through
the same bypassing server. Lozdodge [26], a popular bypassing tool, is implemented
on the single‐point design.
Networked design: Contrary to the single‐point design, this design is more complex
and offers more privacy. In this design, the user’s requests go through a network of
computers before reaching their target. A random path is defined each time the user
makes a new request. From Figure 2.5 (b), the packets exchanged between the client
and the server travel through computer A, D, C, G and H. Tor [24] is a widespread
application using the networked approach for proxy bypassing. A user joins the Tor
network by installing a Tor client on his home computer to avoid his company or
school firewall.
User
A
DC
G H
B
FE
Server
(b) Networked design
User Server
(a) Single‐Point
Chapter 2 ‐‐ Background
Page |
25
Figure 2.5: Anonymizer: Single‐Point vs. Networked design
2.3.3.3 CGI proxy server
The Common Gateway Interface (CGI) is a standard that specifies the way web servers and
client programs such as web browsers interact [27]. The Common Gateway Interface
enables a web browser to send requests to a web server. At the same time, this mechanism
allows a web browser to dynamically extract, process and forward the requested data in a
proper manner to a web browser [27]. CGI scripts are executable programs installed on web
servers to perform a specific task. For example, when a student requests the average of his
marks, a CGI script is executed on the web server to extract his marks, sum them up and
compute the average. This computation is automatically executed on the web server and
transparent to the student.
Through the years, many CGI scripts have been developed including bypassing scripts. A
bypassing script is a piece of code mostly written in Perl or PHP which is used by other
computers to retrieve web pages. A CGI proxy server or a web based proxy server is a
computer hosting a bypassing script accessible from the Internet in order to get around the
restrictions of proxy firewalls. Many CGI scripts, which allow users to bypass their
company’s firewalls, are available on the Internet. Contrary to the other bypassing
techniques mentioned above, this technique does not require the installation of software
nor does it require a deep understanding of the Internet protocols. Thousands of free CGI
proxies are advertised online and can easily be found using a search engine. After receiving
a request from a user, a CGI proxy retrieves the web page from the Internet and stores it
locally. The URLs of the objects contained in the webpage such as images, frames and
videos are then modified to point to the CGI proxy instead of the source server. Finally the
modified webpage is sent back to the client. Figure 2.6 depicts the use of a CGI proxy to
access a banned server indirectly.
Chapter 2 ‐‐ Background
Page |
26
Figure 2.6: Implementation of CGI proxy to bypass firewall restrictions. The traffic in blue is
blocked by the firewall. The traffic is red represents the use of a CGI proxy to access the
banned server illegally.
22..44 RRIISSKKSS OOFF BBYYPPAASSSSIINNGG AA FFIIRREEWWAALLLL
Bypassing a firewall can have serious consequences for a private or corporate network. In
recent years, the Internet went from being a safe platform for sharing data to an
environment infested with threats. Hundreds of malicious tools called “crimeware” have
been developed by cybercriminals to conduct their attacks. As defined by Aaron Emigh et al.
[28], “Crimeware is software that performs illegal actions unanticipated by a user running
the software; these actions are intended to yield financial benefits to the distributor of the
software”. In many cases, computers are infected by crimeware during their online
activities on the Internet. Malicious codes are embedded in emails or transmitted through
malicious URLs or social networking websites. The 2009 report of SOPHOS [29] on security
threats clearly states that every 4.5 seconds a malicious webpage is detected, which leads
to 7,008,000 new threats every year. According to the same report [29], a large number of
anonymizers or CGI websites present on the Internet are infected with malware. The
consequences posed by proxy bypassing are classified in four main groups: financial impact,
User Blocked Server
Bypassing proxy + CGI Script
Chapter 2 ‐‐ Background
Page |
27
productivity loss, shortage of resources and privacy concerns. This section presents these
four consequences.
2.4.1 Financial impact
In 2006, computer economics estimated to US$ 13.2 billion the damage caused by malware
[30]. In the same way, the Federal Bureau of Investigation (FBI) estimated in 2005 a loss of
US$ 67.5 billion by national organisations due to cybercrime [31]. Bypassing traffic is a good
route to introduce malware into private networks. Once installed, malware can propagate
to the rest of the network and leak private information such as copyright materials valued
at thousands of dollars. For example, in mid‐December 2009, Google lost sensitive data due
to the breach of their network. The malware responsible of the leakage of information was
embedded in an email as a link pointing to a malicious website. Further investigations
identify 20 other companies including Adobe as being victims of the same infection [32].
Apart from disclosing sensitive data, malware can also carry DoS attacks on private
networks and disrupt some services. DoS attacks are also very costly to companies such as
online stores, banks and universities. The recovery of an attacked system is time consuming
for network administrators and costs thousands of dollars to organisations and corporations
[33].
2.4.2 Productivity loss
In general, a proxy firewall is referred to as a tool for security enhancement but it can also
help to increase the productivity of employees in workplaces. News, sports, social
networking, emails and video streaming websites are common websites visited by the
majority of people. At the present time, the majority of internet users have an email
account with some having a social networking account in addition. According to the
Chapter 2 ‐‐ Background
Page |
28
statistics of Compete Inc. [34], Facebook, Youtube and Myspace were some of the most
visited websites in 2009. Many employees do not limit their browsing habit to their home
or personal computers. They are more likely to visit the same websites even when in their
workplace. They tend to access their emails or post a message on social networking
websites while at work. Blocking access to these websites is a good way for employers to
decrease the loss of concentration during working hours and in the same way to maximise
productivity. Instead of giving unlimited access to employees during working hours, a
restricted list of websites relevant to the activities of the organisation is enforced on the
proxy firewall. By doing this, network administrators minimise the risk of infection by
malware while increasing the productivity of employees.
2.4.3 Shortage in resources
The third possible consequence of proxy bypassing is the shortage in network resources. As
mentioned earlier, bypassing activities maximise the risk of malware infections and hence
can cause the disruption of network resources and services. In many cases, the bandwidth
of many organisations is illegally exploited by employees in downloading large files such as
movies, music files or games instead of conducting their organisation’s activities. An
organisation with a limited bandwidth can experience a shortage in network resources if
illegal usage is made of it by users. All in all, the deployment of a restricting policy on the
corporation firewall can help to ensure that a good usage of network resources is made by
internal users.
2.4.4 Privacy concerns
A bypassing proxy helps a user to browse the Internet while remaining anonymous. The
privacy of the user is concealed by the fact that the bypassing proxy removes the identifying
information of the user before forwarding his requests to others servers. The same task is
Chapter 2 ‐‐ Background
Page |
29
performed by the bypassing proxy on the data received from other servers before passing
the data back to the requesting user. The bypassing process is completely controlled by the
proxy server which raises privacy concerns. In fact, an untrustworthy proxy can log all the
traffic exchanged between a user and other servers. Authentication credentials and
personal information such as usernames, passwords, credit card, driver’s licence and bank
account numbers are susceptible of being disclosed by bypassing proxies [35]. Additionally,
a bypassing proxy is a good tool for phishing. As an intermediate between a user and the
other servers, a bypassing proxy can collect personal information on users by displaying
illegitimate web forms. For example, a user trying to access his bank account online through
a bypassing proxy can be asked by the bypassing proxy to enter his bank account number,
password and additional information such as date of birth and home address.
22..55 SSUUMMMMAARRYY
In this chapter, background information was provided to explain the different concepts
examined in this thesis. The risks to private networks posed by the bypassing of a proxy
firewall are seen to range from financial losses to the disclosure of sensitive information. In
addition, the operational mechanism of CGI proxies, encrypted tunnels and anonymizers
has been clarified in this chapter. The encryption capabilities of some web protocols and
the intermediary role played by bypassing proxies contribute a lot to avoid censorship and
evade the restrictions of proxy firewalls.
Chapter 3 – Previous Work
Page |
30
CCHHAAPPTTEERR 33
PPRREEVVIIOOUUSS WWOORRKK
In this chapter, related work in the discovery and blocking of the bypassing of proxy
firewalls will be presented. The three following sections will cover three common
techniques utilised for proxy bypassing: encrypted tunnels, anonymizers and CGI proxy
servers. Section 3.1 will present the relevant investigations that have been performed
through the years to detect the illegal use of encrypted tunnels to circumvent bypassing
traffic. Section 3.2 will explore the different approaches to minimizing the use of
anonymizers on private networks. Finally, the last section will focus on related work for
detecting CGI proxy servers.
33..11 EENNCCRRYYPPTTEEDD TTUUNNNNEELLSS
The increasing number of filtering mechanisms and access control infrastructures has been
closely accompanied by the intensive use of encrypted tunnels to bypass these restrictions.
Encrypted tunnels, operating mostly at the application layer of OSI architecture, are
commonly utilised to circumvent illegal traffic. This is done by protocol encapsulation.
Encapsulation is wrapping one protocol inside another protocol. This enables unwanted
traffic to cross through firewalls and enter private networks. More specifically, a trusted
and authorised protocol is used in many cases as an envelope to carry illegitimate traffic in
and out of the security perimeter.
Much research has been done through the years to fingerprint encrypted tunnels that
deviate from their normal use. In general, an illegitimate use of an encrypted tunnel is
detectable by looking at the payload or non‐payload statistics of the traffic generated by
Chapter 3 – Previous Work
Page |
31
the tunnel. One of the major researches in this field has been achieved by Manuel et al. [36]
who designed and implemented a technique to detect, with a high accuracy, illegal HTTP
and SSH flows crossing a network. By analysing the inter‐arrival time, the size and order of
the packets transmitted during a session, their proposed statistical mechanism can predict
with high accuracy the encapsulation of other protocols in HTTP and SSH traffic. In previous
research [37], the conclusion was reached that by monitoring the behaviour of the three IP
properties mentioned earlier (size, inter‐arrival time and order of the packets), it was
possible to derive the protocol used for the data exchange. This discovery was then applied
on encrypted tunnels by collecting legitimate HTTP and SSH profiles and then comparing
them to a dataset made up of sessions of encapsulated protocols inside HTTP and SSH
protocols as well as sessions recorded from acceptable use of these protocols. Their
approach detected with high accuracy encapsulated protocols inside HTTP and SSH
sessions. Their detection mechanism was later tested on real network traffic with great
success.
In [38] Jeffrey Horton and Rei Safavi‐Naini investigated the inappropriate use of SSH tunnels
to hide unwanted traffic. They approached the problem by investigating the deviation
between a normal and an abnormal SSH session based on two Internet Protocol (IP)
properties: size and inter‐arrival time of IP packets. After analysing a large dataset of SSH
sessions, they established a direct relationship between the size of the packets and the use
of SSH protocol. Their investigation pointed out that during a normal SSH session such as
remote access or an interactive session with an SSH server, smaller IP packets are
exchanged between the SSH client and the SSH server. However, protocols such as HTTP
and FTP, respectively encapsulated inside SSH tunnels for web browsing and file transfer,
emulated larger IP packets. A similar investigation to [38] has also been conducted by Riyad
et al. in [39] using two supervised learning algorithms: AdaBoost and RIPPER. The goal of
their study was to classify SSH traffic and non‐SSH traffic using the two algorithms and thus
utilise the most efficient algorithm to predict the service running behind the SSH traffic. A
Chapter 3 – Previous Work
Page |
32
capture of network traffic emulated by many protocols, including SSH, was used as a
dataset. RIPPER was shown to be the best classifier with a 99% prediction accuracy of the
protocol encapsulated inside SSH traffic [39]. Finally, Kevin et al. identified HTTP tunnels
concealing illegitimate traffic generated by spywares and viruses in [40]. The approach
included four non‐payload properties.
33..22 AANNOONNYYMMIIZZEERRSS OORR BBYYPPAASSSSIINNGG SSOOFFTTWWAARREE
Many firewalls incorporate a filtering mechanism to control the connectivity of internal
programs executed by users to the Internet. This mechanism is designed to enhance the
security of private networks by enforcing a list of appropriate software to be run by users.
The detection of unsafe programs on a private network has been intensively investigated
during the recent decades. In [41], Liang et al. investigated the detection of Skype traffic
circumvented through proxy firewalls. Skype is a popular program use for chatting but
mostly for Voice over IP (VoIP). Therefore, Voice over IP is made possible through Skype by
maintaining a continuous traffic stream of data between the caller and the receiver. Their
investigation outlines that Skype traffic even though encrypted can be detected by
analysing payload as well as non‐payload statistics of the data transmitted.
A more significant study, of blocking specific applications to access the Internet, has been
conducted on P2P traffic by Subhabrata et al. [42]. Signatures related to five common P2P
applications were collected and matched to real network traffic. The designed classifier
identified with a high accuracy not only P2P traffic but the application emulating the traffic.
Although much progress has been made to detect unwanted traffic generated by illegal
programs, there are a few or even no investigations focuses on the detection of
anonymizers within a private network. All in all, an application can be blocked by a proxy
firewall if a detection system, based on signatures or patterns inherent to the application, is
deployed.
Chapter 3 – Previous Work
Page |
33
33..33 CCGGII PPRROOXXYY SSEERRVVEERRSS
Little research has been done in identifying the usage of CGI proxies to bypass proxy
firewalls. Many investigations have concentrated on detecting the encapsulation of a
protocol inside an encrypted tunnel and the fingerprinting of traffic generated by a specific
program. The traffic emulated by CGI proxies is very similar to normal web traffic which
makes it difficult to detect. Contrary to encrypted tunnels, CGI proxies are not real‐time
applications and do not need to maintain a tunnel for the data exchange. Data is
transmitted between the CGI proxy and the client in chunk of packets for a short period of
time. CGI proxies implemented through HTTP protocol can be blocked with deep packet
inspection, to a certain extent. However, the encryption possibilities offered through the
Secure Socket Layer (SSL) and exploited by many CGI proxies nullify the efficiency of content
based filtering mechanisms. In addition, the absence of a bypassing program on user’s
computers increases the complexity of characterising and detecting CGI proxy traffic. A
previous investigation [5] to classify CGI proxy traffic based on non‐payload statistics only
revealed low throughput, high amount of data sent and irregular URL format associated to
their use on a private network.
In[54], Heyning Cheng et al. investigated the source of a web page, retrieved with the
HTTPS protocol, based on the size of received objects. Their investigation was conducted on
a local mirror of an external website. Even though the traffic was encrypted, their
investigation revealed that the origin of a web page can be derived by scrutinising the size
of the objects sent by a web server to a browser. A similar investigation on HTTP traffic was
carried out by Andrew Hintz [55] on 5 web pages using the same parameter as in [54]. As in
[54], the size of received objects proved to be a reliable parameter for fingerprinting a web
page. Another study on HTTPS traffic was also conducted by Qixiang Sun et al. [56] on
100,000 web pages. In this case, the detection parameters were: the size of objects
Chapter 3 – Previous Work
Page |
34
received and the number of objects received. A large number of web pages were
successfully fingerprinted using these two parameters.
Overall, traffic analysis is a useful tool to infer the source of the data whether the traffic is
encrypted or not. However, the approach was not applied on bypassing traffic especially
those performed by the Glype script [49]. This research covered the analysis of HTTP and
HTTPS bypassing traffics to fingerprint the banned web pages by looking at the size of
received objects, the inter‐arrival time of packets, the number of TCP flows and the average
size of the packets.
33..44 SSUUMMMMAARRYY
Many investigations have been carried out through the years to eradicate circumventing
traffic. However, the widespread use of encryption algorithms and the complexity of
network topologies require more investigation to keep up with existing and new bypassing
techniques. This chapter presented previous work conducted to fingerprint and detect
illegal traffic carried by encrypted tunnels, CGI proxies and dedicated bypassing programs.
The evidence suggests that CGI proxies are favoured for getting around restrictions because
of their flexibility and their vast number.
Chapter 4 – Goals and experiments
Page |
35
CCHHAAPPTTEERR 44
GGOOAALLSS AANNDD EEXXPPEERRIIMMEENNTTSS
This chapter starts by defining the goals and expectations of this investigation. It also
provides a brief description of the experiments which were formulated to attempt to solve
the problem investigated in this thesis.
44..11 GGOOAALLSS
CGI proxies are efficient tools to bypass censorship. Filtering mechanisms based on the IP
address of inbound packets were acceptable over the past years of networking and the
Internet but these mechanisms lost their momentum with the onset of CGI proxies. The IP
address, on its own, is not enough to deem inbound traffic trustworthy or legitimate. IP
spoofing is indirectly performed by CGI proxies on packets forwarded to a client and thus
making the packets appear as if originating from a trusted source instead of a banned
server. Payload properties are essential in many cases to detect anomalies and illegal
activities. However, the approach, adopted in this investigation, will use non‐payload
properties of inbound packets. These detection properties are: the size of embedded
objects within a web page, the inter‐arrival time of the packets, the number of TCP flows
emulated during browsing activity and the average size of the packets. These properties are
derived just by observing network traffic exchanged between a client and a server during
web page retrieval.
An experimental setup of a network in a virtual environment will be used to investigate
the correlation between the direct access, HTTP and HTTPS bypassing accesses in terms of
size of objects embedded within a webpage, inter‐arrival time of the packets and number
Chapter 4 – Goals and experiments
Page |
36
of TCP flows. A successful classification of CGI bypassing traffic will prevent unwanted
traffic from entering private networks.
The self‐designed network profiles are built from the non‐payload parameters mentioned
previously. In addition, it is assumed that a blacklist is clearly defined on the web proxy.
Therefore, prior to launching the detection of illegal access to any entry of the blacklist,
each entry of a blacklist is accessed in order to build network profiles for each blacklisted
domain or URL.
44..22 EEXXPPEERRIIMMEENNTTSS
The experiments conducted are focused on a small dataset. This is justified by the lack of
physical users to emulate large traffic. Training physical users and allowing them to bypass
the proxy server of the University of Western Sydney (UWS) was considered unsafe and
compromising the security of the network. Moreover, the experiments could be expanded
to a large dataset if the proposed detection mechanism proves efficient. The correctness of
the detection model, if proven through the experiments, will lead to evaluating the
detection prototype on a large dataset or orientate the investigation toward other aspects
of bypassing traffic. The studied dataset was made of 10 heterogeneous web pages each
containing objects of different size. Each web page is accessed three separate times and the
resulting network traffic recorded.
A direct access to each web page is performed followed two subsequent accesses in HTTP
and HTTPS bypassing modes. For each webpage, the network profile in direct mode is
compared to those in HTTP and HTTPS bypassing mode in order to find out the correlation
between direct access HTTP and HTTPS bypassing accesses related to the detection
parameters investigated in this study. If significant correlations are established between the
three accesses, this could lead to the detection of bypassing traffic. The total number of
Chapter 4 – Goals and experiments
Page |
37
times a web page is accessed to carry out the experiment is described in the following table
(Table 4.1). The evaluation of the correctness of the proposed model, if successful, would
lead to the expansion of this study by carrying out additional experiments in a more realistic
environment to confirm the results obtained in the virtual network.
Table 4.1: Total number of accesses for the experiments
Single webpage Overall (10 web pages)
Direct Access 1 10
HTTP bypassing access 1 10
HTTPS bypassing access 1 10
Total 3 30
44..33 SSUUMMMMAARRYY
The proposed detection mechanism is illustrated in Figure 4.1. During phase one, the
blacklisted URLs and domains are provided to the detection mechanism installed on a
computer. A web browser is then used to access each entry of the blacklist during phase
two. The statistics, generated by the traffic from phase two for each blocked URL or
domain, are then collected and stored in a data structure (phase three). The aggregate,
made of the size of embedded objects within a web page, the inter‐arrival time of the
packets and two characteristics of the TCP flow in particular the number of flows,
constitutes the pre‐built profiles for detecting further accesses to the same web pages. The
detection rules are established during the initial experiments by investigating the
correlation between bypassing traffic and direct access traffic. In other words, the initial
experiment investigates if the size of the object retrieved in direct access mode is similar to
those of the objects fetched in two bypassing modes. The same comparison is then made
Chapter 4 – Goals and experiments
Page |
38
for the inter‐arrival time of the packets and the number of TCP flows between the traffic in
direct access and the traffic in the two bypassing modes.
Web traffic generated during the browsing activity of a user (phase four) is matched to the
pre‐built profiles in phase 5. Bypassing traffic is then fingerprinted in phase 6 if live traffic
matches a pre‐built profile according to the rules established during the building phase of
the profiles. The efficiency of the proposed detection mechanism will be examined through
the results presented in the chapters to come.
Figure 4.1: Model for detecting CGI proxies’ traffic
Bypassing traffic
detection
Network Profiles
Web Traffic
1
3 4
5
6
2
Blacklisted Websites
Chapter 5 – Design and Implementation
Page |
39
CCHHAAPPTTEERR 55
DDEESSIIGGNN AANNDD IIMMPPLLEEMMEENNTTAATTIIOONN
This chapter will present the parameters investigated in this research and the
implementation of the testing platform. Anomaly detection systems rely on patterns or
signatures to perform their task. These patterns or signatures are mostly derived from the
IP headers of network packets containing information such as the source IP address, the
destination IP address, the source port and the destination port. In the same way, the
detection model, proposed in this research, is based on some properties of the IP headers
of the packets exchanged between the bypassing server and the client.
In this approach, bypassing traffic is fingerprinted by comparing pre‐built profiles to live
HTTP and HTTPS sessions. To obtain pre‐built profiles, each entry of the proxy firewall
blacklist is retrieved in direct access mode and the detection parameters related to each
entry are stored in a text file. These parameters are the size of the objects embedded
within a webpage, the inter‐arrival time of the packets, the number of TCP flows required to
fetch each webpage and the average size of the packets. A positive alarm is triggered by the
detection system if a live HTTP or HTTPS session matches only one of the pre‐built profiles.
A virtual network, composed of the different parties involved in a bypassing scenario, was
set up to test the correctness of the proposed detection model.
Section 5.1 will describe the network metrics involved in the building of the network
profiles. Section 5.2 will focus on the description of the hardware, software and network
configuration of the testing platform. The different programs written in python and Jscript
to carry out the experiments are also provided in section 5.2.
Chapter 5 – Design and Implementation
Page |
40
55..11 NNEETTWWOORRKK PPRROOFFIILLEESS
This section will describe the network metrics chosen to investigate a way to detect CGI
bypassing traffic on a private network. The metrics investigated in the course of this
research are: the frequency distribution of the size of the objects embedded within a web
page, the inter‐arrival time of the packets and the number of TCP. The detection system
relies on pre‐built network profiles. The process for building network profiles involves two
main steps, the first one being the data collection phase and the other the validation of the
correctness of this detection approach.
During the first phase, a preliminary experiment is run by fetching a list of web pages and
the statistics related to the network metrics investigated are collected. This phase is very
deterministic to prove the correctness of the proposed detection model. The purpose of
this phase is to establish the correlation between normal traffic and bypassing traffic. This is
achieved by retrieving a list of blocked web pages in three different modes:
1‐ Direct Access: In this mode a webpage is retrieved directly from the source server. In
other words, no proxy server is intercepting the request and relaying it to the source
server and sending the replies back to the client.
2‐ Bypassing access through HTTP protocol: Most Common Gateway Interface (CGI)
proxy servers are implemented through the Hypertext Transfer Protocol (HTTP). In
this mode, the objects contained in a webpage such as images, videos, CSS files and
scripts are transferred unencrypted. However, the IP address of the objects of the
webpage is modified to point to the bypassing server instead of the source server.
3‐ Bypassing access through HTTPS protocol: This mode is similar to the previous
mode. However, encryption is enabled through the use of the Secure Sockets Layer
(SSL) to enhance the privacy of the user and conceal the nature of the data exchange
between the client and the bypassing server.
Chapter 5 – Design and Implementation
Page |
41
Each web page from the blacklist is accessed three separate times corresponding to each of
the three specific modes previously mentioned. A web page in the blacklist is represented
by a URL. The statistics of the metrics related to the direct access of each webpage are then
extracted from the IP headers of the packets and stored in a data structure to obtain a
profile. Once all the network profiles from direct access mode are obtained, each web page
is retrieved two more times: one in HTTP bypassing mode and the other in HTTPS bypassing
mode. The network profiles derived from the two subsequent accesses are then compared
to the preliminary profiles to establish the rules of the proposed detection approach.
The second phase is the evaluation of the correctness of the detection approach. To do this,
the size of the object embedded within each web page retrieved in HTTP and HTTPS
bypassing modes are compared with the network profile obtained during the direct access
mode. The aim is to find out for each web page the percentage of object matches between
bypassing mode and direct access mode in terms of the size of embedded objects. A high
percentage of object matches would be an accurate indicator of the real source of webpage
accessed in HTTP or HTTPS bypassing mode. The next step is to compare the inter‐arrival
time of the packets and the number of TCP flows in bypassing mode and direct access
mode. The expectation is to observe a high inter‐arrival and number of TCP flows in
bypassing mode compared to direct access mode. This can be justified by the fact that a
bypassing server adds one or more hops between the client and the source server.
5.1.1 Detection parameters
5.1.1.1 Size of embedded objects
Definition: The size of a webpage object is defined as the amount of bytes, kilobytes or
megabytes occupied by this object. A webpage is a collection of objects, such as graphics,
scripts, Cascading Style Sheet (CSS) files and Hypertext Markup Language (HTML) files,
Chapter 5 – Design and Implementation
Page |
42
accessible to a web browser through the Internet. A client retrieves web pages from a web
server by sending requests to the hosting web server. The usual method to do this is for a
user to click on hyperlink or enter the URL of a web page address into a web browser. The
request is then submitted to the hosting web server through the GET or POST methods
implemented in the HTTP protocol [43]. The GET and POST methods are used to request
web pages and send data to a web server, respectively. Mostly, the retrieval of a web page
requires the use of subsequent HTTP requests to fetch the different objects that are present
in the webpage. It can be seen from Figure 5.1 that a request for the web page
www.example.com created four other requests to download object 1, 2, 3 and 4. The
frequency distribution of the size of embedded objects within a webpage is an array made
of two columns. The first column of each entry in the array represents a distinct size of
objects contained within a webpage while the second column corresponds to the frequency
or number of objects matching the same size.
Justification: The size of the objects contained within a web page can be a key element to
identifying the source of a web page even if the traffic is encrypted. Sophisticated CGI
proxies use HTTP over Secure Sockets Layer to defeat the deep packet inspection
mechanism of proxy firewalls. Moreover, many CGI proxies wrap web pages fetched from a
source server with some proprietary information (mostly the CGI script) making the
detection of blacklisted web pages harder. For example, to retrieve the web page
www.example.com (see Figure 5.1), a CGI proxy will return four consecutive objects to the
client with respective size of 150kB, 100kB, 50kB and 150kB.
By creating a profile based on the size of the objects embedded within a webpage for each
entry of a blacklist and then monitoring the size of the different objects as they are received
by the client during a live HTTP or HTTPS session, the proposed detection system, will try to
identify if a web page transmitted is blacklisted or not. However, this rule will be more
accurate on heterogeneous web pages. In other words, web pages received from different
sources,
different
sizes. Th
video file
Figur
but contain
tiate. The d
ese web pa
es, flash ani
re 5.1: Retri
ning identic
ataset of th
ages are ma
mation files
eval of a we
cal objects w
his research
ade of a larg
s, plaintext f
eb page req
with nearly t
is generate
ge range of
files and scr
uiring multi
Chapter 5 –
the same si
ed by retriev
f object suc
ripts.
iple GET to f
– Design and
ize, will be
ving web pa
h as PDF fil
fetch embe
d Implement
Page
more difficu
ages of diffe
es, graphic
dded object
tation
| 43
ult to
erent
files,
ts
Chapter 5 – Design and Implementation
Page |
44
5.1.1.2 Inter-arrival time
Definition: The inter‐arrival time between packet (n) and packet (n+1) is defined as the
difference (see Figure 5.2) between the arrival time of packet (n+1) and packet (n).
Justification: This metric, derived from the IP properties of two consecutive packets, has
been used as an anomaly detection parameter in many cases [37, 38] with satisfactory
results. However, no major experiment has been performed on CGI proxies using the inter‐
arrival time as an anomaly detection parameter. With this in mind the expectation is to
detect anomalies related to the inter‐arrival time of packets transmitted by CGI proxies. In
fact, the use of a CGI proxy to bypass a firewall adds one or more additional hops to the
route of the packets transmitted between the source of a web page and the destination.
The intermediary role performed by a bypassing proxy may impact the inter‐arrival time of
the packets. Fingerprinting the source of a web page based on the size of the objects
received by the client would be a step forward. Futhermore, the discovery of a CGI proxy on
the route of incoming packets through the identification of a correlation between the use of
CGI proxies and the inter‐arrival time will increase the accuracy of the detection
mechanism.
Figure 5.2: Inter‐arrival time illustration
Source (IP, Port)
Destination (IP, Port)
Packet 4 Packet 3
Packet 1 Packet 2
Inter‐arrival
Time 1 = T2 –T1
Time
T1 T2 T3T4
Inter‐arrival Time 2 = T3 –T2
Inter‐arrival Time 3 = T4 –T3
Chapter 5 – Design and Implementation
Page |
45
5.1.1.3 TCP Flows
Definition: During a TCP session, one or more streams of packets are exchanged between
two processes located on two separate machines. The transmission of the data is achieved
through a TCP socket implemented on both the client and the server side of the
communication. A TCP socket is a pair made of <IP address, Port number>. A TCP flow is a
unique TCP stream made of the combination of the client and server sockets that occurs
during a TCP session. In other words, a TCP flow is identified by the four‐tuple consisting of
<Source IP, Source Port, Destination IP, Destination Port>. Therefore, one or more TCP
flows are needed to retrieve a web page from a web server depending on its structure.
Justification: During an HTTP or HTTPS session, the web browser creates threads to handle
the transfer of data received from the server hosting the web page. Once the first three
way handshake is completed and the connection established, the web browser sends the
first GET to retrieve the home page. The response from the web server is then parsed by
the web browser in order to identify embedded objects. Depending on the number and
type of the embedded objects, the web browser decides to spawn more threads to
download each object or reuse currently opened connections. A web browser can initiate
consecutive connections to the web server in order to speed up the transmission of the
data (see Figure 5.3). However, when the web server is heavily loaded, a limited number of
connections will be opened by the web server to transfer the data to the client. A CGI script,
commonly written in Java or PHP, plays the role of a web server by servicing the requests of
a client during a CGI bypassing technique. By inspecting the TCP flows generated during an
HTTP session, the expectation is to prove the presence of a CGI proxy in the route of
transmitted packets. More precisely, the aim is to find out the difference between the TCP
flows produced during the direct access and bypassing access of a web page in terms of the
number of flows required to fetch the web page and the average size of the packets
transmitted by the bypassing server.
Chapter 5 – Design and Implementation
Page |
46
Figure 5.3: TCP flows emulated during a TCP session and inter‐arrival time of each flow
55..22 IIMMPPLLEEMMEENNTTAATTIIOONN OOFF TTHHEE TTEESSTTIINNGG NNEETTWWOORRKK
5.2.1 Topology of the testing network
The implementation of a testing platform, comprising the parties involved in a bypassing
scenario, was a key element to this study. The purpose of the testing network is to:
Validate the three parameters identified in this study to detect HTTP and HTTPS
bypassing traffics and establish the rules of the detection model by comparing the
statistics of these parameters in direct access mode, HTTP and HTTPS bypassing
modes.
Evaluate the correctness of the detection model proposed in this research. The
testing network will evaluate the correctness of the detection model based on the
lower bound situation. As can be seen in Figure 5.4, only one router separated the
bypassing server, the blocked server and the proxy firewall.
Source (IP, Port)
Destination (IP, Port)
Start Flow 1 Start
Flow 2 Start Flow 3 Start
Flow 4 Start Flow 5
End Flow 1 End
Flow 2 End Flow 3 End
Flow 4 End Flow 5
Chapter 5 – Design and Implementation
Page |
47
The avoidance of censorship involves four different parties:
A client or internal user is a computer from which the bypassing process is initiated.
This computer represents the first endpoint of the bypassing traffic.
A proxy firewall is either a hardware device or software or a combination of both. Its
main function is to prevent the users within a private network from accessing
resources explicitly categorized as contrary to the security policies of that network.
The access control policies are enforced either by blocking some services such as
FTP, SSH and TELNET or filtering the traffic according to some predefined rules.
A bypassing server or external proxy is the second endpoint of the bypassing traffic.
This endpoint is generally a web server hosting a bypassing program such as a CGI
script.
A blocked server or blacklisted server represents a web server to which access is
restricted for the private network users.
A virtual network environment was reproduced for this investigation (see Figure 5.4). This
environment contained the four different parties involved in the bypassing of a proxy
firewall. More specifically, the testing network was made of four virtual machines: a proxy
server, a client computer, a bypassing server and a blocked. The virtual machines were
created using VMware [50]. In addition, a total of 10 web pages, hosted on the blocked
server, were blacklisted on the proxy firewall and therefore represented the blacklist of the
bypassing environment. The role of each machine is described in the next section of this
chapter. Internal traffic originating from the private network was routed through the proxy
server (192.168.1.1) which acted as a default gateway for internal hosts. The
communication between the proxy server, the bypassing server (172.168.17.2) and the
blocked server (172.168.18.2) was enabled through the routing server.
Chapter 5 – Design and Implementation
Page |
48
Figure 5.4: Detailed topology of the virtual network
5.2.2 Hardware and Configuration
5.2.2.1 Physical machine
The testing platform was set up and run on a single physical machine. The description of the
different hardware resources of the physical computer are outlined in Table 5.1. It can be
seen from Table 5.1 that Windows® Vista® Professional was installed on the physical
computer along with VMWare workstation [50] which was used to create virtual machines
in order to reproduce the different actors involved in a bypassing scenario. No additional
software was installed on the physical machine to avoid interference and therefore allocate
all the hardware resources to the virtual machines.
Chapter 5 – Design and Implementation
Page |
49
Table 5.1: Description of the physical machine
Physical Machine
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 4 GB
File system size 500 GB
File system type NTFS
Operating system Windows® Vista® Professional
Network interface cards Intel ® 82566DM‐2 Gigabit Network Connection
2 x VMware Virtual Ethernet Adapter
Software & Services VMware Workstation 6.5.2
5.2.2.2 Virtual machine 1: Proxy firewall
The first virtual machine was the proxy server. Two network cards were implemented on
the proxy server. The first network interface card was connected to the internal network
(192.168.1.0) while the second card allowed the proxy to communicate with external
networks (172.168.16.1). The proxy server played a key role in the investigation by
restricting access to the web pages hosted on the blocked server. Microsoft Internet
Security and Acceleration Server (ISA) 2004 was installed on the proxy server and used as
the firewall application. Web traffic through HTTP and HTTPS protocols was the only traffic
allowed by the proxy server. Moreover, the proxy server was configured to automatically
reject inbound connections to internal hosts while at the same time scrutinising outbound
connections to ensure their compliance with the security policies. Access to the blocked
server as well as the web pages hosted on it were explicitly denied to the private network
clients in the firewall rules. Wireshark was installed on the proxy server to sniff network
traffic. The description of the hardware and configuration of the proxy server is outlined in
Table 5.2.
Chapter 5 – Design and Implementation
Page |
50
Table 5.2: Description of the proxy server (virtual machine)
Virtual Machine: Proxy Server
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 512 MB
File system size 60 GB
File system type NTFS
Operating system Microsoft® Windows® Server 2003 R2 EE SP2
Network interface cards Intel ® PRO/1000 MT Network Connection
VMware Accelerated AMD PCNet Adapter
Software & Services Microsoft ISA Server 2004
Wireshark 1.2.4
5.2.2.3 Virtual machine 2: Blocked server
The blocked server represents the unauthorized server accessed by a user using a bypassing
server. A virtual machine running a web server application (XAMPP) was implemented in
the testing platform to impersonate the banned server. The blocked server hosted web
pages blacklisted by the proxy server but accessible to the bypassing server. That is to say
that the blocked server was unreachable by hosts located on the private network
(192.168.1.0). The web pages hosted on the blocked were accessible by the internal using
the bypassing server in mode modes: HTTP bypassing or HTTPS bypassing mode. This was
made possible by the enable the openSSL library embedded within the web server
application XAMPP. Apart from XAMPP, no additional programs were installed on the
blocked server to maximise its performance. A virtual network interface card with the IP
address 172.168.18.2 was connecting the blocked server to the rest of the network
topology. Detailed information about the configuration of the blocked server is provided in
Table 5.3.
Chapter 5 – Design and Implementation
Page |
51
Table 5.3: Description of the blocked or blacklisted server (virtual machine)
Virtual Machine: Blocked Server
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 512 MB
File system size 60 GB
File system type NTFS
Operating system Microsoft® Windows® Server 2003 R2 EE SP2
Network interface cards VMware Accelerated AMD PCNet Adapter
Software & Services XAMPP for Windows Version 1.7.3
5.2.2.4 Virtual machine 3: Bypassing Proxy
The configuration of the bypassing server was similar to the blocked server except that a
Common Gateway Interface (CGI) bypassing script was installed on the bypassing server for
the experiments (see Table 5.4). The bypassing server (176.168.17.2) was not blacklisted in
the proxy firewall security policies. This server was simulating the services offered by CGI
bypassing proxies available on the Internet to bypass censorship. A single virtual network
interface card was required to connect the bypassing server with the rest of the network.
Table 5.4: Description of the bypassing proxy (virtual machine)
Virtual Machine: Bypassing Proxy
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 512 MB
File system size 60 GB
File system type NTFS
Operating system Microsoft® Windows® Server 2003 R2 EE SP2
Network interface cards VMware Accelerated AMD PCNet Adapter
Software & Services XAMPP for Windows Version 1.7.3 and Glype proxy v1.1
Chapter 5 – Design and Implementation
Page |
52
5.2.2.5 Virtual machine 4: Routing server
The routing server was simulating an Internet Service Provider (ISP) by routing the traffic
between the different networks (176.168.16.0, 176.168.17.0 and 176.168.18.0) except for
the internal network (192.168.1.0). The remote access and VPN functionalities of
Microsoft® Windows® Server 2003 were configured on this virtual machine to achieve the
routing. Three network interface cards were necessary on this virtual machine. It can be
seen from Figure 5.4 that the first card (172.168.16.2) was assigned to incoming and
outgoing traffic to and from the proxy server while the second network card (172.168.17.1)
and third card (172.168.18.1) were used for communicating with the bypassing server
(172.168.17.2) and the blocked server (172.168.18.2), respectively. By using only one router
to separate the different networks, the experiments will test the lower boundary of the
inter‐arrival time of the packets. In a real life scenario, bypassing packets will cross several
routers to reach their destination. Therefore, if a high inter‐arrival time is recorded in this
virtual network then these results can be applied to a more realistic situation. The hardware
configuration of the routing server is described in Table 5.5.
Table 5.5: Description of the Routing Server (virtual machine)
Virtual Machine: Routing Server
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 512 GB
File system size 60 GB
File system type NTFS
Operating system Microsoft® Windows® Server 2003 R2 EE SP2
Network interface cards 2 x Intel ® PRO/1000 MT Network Connection
VMware Accelerated AMD PCNet Adapter
Software & Services Remote Access/VPN Server
Chapter 5 – Design and Implementation
Page |
53
5.2.2.6 Virtual machine 5: Client computer
The fifth virtual machine was the client computer (192.168.1.2) located inside the private
network. This computer was subjected to the restrictions of the proxy server. In other
words, the client computer was denied access to the blocked server but allowed to access
the bypassing server. This virtual machine was used for bypassing the access control rules of
the proxy firewall by accessing the blocked server. The hardware configuration of the
routing server is described in Table 5.6.
Table 5.6: Description of the client computer (virtual machine)
Virtual Machine: Client Computer
Processor Intel ® Core ™2 Duo CPU E7200 @ 2,53 GHz
Physical Memory (RAM) 512 GB
File system size 20 GB
File system type NTFS
Operating system Microsoft® Windows® XP Professional SP2
Network interface cards VMware Accelerated AMD PCNet Adapter
Software & Services Fiddler2
Traffic Generator Script
5.2.3 Software
5.2.3.1 VMware Workstation
VMware Workstation is a virtualization application used to emulate multiple virtual
machines, also called guests, on a physical machine, also known as host [50]. The physical
machine can be a desktop, laptop or server computer. For this investigation a desktop was
used. The hardware resources of the physical machine such as the processor, RAM, network
Chapter 5 – Design and Implementation
Page |
54
interface card(s) and the hard drive are shared between the virtual machines [50]. In
addition, VMware Workstation is capable of mounting peripherals such as DVD‐CD ROM
drive, USB, serial and parallel ports. Furthermore, it is possible to create virtual networks
with VMware Workstation. Besides the virtual network adapters that can be assigned to
virtual machines, VMware Workstation provides a virtual network that allows virtual
machines to communicate with each other. The virtual network platform generated by
VMware is similar to a TCP/IP network interconnecting physical machines through a switch.
The connectivity of virtual machines to physical networks is achieved through various
techniques such as:
Creating a bridge between the NIC of the virtual machine and the NIC of the physical
machine;
Using the Network Address Translation to share the IP address of the host;
Using a virtual network which is connected to the logical or physical network
interface card of the host.
VMware Workstation is mainly used for the following purposes:
To run many operating systems on a single PC;
To implement a testing environment;
To develop and test software updates and patches;
To provide training assisted by computer;
5.2.3.2 ISA server 2004
Microsoft ISA Server 2004 is a layer 7 firewall that protects a private network against
threats from the Internet [44]. The ISA server also provides users with a secure method to
remotely access their data and applications. This is achieved by implementing a secure
channel between two separate networks.
Microsoft® ISA Server 2004 offers several features such as [44]:
Chapter 5 – Design and Implementation
Page |
55
Caching: ISA server accelerates web traffic by storing a local copy of web pages
accessed by users;
Advanced firewall functions: packet filtering, application filtering, content filtering,
access control rules and a web proxy;
Server publishing functions: secure web publishing, preservation of source IP
addresses in Web publishing rules and inspection of SSL packets;
VPN functions: implementation of a Virtual Private Network between two remote
sites, including filtering and inspection of VPN traffic, publishing of VPN servers and
IPSec tunnel mode for point‐to‐point VPN connections;
Intrusion detection system: detection of attacks such as ping of death, IP half scan,
UDP bomb and port scanning.
5.2.3.3 XAMPP for Windows
XAMPP is an Apache distribution, easy to install, designed for developers. XAMPP is an
acronym where X is for multi platform (such as Windows and Linux), A for Apache, M for
MySQL, P for PHP and Perl for the last P [45]. XAMPP is available on Linux platforms,
Windows, Mac OS X and Solaris. The distribution for Windows contains Apache, MySQL,
PHP + PEAR, Perl, mod_php, mod_perl, mod_ssl, OpenSSL, phpMyAdmin, Webalizer,
Mercury Mail Transport System for Win32 and NetWare Systems v3.32, Ming, JpGraph,
FileZilla FTP Server, mcrypt, eAccelerator, SQLite, and WEB‐DAV + mod_auth_mysql [46].
XAMPP is licensed under GNU and was not designed to be executed in a production
environment but in a development environment. As a result, the security configuration of
XAMPP is as open as possible for testing purposes. In this study, XAMPP was configured to
host the bypassing script and allowed both HTTP and HTTPS accesses to the script.
Chapter 5 – Design and Implementation
Page |
56
5.2.3.4 Wireshark
Wireshark is an open source “packet sniffer” that captures network packets and analyses
live network traffic or an image of network traffic that has been previously saved on a mass
storage [47]. According to [47], Wireshark is the most popular packet analyser used by
network professionals to troubleshoot network problems, understand protocols and
examine network traffic for security holes. In addition, spyware, virus activities and other
network anomalies are detectable using Wireshark [47]. In this research, Wireshark was
installed on the proxy firewall to capture network traffic to be used in later analysis.
5.2.3.5 Fiddler2
Fiddler2 [48] is an application that displays HTTP and HTTPS traffics generated by a web
browser. It is a tool which offers the ability to record and view HTTP/HTTPS interactions
between a web browser and a web server [48]. It is a useful application for debugging,
repairing, optimizing and verifying the safety of web sites. In addition, Fiddler2 can be used
to analyse the characteristics of web traffic such as HTTP headers, cookies, query strings
and the length of queries. As a result, Fiddler2 was used to capture direct access traffic,
HTTP and HTTPS bypassing traffics and to generate network statistics after the completion
of the user’s request.
Fiddler2 was essential in this research because of Jscript.NET scripting capabilities
embedded within this software. This feature allows users to write scripts to manipulate the
raw data captured by Fiddler2. Taking advantage of the capabilities offered by Fiddler2, a
script was written in JScript.NET to automatically compute the statistics of the three
parameters of the detection system. The output was saved in a Comma‐Separated Values
(CSV) file and the raw data was dumped in a file for further investigations. The source code
of this script is provided in Appendix (WebTrafficStats).
Chapter 5 – Design and Implementation
Page |
57
5.2.3.6 Glype Proxy Script
Glype Proxy Script is a free PHP script installed on a web server to bypass censorship [49].
As a web based script, Glype Proxy Script downloads the web page(s) requested by a client’s
computer and then transfers them back to the client. This service is offered by many online
web proxies and allows users to browse the Internet anonymously. In other words, the web
proxy hides the IP address of the client’s computer by using its own IP to access the server
hosting the requested web page/web pages. Contrary to other bypassing techniques such
as the use of SSH and VPN tunnels, Glype Proxy Script eliminates the need to modify the
web browser settings in order to bypass censorship [49]. That is to say that no software
installation is required on the client’s computer. After accessing the web based proxy by
entering its IP address or domain name in a web browser, the user can start immediately to
browse the web anonymously through that proxy.
5.2.3.7 Traffic Generator
A dataset of bypassing traffic was an important factor in the experiments. Due to the lack of
physical users, a script called “traffic generator” was written in python to simulate network
traffic by requesting web pages that have been blacklisted on the proxy firewall. A list of 10
URLs, each directing to a banned web page, was provided to the script in a form of a text
file. Each URL was then accessed sequentially by the script through Microsoft Internet
Explorer 8. The resultant traffic generated was captured on both the client’s machine using
Fiddler2 and the proxy firewall using Wireshark. The source of the traffic generator is
provided in the Annexes.
The different steps executed by the traffic generator are presented in Figure 5.5. The flow
chart can be divided in five main steps:
Chapter 5 – Design and Implementation
Page |
58
Initialisation: During this phase, the traffic generator fetches the file containing URLs
randomly selected from the internet. Also, Fiddler2 is started on the client machine
and is ready to capture web traffic.
Decision: After reading the first line of the URLs’ file, the script will decide to move
to the next step which is the retrieval of a URL if the End Of File (EOF) is not reached.
Otherwise the execution the script is stopped. This step allows the script to iterate
through the URLs.
URL retrieval: A URL can be retrieved in three modes as mentioned before (Section
5.1): direct access, HTTP bypassing access or HTTPS bypassing access. Before the
retrieval of a URL, the cache and the cookies of previous web accesses are deleted
with the purpose of ensuring that the data retrieved are not served to the web
browser from the cache but directly fetched from the source server. This is
deterministic for the investigation because caching can reduce the inter‐arrival time
of the packets and therefore compromise the results. In a real world situation,
caching is disabled by CGI proxy servers on the client computer to erase any traces
of bypassing activities. The next action is to create a Microsoft Internet Explorer
Object. In the direct access mode, the address bar of the newly created Microsoft
Internet Explorer is filled with the current URL. However, in HTTP and HTTPS, the
bypassing server URL is accessed first by the script through the Microsoft Internet
Explorer Object and the current URL is the passed to the bypassing server for
retrieval.
Statistics computation: The resulting web traffic generated during the access of a
URL is automatically captured by Fiddler2. Once the webpage is fully loaded, a
command written in Jscript.NET is executed on Fiddler2 to compute the statistics of
the parameters of the network profile for each URL.
Dumping of data: During this step, the raw data and the statistics are dumped
respectively in Fiddler format (SAZ) and Comma Separated Values (CSV) files. Finally,
the Microsoft Internet Explorer object is cleared to release the space in the RAM.
Figure 5..5: Flow chaart of the tra
Chapter 5 –
affic genera
– Design and
tor
d Implement
Page
tation
| 59
Chapter 5 – Design and Implementation
Page |
60
55..33 SSUUMMMMAARRYY
The simulation of a CGI bypassing scenario was fundamental to evaluate the accuracy of the
proposed detection approach. As a result, a bypassing environment was reproduced by
implementing a virtual network made of five virtual machines. The hardware configuration
and software necessary to reproduce the bypassing scenario were provided in this chapter.
Moreover, a description of the different parameters used to create a network profile and
the justification of the choice of each parameter are also explained in this chapter.
Successfully distinguishing and classifying CGI traffic in a virtual network is necessary to
expand the experiments to a large dataset generated from real world traffic.
Chapter 6 – Findings: Results and Analyses
Page |
61
CCHHAAPPTTEERR 66
FFIINNDDIINNGGSS:: RREESSUULLTTSS AANNDD AANNAALLYYSSEESS
This chapter presents the results of the experiments performed and the evaluation of the
efficiency of the proposed model. After creating a model to simulate a bypassing scenario
and implementing it in a virtual network, experiments were then carried out to determine a
possible way to detect CGI proxy bypassing traffic. Section 6.1 will describe the building
phase of the network profiles. After completing the building of initial network profiles, each
webpage is accessed randomly in HTTP and HTTPS bypassing modes. The live HTTP and
HTTPS bypassing traffic profiles are then compared with the pre‐built profiles to fingerprint
the webpage. Section 6.2 will present the comparison of the bypassing network profiles of
the first webpage with its corresponding pre‐built profile. The aggregation of the results of
all the web pages is then provided in Section 6.3 to validate the trends observed with a
single web page. Finally, the last section focuses on proposing a solution from the
aggregation of the different results.
66..11 IINNIITTIIAALL EEXXPPEERRIIMMEENNTT:: PPRROOFFIILLEE BBUUIILLDDIINNGG
In the first experiment, web pages hosted by the blocked server, were directly accessed
from the client computer to collect the size of embedded objects for each webpage, the
inter‐arrival time of the packets, the number of TCP flows and the average size of the
packets. The direct access to the web pages was not routed through the bypassing server.
The collected data was then manually analysed and classified to obtain a network profile for
each webpage. An example of a profile is described in Table 6.1. It can be seen from this
table that the first webpage under investigation contains 4 embedded objects occupying
respectively 551 bytes, 1613 bytes, 7602 bytes and 388 bytes. In addition, the inter‐arrival
Chapter 6 – Findings: Results and Analyses
Page |
62
time of the packets was on average 0.03ms. In total, the web browser emulated 4 TCP flows
to acquire the first web page. The average size of packets transmitted was 582 bytes. HTTP
was the protocol used to download the web pages while similar profiles were created for
the rest of the web pages.
Table 6.1: Traffic profile of initial access
Network Profile Text/ HTML 551 bytes Text/CSS 1613 bytes Image 1 7602 bytes Image 2 388 bytes Inter‐arrival time 0.03ms TCP flows 4 Average packet size 582 bytes
6.2 SSIINNGGLLEE WWEEBBPPAAGGEE RREESSUULLTTSS
Three network traffic profiles were created from the statistics of the traffic generated by
the 3 subsequent accesses to each web page. In this case, the bypassing server was used to
access each web page in HTTP and HTTPS bypassing modes. It was expected that the pre‐
built profiles from the direct access of each web page will match identically the profiles
collect through the bypassing server. The results obtained confirmed that even though a
web page is accessed securely with HTTPS through a bypassing proxy; it is possible to
predict with a high accuracy the source of the web page based on the size of embedded
objects.
It can be seen from Figure 6.1 that the direct access profile collected for web page 1 is
identical to those collected later in terms of the size of embedded objects. During the direct
access, the size of the four embedded objects was 551 bytes, 1613 bytes, 7602 bytes and
388 bytes. The CGI traffic produces comparable results in HTTP and HTTPS bypassing mode.
Chapter 6 – Findings: Results and Analyses
Page |
63
The HTTP bypassing access recorded four objects with the size 564 bytes, 1642 bytes, 7611
bytes and 391 bytes. The difference between the size of each object for direct access, HTTP
and HTTPS bypassing accesses is negligible. However, a higher inter‐arrival time was
observed while accessing web pages through the bypassing server. In addition, the
bypassing traffic initiated 5 TCP flows in bypassing mode instead of 4 required for the
retrieval of the webpage in direct access mode. By analysing the additional TCP flow, it was
discovered that this flow was initiated in order to download the bypassing script necessary
for the user for future bypassing accesses. As mentioned earlier, CGI proxies can wrap a
requested webpage with proprietary information making the total size of the web page
larger. Additional TCP flows are then needed to download the extra data that is appended
to the webpage. Overall, according to the results of the experiments, the size of embedded
objects contained in a web page is a reliable parameter to predict the origin of a web page
as long as a profile of the web page was obtained beforehand. Nonetheless, the aggregated
results of all the web pages investigated in the experiments, is crucial to confirm this
hypothesis.
Figure 6.1: Single webpage results
0
1000
2000
3000
4000
5000
6000
7000
8000
Text/ HTML Text/CSS Image 1 Image 2
Size(in
bytes)
Embedded objects
Web page 1
Direct Access
HTTP bypassing
HTTPS bypassing
Chapter 6 – Findings: Results and Analyses
Page |
64
6.3 AAGGGGRREEGGAATTEEDD RREESSUULLTTSS
The aggregated results of the experiments on five out of the ten web pages accessed are
shown in Figure 6.2, Figure 6.3, Figure, 6.4 Figure 6.5 and Figure 6.6. It can be seen from
these figures that the trends described by the single web page results are also observed for
the rest of the web pages. It was also observed during the simulation that the size of
embedded objects remained nearly constant for each access. Additionally, more TCP flows
are occurring while using the CGI proxy to access a web page. The inter‐arrival time of the
packets remained higher at 0.04 milliseconds for the bypassing traffic throughout the
experiments. The average size of the packets transmitted using the circumventing method
compared to the direct access of each web page was lower (see annexes). As seen from
Figure 6.2 to Figure 6.6, the similarity between the pre‐built profiles and the bypassing
traffic profiles, which were obtained by accessing each web page in direct access mode and
bypassing modes, is crucial in predicting the source of the web page. This is evident even
when the HTTPS protocol is being utilized to access the web page. The variation of the
inter‐arrival time, number of flows and average size of the packets will then enable the
detection system to confirm the presence of a CGI proxy.
The web pages being investigated in this simulation remained static throughout the
experiments. In addition the cache was cleared after each round in order to ensure that the
web pages are fetched from the blocked server and not served from the cache. In a real
world scenario, a monitoring mechanism would be necessary to track the updating of
blacklisted web pages and to re‐build network profiles. In others words, 100 updates of a
web page during a day will result in the web page being accessed 100 times by the
monitoring system to re‐build a new profile based on the new objects appended to the web
page. Furthermore, it can be seen from Figure 6.2 that web pages originating from the
bypassing proxy are easily detectable as blacklisted by matching them to the pre‐built
profiles. However, a conflict can occur between the data appended to a web page by a CGI
Chapter 6 – Findings: Results and Analyses
Page |
65
proxy and existing objects embedded to the webpage if the two types of objects are similar
in size. In this case, the detection system will mismatch web pages and raised a lot of false
alarms.
Figure 6.2: Web page 1 results
Figure 6.3: Web page 2 results
0
1000
2000
3000
4000
5000
6000
7000
8000
Text/ HTML Text/CSS Image 1 Image 2
Size(in
bytes)
Embedded objects
Web page 1
Direct Access
HTTP bypassing
HTTPS bypassing
1020
1040
1060
1080
1100
1120
1140
1160
1180
1200
1220
Image 1 Image 2 Image 3 Image 4 Image 5
Size(in
bytes)
Embedded objects
Web page 2
Direct Access
HTTP bypassing
HTTPS bypassing
Chapter 6 – Findings: Results and Analyses
Page |
66
Figure 6.4: Web page 3 results
Figure 6.5: Web page 9 results
1020
1040
1060
1080
1100
1120
1140
1160
1180
1200
Image 1 Image 2 Image 3 Image 4 Image 5
Size(in
bytes)
Embedded objects
Web page 3
Direct Access
HTTP bypassing
HTTPS bypassing
0
2000
4000
6000
8000
10000
12000
14000
16000
Size(in
bytes)
Embedded objects
Web page 9
Direct Access
HTTP bypassing
HTTPS bypassing
Chapter 6 – Findings: Results and Analyses
Page |
67
Figure 6.6: Web page 10 results
66..44 SSUUMMMMAARRYY
The results of the experiments outlined the necessity to implement two sub‐mechanisms to
detect bypassing traffic. The fingerprinting of a blocked web page is performed by the first
sub‐mechanism by analysing the size of the embedded objects of a web page while the
second sub‐mechanism inspects the traffic for anomalies related to the average size of the
packets, number of TCP flows and the inter‐arrival time of the packets. According to the
results obtained, network traffic generated by a CGI proxy is characterised by a high inter‐
arrival of the packets and an abnormal average size for the packets transmitted.
From the results of the experiments, a bypassing proxy can be detected on a virtual
network by applying the rules outlined in Table 6.2.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
Image
1
Image
2
Image
3
Image
4
Image
5
Image
6
Image
7
Image
8
Size(in
bytes)
Embedded objects
Web page 10
Direct Access
HTTP bypassing
HTTPS bypassing
Chapter 6 – Findings: Results and Analyses
Page |
68
Table 6.2: Detection condition of a CGI proxy traffic
Pre‐built profile Condition CGI profile
Size of object <= or >= Size of object
Inter‐arrival time < Inter‐arrival time
Average size of packets > Average size of packets
Number of TCP flows < Number of TCP flows
Chapter 6 – Findings: Results and Analyses
Page |
69
CCHHAAPPTTEERR 77
AADDDDIITTIIOONNAALL EEXXPPEERRIIMMEENNTTSS
Additional investigation has been carried out to evaluate the accuracy of the proposed
detection approach in a more realistic situation. Firstly, a blacklist made of 70 web pages
was accessed to validate the reliability of the detection parameters in a physical network,
define the rules of the detection mechanism and build the network profile for each entry of
the blacklist. Then, 542 random accesses were made on the 70 web pages to evaluate the
efficiency of the detection approach. More specifically, 271 accesses were made in HTTP
bypassing mode and 271 in HTTPS bypassing mode. The results of the accuracy of the
detection model are presented this chapter.
7.1 Physical network for accuracy evaluation
The purpose of the implementation of a physical network is to evaluate the accuracy of the
detection model in a more realistic situation (see Figure 7.1). This physical network was
made of three physical machines located on different networks across the Internet: a proxy
firewall, a client computer and a bypassing server. In addition, a total of 70 websites,
located on the Internet, were blacklisted on the proxy firewall and therefore represented
the blocked servers of the bypassing environment. It can be seen from the network
topology (see figure 5.4) that the client of the private network (192.168.1.0) was connected
to the proxy firewall (192.168.1.1) via an IP router. A blacklist was enforced on the proxy
firewall to deny access to 70 websites. Two home computers, connected to BIGPOND
network, were used as proxy firewall and client. The bypassing server was a third computer
hosting an Apache web server containing a CGI script. This server was connected to the
Internet
blocked
7.2 Accu
The aim
research
blacklist,
blacklist
accuracy
In
co
Lo
re
through th
server in th
Figure 7.1
uracy evalu
of this scrip
h. This scrip
, the pre‐bu
and the de
y are presen
nitialisation
ontaining U
oop: The s
eached. Dur
e Internet S
e blacklist.
: Topology o
uation scri
pt is to eva
pt takes as
uilt profiles
etection ru
nted in Figur
n: Fiddler is
RLs is fetch
script will lo
ring each lo
Service Prov
of the physi
ipt
luate the ac
inputs: a ra
obtained d
les. The dif
re 7.2. The f
s started to
ed during th
oop throug
op, a rando
Cha
vider (ISP) I
ical network
ccuracy of t
ndom web
during the d
fferent step
flow chart c
o capture
his step.
gh until the
m web page
apter 6 – Fin
IINET and w
k for the acc
the detectio
page select
direct acces
ps executed
an be divide
network tr
e maximum
e is selected
ndings: Resu
was not exp
curacy evalu
on approach
ted from th
ss of each w
d for the ev
ed into five
raffic and t
m number o
d from the b
ults and Ana
Page
plicitly listed
uation.
h covered in
he proxy fire
web page o
valuation o
main steps
the blacklist
of executio
blacklist.
alyses
| 70
d as a
n this
ewall
of the
f the
:
t file
ons is
Chapter 6 – Findings: Results and Analyses
Page |
71
Retrieval of webpage in HTTP and HTTPS bypassing modes: During this step, the
random web page chosen during the previous step is retrieved in HTTP bypassing
mode. Once the web page is fully loaded, Fiddler2 computes automatically the live
traffic profile of the random web page. The same process is then repeated in HTTPS
bypassing mode.
Comparison of live traffic profile to pre‐built profiles: The live traffic profile of the
random webpage is fingerprinted during this step by comparing it to the pre‐built
profiles. At the end of this process, the live traffic profile will match zero, one or
many pre‐built profiles.
Classification of the web page: A webpage can be classified as a positive alarm, a
false alarm or unknown. After the comparison of the live traffic profile to the pre‐
built profiles, a positive alarm is raised if a unique pre‐built profile matches the
random webpage. This pre‐built profile must correspond to the network profile of
the direct access of the random webpage. In case the live traffic profile matches two
or more pre‐built profiles, the webpage is classified as a false alarm. An unknown
flag is triggered if no pre‐built profile is similar to the live traffic profile.
Figurre 7.2: Flow chart of thee evaluation
Cha
n of the effi
apter 6 – Fin
ciency of th
ndings: Resu
he detection
ults and Ana
Page
n approach
alyses
| 72
Chapter 6 – Findings: Results and Analyses
Page |
73
77..33 AACCCCUURRAACCYY EEVVAALLUUAATTIIOONN OOFF TTHHEE DDEETTEECCTTIIOONN AAPPPPRROOAACCHH
7.3.1 Building phase of network profiles
The dataset of the experiments was increased from 10 web pages to 70 web pages to
evaluate the accuracy of the detection approach. In the initial experiment, the 70 web
pages were directly accessed from the client’s computer to determine the size of
embedded objects for each webpage, the inter‐arrival time and the number of TCP flows.
The direct access of each web page was then followed with two separate accesses of the
same web page: one in HTTP bypassing mode and the other in HTTPS bypassing mode.
Thus, three initial accesses were necessary for each web page to identify the correlation
between the three access modes.
Contrary to the web pages used in the virtual network, the web pages retrieved in this
physical network did not remain static throughout the experiments. This was due to the fact
that they were fetched from the Internet. Therefore, some websites were regularly updated
with new information. As for the virtual network, the cache was cleared after each access in
order to ensure that the web pages are fetched from the blocked server and not served
from the cache.
7.3.2 Frequency distribution of the size of embedded objects
The size of an object embedded on a web page is obtained by summing up the header size
and the payload size of the IP packets received during the downloading of the object.
Size of embedded object = ∑ (IP packets size)
or
Size of embedded object = ∑ (Header Size) + ∑ (Payload Size)
Chapter 6 – Findings: Results and Analyses
Page |
74
Many researchers have identified the size of the objects embedded within a web page as a
reliable parameter to fingerprint a web page. Therefore, it was expected that the frequency
distribution of the size of the objects embedded within a web page would be similar in
direct access mode, HTTP and HTTPS bypassing modes. As mentioned earlier, 3 initial
accesses are performed on each web page during the profile building phase. After the first
access, which is the direct access of a web page, the frequency distribution of the size of the
objects embedded within each web page in direct access mode is then compared with
those in HTTP and HTTPS bypassing modes. The percentage of objects for each web page in
HTTP and HTTPS bypassing modes matching the objects of the same web page in direct
access mode, in terms of size, is depicted in Figure 7.3. It can be seen from this figure that
the percentage of matches in the two bypassing modes compared to the percentage in
direct access mode is below 20% throughout the 70 web pages accessed in this
investigation. For some web pages, no matches were found between direct access and
bypassing modes.
The discrepancy between the size of the objects transmitted in direct access, HTTP and
HTTPS bypassing mode, can be explained by the fact that CGI proxies alter the headers’
information of the data retrieved from the source server before forwarding the data back to
the client. By doing that, the original headers are replaced by those of the CGI proxies. This
alteration, performed by CGI proxies on the headers on the IP packets relayed to the client,
increases or decreases the size of embedded objects.
An in depth view of the trends observed in Figure 7.3 is provided in Table 7.1. It can be seen
from this table that the objects of more than 50% of the web pages accessed in HTTP and
HTTPS bypassing did not match any objects of the same web pages accessed in direct mode.
37 and 39 web pages were recorded with no matches in HTTP and HTTPS bypassing modes
respectively (row in red). In addition, the majority of the other half of the web pages had a
match ranged between 1% and 6% of the objects fetched in direct access (rows in blue).
Chapter 6 – Findings: Results and Analyses
Page |
75
The intermediary role played by a CGI proxy between a source server and a client minimises
the possibilities of fingerprinting a web page. For this reason, the size of the objects
received by a web browser during an HTTP or HTTPS session is not a reliable parameter for
the detection of bypassing traffic carried out by CGI proxies. As a result, the frequency
distribution of each web page was divided into two: the first based on the header size and
the second on the payload of embedded objects.
The purpose of dividing the frequency distribution of the size of embedded objects is to
understand the origin of the discrepancies observed between the three access modes. An
increase or decrease of the size of the header or the payload of the packets in the HTTP or
HTTPS bypassing mode could justify the difference in size of the objects received in direct
access compared to those received in bypassing mode.
Figure 7.3: Comparison of the frequency distribution of the size of embedded objects within
a web page in direct access, HTTP bypassing access and HTTPS bypassing access.
0
20
40
60
80
100
120
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61 64 67 70
Percentage
of o
bjects m
atches (%)
Webpage
FREQUENCY DISTRIBUTION OF SIZE RECEIVED
HTTP Bypassing
HTTPS Bypassing
Direct Access
Chapter 6 – Findings: Results and Analyses
Page |
76
Table 7.1: Repartition of web pages in relation to the percentage of matches of the size of
embedded objects in direct access compared to HTTP and HTTPS bypassing accesses.
Range (%) HTTP Bypassing HTTPS Bypassing 0% 37 39
0% ‐ 1% 4 2 1% ‐ 2% 7 6 2% ‐ 3% 7 5 3% ‐ 4% 5 4 4% ‐ 5% 2 6 5% ‐ 6% 5 1 6% ‐ 7% 0 2 7% ‐ 8% 0 2 8% ‐ 9% 1 1 9% ‐ 10% 0 0 10% ‐ 20% 2 2 20% ‐ 100% 0 0
TOTAL 70 70
7.3.3 Frequency distribution of the header size of embedded objects
An IP packet is made of two main parts: a header and a payload. The information contained
in the header of an IP packet, such as source IP, destination IP, source port and destination
port, are used to route the packet to its final destination. The total header size of an
embedded object is obtained by summing up the header size of the IP packets transmitted
to a web browser during the retrieval of the object.
Header Size of embedded object = ∑ (Header Size)
The frequency distribution of the header size of embedded objects is a good approach to
understand the inconsistency observed between the size of embedded objects within a web
page fetched in direct and the size of the same objects retrieved in HTTP and HTTPS
bypassing modes. In fact, if the header size of the packets received in direct mode is
marginally higher or lower than the header size of the packets received in HTTP and HTTPS
Chapter 6 – Findings: Results and Analyses
Page |
77
bypassing modes then this would explain the discrepancy observed in the frequency
distribution of the size of embedded objects within each web page (see Figure 7.4).
Figure 7.4 outlines the classification of web pages in relation to the percentage of matches
of the header size of embedded objects in direct access compared to HTTP and HTTPS
bypassing accesses. As can be seen from this figure, only a tiny fraction of the header sizes
of embedded objects in HTTP and HTTPS bypassing modes match the header sizes of the
object received in direct mode for the same web page. More specifically, there were no
matches between the header size of embedded objects of 31 and 30 web pages in HTTP
and HTTPS bypassing modes respectively when compared to the same web pages retrieved
in direct mode (see Table 7.2, row in red). Additionally, the percentage was marginally
insignificant in HTTP and HTTPS bypassing modes for the rest of the web pages. As shown in
Table 7.2, 64 web pages accessed in HTTP bypassing mode had a matching percentage
comprised between 0% and 10% (rows in blue). From the remaining 6 web pages only 1
webpage reached nearly 30%. The same trend was also observed in HTTPS bypassing mode.
63 web pages out of 70 have a matching rate comprised between 0% and 10% (row in red)
and only 2 web pages reached nearly 30% (row in blue).
The findings made in this section imply that neither the size of embedded objects within a
web page nor the size of the header of the same objects relayed back to a client during an
HTTP or HTTPS bypassing traffic are trustworthy parameters to fingerprint the origin of a
web page. If the alteration of original headers is necessary to hide circumventing traffic, the
actual data retrieved from a web server is unmodified by most CGI proxies. For that reason,
investigating the frequency distribution of the size of the payload may be an alternative
way for identifying a reliable parameter which would be almost constant in direct mode,
HTTP and HTTPS bypassing modes.
Chapter 6 – Findings: Results and Analyses
Page |
78
Figure 7.4: Comparison of the frequency distribution of the header size of embedded
objects within a web page in direct access, HTTP bypassing access and HTTPS bypassing
access.
Table 7.2: Repartition of web pages in relation to the percentage of matches of the header
size of embedded objects in direct access compared to HTTP and HTTPS bypassing accesses.
Range (%) HTTP Bypassing HTTPS Bypassing 0% 31 30
0% ‐ 1% 1 1 1% ‐ 2% 5 5 2% ‐ 3% 6 5 3% ‐ 4% 5 6 4% ‐ 5% 5 3 5% ‐ 6% 5 4 6% ‐ 7% 1 3 7% ‐ 8% 2 2 8% ‐ 9% 3 0 9% ‐ 10% 1 4 10% ‐ 20% 5 5 20% ‐ 100% 1 2
TOTAL 70 70
0
20
40
60
80
100
120
1 4 7 10 1316 1922 25 2831 34 3740 43 4649 5255 58 6164 67 70
Percentage
of o
bjects m
atches (%)
Webpage
FREQUENCY DISTRIBUTION OF HEADER SIZE
HTTP bypassing
HTTPSbypassing
Direct Access
Chapter 6 – Findings: Results and Analyses
Page |
79
7.3.4 Frequency distribution of the payload size of embedded objects
The payload is the second part of an IP packet. It contains the data which is being
exchanged between a client and a server. The total payload size of an embedded object is
obtained by summing up the payload size of the IP packets transmitted to a web browser
during the retrieval of the object.
Header Size of embedded object = ∑ (Payload Size)
It was expected that the size of the payload of embedded objects fetched in direct access
would be similar to the size of the same objects fetched in HTTP and HTTPS bypassing
modes. It can be seen from the trends shown in Figure 7.5 that the size of the payload of
embedded objects within a web page in HTTP and HTTPS bypassing modes match most of
the objects in direct access mode.
As shown in Table 7.3, more than 50% of the objects collected from the direct access of a
web page are identical to those collected in HTTP bypassing mode for 64 web pages (rows
in blue). Only few web pages fetched in HTTP bypassing mode (rows in red) recorded a
matching percentage below 50%. The same observations were made with the HTTPS
bypassing mode. Even though the traffic was encrypted, a total of 65 web pages accessed in
direct mode had more than 50% of their embedded object payloads matching the payloads
of the same web pages accessed in HTTPS bypassing mode. All together, more than 90% of
the web pages composing the dataset of this research recorded a high percentage of
matches related to the payload of the objects embedded within each web page in direct
access, HTTP and HTTPS bypassing accesses. Consequently, the findings made in this section
indicate that the size of the payload of the objects embedded within a web page is a
reliable parameter for tracing the source of a web page.
Chapter 6 – Findings: Results and Analyses
Page |
80
Figure 7.5: Comparison of the frequency distribution of the payload size of embedded
objects within a web page in direct access, HTTP bypassing and HTTPS bypassing modes.
Table 7.3: Repartition of web pages in relation to the percentage of matches of the payload
size of embedded objects in direct access compared to HTTP and HTTPS bypassing modes.
Range (%) HTTP Bypassing HTTPS Bypassing 0% ‐ 10% 0 1 10% ‐ 20% 0 0 20% ‐ 30% 1 2 30% ‐ 40% 3 2 40% ‐ 50% 2 0 50% ‐ 60% 10 11 60% ‐ 70% 13 14 70% ‐ 80% 23 21 80% ‐ 90% 16 17 90% ‐ 100% 2 2
Total 70 70
7.3.5 Inter-arrival time
The average inter‐arrival time of a TCP flow is obtained by dividing the sum of the inter‐
arrival of the packets of the TCP flow by the number of packets.
0
20
40
60
80
100
120
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69
Percentage
of o
bjects m
atches (%)
Webpage
FREQUENCY DISTRIBUTION OF PAYLOAD
HTTP Bypassing
HTTPS Bypassing
Direct Access
Chapter 6 – Findings: Results and Analyses
Page |
81
The fetching of a web page through a CGI proxy adds one or more hops to the path of the
packets exchanged between a client and the source server. For this reason, it was expected
that the interval time of the packets in bypassing mode would be higher than the inter‐
arrival time of the packets in direct access mode. However, the results obtained from the
experiments did not confirm this expectation. It is clear from Table 7.4 that 55 web pages,
which is equal to nearly 77% of the dataset, recorded a higher inter‐arrival time during the
HTTP bypassing mode compared to direct access mode. In the same way, the average inter‐
arrival time of 58 web pages representing 80% of the web pages accessed in HTTPS
bypassing mode was higher compared to the inter‐arrival of the packets in direct access
mode. Thus, it is evident from the observations made from the comparison between the
inter‐arrival time of packets in direct access and bypassing access that the inter‐arrival time
is a potential parameter to identify bypassing traffic. However, the evaluation of the
accuracy of the detection system will establish the degree of reliability of this parameter in
detecting bypassing traffic.
Table 7.4: Repartition of web pages in relation to the inter‐arrival time of the packets in
direct access compared to HTTP and HTTPS bypassing accesses.
Inter‐Arrival time HTTP bypassing HTTPS Bypassing Number of web pages
Percentage Number of web pages
Percentage
Direct access <
bypassing access 54 77.14% 56 80%
Direct access >
bypassing access 16 22.86% 14 20%
Number of packets Average inter‐arrival time =
∑ (inter‐arrival of packets)
7.3.6 Nu
The obje
web bro
connecti
This num
expectat
generate
addition
by a clie
informat
objects.
homepag
(www.gl
homepag
Figure
umber of T
ects embed
owser throu
ions establi
mber can va
tion by inv
e more TCP
al objects, s
ent’s comp
tion implies
As an exam
ge of the
ypeproxy.co
ge but also
e 7.6: Retrie
TCP flows
ded within
gh a TCP co
shed betwe
ry dependin
vestigating t
flows than
such as the
uter in ord
s that more
mple, it can
e universit
om), the we
a toolbar re
eval of www
a webpage
onnection.
een a client
ng on the si
the numbe
n direct acce
CGI bypass
der to allow
e TCP flows
n be seen
ty website
eb browser
epresenting
w.uws.edu.a
Cha
e are down
The numbe
and a serv
ize of the w
er of TCP f
ess traffic. T
sing scripts,
w the user
s are neede
from Figure
(www.uw
Firefox rec
the CGI scr
u through t
apter 6 – Fin
loaded from
er of TCP flo
ver during th
web page or
flows was
This can be
are added
to make fu
ed in bypas
e 7.6 that
ws.edu.au)
ceived not o
ript (red squ
the CGI prox
ndings: Resu
m the sourc
ows is the
he retrieval
the load of
that bypas
e explained
to the web
urther requ
sing traffic
during the
through t
only the dat
uare on Figu
xy www.glyp
ults and Ana
Page
ce server by
total numb
l of a web p
f the server
ssing traffic
by the fact
page reque
uests. Addit
to fetch al
fetching of
the CGI p
ta related to
ure 7.6).
peproxy.com
alyses
| 82
y the
ber of
page.
r. The
c will
t that
ested
tional
ll the
f the
proxy
o the
m.
Chapter 6 – Findings: Results and Analyses
Page |
83
The results from the experiments, related to the number of TCP flows, are highlighted in
Table 7.4. It is evident from this table that around 21% and 17% of web pages of the dataset
did not match the expectation respectively in HTTP and HTTPS accesses. In other words, the
number of TCP flows of 15 web pages was lower in HTTP bypassing compared to the same
number of TCP flows for identical web pages. In HTTPS bypassing access, the same trend
was observed only for 12 web pages. Even though these percentages are marginal
compared to the percentages of web pages matching the expectation (nearly 78% for HTTP
bypassing access and 83 for HTTPS), these findings can have a significant impact on the
accuracy of the detection system. As for the inter‐arrival time, this parameter would need
to be validated as a reliable indicator of bypassing traffic during the evaluation of the
accuracy of the detection approach.
Table 7.5: Repartition of web pages in relation to the number of TCP flows in direct access
compared to HTTP and HTTPS bypassing accesses.
77..44 DDEETTEECCTTIIOONN RRUULLEESS
During the experiments, it was observed that the size of the payload of embedded objects
remained nearly constant in direct access, HTTP and HTTPS bypassing accesses. This
observation is crucial in predicting the source of the web page. Additionally, for the majority
of web pages, more TCP flows are occurring while using the CGI proxy to access a web page
Number of TCP flows
HTTP bypassing HTTPS Bypassing Number of web pages
Percentage Number of web pages
Percentage
Direct access >
bypassing access
55 78.57142857% 58 82.85714286%
Direct access <
bypassing access
15 21.42857143% 12 17.14285714%
Chapter 6 – Findings: Results and Analyses
Page |
84
either with the HTTP protocol or the HTTPS protocol. Also, the inter‐arrival time of the
packets remained higher for the bypassing traffic throughout the experiments.
The results of the experiments, carried out during the profile building phase, outlined the
necessity to implement two sub‐mechanisms to detect bypassing traffic. The fingerprinting
of a blocked web page is performed by the first sub‐mechanism by matching the payload
size of embedded objects within a web page while the second sub‐mechanism inspects the
traffic for anomalies related to the number of TCP flows and the inter‐arrival time of the
packets. According to the results obtained, the bypassing traffic generated by a CGI proxy is
generally characterised by a high inter‐arrival of the packets and an abnormal number of
TCP flows initiated to fetch a web page.
To sum up, a web page circumvented by a HTTP or HTTPS bypassing traffic, can be detected
by comparing pre‐built profiles with live traffic profiles according to the rules outlined in
Table 7.5. If the profile of live network traffic matches one of the pre‐built profiles after
applying the detection rules, a positive alarm is then raised.
Table 7.6: Detection rules of bypassing traffic
Live traffic profile Rules Pre‐built profile Frequency distribution of the size of payload of embedded objects
Match at least 50%
Frequency distribution of the size of payload of embedded objects
Inter‐arrival time > Inter‐arrival time Number of TCP flows > Number of TCP flows
77..55 RREESSUULLTTSS OOFF TTHHEE AACCCCUURRAACCYY OOFF TTHHEE DDEETTEECCTTIIOONN AAPPPPRROOAACCHH
The significance of this study lies in the evaluation of the accuracy of the proposed
approach. Therefore the accuracy of the detection approach will be analysed based on the
combination of the parameters, as follows:
Chapter 6 – Findings: Results and Analyses
Page |
85
Frequency distribution of payload
Frequency distribution of payload and average inter‐arrival time
Frequency distribution of payload, inter‐arrival time and number of TCP flows
7.5.1 Results of HTTP bypassing mode
7.5.1.1 Frequency distribution of the size of payload
The first evaluation of the accuracy of the detection model was carried out with the
frequency distribution of the size of the payload of embedded objects as the only detection
parameter. It is clear from Figure 7.7 that the accuracy of the detection mechanism was
above 50% for a matching percentage range comprised between 50% and 70%. More
specifically, 233 web pages accessed randomly were successfully fingerprinted when the
percentage of payload size matches between pre‐built profiles and live HTTP bypassing
traffic was set between 50% and 55%. At the same time, 31 web pages were classified as
unknown as no match was found for these web pages when compared to the pre‐built
profiles. That is to say that the frequency distribution of the payload of these web pages did
not match at least 50% of objects from the collection of embedded objects of any of the
pre‐built profiles. Finally, the detection recorded more than 1 match for 7 web pages.
It can also be seen from Figure 7.7 that, as the percentage of payload size matches
increases, a sharp drop of the number of positive alarms is observed while, simultaneously,
many web pages are unclassified. However, by increasing the percentage of matches, the
number of false alarms dropped from 7 to 1 web page after 70% of matches.
Chapter 6 – Findings: Results and Analyses
Page |
86
Figure 7.7: Evaluation of the accuracy of the detection approach according to the frequency
distribution in HTTP bypassing mode.
7.5.1.2 Frequency distribution of payload combined with inter-arrival time
The second evaluation of the detection model accuracy was carried out with the frequency
distribution and the inter‐arrival time as the parameters for detection. In this evaluation a
web page is fingerprinted if the following rules are met:
Live traffic profile Rules Pre‐built profile
Frequency distribution of the size of
payload of embedded objects
Match at
least 50%
Frequency distribution of the size
of payload of embedded objects
Inter‐arrival time > Inter‐arrival time
233 218
206
163 135
103
68
28 12
0 7 4 4 4 1 1 1 1 1 1 31
49 61
104 135
167
202
242 258 270
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access (%)
DETECTION ACCURACY: PAYLOAD SIZE
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 85.98 80.44 76.01 60.15 49.82 38.01 25.09 10.33 4.43 0.00 FALSE 2.58 1.48 1.48 1.48 0.37 0.37 0.37 0.37 0.37 0.37 UNKNOWN 11.44 18.08 22.51 38.38 49.82 61.62 74.54 89.30 95.20 99.63 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Chapter 6 – Findings: Results and Analyses
Page |
87
The results of the second evaluation are shown in Figure 7.8. From this figure, it is evident
that adding the inter‐arrival time as a detection parameter did not increase the accuracy of
the detection approach. Contrary to the first evaluation where 85.98% of web pages were
successfully fingerprinted, it can be seen from Figure 6.6 that the accuracy of the detection
system dropped to 75.66% for a matching percentage set between 50% and 55%. For
instance, the detection system recorded 205 positive alarms, 7 false alarms and 59
unclassified web page in this evaluation compared to 233 positive alarms, 7 false alarms
and 31 unclassified web pages for a matching percentage set between 50% and 55%. The
same trends were observed throughout this evaluation when the matching percentage
increased. However, the number of false alarms remained steady in the first and second
evaluation. It is clear from this observation that the accuracy of the detection system
dropped due to an increase of unclassified web pages.
In total, the findings from the second evaluation proved that the inter‐arrival time is not a
reliable parameter to detect bypassing traffic. This parameter decreased the accuracy of the
detection approach by increasing the rate of unclassified web pages. At the same time, the
inter‐arrival time had no effect on reducing the number of false alarms.
Chapter 6 – Findings: Results and Analyses
Page |
88
Figure 7.8: Evaluation of the accuracy of the detection approach according to the frequency
distribution and the inter‐arrival time in HTTP bypassing mode.
7.5.1.3 Frequency distribution of payload combined with inter-arrival time and the
number of TCP flows
The third evaluation of the detection approach accuracy was carried out combining the
frequency distribution, the inter‐arrival time and the number of TCP flows as parameters
for detection. In this evaluation a web page is fingerprinted if in addition to the rules of the
second evaluation, the number of TCP flows of a pre‐built traffic is greater than those
recorded for a live traffic profile. The results obtained from the third evaluation are
presented in Figure 7.9. It can clearly be seen from this figure that the accuracy of the
detection system dropped by nearly half throughout this evaluation in relation to the
205 192
181
146 120
91 59
27 12
0 7 4 4 4 1 1 1 1 1 1
59 75 86
121 150
179 211
243 258 270
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access
DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 75.65 70.85 66.79 53.87 44.28 33.58 21.77 9.96 4.43 0.00 FALSE 2.58 1.48 1.48 1.48 0.37 0.37 0.37 0.37 0.37 0.37 UNKNOWN 21.77 27.68 31.73 44.65 55.35 66.05 77.86 89.67 95.20 99.63 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Chapter 6 – Findings: Results and Analyses
Page |
89
accuracy recorded during the first evaluation. The accuracy decreased from about 75% in
the second evaluation to just 54% during the third evaluation. As for the second evaluation,
adding the number of TCP flows to the detection parameters created a sharp rise of
unclassified web pages and a dramatic fall of positive alarms. However the rate of false
alarms decreased steadily to reach 0 when the matching percentage was set between 70%
and 75%. Overall, the number of TCP flows observed during an HTTP session is not a reliable
indicator to expose bypassing activities on a private network. However, this parameter is a
good metric to minimise the false alarms raised by a detection mechanism of bypassing
traffic in HTTP mode.
Figure 7.9: Evaluation of the accuracy of the detection approach according to the frequency
distribution, the inter‐arrival time and the number of TCP flows in HTTP bypassing mode.
147 140 133 103
87 61
36 12 4 0 7 3 3 3 0 0 0 0 0 0
117 128 135 165
184 210
235 259 267 271
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access
DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME + NUMBER OF TCP FLOWS
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 54.24 51.66 49.08 38.01 32.10 22.51 13.28 4.43 1.48 0.00 FALSE 2.58 1.11 1.11 1.11 0.00 0.00 0.00 0.00 0.00 0.00 UNKNOWN 43.17 47.23 49.82 60.89 67.90 77.49 86.72 95.57 98.52 100.00 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Chapter 6 – Findings: Results and Analyses
Page |
90
7.5.2 Results of HTTPS bypassing mode
7.5.2.1 Frequency distribution of the size of payload
The rules applied during the first evaluation of the accuracy in HTTP bypassing mode were
identical to those used in this evaluation. The trends observed in HTTP bypassing mode
were very similar to those observed in HTTPS. As shown in Figure 7.10, an accuracy of
84.13% was recorded when the matching percentage was set between 50% and 55%. As for
HTTP, the accuracy dropped as the matching percentage increased. However, it is
important to notice the number of false alarm stayed low compared to the results obtained
in HTTP bypassing mode.
Figure 7.10: Evaluation of the accuracy of the detection approach according to the frequency distribution in HTTPS bypassing mode
228 217
200
165
129 111
61 35
14 0 9 2 0 0 0 0 0 0 0 0
34 52
71
106
142 160
210 236
257 271
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access
DETECTION ACCURACY: PAYLOAD SIZE
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 84.13 80.07 73.80 60.89 47.60 40.96 22.51 12.92 5.17 0.00 FALSE 3.32 0.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 UNKNOWN 12.55 19.19 26.20 39.11 52.40 59.04 77.49 87.08 94.83 100.00 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Chapter 6 – Findings: Results and Analyses
Page |
91
7.5.2.2 Frequency distribution of payload combined with inter-arrival time
The results of the second investigation are shown in Figure 7.11. The same detection rules,
as those applied in the second investigation of HTTP bypassing mode, were used to this
investigation. From Figure 7.11, it can be seen that the trends of the second evaluation in
HTTP and HTTPS bypassing modes are almost identical. However, the impact of the inter‐
arrival on the accuracy of the detection is marginal in the HTTPS bypassing mode. In other
words, the majority of the web pages randomly accessed in HTTPS bypassing mode had a
higher inter‐arrival time compared to the direct access mode.
As seen in Figure 7.11, the accuracy dropped from 84.13% in the first investigation to 80%
in this investigation for a matching percentage set between 50% and 55%. For the same
investigation in HTTP bypassing mode, the accuracy dropped almost by 10% which is double
the figures recorded for HTTPS mode. In fact, during the HTTPS bypassing scenario,
encryption is applied to the packets exchanged between a client and a server. The process
of encrypting and decrypting IP packets delays the arrival of the packets to the client. This
could explain why the arrival‐time of the packets is affected in this mode. In addition, it is
clear from Figure 6.10 that the inter‐arrival time had no impact in decreasing the number of
false alarms.
Overall, the inter‐arrival time is a reliable parameter to find out the origin of a web page
when HTTPS is used to bypass censorship. High accuracy was scored by the detection model
when the matching range was set between 50 % and 65%. This optimal range was larger in
HTTP bypassing mode where high accuracy was still recorded up to 75% of matching
percentage. As a result, it is crucial to fine‐tune the detection system depending on the
protocol used during a bypassing mode to fingerprint the majority of the web pages
blacklisted on the proxy firewall.
Chapter 6 – Findings: Results and Analyses
Page |
92
Figure 7.11: Evaluation of the accuracy of the detection approach according to the
frequency distribution and the inter‐arrival time in HTTPS bypassing mode.
7.5.2.3 Frequency distribution of payload combined with inter-arrival time and the
number of TCP flows
The results obtained from the use of the three parameters are summarized in Figure 7.12. It
can be seen from this figure that the accuracy of the detection approach declined to 51.29
for the first matching percentage range (50% ‐ 60%). The same observations are noticed
across the 271 web pages accessed randomly in HTTPS bypassing mode. As for the HTTP
bypassing mode, the number of TCP flows is not a good indicator to identify the source of a
web page with a high accuracy.
217 208 191
157 124
106
61 35
14 0 9 2 0 0 0 0 0 0 0 0
45 61
80 114
147 165
210 236
257 271
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access
DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 80.07 76.75 70.48 57.93 45.76 39.11 22.51 12.92 5.17 0.00 FALSE 3.32 0.74 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 UNKNOWN 16.61 22.51 29.52 42.07 54.24 60.89 77.49 87.08 94.83 100.00 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
Chapter 6 – Findings: Results and Analyses
Page |
93
Figure 7.12: Evaluation of the accuracy of the detection approach according to the
frequency distribution, the inter‐arrival time and the number of TCP flows in HTTPS
bypassing mode.
139 133 125 96
81 70 34
15 6 0 8 1 0 0 0 0 0 0 0 0
124 138 146 175 190 201
237 256 265 271
0
50
100
150
200
250
300
50 ‐55 55 ‐ 60 60 ‐ 65 65 ‐ 70 70 ‐ 75 75 ‐ 80 80 ‐ 85 85 ‐ 90 90 ‐ 95 95 ‐ 100
Number o
f webpage(s)
Percentage of payload size matches between direct Accees and Bypassing Access
DETECTION ACCURACY: PAYLOAD SIZE + INTER‐ARRIVAL TIME + NUMBER OF TCP FLOWS
POSITIVE FALSE UNKNOWN
50%‐55%
55%‐60%
60%‐65%
65%‐70%
70%‐75%
75%‐80%
80%‐85%
85%‐90%
90%‐95%
95% ‐100%
POSITIVE 51.29 48.90 46.13 35.42 29.89 25.83 12.55 5.54 2.21 0.00 FALSE 2.95 0.37 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 UNKNOWN 45.76 50.74 53.87 64.58 70.11 74.17 87.45 94.46 97.79 100.00 TOTAL 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00
CHAPT
CONCL
8.1 Co
The desig
The bypa
firewall
detection
evaluate
first pha
then mat
size of th
during th
of the TC
the inter
predict if
8.2 Fu
Integrati
of sensit
The prop
our inves
real wor
scripts b
impleme
in this in
testing n
TER 8
LUSION
ontribution
gn and testi
assing of pr
products la
n approach
d by testing
se network
tched to pre
he different
he first phas
CP flows use
r‐arrival time
f the traffic i
uture work
ng CGI prox
ive data. Th
posed detect
stigation wil
ld scenario.
because on
ntation of d
vestigation.
eeds to be c
ng of a mec
roxy firewa
ck an effici
, to detect
g it in a virtu
profiles of
e‐built profi
objects em
se. Once a b
ed to fetch t
e of the pac
s originating
y detection
is research r
tion model w
l be to carry
Furthermor
ly the byp
ifferent byp
For the det
carried out t
chanism to d
lls was inve
ent mechan
t CGI bypas
ual network
blacklisted w
les to finger
bedded in t
blacklisted w
the web pag
ckets is then
g from a CGI
in proxy fire
raised a lot
was only tes
y out the sa
re, the inves
passing scri
passing script
tection of CG
o overcome
detect CGI tr
estigated in
nism to det
ssing traffic
k. The detect
web pages w
rprint traffic
he web pag
web page is s
ge, the aver
n compared
proxy or a n
ewalls will g
of issues tha
sted in a virt
me experim
stigation wil
pt glype [4
ts can produ
GI proxies, o
the problem
raffic was th
this thesis
tect and blo
c, was prop
tion is perfo
were create
c emanating
ge is utilized
successfully
age size of t
with the sta
normal web
greatly incre
at need to b
ual network
ments on a la
l be expand
46] was co
uce different
our propose
m presented
Chapte
he main aim
by using s
ock circumv
posed in th
ormed in tw
ed. The inco
g from a blo
as the dete
fingerprinte
the packets
atistics of in
server.
ease the priv
be investigat
k. Therefore,
arge dataset
ded to cover
overed in t
t results tha
ed model is
d in the thes
er 7 ‐‐ Concl
Page
of this rese
imulation. M
venting traff
he research
wo phases. I
oming traffic
ocked server
ection param
ed, an inspe
transmitted
coming traf
vacy and sec
ted in the fu
, the next st
t generated
r more bypa
this thesis.
n those obta
a start but
is.
usion
e | 94
earch.
Many
fic. A
and
n the
c was
r. The
meter
ection
d and
ffic to
curity
uture.
tep of
from
assing
The
ained
more
References
Page | 95
RREEFFEERREENNCCEESS
[1] Myth: A connected PC will be infected in less than 5 minutes
Available at: http://en.kioskea.net/faq/455‐myth‐a‐connected‐pc‐will‐be‐infected‐in‐
less‐than‐5‐minutes
Accessed in 2009.
[2] Kenneth Ingham and Stephanie Forrest. Rep. A History and Survey of Network
Firewalls, University of New Mexico Computer Science Department, 2002.
[3] Vacca, John R. Jumpstart for network and systems administrators, Elsevier Digital
Press, 2005.
[4] Brian Baskin, Tony Bradley, Jeremy Faircloth, Craig A. Schiller, Ken Caruso, Paul
Piccard, et al. Combating spyware in the enterprise, Syngress Publishing, 2006.
[5] Eliezer Idjalahoue, Approaches for limiting the bypassing of proxy firewalls, University
of Western Sydney, 2008.
[6] Gordana DODIG‐CRNKOVIC, Scientific Methods in Computer Science
Department of Computer Science Mälardalen University Västerås, Sweden
Available at: http://www.mrtc.mdh.se/~gdc/work/cs_method.pdf
Accessed in 2009.
[7] Thomas W Shinder. The Best Damn Firewall Book Period, Second Edition, Syngress
publishing, Elsevier: December 2007.
[8] William R. Cheswick, Steven M. Bellovin, Aviel D. Rubin. Firewalls and Internet
Security, Second Edition: Repelling the Wily Hacker. Addison‐Wesley Longman
Publishing, 2003.
[9] John Chirillo. Hack attacks revealed: A complete reference with custom security
hacking toolkit, John Wiley & Sons, New York, 2001.
[10] Stephen Hochstetler, Harry Tanner, Ramachandra Kulkarni, Sebastian Mika. Extending
Network Management Through Firewalls, IBM, June 2001.
References
Page | 96
[11] Elizabeth D. Zwicky, Simon Cooper, D. Brent Chapman
Building Internet Firewalls, Second Edition
O'Reilly Media, June 2000
[12] Behrouz A. Forouzan, Sophia Chung Fegan. TCP/IP Protocol Suite Third edition,
McGraw Hill, 2006.
[13] Karen Scarfone, Peter Mell
Guide to Intrusion Detection and Prevention Systems (IDPS)
Recommendations of the National Institute of Standards and Technology
February 2007
[14] Brian Komar, Ronald Beekelaar, and Joern Wettern. Firewalls for dummies, second
edition, Wiley Publishing, 2003.
[15] Eric Cole, Ronald Krutz, James W. Conley. Network security bible, second edition, John
Wiley & Sons 2005.
[16] Jan L. Harrington.
Network Security: A Practical Approach, Elsevier 2005.
[17] Joachim von zur Gathen. How to bypass a firewall, Bonn‐Aachen International Centre
for Information Technology, 2006.
[18] William Stallings. Network security essentials, Applications and standards, third
edition”, Prentice Hall, 2007.
[19] Ari Luotonen.
Web proxy servers, Prentice Hall, 1997.
[20] John W. Rittinghouse, William M. Hancock. Cybersecurity operations handbook,
Elsevier Digital Press, 2003.
[21] Michael E. Whiteman, Herbert J. Mattord, Richard D. Austin, Greg Holden.
Guide to firewalls and network security: with intrusion detection and VPNs, Second
edition, Course Technology Press Boston, MA, United States, 2003.
[22] Daniel J. Barrett, Richard E. Silverman. SSH, the Secure Shell: The Definitive Guide,
O’Reilly Media, Inc., 2005.
References
Page | 97
[23] Srinivas Sampalli. Security in Virtual Private Networks, in Network Security: Current
Status and Future Directions, C. Douligeris and D. Serpanos (editors), Wiley‐IEEE Press,
March 2007
[24] Floss manuals, sesame. Bypassing Internet Censorship
Available at: http://www.scribd.com/doc/12714224/how‐to‐bypass‐internet‐
censorship, Accessed in 2009.
[25] The living Internet
Available at: http://www.livinginternet.com/i/is_anon_work.htm
Accessed in 2009.
[26] Lozdodge: proxy avoidance application
Available at: http://www.lozware.com/
Accessed in 2009.
[27] Jeffry Dwight, Michael Erwin and al. Using CGI, Special edition, Que Corp.
Indianapolis, IN, USA, 1997.
[28] Markus Jakobsson, Zulfikar Ramzan. Crimeware, understanding new attacks and
defences, Addison‐Wesley Professional, 2008.
[29] SOPHOS, Security threat report: 2009
Available at : http://www.sophos.com/sophos/docs/eng/marketing_material/sophos‐
security‐threat‐report‐jan‐2009‐na.pdf
Accessed in 2009.
[30] Computer Economics. 2007 Malware Report: The Economic Impact of Viruses,
Spyware, Adware, Botnets and other Malicious Code, Tech. rep., June 2007.
[31] Cybercrime: Public and Private Entities Face Challenges in Addressing Cyber Threats
Available at: http://www.gao.gov/new.items/d07705.pdf, June 2007
Accessed in 2009.
[32] ScanSafe: Annual global threat report 2009
Available at: http://www.scansafe.com/downloads/gtr/2009_AGTR.pdf
Accessed in 2009.
References
Page | 98
[33] Michael Erbschloe. Trojans, Worms, and Spyware: A computer security professional’s
guide to malicious code, MA: Elsevier Butterworth‐Heinemann, 2005.
[34] Compete Inc: web traffic analysis
Available at: http://www.compete.com/
Accessed in 2009.
[35] IEEE Computer Society, Guy‐Vincent (University of Ottawa)
Centralized Web Proxy Services: security and privacy considerations, pp. 46‐52.
December 2007.
[36] Manuel Crotti, Maurizio Dusi, Francesco Gringoli, Luca Salgarelli. Detecting HTTP
Tunnels with Statistical Mechanisms, in Proceedings of the 42th IEEE International
Conference on Communications (ICC 2007), (Glasgow, Scotland), pp. 6162–6168, June
2007.
[37] Manuel Crotti, Maurizio Dusi, Francesco Gringoli, Luca Salgarelli. Traffic Classification
through Simple Statistical Fingerprinting, Computer Communications Review, 37(1):7–
16, 2007.
[38] Jeffrey Horton and Rei Safavi‐Naini. Detecting policy violations through traffic analysis,
22nd Annual Computer Security Applications Conference (ACSAC '06), Miami Beach,
Florida, USA, December 2006, 109‐120.
[39] Riyad Alshammari, Nur Zincir‐Heywood. A Flow Based Approach For SSH Traffic
Detection, In Systems, Man and Cybernetics, IEEE International Conference on, pages
296–301, Oct. 2007.
[40] Kevin Borders, Atul Prakash. Web Tap: Detecting Covert Web Traffic, In Proceedings of
ACM CCS, October 2004.
[41] Liang Lu, Jeffrey Horton, Reihaneh Safavi‐Naini, and Willy Susilo. Transport Layer
Identification of Skype Traffic, International Conference ICOIN 2007, Estoril, Portugal,
January 2007.
[42] Sen, S., Spatscheck, O., Wang, D.: Accurate, Scalable In‐Network Identification of P2P
Traffic Using Application Signatures. In: Proceedings International WWW
References
Page | 99
Conference, New York, USA (2004).
[43] Stephen Thomas. HTTP essentials: Protocols for secure, scalable web sites, John Wiley
& Sons Inc, 2001.
[44] Bud Ratliff and Jason Ballard with the Microsoft ISA server team.
Microsoft® Internet Security and Acceleration (ISA) Server 2004, Administrator’s
Pocket consultant, Sams Indianapolis, IN, USA, 2005.
[45] Nils‐Erik Frantzell, IBM
Install XAMPP for easy, integrated development
Available at: http://www.ibm.com/developerworks/linux/library/l‐xampp/
Accessed in 2009.
[66] Apache friends ‐ XAMMP
Available at: http://www.apachefriends.org/en/xampp.html
Accessed in 2009.
[47] Angela Orebaugh, Gilbert Ramirez, Josh Burke, Larry Pesce, Joshua Wright, Greg
Morris
Wireshark & Ethereal: Network Protocol Analyzer Toolkit, Syngress Publishing, Inc.
2007.
[48] Fiddler2
Available at: http://www.fiddler2.com/fiddler2/
Accessed in 2009.
[49] Glype proxy: free browsing
Available at: http://www.glype.com/
Accessed in 2009.
[50] Eric Hammersley,
Professional VMwareServer, Wrox Press Ltd., Birmingham, UK, 2006.
[51] G. Ziemba, D. Reed and P. Traina. RFC 1858, Security Considerations for IP Fragment
Filtering
Available at: http://www.ietf.org/rfc/rfc1858.txt
References
Page | 100
Accessed in 2009.
[52] I. Miller. RFC 3128, Protection against a variant of the tiny fragment Attack, June 2001
Available at: http://tools.ietf.org/html/rfc3128
Accessed in 2009.
[53] J. Anderson. An Analysis of Fragmentation Attacks
Available at: http://www.ouah.org/fragma.html
Accessed in 2009.
[54] Heyning Cheng and Ron Avnur,
Traffic Analysis of SSL Encrypted Web Browsing
[55] Andrew Hintz, Workshop on Privacy Enhancing Technologies PET2002
Fingerprinting websites using traffic analysis. The university of Texas at Austin
[56] Qixiang Sun; Simon, D.R.; Yi‐Min Wang; Russell, W.; Padmanabhan, V.N.; Lili Qiu,
2002 IEEE Symposium on Security and Privacy
Statistical Identification of Encrypted Web Browsing Traffic
APPE
JSCRIPT
WEBTRA
// This fu// The st
static fun{ //Selec
Fiddle //Dum Fiddle //Proc var We
if (null { Fidd retu } var Sta var We
try { //Cr // Th // Th Web
ENDIX
T.NET EM
AFFICSTATS
unction is desatistical of ea
nction WebTr
ct all the stre
rObject.UI.ac
mp the raw darObject.UI.ac
cess the strea
ebTraffic = F
l == WebTraff
dlerObject.Sta
urn;
atFilename = ebStatFile:Str
reate a statisthe statistic filhis file is later
bStatFile = Fil
MBEDDED
S
signed to genach session is
rafficStats(Fil
am generate
ctSelectAll();
ata of the webctSaveSession
ms and extra
iddlerApplica
fic || WebTra
atusText= "No
CONFIG.GetP
reamWriter =
tic file for the e is save in a r passed to M
e.CreateText
WITH
erate and dusave in a uni
eIndex)
d by a web se
b session in a nsToZip(CONF
ct statistics o
ation.UI.GetA
affic.Length <
o web traffic a
Path("Capture
= null;
web sessionCSV format.
Microsoft exce
(StatFilename
FIDDLER2
mp the statisque file: Web
ession after th
ZIP file FIG.GetPath("
of interest
llSessions();
< 1)
available for
es") + "WebS
el to generate
e);
2 TO
stics of a web bstatfile + file
he completio
"Captures") +
analysis!";
StatFile" + File
e graphs
COMPUTE
session (HTTe index + .CSV
on of a reques
+ "dump" + Fil
eIndex + ".csv
App
Page
STATIST
TP and HTTPS V
st
leIndex + ".sa
v";
pendix
| 101
TICS:
)
az");
Appendix
Page | 102
//Heading of the statistics file WebStatFile.Write("ExecutionOrder,"); WebStatFile.Write("ProcessID,"); WebStatFile.Write("Protocol,"); WebStatFile.Write("Method,"); WebStatFile.Write("ServerName,"); WebStatFile.Write("ServerIP,"); WebStatFile.Write("ServerPort,"); WebStatFile.Write("ClientIP,"); WebStatFile.Write("ClientPort,"); WebStatFile.Write("BytesReceived,"); WebStatFile.Write("HeaderSize,"); WebStatFile.Write("DataSize,"); WebStatFile.WriteLine("InterArrivalTime" + "\t") // This code goes through the traffic streams generated during a web session // and dump all their statistics in the log file for (var webstream = 0; webstream < WebTraffic.Length; webstream++) { var BytesRcv = 0; var BytesHeader = 0; var BytesData = 0; var InterArrivalTime = 0; // The following lines of code computes: // 1‐ Total size of data received by a stream // 2‐ Total size of the headers // 3‐ Total size of the payload InterArrivalTime = WebTraffic[webstream].Timers.ServerDoneResponse ‐ WebTraffic[webstream].Timers.ServerBeginResponse; if (null != WebTraffic[webstream].responseBodyBytes) { BytesData= WebTraffic[webstream].responseBodyBytes.LongLength; } if ((null != WebTraffic[webstream].oResponse) && (null != WebTraffic[webstream].oResponse.headers)) { BytesHeader = WebTraffic[webstream].oResponse.headers.ByteCount();
} // W W W W W W W W W W W W W
} } catch ( { Mes
} finally { if (W { W W Fi } } }
TRAFFIC
#! /usr/bi#=======
# IMPORT
# cPAMIE
BytesRcv = B
/Write the sta
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
WebStatFile.W
(ErrorMsg)
ssageBox.Sho
y
WebStatFile !=
WebStatFile.Cl
WebStatFile.Di
ddlerObject.U
C GENERATO
n/env python============
TATION OF LIE is high level
BytesHeader +
atistics in the
Write(webstre
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(WebTraf
Write(BytesRcv
Write(BytesHe
Write(BytesDa
WriteLine(Inte
w(ErrorMsg);
= null )
ose(); ispose(); UI.actRemove
OR
n ===========
BRARIES library use to
+ BytesData
log file
am + ","); ffic[webstrea
ffic[webstrea
ffic[webstrea
ffic[webstrea
ffic[webstrea
ffic[webstrea
ffic[webstrea
ffic[webstrea
v + ","); ader + ","); ta + ","); rArrivalTime
;
eAllSessions()
============
o automate th
am].oFlags["x
am].oRequest
am].oRequest
am].hostname
am].oFlags["x
am].port + ","am].oFlags["x
am].oFlags["x
+ "\t");
);
===========
he Microsoft
x‐ProcessInfo"t.headers.Uri
t.headers.HTT
e + ","); x‐hostIP"] + ",); x‐clientIP"] + "x‐clientport"]
===========
Internet Expl
"] + ","); Scheme + ","TPMethod + "
");
","); + ",");
============
lorer client.
App
Page
); ",");
===========
pendix
| 103
==="
Appendix
Page | 104
# cPAMIE is used in this project to simulate user browsing activities by: # 1‐ Creating a Microsoft Internet Explorer object # 2‐ Passing an URL to the object # 3‐ Retrieving automatically the URL # Python scripting is used to interact and send commands to the Microsoft Internet Explorer object #==================================================================================" from cPAMIE import PAMIE import os, sys, time, datetime, subprocess from time import sleep #==================================================================================" # This function displays error messages during execution #==================================================================================" def ExecMessage(ExecOutput, nMode): if ExecOutput == 0: print("%s statistics Generation... DONE" % nMode) os.system('ExecAction.exe "clear"') sleep(2) elif ExecResp == 1: print("Number of arguments to Fiddler incorrect") elif ExecResp == 2: print("Fiddler not working") else: print("%s statistics Generation... FAILED" % nMode) #==================================================================================" # This function simulates the bypassing traffic in HTTP or HTTPS mode # The Microsoft Internet Explorer object first accesses the bypassing server and then passes the URL to bypassing server # BypassServer: Name of the bypassing server # CurrentUrl: Current URL # CurrentRound: Current Round if many rounds are specified, 1 by default # CurrentFile: Index of the current file to store the statistics and raw data # BypassMode: bypassing mode: HTTP or HTTPS #==================================================================================" def ExecBypassing(BypassServer, CurrentUrl, CurrentRound, CurrentFile, BypassMode): global ie bText = 'u' bButton = 'Go' print ("%s Bypassing traffic for: %s" % (BypassMode, CurrentUrl)) print ("%s bypassing traffic simulation... In Progress" % BypassMode) ie.navigate(BypassServer) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(10) # Pass URL to retrive to the bypassing server os.system('ExecAction.exe "clear"')
Appendix
Page | 105
ie.setTextBox(bText, CurrentUrl) ie.clickButton(bButton) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(60) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FiddlerExec = 'ExecAction.exe ' + '"stats ' + CurrentRound + '‐' + CurrentFile + BypassMode + '"' print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) sleep(3) ExecMessage(ExecResp, "Bypassing Traffic") #==================================================================================" # This function retrieves a URL from the Internet in HTTP and HTTPS bypassing modes #==================================================================================" def WebBrowsing(nRound, uUrlList, HTTPBypassServer, HTTPSBypassServer): global ie print ("Starting web browser") FileCount = 0 if len(uUrlList) > 0: for line in uUrlList: ie = PAMIE() os.system('ExecAction.exe "clear"') sleep(2) print ("Normal traffic for: %s" % line.strip()) ie.navigate(line.strip()) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(40) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FileCount += 1 FiddlerExec = 'ExecAction.exe ' + '"stats ' + str(nRound) + '‐' + str(FileCount) + '"' sleep(2) print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) os.system('ExecAction.exe "clear"') sleep(2) ExecMessage(ExecResp, "Normal Traffic") ExecBypassing(HTTPBypassServer, line.strip(), str(nRound), str(FileCount), \
"HTTP") ExecBypassing(HTTPSBypassServer, line.strip(), str(nRound), str(FileCount), \ "HTTPS") ie.quit() sleep(5) else: print ("No Url to retrieve")
Appendix
Page | 106
#==================================================================================" # This function performs the following tasks: # 1‐ Retrieve a webpage using its URL # 2‐ Collect all the traffic streams generated during the retrieval of the webpage # 3‐ Compute the statistics of interest. # 4‐ Store the statistics in a CVS file and dump the raw data in a ZAR file #==================================================================================" def ProfileGenerator(): HTTPBypassingServer = 'SPECIFY THE HTTP BYPASSING SERVER HERE' HTTPSBypassingServer = ' SPECIFY THE HTTPS BYPASSING SERVER HERE'' os.system('cls') os.system('ExecAction.exe "start"') os.system('ExecAction.exe "clear"') print ("Starting browsing simulator at: %s" % (str(datetime.datetime.now()))) if len(sys.argv) > 1: try: file = open(sys.argv[2], 'r') UrlList = file.readlines() file.close print ("Parameters loaded successfully") Round = 1 while(Round <= int(sys.argv[1])): os.system('ExecAction.exe "nuke"') sleep(5) WebBrowsing(Round, UrlList, HTTPBypassingServer, \ HTTPSBypassingServer) Round += 1 else: print ("Profile Generation ...Done") except: print ("Generator initialization... FAILED") else: print ("Usage: python simulator.py <Number of rounds> <path of the file containing the URLs>") #==================================================================================" # Usage of the Script: python ProfileGenerator.py parameter1 parameter2 # Parameter1: Number of rounds # Parameter2: Path of the file containing the list of URLs to retrieve #==================================================================================" if len(sys.argv) == 3: if (int(sys.argv[1]) >= 1): ProfileGenerator()
el
EFFICIEN
#! /usr/bi#=======
#= IMPOR
#=======
from cPAimport osfrom time
#=======
# This fun#=======
def TimeT
TotalTim hours, m TotalTim return ( #=======
# This fun#=======
def ExecM if el
el
el
#=======
# This fun# profile f#=======
def ExecB gl
Li
bT
b
lse: print (
NCY OF DET
n/env python============
RTATION OF L============
MIE import Ps, sys, time, de import slee
============
nctions conve============
ToMillisecond
me = 0 minutes, secome += 3600 *(TotalTime*1
============
nction display============
Message(Exec
ExecOutput print("
os.syst
sleep(
lif ExecResp =lif ExecResp =lse: print("%s
============
nction retrievefor each URL ============
Bypassing(Byp
lobal ie iveProfile = '"Text = 'u' Button = 'Go'
"The number
TECTION A
n ===========
LIBRARIES ===========
PAMIE atetime, randp
===========
rts the inter a===========
des(TimeElaps
onds = TimeE
* float(hours) 1000)
===========
s error messa
===========
cOutput, nMo
== 0: "%s statistics tem('ExecAct
2) == 1: print("N== 2: print("Fs statistics Ge
===========
es a web page
===========
passServer, Cu
livestats C:\P
'
r of rounds m
PPROACH S
============
============
dom
============
arrival from h============
sed):
lapsed.split("
+ 60 * float(m
============
ages during ex============
ode):
Generation..
tion.exe "clea
Number of argiddler not woeneration... FA
============
e in HTTP or H
============
urrentUrl, By
Python31\Pro
must be greate
SCRIPT
===========
===========
===========
hh:mm:ss into===========
":") minutes) + flo
===========
xecution ===========
. DONE" % nMr"')
guments to Fiorking") AILED" % nMo
===========
HTTPS bypass
===========
passMode):
ojects\Master
er or equal t
===========
===========
===========
o milliseconds
===========
oat(seconds)
===========
===========
Mode)
iddler incorre
ode)
===========
sing modes a
===========
rs\LiveProfile.
to 1")
============
============
============
s ============
============
============
ect")
============
nd compute t
============
.csv"'
App
Page
===========
===========
===========
===========
===========
===========
===========
the live traffic
===========
pendix
| 107
==="
==="
==="
==="
==="
==="
===" c
==="
Appendix
Page | 108
print ("%s bypassing traffic simulation... In Progress" % BypassMode) ie = PAMIE() ie.navigate(BypassServer) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(10) # Pass URL to retrive to the bypassing server os.system('ExecAction.exe "clear"') ie.setTextBox(bText, CurrentUrl) ie.clickButton(bButton) # Wait until the page is completely loaded while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) sleep(40) while ie.getIE().Busy and ie.getIE().readyState != "complete": sleep(3) FiddlerExec = 'ExecAction.exe ' + LiveProfile print("Generating statistics in file.... %s" % FiddlerExec) ExecResp = os.system(FiddlerExec) sleep(3) ExecMessage(ExecResp, "Bypassing Traffic") ie.quit() #==================================================================================" # This functions compares the objects of a pre‐built profile to the objects of a live profile and # output the number of matches #==================================================================================" def MatchObject(cLiveProfileArray, ObjSize, ObjPosition): ObjCounter = 0 TotalObjects = 0 InterArrivaltime = 0 for line in cLiveProfileArray: TrafficArray = line.strip().split(',') if (ObjSize == float(TrafficArray[ObjPosition].strip())): ObjCounter += 1 if float(TrafficArray[9].strip()) > 0: InterArrivaltime += TimeToMillisecondes(TrafficArray[12].strip()) TotalObjects += 1 if (TotalObjects > 0): InterArrivaltime = InterArrivaltime / TotalObjects return ObjCounter, InterArrivaltime #==================================================================================" # This functions saves the statistics of HTTP and HTTPS bypassing traffic in a CSV file #==================================================================================" def UpdateFingerprintFile(FileIndex, InputLine, BypassMode, ParamMode): if BypassMode == "HTTP": cFileName = 'HTTPFingerprint' + str(ParamMode) + '‐' + str(FileIndex)\
+ '.csv' elif BypassMode == "HTTPS": cFileName = 'HTTPSFingerprint' + str(ParamMode) + '‐' + \
Appendix
Page | 109
str(FileIndex) + '.csv' FingerprintFile = open(cFileName, 'a') FingerprintFile.write(InputLine) FingerprintFile.close() #==================================================================================" # This functions search for a pre‐built profile matching the live traffic # The function compares each pre‐built to the live profile and output the number of matches # recorded for the live traffic profile #==================================================================================" def FingerprintURL(RandURL, LowerBoundary, ObjPosition, BypassingMode): print ("Fingerprinting Webpage... STARTING") print ("Reading Live Profile statistics") ProfileFile = 'LiveProfile.csv' sFile = open(ProfileFile, 'r') LiveProfileArray = sFile.readlines() if len(LiveProfileArray) > 1: LiveProfileArray.pop(0) sFile.close IndexBlackList = 1 while(IndexBlackList <= 70): cProfileFilename = 'Profile' + str(IndexBlackList) + '.csv' print ("File: %s" % cProfileFilename) sFile = open(cProfileFilename, 'r') BlackListProfileArray = sFile.readlines() if len(BlackListProfileArray) > 2: InterArrivaltimeArray = BlackListProfileArray[len(BlackListProfileArray)‐\ 1].split(',') InterArrivaltime = float(InterArrivaltimeArray[1]) BlackListProfileArray.pop(0) BlackListProfileArray.pop(len(BlackListProfileArray)‐1) sFile.close NbrElements = 0 MatchCounter = 0 for CurrentObject in BlackListProfileArray: ObjectArray = CurrentObject.strip().split(',') NbrElements += int(ObjectArray[1]) ObjectCounter, LiveInterArrivaltime = MatchObject(LiveProfileArray,\
float(ObjectArray[0]), ObjPosition) if (ObjectCounter > 0) and (ObjectCounter <= int(ObjectArray[1])):
MatchCounter += ObjectCounter elif (ObjectCounter > int(ObjectArray[1])): MatchCounter += int(ObjectArray[1]) BoundaryRange = LowerBoundary while(BoundaryRange <= 100): RulesCounter1 = 0 RulesCounter2 = 0 RulesCounter3 = 0
Appendix
Page | 110
if ((MatchCounter/len(LiveProfileArray))*100 >= BoundaryRange): RulesCounter1 += 1
if ((len(LiveProfileArray)<= NbrElements)): RulesCounter2 = RulesCounter1 + 1 if ((LiveInterArrivaltime > InterArrivaltime)): RulesCounter3 = RulesCounter2 + 1 UpdateIndex = 1 while (UpdateIndex <= 3): if (UpdateIndex == 1): RulesCounter = RulesCounter1 if (UpdateIndex == 2): RulesCounter = RulesCounter2 if (UpdateIndex == 3): RulesCounter = RulesCounter3 if (IndexBlackList == 1): UpdateFingerprintFile(BoundaryRange,\
str(RandURL+1), BypassingMode, UpdateIndex) if (RulesCounter == UpdateIndex) : if (BoundaryRange >= LowerBoundary) and
(BoundaryRange <= 100): UpdateFingerprintFile(BoundaryRange, ',' + \ str(IndexBlackList), BypassingMode, UpdateIndex) if (IndexBlackList == 70): UpdateFingerprintFile(BoundaryRange, '\n', \ BypassingMode, UpdateIndex) UpdateIndex += 1 BoundaryRange += 5 IndexBlackList += 1 #==================================================================================" # This function performs the following tasks: # 1‐ Generate a random URL to retrieve # 2‐ The random URL is then accessed in HTTP bypassing and HTTPS bypassing modes # 3‐ The live traffic profile obtained after each access is fingerprinted (compared to pre‐built profiles) #==================================================================================" def RandomBrowsing(NbrURLs, URLFilename, ObjPosition, Boundary): HTTPBypassingServer = ' SPECIFY THE HTTP BYPASSING SERVER HERE' HTTPSBypassingServer = ' SPECIFY THE HTTP BYPASSING SERVER HERE' ExecFlag = 0 os.system('cls') print ("Starting Efficiency Testing egine at: %s" % (str(datetime.datetime.now()))) try: file = open(URLFilename, 'r') UrlList = file.readlines() file.close print ("Parameters loaded successfully") ExecFlag = 1 except: print ("Loading of URLs file... FAILED") if (ExecFlag == 1): nRound = 1 while (nRound <= 4): IndexRun = 0
Appendix
Page | 111
RandomRun = [] while(IndexRun < NbrURLs): RandomRun.append(IndexRun) IndexRun += 1 random.shuffle(RandomRun) print (RandomRun) IndexURL = 0 for RandomURL in RandomRun: IndexURL += 1 print ("Current Round: %s Run: %s" % (str(nRound), str(IndexURL))) ExecBypassing(HTTPBypassingServer, UrlList[RandomURL], "HTTP") FingerprintURL(int(RandomURL), Boundary, ObjPosition, "HTTP") sleep(3) ExecBypassing(HTTPSBypassingServer, UrlList[RandomURL], "HTTPS") FingerprintURL(int(RandomURL), Boundary, ObjPosition, "HTTPS") nRound += 1 #===================================================================================" # Usage of the Script: python EfficiencyTest.py parameter1 parameter2 parameter3 parameter4 # Parameter1: Total number of URLS # Parameter2: Path of the URLS' file # Parameter 3: Position of the parameter as followed: # 0 ‐ ExecutionOrder 1‐ ProcessID 2‐ Protocol 3‐ Method 4‐ ServerName 5‐ ServerIP # 6‐ ServerPort 7‐ ClientIP 8‐ ClientPort 9‐ BytesReceived 10‐ HeaderSize 11‐ PayloadSize # Parameter 4: lower boundary #===================================================================================" if len(sys.argv) == 5: if (len(sys.argv[1]) < 5): RandomBrowsing(int(sys.argv[1]), sys.argv[2], int(sys.argv[3]), int(sys.argv[4])) else: print ("Usage: Python EfficiencyTest <Maximum URLs> <URL File Path> <Payload Index> <Boundary>")
top related