measuring cdn performance and why you're doing it wrong

105
Measuring CDN Performance Hooman Beheshti VP Technology

Upload: fastly

Post on 16-Apr-2017

863 views

Category:

Software


2 download

TRANSCRIPT

Page 1: Measuring CDN performance and why you're doing it wrong

 Measuring  CDN  Performance  

 Hooman  Beheshti  

VP  Technology  

Page 2: Measuring CDN performance and why you're doing it wrong

Why  this  matters  •  Performance  is  one  of  the  main  reasons  we  use  a  CDN  

•  Measurement  often  used  during  evaluation  phase  to  compare  CDNs  – Most  of  what  we’ll  talk  about  is  in  this  context  

•  Seems  easy,  but  isn’t  •  Heavily  vendor-­‐influenced  –  “Ok  Google:  define  irony!”  

Page 3: Measuring CDN performance and why you're doing it wrong

Goals  

•  What  does  the  measurement  landscape  look  like  

•  Share  measurement  experiences  

•  Help  guide  towards  good  testing  plan  if/when  you  decide  to  do    this  

Page 4: Measuring CDN performance and why you're doing it wrong

Background  

Page 5: Measuring CDN performance and why you're doing it wrong
Page 6: Measuring CDN performance and why you're doing it wrong

Delivery:  static/cached  objects  

Client  

CDN  Node  

Origin  

Page 7: Measuring CDN performance and why you're doing it wrong

Delivery:  dynamic/uncached  objects  

Page 8: Measuring CDN performance and why you're doing it wrong

What  we’ll  be  focusing  on  •  Only  on  delivery  and  not  all  the  other  features  CDNs  provide  

•  How  we  measure  •  Metrics  to  measure  •  What  to  measure  •  Some  gotchas,  misconceptions,  and  common  mistakes  

Page 9: Measuring CDN performance and why you're doing it wrong

Measurement  Techniques  

(how  we  measure)  

Page 10: Measuring CDN performance and why you're doing it wrong

Measurement  techniques  •  Pretend  Users  –  Synthetic  tests  – Not  actual  users  

•  Real  Users  –  In  the  browser  – Actual  users  

Page 11: Measuring CDN performance and why you're doing it wrong

Synthetic  testing  

Page 12: Measuring CDN performance and why you're doing it wrong

Synthetic  testing  

•  Usually  a  large  network  of  test  nodes  all  over  the  globe  

•  Highly  scalable,  can  do  lots  of  tests  at  once  •  Many  vendors  that  have  this  model  – Examples:  Catchpoint,  Dynatrace(Gomez),  Keynote,  Pingdom,  etc  

Page 13: Measuring CDN performance and why you're doing it wrong

Synthetic  testing  •  Built  to  do  full  performance  and  availability  testing  

–  Lots  of  “monitors”  –  emulating  what  real  users  do  –  DNS,  Traceroute,  Ping,  Streaming,  Mobile  –  HTTP  

•  Object  •  Browser  •  Transactions/Flows  

 •  Tests  set  up  with  some  frequency  to  repeatedly  test  things  

–  Aggregates  reported  

Page 14: Measuring CDN performance and why you're doing it wrong

Backbone  nodes  •  Test  machines  sitting  in  datacenters  all  around  the  globe  •  Really  good  at:  

–  Availability  and  reachability  –  Scale  –  Backend  problems  –  Global  reach  

•  Terrible  indicators  of  raw  performance  –  No  latency  –  Infinite  bandwidth  

Page 15: Measuring CDN performance and why you're doing it wrong

Backbone  nodes  •  Test  machines  sitting  in  datacenters  all  around  the  globe  •  Really  good  at:  

–  Availability  and  reachability  –  Scale  –  Backend  problems  –  Global  reach  

•  Often  terrible  indicators  of  raw  performance  –  No  latency  –  Infinite  bandwidth  

Page 16: Measuring CDN performance and why you're doing it wrong

https://www.flickr.com/photos/stars6/4381851322/  

Page 17: Measuring CDN performance and why you're doing it wrong

Last  mile  nodes  •  Test  machines  sitting  behind  a  real  home-­‐like  internet  connection  

•  Much  better  at  reporting  what  you  can  expect  from  users,  but  sometimes  unreliable  

•  Also  not  as  dense  in  deployment  

Page 18: Measuring CDN performance and why you're doing it wrong

backbone   last  mile  

Page 19: Measuring CDN performance and why you're doing it wrong

Real  users  (RUM)  

Page 20: Measuring CDN performance and why you're doing it wrong

RUM  

•  Use  javascript  to  collect  timing  metrics  

•  Can  collect  lots  of  things  through  browser  APIs  – Page  metrics,  asset  metrics,  user-­‐defined  metrics  

Page 21: Measuring CDN performance and why you're doing it wrong

Use  test  assets  

•  Use  this  model  to  initiate  tests  in  the  browser  •  Some  vendors:  – Cedexis,  TurboBytes,  CloudHarmony,  more…  – Usually,  this  isn’t  their  business,  but  the  data  drives  their  main  business  objectives  

•  You  can  build  this  yourself  too  

Page 22: Measuring CDN performance and why you're doing it wrong

Use  real  assets  in  the  page  •  Collect  timings  from  actual  objects  – Resource  timing  

•  Vendors  –  SOASTA,  New  Relic,  most  synthetic  vendors  – Boomerang  (open  source)  – Google  Analytics  User  Timings  

Page 23: Measuring CDN performance and why you're doing it wrong

DATA,  DATA,  DATA  

•  For  either  RUM  technique,  we  need  A  LOT  of  data  

•  Too  much  variance  – Most  vendors  don’t  use  averages  – Medians,  percentiles,  and  histograms  

Page 24: Measuring CDN performance and why you're doing it wrong

Measurement  Metrics  

Page 25: Measuring CDN performance and why you're doing it wrong

Client   Server  

Page 26: Measuring CDN performance and why you're doing it wrong

Client   Server  

1  x  RTT    

Page 27: Measuring CDN performance and why you're doing it wrong

Client   Server  

DNS  DNS  

Page 28: Measuring CDN performance and why you're doing it wrong

TCP  

Client   Server  

DNS  DNS  

Page 29: Measuring CDN performance and why you're doing it wrong

TCP  

Client   Server  

DNS  DNS  

(TLS)  

Page 30: Measuring CDN performance and why you're doing it wrong

TCP  

Client   Server  

DNS  DNS  

(TLS)  

HTTP  (TTFB)  

Page 31: Measuring CDN performance and why you're doing it wrong

TCP  

Client   Server  

DNS  DNS  

(TLS)  

HTTP  (TTFB)  

HTTP  (Download)  

Page 32: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

Page 33: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

DNS   RTT  to  DNS  server,  DNS  iterations,  DNS  caching  and    TTLs  

Page 34: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

DNS  

TCP  

RTT  to  DNS  server,  DNS  iterations,  DNS  caching  and    TTLs  

RTT  to  cache  server  (CDN  footprint  &  routing  algorithms)  

Page 35: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

DNS  

TCP  

(TLS)  

RTT  to  DNS  server,  DNS  iterations,  DNS  caching  and    TTLs  

RTT  to  cache  server  (CDN  footprint  &  routing  algorithms)  

RTT  to  cache  server  (or  RTTs  depending  on  TLS  False  Start),  efficiency  of  TLS  engine  

Page 36: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

DNS  

TCP  

(TLS)  

TTFB  

RTT  to  DNS  server,  DNS  iterations,  DNS  caching  and    TTLs  

RTT  to  cache  server  (CDN  footprint  &  routing  algorithms)  

RTT  to  cache  server  (or  RTTs  depending  on  TLS  False  Start),  efficiency  of  TLS  engine  

RTT  to  where  the  object  is  stored  +  storage  efficiency    (different  for  requests  to  origin);  lower  bound  =  network  RTT  

Page 37: Measuring CDN performance and why you're doing it wrong

DNS   TCP   (TLS)   TTFB   Download  (TTLB-­‐TTFB)  

Time  

DNS  

TCP  

(TLS)  

TTFB  

TTLB-­‐TTFB  

RTT  to  DNS  server,  DNS  iterations,  DNS  caching  and    TTLs  

RTT  to  cache  server  (CDN  footprint  &  routing  algorithms)  

RTT  to  cache  server  (or  RTTs  depending  on  TLS  False  Start),  efficiency  of  TLS  engine  

RTT  to  where  the  object  is  stored  +  storage  efficiency    (different  for  requests  to  origin);  lower  bound  =  network  RTT  

Bandwidth,  congestion  avoidance  algorithms  (and  RTT!)  

Page 38: Measuring CDN performance and why you're doing it wrong

Core  object  metrics  

•  Not  every  request  experiences  every  metric:  – DNS:  once  per  domain  – TCP/TLS  setup  once  per  connection  – TTFB/Download  for  every  object  (not  already  in  browser  cache)  

 

Page 39: Measuring CDN performance and why you're doing it wrong

Resource  timing  

http://www.w3.org/TR/resource-­‐timing/  

Page 40: Measuring CDN performance and why you're doing it wrong

Resource  timing  

window.performance.getEntries()

Page 41: Measuring CDN performance and why you're doing it wrong

Mistakes  we  make    

(when  evaluating)  

Page 42: Measuring CDN performance and why you're doing it wrong

CDN  X  

vs  CDN  Y  

Page 43: Measuring CDN performance and why you're doing it wrong

“I’ll  pick  an  image  from  my  home  page,  use  backbone  synthetic  tests  from  all  over  the  world  and  pick  the  CDN  that  has  the  fastest  average  time”  

“let’s  test  an  asset  via    RUM  on  a  million  page  views  a  day  and  pick  the  fastest  CDN”  

“let’s  run  webpagetest  on  both  CDNs  and  go  with  whichever  has  a  faster  page  load  time”  

~$time curl –v http://…

Page 44: Measuring CDN performance and why you're doing it wrong

we  measure  the  wrong  thing    

Page 45: Measuring CDN performance and why you're doing it wrong

Web  application:  objects  •  Your  application  should  determine  what  you  test:  – Objects  served  from  the  edge  – Objects  served  from  origin  (through  CDN)  

 •  If  HTML  is  from  origin  (through  CDN),  we  must  measure  it  –  Essential  to  critical  page  metrics  

Page 46: Measuring CDN performance and why you're doing it wrong

Web  application:  object  sizes  

Page 47: Measuring CDN performance and why you're doing it wrong

•  On  any  page  –  DNS  queries  only  happen  a  small  

number  of  times  –  6  TCP  connections  per  domain  –  1  TLS  setup  per  connection  –  Many  many  many  HTTP  fetches  

•  Core  metrics  –  TTFB  –  Download  (TTLB-­‐TTFB)  if  

important  large  objects  –  Should  have  a  good  idea  of  DNS/

TCP/TLS,  but  less  critical  

Page 48: Measuring CDN performance and why you're doing it wrong

Web  application  •  If  CDN  only  for  static/cacheable  objects:  – One  or  two  representative  assets    –  TTFB  and  maybe  download  most  important  

Client   CDN  Node  

Page 49: Measuring CDN performance and why you're doing it wrong

X-Cache: HIT

Page 50: Measuring CDN performance and why you're doing it wrong

Web  application  •  If  CDN  also  for  whole  site  (HTML  going  through  CDN)  –  Sample  of  key  HTML  pages,  delivered  from  origin  –  TTFB  will  show  efficiency  of  routing  (and  connection  management)    to  origin  

–  TTLB  will  show  efficiency  of  delivery  

Web  Server  Client   CDN  Node  

Page 51: Measuring CDN performance and why you're doing it wrong

Web  application  •  If  CDN  also  for  whole  site  (HTML  going  through  CDN)  –  Sample  of  key  HTML  pages,  delivered  from  origin  –  TTFB  will  show  efficiency  of  routing  (and  connection  management)    to  origin  

–  TTLB  will  show  efficiency  of  delivery  

Web  Server  Client   CDN  Node   CDN  Node  

Page 52: Measuring CDN performance and why you're doing it wrong

we  measure  the  wrong  way  

Page 53: Measuring CDN performance and why you're doing it wrong

Backbone  Nodes  

(For  true  performance  measurements)  

Page 54: Measuring CDN performance and why you're doing it wrong

%  of  tes

ts  

msec  

TCP  Connect  Time  Histogram  (BB  nodes)  

Page 55: Measuring CDN performance and why you're doing it wrong

object  metrics      or    

page  metrics  

Page 56: Measuring CDN performance and why you're doing it wrong
Page 57: Measuring CDN performance and why you're doing it wrong
Page 58: Measuring CDN performance and why you're doing it wrong

Download:  15Mbps    Upload:  5Mbps    Latency:  10  ms,  25  ms  

Page 59: Measuring CDN performance and why you're doing it wrong

10  msec   25  msec  

Page 60: Measuring CDN performance and why you're doing it wrong

10  msec   25  msec  

Page 61: Measuring CDN performance and why you're doing it wrong

onload   Speed  Index   Start  Render  

10  msec  

25  msec  

Page 62: Measuring CDN performance and why you're doing it wrong

What  the…???  •  We  always  assume  “all  things  equal”  •  Too  many  factors  affect  page  load  time  

–  3rd  parties  (sometimes  varying),  content  form  origin,  layout,  JS  execution,  etc  

•  Too  much  variance  

Source:  httparchive.org  

Page 63: Measuring CDN performance and why you're doing it wrong

To  be  clear…  •  Always  use  webpagetest  (or  something  like  it)  to  understand  your  

application’s  performance  profile  

•  Continue  to  monitor  application  performance,  and  always  spot  check  

•  Be  extremely  careful  when  using  it  to  compare  CDN  performance,  it  can  mislead  you  –  If  using  RUM  to  measure  page  metrics,  with  lots  of  data,  things  

become  a  little  more  meaningful  (data  volume  handles  variance)  

Page 64: Measuring CDN performance and why you're doing it wrong

we  overgeneralize    and    

draw  the  wrong  conclusions  

Page 65: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratios  

Page 66: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratio:  traditional  calculation  

1  -­‐     Requests  to  Origin    

Total  Requests  

Page 67: Measuring CDN performance and why you're doing it wrong

Origin  

Page 68: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

Page 69: Measuring CDN performance and why you're doing it wrong

TCP  

Origin  

Cache          

Page 70: Measuring CDN performance and why you're doing it wrong

HTTP  

Origin  

Cache          

Page 71: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

HTTP  

Page 72: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

HTTP  

Page 73: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

HTTP  

Page 74: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

HTTP  

Page 75: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

HOT   COLD  

Page 76: Measuring CDN performance and why you're doing it wrong

Origin  

Cache          

cache  “hit”  

Page 77: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratio:  traditional  calculation  

1  -­‐     Requests  to  Origin    

Total  Requests  

Page 78: Measuring CDN performance and why you're doing it wrong

Isn’t  this  better?  

Hits  

Total  Requests  @edge  

Page 79: Measuring CDN performance and why you're doing it wrong

Isn’t  this  better?  

Hits  

Hits  +  Misses  @edge  

Page 80: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratio  

vs.  1  -­‐    Requests  to  Origin    

Total  Requests  

Hits  

Hits  +  Misses  @edge  

Page 81: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratio  

vs.  1  -­‐    Requests  to  Origin    

Total  Requests  

Hits  

Hits  +  Misses  @edge  

Offload  

Page 82: Measuring CDN performance and why you're doing it wrong

Cache  hit  ratio  

vs.  1  -­‐    Requests  to  Origin    

Total  Requests  

Hits  

Hits  +  Misses  @edge  

Offload   Performance  

Page 83: Measuring CDN performance and why you're doing it wrong

Effect  on  long  tail  content  

Page 84: Measuring CDN performance and why you're doing it wrong

Effect  on  long  tail  content  

(long  tail:  Cacheable  but  seldom  fetched)  

Page 85: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

   

   

   

   

Page 86: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

   

   

   

   

Connect  (median)  

Popular   14msec  

1hr  Tail   15msec  

6hr  Tail   16msec  

Page 87: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

   

   

   

   

Connect  (median)  

Popular   14msec  

1hr  Tail   15msec  

6hr  Tail   16msec  6,400+  measurements  

77,000+  measurements  

38,000+  measurements  

Page 88: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

Connect  (median)   Wait  (median)  

Popular   14msec   19msec  

1hr  Tail   15msec   26msec  

6hr  Tail   16msec   32msec  6,400+  measurements  

77,000+  measurements  

38,000+  measurements  

Page 89: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

Isn’t  this  better?  

Page 90: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

Page 91: Measuring CDN performance and why you're doing it wrong

Popular   Medium  Tail  (1hr)   Long  tail    (6hr)  

Page 92: Measuring CDN performance and why you're doing it wrong

After  all  that….  

Page 93: Measuring CDN performance and why you're doing it wrong

How  much  of  this  really  matter?      

(when  trying  to  choose  between  multiple    CDNs)  

Page 94: Measuring CDN performance and why you're doing it wrong
Page 95: Measuring CDN performance and why you're doing it wrong
Page 96: Measuring CDN performance and why you're doing it wrong

The  bigger  picture  

•  It’s  really  easy  to  lock  in  on  a  metric  

•  Performance  absolutely  matters  

•  True  performance  isn’t  always  as  easy  to  measure  

Page 97: Measuring CDN performance and why you're doing it wrong

We  must  ask  questions  …  

Page 98: Measuring CDN performance and why you're doing it wrong

What’s  the  storage  model  and  how  does  it  affect  long  tail  content?  

Page 99: Measuring CDN performance and why you're doing it wrong

What  should  I  expect  with    cache  hit  ratios    

for  offload  and  performance?  

Page 100: Measuring CDN performance and why you're doing it wrong

Footprint?    

(is  what  I’m  testing  the  same  as  what  I’m  buying?)  

Page 101: Measuring CDN performance and why you're doing it wrong

HTTP  vs  TLS  footprint?  

Page 102: Measuring CDN performance and why you're doing it wrong

Can  I  serve  stale  content  if  necessary?  

(stale-while-revalidate & stale-if-error)

Page 103: Measuring CDN performance and why you're doing it wrong

What  if  I  can  cache  something  I  didn’t  think  I  could?  

Page 104: Measuring CDN performance and why you're doing it wrong

Key  takeaways  •  Everything  is  application-­‐dependent  

–  Evaluate  how  your  application  works  and  what  impacts  performance  the  most  

•  Don’t  get  locked  into  a  single  number/metric  

•  Always  know  your  application  performance  and  bottlenecks  

•  Be  mindful  of  the  bigger  picture  

•  Don’t  stop  measuring!  

Page 105: Measuring CDN performance and why you're doing it wrong

Thank  you!  

[email protected]    

office  hours  Friday  @lunch