Web Caching
Dr. Yingwu Zhu
What is Web Caching
• Introducing proxy servers at certain points in the network that serve in caching Web documents for faster client access.
• Comparable to the cache memory in a computer system
Proxy Cache
clients
proxy
servers
Reply
Req.Req.
Reply
How?
• Client send requests to the proxy.• If the requested document is in its
cache, the proxy serves the request from its cache.
• Otherwise, the proxy forward the request to the server.
• Server replies the request through the proxy (proxy keep a copy of the requested document).
Why Web Caching?
• Rapid growth in HTTP traffic to form the largest part of the Internet traffic which causes more network congestion and server unavailability.
• The number of Web static pages almost doubles every year
• Some old data– Number of unique pages: 800M < X < 2.2B – Number of unique web sites: 8,500,000– static pages: %30 - %40– pages revisited: %80– expected hit-rate: %24 - %32
Why Web Caching?
• Bandwidth
• Latency
• Performance = Response Time
• Server Load
• Failure Redundancy
Expected Gains
• Bandwidth saving• Improving content availability.• Improving web server availability.• Server load balancing.• Reducing user-perceived latency
What: Content and Protocols
• HTTP 1.0 Basic protocol– Send Request based on fix number of
verbs• GET• HEAD• POST
– Receive response, meta-data, content
What: Content and Protocols• HTTP Request
Request = Simple-Request | Full-Request
Simple-Request = "GET" SP Request-URI CRLF
Full-Request = Request-Line ; * ( General-Header ;
| Request-Header ;| Entity-Header ) ;
CRLF[ Entity-Body ]
What: Content and Protocols
• Example: GET /pub/www/index.html HTTP/1.0
• Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Sat, 19 Oct 2002 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private
What: Content and Protocols
• Example “if-modified-since”:GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
• Response:HTTP/1.1 200 OKServer: Microsoft-IIS/5.0Date: Thu, 13 Jul 2000 05:46:53 GMTExpires: Sun, 20 Oct 2002 16:00:00 GMTContent-Length: 2291Content-Type: text/htmlCache-control: private
What: Content and Protocols
• Example “if-modified-since”:
GET /pub/www/index.html HTTP/1.0If-Modified-Since: Sat, 19 Oct 2002 19:43:31 GMT
• Response:
HTTP/1.1 304 Not Modified
HTTP support for caching
• Conditional requests (IMS)• Servers can set expires and max-age • Request indirection: application level
routing• Range requests, entity tag • Cache-control header
– Requests: min-fresh, max-stale, no-transform
– Responses: must-revalidate, public, private, no-cache
Reverse
ProxyReverse
ProxyReverse
Proxy
Intranet
Where
Browser
Local ISP
cacheL4 Switch
Data Center
ISPcdn
cache
cache
Content
ServerContent
ServerContent
ServerContent
Server
Reverse
Proxy
Browsercache
Browsercache
cdn
Cache Types
• Proxy Caching• Reverse Proxy Caching• Transparent Caching
• Adaptive Caching
• Push Caching
• Active Caching
Proxy Caching
• Harvest/Squid
• Provide web content for a fixed user base
• Deployed at the network edges (company or institutional
gateway or firewall hosts)
• Standalone operation
• Manual configuration in web browsers
• Commodity product/technology
• Single point of failures
Reverse Proxy Caching
• Designed to offload duties from one
or more specific servers
• Data size is limited to size of static
content on the server
• Challenge is fast, disk-less operation
• Cache consistency is easy
Transparent Caching
• Intercept HTTP requests and redirect them to web
cache servers or cache clusters
• No client configuration
• Violates end-to-end paradigm
– Client thinks it is talking directly to server
– Server thinks it is talking to cache
• Implemented as: L4-switch
– Layer 4 switch makes switching decisions based on TCP
or UDP port number, i.e., 80
Transparent Caching
Adaptive Caching
• ISP Level caching, global data placement optimization
• Cooperating multiple distributed caches
• Operate as a cache-mesh based on content demand
• Cache Group Management Protocol – How meshes are formed
– How individual caches join/leave the meshes
• Content Routing Protocol sends request to the appropriate
cache within the meshes• Uses distributed cache meshes to solve the hot spot
problem• Caches dynamically join and leave the groups based on
content demand• Administrative boundaries must be relaxed
Push Caching
• Keep data close to those clients requesting this information
• Send the data out proactively• Assumption: we are able launch
caches that may cross administrative boundaries
• Incurs cost (storage and transmission)
Active Caching
• Applies caching to dynamic documents• 30 % of client HTTP requests contains
cookies• The servers provides the cache with
the objects and any associated cache applets
– Use an applet inside of the cache to
customize dynamic pages on the fly
Cache Placement/Deployment
• Close to clients/content consumers– Proxy caching– Transparent proxy caching
• Close to servers/content providers– Improve access to logical sets of data– Delay-sensitive data: video, audio– Reverse proxy caching– Push caching
• Network choke points: strategic deployment– Adaptive caching– Problem with administrative control
Zipf Law vs. Web Access
• Zipf Law• Web Access• Caching?
Zipf’s Law
• Zipf’s law: The frequency of an event P as a function of rank i is a power law function:
Pi = Ω / iα where α ≤ 1
Zipf’s Law
• Observed to be true for– Frequency of written words in
English texts– Population of cities– Income of a company as a function
of rank
Zipf’s Law vs. Web Access
• For a given server, page access by rank follows Zipf’s law
• Web requests from a fixed population of users follows Zipf’s law 0.64 < α < 0.83
Observations
• Top %1 of all documents account for %20 - %35 of proxy requests
• Top %10 account for %45 - %55 of requests
• It takes %25 to %40 of all documents to account for %70 of requests
• It takes %70 to %80 of all documents to account for %90 of requests
Zipf’s Law and Caching
Discussion
• How does this help in cache design?
Basic caching algorithm
Pages may be
• Fresh: up-to-date
• Expired: current date > expiration
date
• Stale: “old”
Basic caching algorithm - #2
If (page is in the cache)if ( page is expired or stale )
Get from server - if-modified-since
If not modified, Get from cache Get from ServerElse Get from Server
Basic caching algorithm - #3
If cache has spaceStore the file
Else1. Delete expired from cache2. Delete stale from cache3. Delete LRU from cache4. Delete largest/smallest from cache?
Cache Replacement
• Cache size is limited, need replacement policy
• LRU• LFU• Greedy-dual size• Many others
Cache Consistency
• Multiple copies of objects created– How and when renewing the copies?
• Goals– Avoid stale copies– Keep non useful traffic as low as possible
Cache Consistency: Polling
Solution 1: polling every time
implemented in HTTP using the optional “if-modified-since" request header field
Benefit: strong consistencyDrawback: very slow cache hit
Cache Consistency: PollingSolution 2: polling if TTL expires, widely
used– Associate a TTL (12 hours or 2 days) with each
cached object
implemented in HTTP using the optional "expires" header field
Benefit: fast cache hitDrawback: weak cache consistency (5% stale) due to TTL is an a priori estimate of an object's life time
Cache Consistency
• Solution 3 : Invalidation Protocols• The server helps the proxy in maintaining
consistency• Invalidation protocols
– When the proxy makes a request,• Piggyback cache validation (PCV) : the proxy provides some
other potentially stale copies for server validating• Piggyback cache invalidation (PCI) : the server provides
some copies which have been updated since last access– Use of volumes
• Volume lease :– The client receive a lease from the server– During the lease validity the client can retreive copies
from proxy– When the lease expire the client has to renew it
• Problems: scalability, servers needs keep cache states
Cache Cooperation
• Hierarchical caching– Cache servers form a hierarchy, tree-like
structures– Parent servers: top of the hierarchy, receive
requests from child servers. If they do not have the requested objects, either ask their parents or original web servers
– Sibling servers: if the local cache does not have the requested object, then ask its sibling caches. If the sibling caches do not have the object, then the local cache asks the parent cache
Cache Hierarchies• Use hierarchy to scale a proxy
– Why? • Larger population = higher hit rate (less compulsory
misses)• Larger effective cache size
– Why is population for single proxy limited?• Performance, administration, policy, etc.
• NLANR cache hierarchy– Most popular – 9 top level caches– Internet Cache Protocol based (ICP)– Squid/Harvest proxy
• How to locate content?
ICP (Internet cache protocol)
• Simple protocol to query another cache for content
• Uses UDP – why?• ICP message contents
– Type – query, hit, hit_obj, miss– Other – identifier, URL, version, sender address– Special message types used with UDP echo port
• Used to probe server or “dumb cache”
• Query and then wait till time-out (2 sec)• Transfers between caches still done using HTTP
Squid
Client
Parent
Child Child Child
Web page request
ICP Query
ICP Query
Squid
Client
Parent
Child Child ChildICP MISS
ICP MISS
Squid
Client
Parent
Child Child Child
Web page request
Squid
Client
Parent
Child Child Child
Web page request
ICP Query
ICP Query
ICP Query
Squid
Client
Parent
Child Child Child
Web page request
ICP MISS
ICP HIT
ICP HIT
Squid
Client
Parent
Child Child Child
Web page request
Hierarchical caching
• Ideally, want the cache mesh to behave as a single cache with equivalent capacity and processing capability
• ICP: many copies of popular objects created – capacity wasted
• High Latency: More than one hop needed for searching object
• How to improve? Discuss!
Problems with caching
• Over 50% of all HTTP objects are uncacheable.• Sources:
– Dynamic data stock prices, frequently updated content
– CGI scripts results based on passed parameters– SSL encrypted data is not cacheable
• Most web clients don’t handle mixed pages well many generic objects transferred with SSL
– Cookies results may be based on passed data– Hit metering owner wants to measure # of hits
for revenue, etc, so, cache busting
Risks of Using Proxy
• Benefits: reduce latency, bandwidth saving, etc.
• Risks– Obsolete data– Violate client privacy: the proxy can
keep a log file telling which objects the client has requested
– Data integrity
Real Proxy Servers• Squid: The most widely used. The better working and the
free one.• http://www.squid-cache.org/• Microsoft ISA Server 2004 : Microsoft developed ISA to
replace Microsoft proxy server. It’s fully functional with Active Directory
http://www.microsoft.com/isaserver/• Apache: Apache web server has a module to do reverse
caching (experimental) http://httpd.apache.org/docs-2.0/mod/mod_cache.html• Cisco Cache Engine: sits next to (mostly) Cisco routers and
receives transparently redirected HTTP requests http://www.cisco.com/warp/public/cc/pd/cxsr/500/index.shtml
• CERN/W3C HTTPd: It was the original proxy server. http://www.w3.org/hypertext/WWW/Daemon/Status.html