web servers
DESCRIPTION
TRANSCRIPT
Web Servers & Log AnalysisWeb Servers & Log Analysis
• What can we learn from looking at Web server logs?- What server resources were requested- When the files were requested- Who requested them (where IP address = who)- How they requested them (browser types & OS)
• Some assumptions- A request for a resource means the user did receive it- A resource is viewable & understandable to each user- Users are identified within a loose set of parameters
• How does knowing request patterns affect or help IA?
Types of Web Server LogsTypes of Web Server Logs
• Proxy-based- Web access servers to control access or cache
popular files
• Client-based- Local cache files- Browser History file(s)
• Network-based- Routers, firewalls & access points
• Server-based- Web servers to serve content
Using Web ServersUsing Web Servers
• The Apache Software Foundation• Microsoft Internet Information Server (Service
s)• These applications “Serve”- Text - HTML, XML, plain text- Graphics - jpeg, gif, png- CGI, servlets, XMLHttpRequest & other logic- other MIME types such as movies & sound
• Most servers can log these files- Daily, weekly or monthly- Can not always log CGI or related logic
(specifically or “out of the box”)
How Servers WorkHow Servers Work
• Hypertext Transfer Protocol - http
1. A file is requested from the browser
2. The request is transferred via the network
3. The server receives the request (& logs it)
4. The server provides the file (& logs it)
5. The browser displays the file
• Almost all Web servers work this way
Types of Server LogsTypes of Server Logs
• Access Log- Logs information such as page served or time
served
• Referer Log- Logs name of the server and page that links to
current served page- Not always- Can be from any Web site
• Agent Log- Logs browser type and operating system
• Mozilla• Windows
Log File FormatLog File Format
• Extended Log File Format - W3C Working Draft WD-logfile-960323
• key advantage:- computer storage cost decreases while paper cost
rises
• every server generates slightly different logs
Extended Log File FormatsExtended Log File Formats
• WWW Consortium Standards• Will automatically record much of what is
programmatically done now.- faster- more accurate- standard baselines for comparison - graphics standards
What is a log file?What is a log file?
• A delimited, text file with information about what the server is doing- IP Address or Domain name- Date/Time- Method used & Page Requested- Protocol, Response Code & Bytes Returned- Referring Page (sometimes)- UserAgent & Operating System
p0016c74ea.us.kpmg.com - - [01/Sep/2004:08:17:21 -0500] "GET /images/sanchez.jpg HTTP/1.1" 200 - "http://www.ischool.utexas.edu/research/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows XP)"
In search of Reliable DataIn search of Reliable Data
• Not as Foolproof as Paper- You can see when someone is reading a page- You can know the page is turned- You can know the book is checked out
• No State Information- The same person or another person could be
reading pages 1 then page 2- You really can’t tell how many users you have
• Server Hits not perfectly Representative- Counters inaccurate- Caching & Robots can influence + & -
• Floods/Bandwidth can Stop “intended” usage
What is a “hit”?What is a “hit”?
• Technically, a hit is simply any file requested from the server- That is logged- That represents (usually) part of a request to “see”
a whole Web page
• Hits combine to represent a “page view”• Page views combine to represent an
“episode” or “session”- Episode is one activity or question a user perfoms
or requests on a Web site- Session is a series of episodes that embodies all
the interactions a user undertakes using a Web site (per time, based on averages around 30 min.)
Making Servers More ReliableMaking Servers More Reliable
• Keep system setups simple- unique file and directory names- clear, consistent structure
• Configure CMS for logging/serving • Use an FTP server for file transfer- frees up logs and server!
• Judicious use of links• Wise MIME types- some hard/impossible to log
Clever Web Server SetupClever Web Server Setup
• Redirect CGI to find referrer• Use a database- store web content- record usage data
• create state information with programming- NSAPI- ActiveX
• Have contact information• Have purpose statements
Managing Log FilesManaging Log Files
• Backup• Store Results or Logs?• Beginning New Logs• Posting Results
Log Analysis ToolsLog Analysis Tools
• Analog• Webalizer• Sawmill• WebTrends• AWStats• WWWStat• GetStats• Perl Scripts• Data Mining & Business Intelligence tools
WebTrendsWebTrends
• A whole industry of analytics• Most popular commercial application
Log Analysis Cumulative SampleLog Analysis Cumulative Sample
• Program started at Tue-03-Dec-2005 01:20 local time. • Analysed requests from Thu-28-Jul-2004 20:31 to Mon-
02-Dec-1996 23:59 (858.1 days). • Total successful requests: 4 282 156 (88 952) • Average successful requests per day: 4 990 (12 707) • Total successful requests for pages: 1 058 526 (17 492) • Total failed requests: 88 633 (1 649) • Total redirected requests: 14 457 (197) • Number of distinct files requested: 9 638 (2 268) • Number of distinct hosts served: 311 878 (11 284) • Number of new hosts served in last 7 days: 7 020 • Corrupt logfile lines: 262 • Unwanted logfile entries: 976 • Total data transferred: 23 953 Mbytes (510 619 kbytes) • Average data transferred per day: 28 582 kbytes (72 946
kbytes)
How about the iSchool Web site?How about the iSchool Web site?
• Our server files are collected constantly- Daily - Weekly- Monthly- Even yearly
• What does a quick look tell us?- How well is the server working?
• Uptime, server errors, logging errors- How popular is our site?
• Number of hits, popular files- Who is visiting the site?
• Countries, types of companies- What searches led people here?
UT & its Web server logsUT & its Web server logs
• UT Web log reports(Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00).
Successful requests: 39,826,634 (39,596,364)
Average successful requests per day: 5,690,083 (5,656,623)
Successful requests for pages: 4,189,081 (4,154,717)
Average successful requests for pages per day: 598,499 (593,530)
Failed requests: 442,129 (439,467)
Redirected requests: 1,101,849 (1,093,606)
Distinct files requested: 479,022 (473,341)
Corrupt logfile lines: 427
Data transferred: 278.504 Gbytes (276.650 Gbytes)
Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)
Neat Analysis TricksNeat Analysis Tricks
• use a search engine to find references- “link:www.ischool.utexas.edu/~donturn”
• key to using unique names- use many engines
• update times different• blocking mechanisms are different
• use Web searches (or Yahoo, Bloglines…)- look for references- look for IP addresses of users
Neat Tricks, cont.Neat Tricks, cont.
• Walking up the Links- follow URL’s upward
• Reverse Sort- look for relations
• Use your own robot to index- Test
Web Surveys, an alternativeWeb Surveys, an alternative
• Surveys actually ask users what they did, what they sought & if it helped
• GVU, Nielsen and GNN- Qualitative questions
• phone• web forms
- Self-selected sample problems• random selection• oversample
Analysis of a Very Large Search LogAnalysis of a Very Large Search Log
• What kinds of patterns can we find?
• Request = query and results page
• 280 GB – Six Weeks of Web Queries- Almost 1 Billion Search Requests, 850K valid, 575K queries- 285 Million User Sessions (cookie issues)- Large volume, less trendy- Why are unique queries important?
• Web Users:- Use Short Queries in short sessions - 63.7% one request- Mostly Look at the First Ten Results only- Seldom Modify Queries
• Traditional IR Isn’t Accurately Describing Web Search• Phrase Searching Could Be Augmented
• Silverstein, Henzinger, Marais, Moricz (1998)
Analysis of a Very Large Search LogAnalysis of a Very Large Search Log
• 2.35 Average Terms Per Query- 0 = 20.6% (?)- 1 = 25.8%- 2 = 26.0% = 72.4%
• Operators Per Query- 0 = 79.6%
• Terms Predictable• First Set of Results Viewed Only = 85%• Some (Single Term Phrase) Query Correlation - Augmentation- Taxonomy Input- Robots vs. Humans
Real Life Information RetrievalReal Life Information Retrieval
• 51K Queries from Excite (1997)• Search Terms = 2.21• Number of Terms
- 1 = 31% 2 = 31% 3 = 18% (80% Combined)
• Logic & Modifiers (by User)- Infrequent- AND, “+”, “-”
• Logic & Modifiers (by Query)- 6% of Users- Less Than 10% of Queries- Lots of Mistakes
• Uniqueness of Queries- 35% successive- 22% modified- 43% identical
Real Life Information RetrievalReal Life Information Retrieval
• Queries per user 2.8• Sessions
- Flawed Analysis (User ID)- Some Revisits to Query (Result Page Revisits)
• Page Views- Accurate, but not by User
• Use of Relevance Feedback (more like this)- Not Used Much (~11%)
• Terms Used Typical & frequent• Mistakes
- Typos- Misspellings- Bad (Advanced) Query Formulation
• Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)
Downie & Web UsageDownie & Web Usage
• Server logs are like library usage• User-based analyses- who- where- what
• File-based analyses- amount
• Request analyses- conform (loosely) to Zipf’s Law
• Byte-based analyses
Web use analysis & IA?Web use analysis & IA?
• Another tool to begin to understand how people use your Web provided resources
• With a small amount of setup, you can learn a large amount
• Server use can be integrated into site usage for users- Lists of popular pages & more interlinking pages- Adding search terms that found the page to related pages- Adjust metadata to reflect searches that find pages- Add pages to the site index or site map
• First-cut usability information- Pages 1 & 2 were accessed, but not 3 - Why?- Navigation usage, link ordering and design understanding- Knowing what browsers & OS helps tailor design and media
types
BREAK!BREAK!
• No Presentation this week- Next week: Asset management, content
management & version control
• Break up media development work
• Examine current pages, style sheets & designs
• Set up next set of pair & individual deliverables
Media Development workMedia Development work
• We need to find & create graphics for the new site
• Content about:- Austin- UT- iSchool- People at the iSchool- Students at work in the iSchool (classes, labs)
• Screen grab from videos• Search the Web for copyright free images• Take our own pictures
Current Pages & DesignsCurrent Pages & Designs
• First version of main iSchool page template and CSS complete
• Secondary page template & CSS complete- Some secondary pages already built
• Index page template set• Site map page initially set- Big Map- Main pages map
Next stepsNext steps
• In class- Test & evaluate current CSS and templates- Improvise secondary home page based on initial design- Examine new Alumni section- Examine new Course Listing page
• For homework- Complete secondary page migration to new design- Rotate design work
• Alumni• Site Map• Home page design ideas
- Picture/Media creation work