reading cyber tracks: analyzing log files and search logs darlene fichter data library coordinator,...
TRANSCRIPT
Reading Cyber Tracks: Analyzing Log Files and Search Logs
Darlene Fichter
Data Library Coordinator, U of S Library
January 29, 2004
Overview
Log Files How can log files? Getting up close and personal with log files 7 things my server logs told me Error logs
Search Logs Content synopsis Site search performance Intranet usability “Best bets”
Why super heroes read log files
Macro pictureRich source of information about
User behaviorsLink choicesTypical paths through a web site
Point out trouble spotsHelp inform redesigns
DetractorsThe least useful type of data for understanding
users Doesn’t measure outcomes We don’t know the intent of the visitor
Hits are meaninglessTrueNot true, if you’re estimating service capacity
and performanceImprecise and incomplete
Server logs can tell you
Who is using your site? Who never uses your site? Where do they enter?What route do they follow?What do they use?How long do they stay?
Big picture
Average Number of Visits per Day on Weekdays 6254
Average Number of Hits per Day on Weekdays 110437
Average Number of Visits per Weekend 8157
Average Number of Hits per Weekend 113500
Most Active Day of the Week Mon
Least Active Day of the Week Sat
Most Active Date December 04 2003
Number of Hits on Most Active Date 173527
Least Active Date December 25 2003
Number of Hits on Least Active Date 38078
Most Active Hour of the Day 14:00-14:59
Least Active Hour of the Day 01:00-01:59
Page duration
How long do most people spend on a page– Inordinately long time could mean
Very confusingVery worthwhileWent for coffee?
– Skip averages and look for the mode or median
Exit pages
The point where someone leaves your site, may offer some interesting clues
Related links – fineFind an article page listing databases
– A caveat to keep in mind• Use of the back button may not show up when pages
are loaded from the browser cache
Forms
What is the completion rate for forms?How many people abandon the ILL loan
process part way through?
Forms
Does your system for marking required “fields work” are people presented with error upon error on submission?
Are employees entering in bogus responses in form fields to circumvent bad design?
What can you measure?
Depends on what is recorded in the log fileWeb server access log files
– ASCII file that records each request– Two common web server log files types
CommonCombined
– More data
Status codes
In general– 200 codes are successful requests by a client– 300’s report server redirects – 400’s are used for client errors– 500’s are used for server errors
404
Log analysis software
Produce summary tables, charts and graphs Popular ones are:– WebTrends (commercial, Windows, Unix)– Analog (free, Unix, Windows)– Wusage (free, Unix, Windows)– Many more
Yahoo Log Analysis Tools > Titles
Sample: Summary top files requested
Meaningfulfilenames rather
than id=1232help make this
report understandable
How visible are your links and menus?
Are you tuning your site? Is the new button or label working?
Is anyone clicking on the special announcement information?
Run a special report and see what links are used the most on your home page
Redesign of U of S home page
Helpwasremoved.
Homepage Clickthrus: http://www.usask.ca/analog/homepage/
Before and after Does the new top menu work?
Click Tracks [www.clicktracks.com] - displaying all the links on
the page and % of visitors that click on it.
Digging for evidence
Are people able to get from here to there? Specific example
Evaluating a site wide menuTrying to make the case that generic terms rather
than “brand names” were more effective Team response was polite nods
Looked up how many people actually selected this area from the home page based on the brand name label rather then generic term.
Possible because the links had different syntax
Tip: Add tracking code to the end of a linkhttp://library.usask.ca/data?top
Log file: - [27/Jan/2004:03:08:11 -0600] "GET /data?top
Log files to the rescue
We discovered
A quick glance at the log file revealed in the prior two days
200 accesses resulted from the brand name label1000 accesses for the generic term in a less
prominent location
Where do you post announcements?
Need to get everyone’s attention Branch closure Pay fines now in order to convocate
Not every one enters your site at the home page
Find the entry pages
What’s hot and what’s not?
What areas or pages are popular?How is it changing over time?Popular may ≠ good
– Custom 404 pages are often #1 on a site with link rot– High use may mean people are lost, if your site doesn’t
have a followed link colour
Popularity questions
What’s popular but shouldn’t be?– Overdependence on site search may signal site
navigation weaknesses
What should be popular but isn’t?– If you expect high usage and it’s not happening, recheck
links, labels and position. Is the link to underutilized area prominent? Is it plain language or jargon?
Does anyone care?
Are we posting new announcements and no one reads them (ever)?
Are the only hits from search engines spidering the site?
What should we add more of?
Getting down to the details
When can you move
to CSS layouts?
When can you downgrade
support forNetscape 4.78?
What web browsers do you need to support?
Explorer ; x
Explorer 2.x
Explorer 7.x
Explorer 1.x
Explorer 3.x
Explorer 4.x
Explorer 5.x
Explorer 6.x
0
20
40
60
80
12/0212/04
12/0612/08
12/1012/12
12/1412/16
12/1812/20
12/2212/24
12/2612/28
12/30
Per
cent
of T
otal
Hits
Microsoft Explorer Brow sersMicrosoft Explorer Brow sers
Mon 12/01/2003 - Wed 12/31/2003 (1 Month Scale)
Retrace someone’s footsteps
What page referred them to the library site?No referrer? Bookmark, typed in URL (or a robot)
What path did they follow?Sometimes even what link they clickedWhat data they may have typed in a search
box?Where did they leave?
Follow the top paths
Pay attention where they stopped and restarted
No direct links from one area to another, may indicate they used their back button
Error log captures
DateError levelClient IP address or hostnameError message or path to requested file
[Wed Jan 28 00:15:26 2004] [error] [client 24.69.255.237] File does not exist: /data/www/northwest/images/spacer.gif, referer: http://library.usask.ca/northwest/contents.html[Wed Jan 28 00:16:30 2004] [error] [client 66.77.73.89] File does not exist: /data/www/education/chldawrd.html
Also log
Some types of authentication failuresAuthentication problems may indicate a need
to add:Directions – usernames are case sensitiveImplement a password reminder feature
Redesign or launch of new service
Watch you log files in “real time” or every few seconds
tail –f /usr/local/apache/logs/error_log
tail -f Path to error_log file
For example on a UNIX server, use this command:
Site search and search logs
Rich source of dataOften underutilizedTap into people’s expectations
Site search engine audit can help you to tune your search engine, web pages, and results.
Useful for finding out about:
Content synopsisSite search performanceClues about web site usabilityCreating “best bets”
Search engines generate two data files
Robot logging– What URLS– What files– How many terms are indexed
Search query log– Similar to web server log
Strength – generate term frequenciesWeakness – most don’t show IF the user clicked any results
Content synopsis
Discovery tool if many independent content developers and/or servers
Bird’s eye view of breadth and depthThe robot report can tell you:
How many HTML pages?How many PDF’s?How many unique words?How many bad links/URLs?
Sample of Swish-e directory reportChecking dir "/data/resources/usability/readersurvey_files"...Checking dir "/data/resources/usability/templates"...Checking dir "/data/resources/usability/ugasurvey"...Checking dir "/data/software.purchased"...Checking dir "/data/ssh"...Checking dir "/data/staffsessionmaterial"...Checking dir "/data/staffsessionmaterial/groupware.presentation"...Checking dir "/data/surveys"...Checking dir "/data/sysinfo"...Checking dir "/data/sysinfo/CVS"...Checking dir "/data/sysinfo/apps"...Checking dir "/data/sysinfo/ntnetwork"...
Excerpt of Swish-e summary
Removing very common words... no words removed.Writing main index... 23414 unique words indexed.Writing file index... 752 files indexed.Running time: 21 seconds.Indexing done!Removing very common words... no words removed.
Site search performance: coverage
Log analysis is just one part of the assessment
Strength is the ability to see 1000’s of real queries
Identify and repeat the top 50, 100 or 200 queries
BBCi – BBC search engines
Receive hourly reports so they can track trends1
Tune results– Space shuttle Columbia disaster– Columbia the country
1 http://www.currybet.net/articles/audiences/
Sample queries by frequency
Verify Ultraseek search term query report– Site-wide and collection specific queries
Queries by Frequency Results from past 1 month
14722 Total queries
174 "staff" 169 url:lights.ca, url:hr url:marketing url:sysinfo url:contact, || domain 146 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || stats 139 "domain setup" 105 url:lights.ca, url:docs url:hr url:sysinfo url:contact || orientation 92 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || "hot+spare"
Mining for gold
What did people expect to find?– Unlike web server logs, we can capture user’s
own words for what they are seeking– Safely assume they had a “hope” that we’d have
the information– How well do we measure up?
Do we have content for the top 25 queries?If not, why not? Should we add it?
Wild and wacky requests
Is the site search labeled clearly?– Do people know where they are?– Do they know what they’re searching?– If they’re looking for MP3 downloads, Hawaiian
vacations, and foot fetishes, then the visual identity and “sense of place” is needed to distinguish this from Internet engines
Book related queries
Lots of book titles and authors– This may be okay – depends on your site content– May mean they are lost and looking for the
catalogueCheck your labelingGuide them – Looking for books?
Search the catalogue
Missing content
Are they looking for events?Employee directory informationSubject pages (stored in a database and not
in the site search)– May be as simple as adding it to the robot– May require a simple (or sophisticated)
metasearch
Search and silos
Challenging to have people recognize what they are searching
Federated searching presents possibilities of breaking down silos
Specific example
Dozens of queries for Italian magazines
It imperative our sites reflect the language of our users and, in academia, the language of faculty.
A professor of Italian literature gave an assignment and sent his students to find “Italian magazines” on the library’s web site
No where do the words “Italian magazines” appear on a web page
Site search performance: interface
Default settings for search pageMost searchers accept your defaultsHow well do these mesh
– “anding” or “not anding” all keywords
Assessing performance
Look at your top 50 and 100 search queries and see if the defaults are helping or impeding good result sets
How often is there a likely looking item in the top 3 results? Is it relevant?
Is there a clear mismatch between the searcher’s terminology and the site terminology?
Is the best page ranking low due to its construction– Graphics, not title tag, etc.
“No matches found”
Dig deep– Is it typographic or spelling errors?
Does your search offer suggestions?– Is it a language gap?– Is it variant terms for the same thing?
Situational analysisWhere is the leaping off point for the search?
– Run a referrer report for your search page– When do many users abandon browsers and head for search?
Turn to your web server logs– Referring URL and then the search terms entered– What was the page like just before they searched?– Did it have the answer to the question?– Was the choice on a menu but the vocabulary in the label
different?
From browse to search
Usability testing shows that people move back and forth
P. Gremett found when analyzing Amazon, the majority of users browsed until the browsing areas became too busy, ambiguous or lacked relevant content
P. Gremet. Utilizing a Users Context to Improve Search Results. CHI 2003.
Typical response: blame the search engine
Reality check– Garbage in, garbage out– Use the terms our visitors are using– Make sure pages are designed to rank effectively
IF the web team can change the content
Example: Recent site audit
More specific queries ranked lower than a single general term
EmploymentEmployment incomeEmployment income occupation
It is counter-intuitive, that the last phrase would rank higher on the first result set
From research to action
“Best bets”– Richard Wiggins, Louis Rosenfeld and Martin
Belam and others have written about the performance increase you can achieve with “Best Bets”
– Wiggins found that 50% of searches could be matched to 1000 unique queries
What’s a “best bet”?
Rather than relying on the search engine to rank, human editors designate one or more sites as best bet for the top 20, 200 or 500 terms
The “best bet” appears at the top of the result set– Example: a search for library brings up the Library
home page, not C programming libraries or the library budget.
In summary
Web logs and search logs– Rich sources of information– Give us clues and help improve our sites– Fast and easy and readily at hand
Help you create new value added services
Log data
Best used as part of redesign process that includes web site audits, usability testing and log analysis