reading cyber tracks: analyzing log files and search logs darlene fichter data library coordinator,...

86
Reading Cyber Tracks: Analyzing Log Files and Search Logs Darlene Fichter Data Library Coordinator, U of S Library January 29, 2004

Upload: iris-holt

Post on 17-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Reading Cyber Tracks: Analyzing Log Files and Search Logs

Darlene Fichter

Data Library Coordinator, U of S Library

January 29, 2004

Overview

Log Files How can log files? Getting up close and personal with log files 7 things my server logs told me Error logs

Search Logs Content synopsis Site search performance Intranet usability “Best bets”

Why super heroes read log files

Macro pictureRich source of information about

User behaviorsLink choicesTypical paths through a web site

Point out trouble spotsHelp inform redesigns

DetractorsThe least useful type of data for understanding

users Doesn’t measure outcomes We don’t know the intent of the visitor

Hits are meaninglessTrueNot true, if you’re estimating service capacity

and performanceImprecise and incomplete

Server logs can tell you

Who is using your site? Who never uses your site? Where do they enter?What route do they follow?What do they use?How long do they stay?

Big picture

Average Number of Visits per Day on Weekdays 6254

Average Number of Hits per Day on Weekdays 110437

Average Number of Visits per Weekend 8157

Average Number of Hits per Weekend 113500

Most Active Day of the Week Mon

Least Active Day of the Week Sat

Most Active Date December 04 2003

Number of Hits on Most Active Date 173527

Least Active Date December 25 2003

Number of Hits on Least Active Date 38078

Most Active Hour of the Day 14:00-14:59

Least Active Hour of the Day 01:00-01:59

Page duration

How long do most people spend on a page– Inordinately long time could mean

Very confusingVery worthwhileWent for coffee?

– Skip averages and look for the mode or median

Exit pages

The point where someone leaves your site, may offer some interesting clues

Related links – fineFind an article page listing databases

– A caveat to keep in mind• Use of the back button may not show up when pages

are loaded from the browser cache

Forms

What is the completion rate for forms?How many people abandon the ILL loan

process part way through?

Forms

Does your system for marking required “fields work” are people presented with error upon error on submission?

Are employees entering in bogus responses in form fields to circumvent bad design?

What can you measure?

Depends on what is recorded in the log fileWeb server access log files

– ASCII file that records each request– Two common web server log files types

CommonCombined

– More data

Example: Apache combined log format

Who and when?

IP address or hostname

Identity orLogin

(seldom used)

Usernamerecorded

Date, Time

What did they ask for? Did it work?

Method PathProtocol (http)

and versionStatusCode

Status codes

In general– 200 codes are successful requests by a client– 300’s report server redirects – 400’s are used for client errors– 500’s are used for server errors

404

Page immediately before this request

Bytestransferred

Referring site

User Agent: Browser and OS

Browser – MozillaOS – Windows NT

Log analysis software

Produce summary tables, charts and graphs Popular ones are:– WebTrends (commercial, Windows, Unix)– Analog (free, Unix, Windows)– Wusage (free, Unix, Windows)– Many more

Yahoo Log Analysis Tools > Titles

Sample: Top domains chart

Sample: Summary top files requested

Meaningfulfilenames rather

than id=1232help make this

report understandable

What your logs can tell you, if you listen…

Specific areas where logs are usefulSpecific examples

How visible are your links and menus?

Are you tuning your site? Is the new button or label working?

Is anyone clicking on the special announcement information?

Run a special report and see what links are used the most on your home page

Redesign of E-Journal page

Subject browsewas #2.

Redesign of U of S home page

Helpwasremoved.

Homepage Clickthrus: http://www.usask.ca/analog/homepage/

Redesign of U of S home page

1. Departments2. Search3. PAWS4. Students5. Admissions

Redesign of Health Sciences Library page

Home page clickthrusused to

set priority Order.

Before and after Does the new top menu work?

Click Tracks [www.clicktracks.com] - displaying all the links on

the page and % of visitors that click on it.

Digging for evidence

Are people able to get from here to there? Specific example

Evaluating a site wide menuTrying to make the case that generic terms rather

than “brand names” were more effective Team response was polite nods

Looked up how many people actually selected this area from the home page based on the brand name label rather then generic term.

Possible because the links had different syntax

Tip: Add tracking code to the end of a linkhttp://library.usask.ca/data?top

Log file: - [27/Jan/2004:03:08:11 -0600] "GET /data?top

Log files to the rescue

We discovered

A quick glance at the log file revealed in the prior two days

200 accesses resulted from the brand name label1000 accesses for the generic term in a less

prominent location

Where do you post announcements?

Need to get everyone’s attention Branch closure Pay fines now in order to convocate

Not every one enters your site at the home page

Find the entry pages

Top entry pages

What’s hot and what’s not?

What areas or pages are popular?How is it changing over time?Popular may ≠ good

– Custom 404 pages are often #1 on a site with link rot– High use may mean people are lost, if your site doesn’t

have a followed link colour

Link rot?

http://www.bio.cornell.edu/stats/01/07/default_01_b.htm

Top directories

Popularity questions

What’s popular but shouldn’t be?– Overdependence on site search may signal site

navigation weaknesses

What should be popular but isn’t?– If you expect high usage and it’s not happening, recheck

links, labels and position. Is the link to underutilized area prominent? Is it plain language or jargon?

Does anyone care?

Are we posting new announcements and no one reads them (ever)?

Are the only hits from search engines spidering the site?

What should we add more of?

Is a feature used?

After a debate, quick links

and audiencemenus

were addedto the site.

Quick links – very popular #3 and #5

Audience menus

Over time the “student” option on the audience menus has increased

Getting down to the details

When can you move

to CSS layouts?

When can you downgrade

support forNetscape 4.78?

What web browsers do you need to support?

Explorer ; x

Explorer 2.x

Explorer 7.x

Explorer 1.x

Explorer 3.x

Explorer 4.x

Explorer 5.x

Explorer 6.x

0

20

40

60

80

12/0212/04

12/0612/08

12/1012/12

12/1412/16

12/1812/20

12/2212/24

12/2612/28

12/30

Per

cent

of T

otal

Hits

Microsoft Explorer Brow sersMicrosoft Explorer Brow sers

Mon 12/01/2003 - Wed 12/31/2003 (1 Month Scale)

Cross platform testing

Table

Retrace someone’s footsteps

What page referred them to the library site?No referrer? Bookmark, typed in URL (or a robot)

What path did they follow?Sometimes even what link they clickedWhat data they may have typed in a search

box?Where did they leave?

Log analysis tools – “top paths”

http://www.bio.cornell.edu/stats/01/07/default_01_b.htm

A sadtale

Paths

Aneven sadder

tale

Or a programmer doing debugging?

Follow the top paths

Pay attention where they stopped and restarted

No direct links from one area to another, may indicate they used their back button

Error logs

Usually well used by development teamsOnly touch on a few points

Error log captures

DateError levelClient IP address or hostnameError message or path to requested file

[Wed Jan 28 00:15:26 2004] [error] [client 24.69.255.237] File does not exist: /data/www/northwest/images/spacer.gif, referer: http://library.usask.ca/northwest/contents.html[Wed Jan 28 00:16:30 2004] [error] [client 66.77.73.89] File does not exist: /data/www/education/chldawrd.html

Also log

Some types of authentication failuresAuthentication problems may indicate a need

to add:Directions – usernames are case sensitiveImplement a password reminder feature

Redesign or launch of new service

Watch you log files in “real time” or every few seconds

tail –f /usr/local/apache/logs/error_log

tail -f Path to error_log file

For example on a UNIX server, use this command:

Site search and search logs

Rich source of dataOften underutilizedTap into people’s expectations

Site search engine audit can help you to tune your search engine, web pages, and results.

Useful for finding out about:

Content synopsisSite search performanceClues about web site usabilityCreating “best bets”

Search engines generate two data files

Robot logging– What URLS– What files– How many terms are indexed

Search query log– Similar to web server log

Strength – generate term frequenciesWeakness – most don’t show IF the user clicked any results

Content synopsis

Discovery tool if many independent content developers and/or servers

Bird’s eye view of breadth and depthThe robot report can tell you:

How many HTML pages?How many PDF’s?How many unique words?How many bad links/URLs?

“Smarter” robots may reportHow many secure areas?Refresh rate – how many documents have changed

Sample of Swish-e directory reportChecking dir "/data/resources/usability/readersurvey_files"...Checking dir "/data/resources/usability/templates"...Checking dir "/data/resources/usability/ugasurvey"...Checking dir "/data/software.purchased"...Checking dir "/data/ssh"...Checking dir "/data/staffsessionmaterial"...Checking dir "/data/staffsessionmaterial/groupware.presentation"...Checking dir "/data/surveys"...Checking dir "/data/sysinfo"...Checking dir "/data/sysinfo/CVS"...Checking dir "/data/sysinfo/apps"...Checking dir "/data/sysinfo/ntnetwork"...

Excerpt of Swish-e summary

Removing very common words... no words removed.Writing main index... 23414 unique words indexed.Writing file index... 752 files indexed.Running time: 21 seconds.Indexing done!Removing very common words... no words removed.

Site search performance: coverage

Log analysis is just one part of the assessment

Strength is the ability to see 1000’s of real queries

Identify and repeat the top 50, 100 or 200 queries

BBCi – BBC search engines

Receive hourly reports so they can track trends1

Tune results– Space shuttle Columbia disaster– Columbia the country

1 http://www.currybet.net/articles/audiences/

Sample queries by frequency

Verify Ultraseek search term query report– Site-wide and collection specific queries

Queries by Frequency Results from past 1 month

   14722 Total queries

     174 "staff"     169 url:lights.ca, url:hr url:marketing url:sysinfo url:contact, || domain     146 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || stats     139 "domain setup"     105 url:lights.ca, url:docs url:hr url:sysinfo url:contact || orientation      92 url:lights.ca, url:docs url:hr url:sysinfo url:contact, || "hot+spare"

Mining for gold

What did people expect to find?– Unlike web server logs, we can capture user’s

own words for what they are seeking– Safely assume they had a “hope” that we’d have

the information– How well do we measure up?

Do we have content for the top 25 queries?If not, why not? Should we add it?

“External” terms for U of S Library

Wild and wacky requests

Is the site search labeled clearly?– Do people know where they are?– Do they know what they’re searching?– If they’re looking for MP3 downloads, Hawaiian

vacations, and foot fetishes, then the visual identity and “sense of place” is needed to distinguish this from Internet engines

Search terms

Book related queries

Lots of book titles and authors– This may be okay – depends on your site content– May mean they are lost and looking for the

catalogueCheck your labelingGuide them – Looking for books?

Search the catalogue

Missing content

Are they looking for events?Employee directory informationSubject pages (stored in a database and not

in the site search)– May be as simple as adding it to the robot– May require a simple (or sophisticated)

metasearch

Search and silos

Challenging to have people recognize what they are searching

Federated searching presents possibilities of breaking down silos

Specific example

Dozens of queries for Italian magazines

It imperative our sites reflect the language of our users and, in academia, the language of faculty.

A professor of Italian literature gave an assignment and sent his students to find “Italian magazines” on the library’s web site

No where do the words “Italian magazines” appear on a web page

Site search performance: interface

Default settings for search pageMost searchers accept your defaultsHow well do these mesh

– “anding” or “not anding” all keywords

Assessing performance

Look at your top 50 and 100 search queries and see if the defaults are helping or impeding good result sets

How often is there a likely looking item in the top 3 results? Is it relevant?

Is there a clear mismatch between the searcher’s terminology and the site terminology?

Is the best page ranking low due to its construction– Graphics, not title tag, etc.

Results may surprise you!

Librarywas a stop

word.Made a best bet.

U of S internal search engine report

“No matches found”

Dig deep– Is it typographic or spelling errors?

Does your search offer suggestions?– Is it a language gap?– Is it variant terms for the same thing?

Situational analysisWhere is the leaping off point for the search?

– Run a referrer report for your search page– When do many users abandon browsers and head for search?

Turn to your web server logs– Referring URL and then the search terms entered– What was the page like just before they searched?– Did it have the answer to the question?– Was the choice on a menu but the vocabulary in the label

different?

From browse to search

Usability testing shows that people move back and forth

P. Gremett found when analyzing Amazon, the majority of users browsed until the browsing areas became too busy, ambiguous or lacked relevant content

P. Gremet. Utilizing a Users Context to Improve Search Results. CHI 2003.

Typical response: blame the search engine

Reality check– Garbage in, garbage out– Use the terms our visitors are using– Make sure pages are designed to rank effectively

IF the web team can change the content

Example: Recent site audit

More specific queries ranked lower than a single general term

EmploymentEmployment incomeEmployment income occupation

It is counter-intuitive, that the last phrase would rank higher on the first result set

From research to action

“Best bets”– Richard Wiggins, Louis Rosenfeld and Martin

Belam and others have written about the performance increase you can achieve with “Best Bets”

– Wiggins found that 50% of searches could be matched to 1000 unique queries

What’s a “best bet”?

Rather than relying on the search engine to rank, human editors designate one or more sites as best bet for the top 20, 200 or 500 terms

The “best bet” appears at the top of the result set– Example: a search for library brings up the Library

home page, not C programming libraries or the library budget.

BBC – “Columbia”

BBC – “library”

BBCi Web – “Mars”

In summary

Web logs and search logs– Rich sources of information– Give us clues and help improve our sites– Fast and easy and readily at hand

Help you create new value added services

Log data

Best used as part of redesign process that includes web site audits, usability testing and log analysis

Questions

Darlene Fichter– [email protected]– library.usask.ca/~fichter/