how accurate are ir usage statistics?

Post on 11-Apr-2017

241 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Leabharlann UCD

An Coláiste Ollscoile, Baile Átha Cliath,Belfield, Baile Átha Cliath 4, Eire

UCD Library

University College Dublin,Belfield, Dublin 4, Ireland

Joseph GreeneResearch Repository LibrarianUniversity College Dublinjoseph.greene@ucd.iehttp://researchrepository.ucd.ie

How accurate are IR usage statistics?

Open Repositories 2016Dublin, 16 June

Usage statistics are important for OA repositories

• How is the service used overall?• Advocacy

– Connects with authors on what is most important to them: the use of their research

• KPI for return on investment– Usage of a Library service– Visibility of university’s research

Monthly email sent to all depositors

Infographic distributed semi-annually by College Liaison Librarians

How accurate are they? Web robots

• Some follow rules– Search engines, Internet Archive, link checkers,

Twitterbot, etc.– robots.txt, naming themselves in the user agent

string• Others do not

– Email spammers, comment spammers, dictionary attackers, phishers, etc.

– Often mimic human users

Experimental study

• Simple random sample of 2 years of UCD repository’s download data– n=341, N=3.3 million; 96.20% certainty

• Manually checked to determine if robot or human• Compared findings against our robot detection

technique– U. Minho DSpace Stats Add-on– Monthly outlier exclusion (manual)

Greene, J. Web robot detection in scholarly Open Access institutional repositories. Library Hi Tech, July 2016

First finding

85% of the Research Repository UCD’s unfiltered downloads come from robots• This is confirmed in a 2013 IRUS-UK white paper

on 20 IRs; 85% was also found to be robots

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall (robots)

Accu

racy

of d

ownl

oad

stat

s (in

vers

e pr

eciti

on)

Catching more robots improves stats(But how much depends on the number of robots)

Get b

ette

r sta

ts

Catch more robots

Typical website, 15% robot traffic

OA journal, 40% robot

Internet Archive, 91% robot

OA repositories, 85% robot

How did we do at UCD?

• What proportion of robot downloads did we catch? (Recall)– Our method catches 94% of all robots

• How often were we correct -- how many are actually human? (Precision)– 98.9% of downloads that we label robots really are

robots• How accurate are the download stats -- how many

are actually made by human beings? (Inverse precision)– 73% of the download statistics as reported are

human

How does that compare?

• Who knows? There are no other studies like this on repositories!

• Applied DSpace's and EPrints' web robot detection algorithms to our data– Experimental– Real data– Same dataset used for each ‘system’– Algorithms easy to mimic in vitro– But SEO, crawl behaviour may be different for

different systems

Robot detection techniques used

DSpace EPrints Minho DSpace

Statistics Add-on Rate of requests ✓ 3 User agent string ✓ ✓ ✓ robots.txt access ✓

Volume of requests ✓ 2 ✓ 3 List of known robot IP addresses ✓ ✓ Reverse DNS name lookup ✓ 1 Trap file ✓ User agents per IP address Width of traversal in the URL space ✓ 3 1Only implemented nominally or experimentally 2Via the repeat download or ‘double-click’ filter 3Data available as a configurable report for manual decision making

Results

DSpace Eprints Minho (no manual outlier checking)

Minho plus monthly manual checking (UCD)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.897 0.911 0.8900.942

Robots detected (Recall)

DSpace Eprints Minho (no manual outlier checking)

Minho plus monthly manual checking (UCD)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

11.000

0.9400.989 0.989

Accuracy of detection (Precision)

DSpace

Eprin

ts

Minho (no m

anual

outlier c

hecking)

Minho plus monthly

manual

checki

ng (UCD)

Without fi

ltration

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.620 0.552 0.5900.730

0.144

Accuracy of download stats(Inverse precision)

I.e. 38% of DSpace’s reported downloads are made by robots, etc.

DSpace

EPrin

ts

Minho

Minho with

monthly

manual

checki

ng (UCD)

No robot d

etection

00.10.20.30.40.50.60.70.80.9

1

Robot detection in OA IR systems

RecallPrecisionNegative precision (accuracy of download stats)

Thank you!

top related