aggregation and analysis of usage data: a structural, quantitative perspective johan bollen

49
Quick TIFF are ne Research Library, Los Alamos National Laboratory @ November 2007, NISO - Dallas, TX Digital Library Research & Prototyping Team QuickTime™ TIFF (Uncompr are needed t Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl . gov Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Lyudmila L. Balakireva (LANL) Wenzhong Zhao (LANL), Aric Hagberg (LANL) Research supported by the Andrew W. Mellon Foundation. NISO Usage Data Forum • November 1-2, 2007 • Dallas, TX

Upload: sally

Post on 10-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library [email protected] Acknowledgements: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Aggregation and Analysis of Usage Data:

A Structural, Quantitative Perspective

Johan BollenDigital Library Research & Prototyping Team

Los Alamos National Laboratory - Research Library

[email protected]

Acknowledgements:

Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Lyudmila L. Balakireva (LANL)

Wenzhong Zhao (LANL), Aric Hagberg (LANL)

Research supported by the Andrew W. Mellon Foundation.

NISO Usage Data Forum • November 1-2, 2007 • Dallas, TX

Page 2: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Batgirl:“Who’s going to be batman now?”

Page 3: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage data has arrived.

Value of usage data/statistics is undeniable:• Business intelligence• Scholarly assessment• Monitoring of scholarly trends• Enhanced end-user services

Production chain:• Recording• Aggregation• Analysis

Challenges and opportunities at all links in chain.

But of interest to this workshop in particular:• Recording: data models, standardization• Aggregation: restricted sampling vs. representativeness• Analysis: frequentist vs. structural

This presentation: overview of lessons learned by MESUR project at LANL.

Technical project details later this afternoon.

Page 4: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 5: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 6: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The MESUR project.

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Johan Bollen (LANL): Principal investigator.Herbert Van de Sompel (LANL): Architectural consultant.Aric Hagberg (LANL): Mathematical and statistical consultant.Marko Rodriguez (LANL): PhD student (Computer Science, UCSC).Lyudmila Balakireva (LANL): Database management and development.Wenzhong Zhao (LANL): Data processing, normalization and ingestion.

“The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data.”

Page 7: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Project data flow and work plan.

1 2 3 4a 4b

negotiation aggregation ingestion referencedata set

metricssurvey

Page 8: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Project timeline.

We are here!

Page 9: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lecture focus: MESUR lessons learned

Page 10: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 11: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lesson 1: aggregation adds significant value.

Focus on institution-oriented services and analysis

Multiple institutions, each with focus on institution-oriented services and analysis

Aggregating usage data provides:1. Validation: reliability and span2. Perspective: group-based services and analysis

Note: MESUR=not services

Page 12: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Aggregation value: validation

Aggregating usage data provides:1. Reliability: overlaps and intersections

• Validation: error checking• Reliability: repeated

measurements2. Span:

• Diversity: multiple communities• Representativeness: sample of

scholarly community

Page 13: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Aggregation value: convergence.

CSU Usage PR IF (2003) Title (abbv.)

1 78.565 21.455 JAMA-J AM MED ASSOC

2 71.414 29.781 SCIENCE

3 60.373 30.979 NATURE

4 40.828 3.779 J AM ACAD CHILD PSY

5 39.708 7.157 AM J PSYCHIAT

LANL Usage PR IF (2003) Title (abbv.)

1 60.196 7.035 PHYS REV LETT

2 37.568 2.950 J CHEM PHYS

3 34.618 1.179 J NUCL MATER

4 31.132 2.202 PHYS REV E

5 30.441 2.171 J APPL PHYS

MSR Usage PR IF (2005) Title (abbv.)

1 15.830 30.927 SCIENCE

2 15.167 29.273 NATURE

3 12.798 10.231 PNAS

4 10.131 0.402 LECT NOTES COMP SCI

5 8.409 5.854 J BIOL CHEM

Convergence!• Open research questions:

o Is this guaranteed?o To what? A common-baseline?

• What we do know: o Institutional perspective can be

contrasted to baseline.o As aggregation increases in size,

so does value.

Page 14: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

We have COUNTER/SUSHI. How about the aggregation of item-level usage data?

If there is value in aggregating COUNTER and other reports, there isconsiderable value in aggregating item-level usage data.

Page 15: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 16: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lesson 2: We need a standardized representation framework for item-level usage data.

MESUR: Ad hoc parsing is highly problematic- field semantics- field relations- various data models

Framework objectives: 1. Minimize data loss:

1. Preserve event info2. Preserve sequence info3. Preserve document metadata

2. “realistic”: scalability and granularity.3. Apply to variety of usage data, i.e. no inherent

bias towards specific type of usage data

Page 17: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Requirements for usage data representation framework.

Implications1. Sequence: session ID and date/time preserves sequence2. Privacy: session ID groups events not by user ID3. Request types: filter on types of usage

Needs to minimally represent following concepts:1. Event ID: distinguish usage events2. Referent ID: DOI, SICI, metadata3. User/Session ID: define groups of events related by user4. Date and time ID: identify data and time of event5. Request types: identiby type of request issued by user

Page 18: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

COUNTER reports: information loss

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

From: www.niso.org/presentations/MEC06-03-Shepherd.pdf

1. Event ID: distinguish usage events ?2. Referent ID: DOI, SICI, metadata3. User/Session ID: define groups of events related by user?4. Date and time ID: identify data and time of event5. Request types: identiby type of request issued by user

Page 19: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

From same usage data: journal usage graphs

journal1

journal2

Page 20: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

From same data: article usage graphs

Page 21: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Usage graphs connect this domain to50 years of network science

Heer (2005) - Large-Scale Online Social Network Visualization

• social network analysis

• small world graphs

• network science

• graph theory

• social modeling

Good reads:•Barabasi (2003) Linked.•Wasserman (1994). Social network analysis.

Page 22: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 23: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Standardization objectives similar to work done for COUNTER and SUSHI:

1. Serialization (~COUNTER): - standard to serialize usage data- Suitable for large-scale, open aggregation- Event provenance and identification

2. Transfer protocol (~SUSHI):- Communication of usage data between log archive and

aggregator- Allow open aggregation across stakeholders in scholarly

community3. Privacy standard:

- Standards need to address privacy concerns- Should allow emergence of trusted intermediaries: aggregation

ecology

Lesson 3: aggregating item-level usage data requires standardized aggregation framework.

LANL has made proposals based on community standards

Page 24: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

OpenURL ContextObject to represent usage data

Referent * identifier * metadata

Requester* User or user proxy: IP, session, …

ServiceType

<?xml version=“1.0” encoding=“UTF-8”?><ctx:context-object timestamp=“2005-06-01T10:22:33Z” … identifier=“urn:UUID:58f202ac-22cf-11d1-b12d-002035b29062” …>…<ctx:referent> <ctx:identifier>info:pmid/12572533</ctx:identifier> <ctx:metadata-by-val> <ctx:format>info:ofi/fmt:xml:xsd:journal</ctx:format> <ctx:metadata> <jou:journal xmlns:jou=“info:ofi/fmt:xml:xsd:journal”> … <jou:atitle>Isolation of common receptor for coxsackie B … <jou:jtitle>Science</jou:jtitle>…</ctx:referent>… <ctx:requester> <ctx:identifier>urn:ip:63.236.2.100</ctx:identifier> </ctx:requester>… <ctx:service-type> … <full-text>yes</full-text> … </ctx:service-type> … Resolver… Referrer… ….</ctx:context-object>

Resolver:* identifier of linking server

Event information: * event datetime * globally unique event ID

Page 25: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Aggregation framework: existing standards

Aggregatedlogs

Log DB

Serialization:OpenURL

ContextObjects

LogRepository 1

Logharvester

COCOCO

COCOCO

COCOCO

AggregatedUsage Data

Johan Bollen and Herbert Van de Sompel. An Architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

LogRepository 2

LogRepository 3 Tranfer protocol:

OAI-PMH

Page 26: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The issue of anonymization

Privacy and anonymization concerns play on multiple levels that standard needs to address:

1. Institutions: where was usage data recorded?2. Providers: who provided usage data?3. Users: who is the user?

- Goes beyond “naming” and masking identity: simple statistics can reveal identity

- User identity can be inferred from activity pattern (AOL search data)

- Law enforcement issues

MESUR:1. Session ID: preserve sequence without any references to

individual users2. Negotiated filtering of usage data

Page 27: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 28: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lesson 4: socio-cultural issues matter.

BUT:• 12 months• +550 email messages• Countless teleconferences

Correspondence reveals socio-cultural issues for each class of providers

o Privacyo Business modelso Ownershipo World-view

Study by means of email term analysis of correspondence

o Term frequencies: tag cloudso Term co-occurrences: concept networks

MESUR’s efforts

Negotiated with 1. Publishers2. Aggregators3. Institutions

Results:• +1B usage events obtained• Usage data provided

ranging from 100M to 500K• Spanning multiple

communities and collections

Page 29: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:institutions

Note: strong focus on technical matters related to sharing SFX data

Page 30: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:institutions

Page 31: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:publishers

Note: strong focus on procedural matters related to agreements and project objectives

Page 32: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according publishers

Page 33: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:aggregators

Note: strong focus on procedural matters and project/research objectives

Page 34: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:aggregators

Page 35: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:Open Access Publishers

Note: strong focus on technical matters related to sharing of data

Page 36: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

The world according to usage data providers:Open Access publishers

Page 37: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Summary diagram of provider concerns

Page 38: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lesson 5: size isn’t everything

“Return on investment” analysis• Value generated relative to investment• Investment = correspondence• Value = usage events loaded

Extracted from each provider correspondence with MESUR:• Investment:

1. Delay between first and last email2. Number of emails3. Number of bytes in emails4. Timelines of email “intensity”

• Value:o Number of usage events loadedo (bytes won’t work: different formats)

MESUR’s efforts:• 12 months• +550 email messages

Good results but large variation:• Usage data provided

ranging from 100M to 500K

Page 39: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Return on investment (ROI)

Days Emails Bytes Events (M) Type

1 167 47 39858 149.5 A

2 6 38 18185 22.5 OAP

3 178 93 53380 25.3 P

4 126 82 43427 15.1 I

5 190 43 29304 5.6 A

6 55 70 34209 3.4 I

7 114 41 39410 1.0 A

ROI = value / work value = usage events loadedwork = 1) days between first contact

2) emails sent3) bytes communicated

Page 40: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Return on investment (ROI)

Type Asked Positive Negative % pos

Publishers 14 8 6 57%

Aggregators 3 3 0 100%

Institutions 4 4 0 100%

Totals 21 14 7

ROI = value / work Value = usage data obtainedWork = asking

Survivor bias! Another way to look at it:

Page 41: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Return on investment: ETA?

All

Publisher

Aggregator

Page 42: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Return on investment: timelines

Institution

Note: same pattern, but much less traffic.

Page 43: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 44: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Lesson 6: Sampling is (seriously) difficult

• Usage data = sample• Sample of whom?

o Different communitieso Different collectionso Different interfaces

• Who to aggregate from?o Different providerso Interfaceso Systemso Representation frameworks

• Result = difficult choices

Page 45: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

A tapestry of usage data providers:

Main players:• Individual institutions • Aggregators• Publishers

We will hear from several in this workshop

Each represent different, and possibly overlapping, samples of the scholarly community.

Institutions:• Institutional communities• Many collectionsAggregators:• Many communities• Many collectionsPublishers:• Many communities• Publisher collection

Page 46: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

A tapestry of usage data providers

* A: Technicalo Aggregationo Normalizationo Deduplication

MESUR’s research

MESUR: significant project overhead can be reduced by standards

Unresolved. However, MESUR lessons may elucidate.

* C: Socio-culturaloWorld viewsoConcernsoBusiness models

* B: TheoreticaloWho: Sample characteristics?oWhat: resulting aggregate?oWhy: sampling objective?

Which providers to choose? Different problems:

Page 47: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Structure of presentation.

1) MESUR: introduction and overview

2) Aggregation: value proposition

3) Data models: 1) Usage data representation

2) Aggregation frameworks

4) Usage data providers: technical and socio-cultural issues, a survey

5) Sampling strategies: theoretical issues

6) Discussion: way forward, roadmap

Page 48: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

ConclusionThere exists considerable interest in aggregating item-level usage data.This requires standardization on multiple levels:1) Recording and processing

• Field semantics: what do the various recorded usage items mean?• Requests: standardize on request types and semantics? • Standardized representation needs to be scalable yet minimize information loss

2) Aggregation:• Serialization: standardized serialization for resulting usage data• Transfer: standard protocol for exposing and harvesting of usage data• Privacy: protect identity and rights of providers, institutions and users

Although technical issues can be resolved:• Socio-cultural issues:

1) Worldviews, business models and ownership shape aggregation strategies.2) Greatly improved by existing standards and infrastructure for usage recording, processing and aggregating

• Theoretical questions of sampling: MESUR’s research project will provide insights.

Page 49: Aggregation and Analysis of Usage Data: A Structural, Quantitative Perspective Johan Bollen

QuickTime™ and aTIFF (LZW) decompressorare needed to see this picture.Research Library, Los Alamos National Laboratory

@ November 2007, NISO - Dallas, TX

Digital Library Research & Prototyping Team

QuickTime™ and aTIFF (Uncompressed) decompressorare needed to see this picture.

Some relevant publications.

Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007

Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and technology, 59(1), pages 001-014 (cs.DL/0610154).

Johan Bollen and Herbert Van de Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006.

Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006.

Johan Bollen, Marko A. Rodriguez, and Herbert Van de Sompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030)

Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005.