crushing, blending, and stretching data

50
Crushing, Blending, and Stretching Data Data Warehousing and Mining Data from Library and University Information Systems for Assessment of Library Operations: A Case Study in Progress Ecole des sciences de l'information, Rabat, Morocco, Monday, April 13, 2009 Ray Schwartz, Systems Specialist Librarian Cheng Library, William Paterson University, Wayne, New Jersey, USA schwartzr2 @ wpunj.edu

Upload: ray-schwartz

Post on 01-Nov-2014

3.196 views

Category:

Technology


2 download

DESCRIPTION

Data Warehousing and Mining Data from Library and University Information Systems for Assessment of Library Operations: A Case Study in Progress

TRANSCRIPT

Page 1: Crushing, Blending, and Stretching Data

Crushing, Blending, and Stretching Data

Data Warehousing and Mining Data from Library and

University Information Systems for Assessment of Library

Operations: A Case Study in Progress

Ecole des sciences de l'information, Rabat, Morocco, Monday, April 13, 2009

Ray Schwartz, Systems Specialist Librarian

Cheng Library, William Paterson University, Wayne, New Jersey, USAschwartzr2 @ wpunj.edu

Page 2: Crushing, Blending, and Stretching Data

2

Outline

• Why Assessment and Why Now?• What is Data Mining and Data

Warehousing and Why Do We Do It?• Our Library and University• Groups and Services• Steps• Reporting

Page 3: Crushing, Blending, and Stretching Data

3

Have We Always Assessed?

• Anecdotally—Yes.• Systematically—Not usually.

– Large scale assessment of manual systems (such as serials check-in, and card catalogs, circulation files) are not practical.

– Smaller scale and directed assessment is possible.

Page 4: Crushing, Blending, and Stretching Data

4

What changed since the days of manual systems?

Page 5: Crushing, Blending, and Stretching Data

5

• For many institutions in the West, the Integrated Library System (ILS) has been in use for over 20 years.

• Larger scale assessment is now possible with the electronic systems.– Counts of circulation transactions– Fund codes for purchases of library

materials• Reports from vendor services

– Bibliographic utilities– Subscription agents– Book jobbers

Page 6: Crushing, Blending, and Stretching Data

6

Page 7: Crushing, Blending, and Stretching Data

7

Page 8: Crushing, Blending, and Stretching Data

8

What is different now?

• New services have come into existence.– Inside libraries

• Full-Text Databases• Link Resolvers

– Outside of libraries• Google• Amazon

Page 9: Crushing, Blending, and Stretching Data

9

Page 10: Crushing, Blending, and Stretching Data

10

What is Data Mining and Data Warehousing

• Extracting data from legacy systems and other resources;

• cleaning, scrubbing and preparing data for decision support;

• maintaining data in appropriate data stores; • accessing and analysing data using a variety

of end user tools; • and mining data for significant relationships.

• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.

Page 11: Crushing, Blending, and Stretching Data

11

• The primary purpose of these efforts is to provide easy access to specifically prepared data that can be used with decision support applications such as management reports, queries, decision support systems, executive information systems and data mining.

• Chaffey, D., Mayer, R., Johnston, K., & Ellis-Chadwick, F. (2002). Internet Marketing: Strategy, Implementation and Practice (2nd ed.). Financial Times/ Prentice Hall.

Page 12: Crushing, Blending, and Stretching Data

12

Of course there are many ways to measure

– Scott Nicholson’s

Measurement Model

Page 13: Crushing, Blending, and Stretching Data

13

Knowledge states and User citations to materials•How useful is the library system?•Focus groups, User Citation tracking

Usability•Effectiveness of the system for the staff and institution.

External (User)

Recorded interactions with interface & materials•Bibliomining•Transaction/Web Log Analysis•Observation of User Behavior

Procedures and Standards•Staff survey and interviews•Audits of collections, systems, or staff

Internal (Library System)

UseLibrary SystemPerspectiveTopic

Nicholson, Scott (2004). A Conceptual framework for the holistic measurement and cumulative evaluation of library services. Journal of Documentation 60(2) p.164-181

Measurement Matrix with methodologies

Page 14: Crushing, Blending, and Stretching Data

14

Our University

• 9000 undergraduates• 1000 graduates (mostly education

majors)• 400 faculty• 800 adjuncts• 1000 staff

Page 15: Crushing, Blending, and Stretching Data

15

Our Library

• 19 librarians and 26 library staff• 350,000 volumes• 18,000 audiovisual items• 22,000 print and electronic periodicals • 100 general and subject specific

databases

Page 16: Crushing, Blending, and Stretching Data

16

Our Systems since 2005

• Voyager ILS • Online Periodical Database (OPD)• Clio ILL Software• EZProxy Server• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server

Page 17: Crushing, Blending, and Stretching Data

17

Online Periodicals Database

DBMS

Integrated Library System

Voyager

Patrons Searches

Banner

SIS HRS

Web Server

Circulation Media Scheduling

Serials Solutions A to Z

Other Vendors‘ Database Services & Usage Reports

Proxy Server

Off Campus Dbase Hits & ILL Form

( EZProxy Log )

University Networked Drive K:

ILL ( Cliodata )

Patrons MaterialsUniversity Email Server

Current Relationships

Internal

only WPUNJ Server

Externally accessibleWPUNJ Server

NonWPUNJ Server

Scripting Language

( University ERP System )

OCLC – Bibliographic Utility

WorldCat

ILL

Systems Chart – ca. 2005

Materials

Patrons

www.wpunj.edu Scripting Language

Web ServerILL Form

Page

ER Micro Form

Serials Form

Page 18: Crushing, Blending, and Stretching Data

18

Vendor Services

• Serials Solutions• OCLC – Bibliographic Utility• Blackwell – Book Jobber• Ebsco – Subscription Agent• Marcive – Authority Control• Database Vendors

Page 19: Crushing, Blending, and Stretching Data

19

The Question

Which categories of patrons are accessing which services?

Page 20: Crushing, Blending, and Stretching Data

20

First Step – Patron Statistical Categories

Page 21: Crushing, Blending, and Stretching Data

21

• Voyager Patron Database allows a maximum of 10 statistical categories per patron record.

• Decide which statistical categories are needed for each patron group defined.

• Work with your University Information Systems Department to extract the relevant data from the relevant sources.

Page 22: Crushing, Blending, and Stretching Data

22

Groups and Services

• Major• Status

– Undergrad or Grad– Faculty, Adjunct Faculty

or Staff

• Department• College• Degree• No. of Credits• Year of Study• Campus Location

• Circulation– Books– Media– Reserve– By Fund Code– Location

• ILL / Document Delivery• Databases• Library Web Pages

– Subject Area Resource Guides

– Reference Requests• Catalog• Other Vendor Services

– Serials Solutions

Page 23: Crushing, Blending, and Stretching Data

23

History Department - 12 months -Feb. 2008

Library Total = declared undergrad & grad majors, adjuncts & full time faculty borrowers

BORROWER = any member who borrowed materials

MEMBER = declared major or department member

EQUIPMENT CIRCULATION = camcorders, overhead & data projectors, laptops, easels, DVD players, etc.

MEDIA CIRCULATION = audio & video materials, including media reserves

BOOK CIRCULATION = books, book disks, maps, oversize, Curriculum materials, reserve books, NJ History, Leisure Lounge

DEFINITIONS:

10.597.1167% 4,981 7,418 52,756 20,703 8,713 23,370 LIBRARY TOTALS

19.9315.6679% 242 308 4,824 988 443 3,393 HISTORY TOTALS

20.3519.5096% 23 24 468 194 115 159 FULL-TIME FACULTY

9.255.7863% 20 32 185 20 65 100 ADJUNCT FACULTY

39.0836.2993% 13 14 508 76 13 419 GRADUATE STUDENTS

19.6915.3978% 186 238 3,663 698 250 2,715 UNDERGRADUATE STUDENTS

CIRC/ BORROWER

CIRC/ MEMBER

% BORROW

INGBORROWERSMEMBERSTOTAL CIRCEQUIP CIRCMEDIA CIRCBOOK CIRCPATRON STATUS

Page 24: Crushing, Blending, and Stretching Data

24

Problems with Configuration of Services

• Little to no linkage of data• Need to search multiple services

to get complete picture of serial holdings

• Multiple user IDs for authentication

Page 25: Crushing, Blending, and Stretching Data

25

Retirement the the OPD

• Serials holdings data was extracted from the OPD and added to Voyager catalog

• From Voyager catalog, serials holdings data is extracted and added to Serials Solutions A to Z list

Page 26: Crushing, Blending, and Stretching Data

26

• Authentication of ILL form is routed through the EZProxy server

• A web bug is placed in the microform request page to record submission in the Voyager's web server logfile.

Page 27: Crushing, Blending, and Stretching Data

27

New Services Added

• Serials Solutions MARC Record Service

• Serials Solutions Link Resolver• OCLC Worldcat Collection Analysis

Page 28: Crushing, Blending, and Stretching Data

28

Second Step – Setup an Application Server

Page 29: Crushing, Blending, and Stretching Data

29

Our Systems in 2008

• Voyager ILS• Shared Application Server• Clio ILL Software• EZProxy Server• Banner – University ERP• University Networked Drive K:• University Email Server• University Web Server

Page 30: Crushing, Blending, and Stretching Data

30

Integrated Library System

Voyager

Patrons Searches

Banner

SIS HRS

Web Server

Circulation Media Scheduling

University Networked Drive K:

ILL ( Cliodata )

Patrons Materials

Proxy Server

Off Campus Dbase Hits & ILL Form

( EZProxy Log )

University Email Server

Application Server

Scripting Language

Web Server

DBMS

Usage by

OffCampus Dbase

Patron Groups

ILL Patrons/

Materials Requested

ILL Patrons/Materials Received

Current Relationships

Internal

only WPUNJ Server

Externally accessibleWPUNJ Server

NonWPUNJ Server

Scripting Language

( University ERP System )

Systems Chart - 2008

Other Vendors‘ Database Services & Usage Reports

www.wpunj.edu Scripting Language

Web ServerILL Form

Page

ER Micro Form

Serials Form

Serials SolutionsA to Z

MARC Records

Link Resolver

OCLC – Bibliographic Utility

WorldCat

ILL

WCA

Page 31: Crushing, Blending, and Stretching Data

31

What is an Application Server?

• A machine or its software that works in conjunction with a web server to deliver application services such as the dynamic creation of a webpage from content stored in a database. From http://www.webtools.ca.gov/help/Glossary.asp

• Web Server Software (Apache or IIS)• Database Management System – DBMS (MySQL,

Oracle, MS SQL Server)• Scripting Language (Perl, PHP, ColdFusion, ASP)

Page 32: Crushing, Blending, and Stretching Data

32

Why an Application Server?

• Relevant data in logfiles need to be in a database to be analyze.

• Need your own DBMS to create new tables and queries.

Page 33: Crushing, Blending, and Stretching Data

33

• Decide how you will use the Application Server.

• Decide on the best and most plausible configuration.

Page 34: Crushing, Blending, and Stretching Data

34

One of Our Projects• Mining EZProxy logfiles and linking to

patron statistical categories from the Voyager Patron Database

– What majors and departments are accessing which database services?

– What majors and departments are accessing the ILL services?

Page 35: Crushing, Blending, and Stretching Data

35

Integrated Library System

Voyager

Patrons Searches

Banner

SIS HRS

Web Server

Circulation Media Scheduling

University Networked Drive K:

ILL ( Cliodata )

Patrons Materials

Serials SolutionsA to Z

MARC Records

Link Resolver

Proxy Server

Off Campus Dbase Hits & ILL Form

( EZProxy Log )

University Email Server

Application Server

Scripting Language

Web Server

DBMS

Usage by

OffCampus Dbase

Patron Groups

ILL Patrons/

Materials Requested

ILL Patrons/Materials Received

Current Relationships

ILL Collection and Patron Group Analyses

Off Campus Database Hits by Patron Group

Internalonly

WPUNJ Server

Externally accessibleWPUNJ Server

NonWPUNJ Server

( University ERP System )

OCLC

WorldCat

ILL

Systems Chart - 2008

Other Vendors‘ Database Services & Usage Reports

www.wpunj.edu Scripting Language

Web ServerILL Form

Page

ER Micro Form

Serials Form

WCA

Scripting Language

Page 36: Crushing, Blending, and Stretching Data

36

ILL request form authentications by major – Academic year 07/08

Major90M- History28M- Non-Degree25M- Pub Pol & Intl Affairs20M- Spanish18M- English16M- Undecided14M- Art14M- Education11M- Sociology10M- Biology

9M- Music9M- Special Programs8M- Psychology7M- Biotechnology7M- Political Science6M- Anthropology6M- Music - Jazz Studies4M- Business4M- Communication4M- Nursing

Book CountMajor

62M- Psychology60M- Sociology42M- Applied Clinical Psych35M- Education31M- History30M- Spanish29M- Nursing

1919M- Communication14M- Biotechnology14M- Counseling14M- English12M- Non-Degree10M- Community/Sch Health

7M- Biology7M- Political Science6M- Undecided5M- Comm Media Studies5M- Reading4M- Business

Article Count

M- Communication Disorders

Page 37: Crushing, Blending, and Stretching Data

07/29/08

Which Databases are accessed by Majors and

Departments?

Page 38: Crushing, Blending, and Stretching Data

07/29/08

By Major and HostMajor Count HostM- Nursing 3377 ebscohost.comM- Non-Degree 3010 ebscohost.comM- Psychology 2303 ebscohost.comM- Counseling 1487 ebscohost.comM- Communication 1359 ebscohost.comM- Education 1267 ebscohost.comM- Business 1246 proquest.umi.comM- Sociology 1152 ebscohost.comM- Business 1145 lexis-nexis.comM- Undecided 1100 ebscohost.comM- Applied Clinical Psych 1075 ebscohost.comM- English 1034 ebscohost.comM- Sociology 916 csa.comM- Business 794 ebscohost.comM- Accounting 738 lexis-nexis.comM- Reading 683 ebscohost.comM- Physical Education 653 ebscohost.comM- Special Programs 600 ebscohost.comM- Non-Degree 463 ereserve.wpunj.edu

Page 39: Crushing, Blending, and Stretching Data

07/29/08

By Dept and Host

Department Count HostS- Information Systems 933 webscript.exe?fs.scrS- Psychology Dept. 742 ebscohost.comS- Accounting and Law 559 lexis-nexis.comS- Political Sci Dept. 308 lexis-nexis.comS- Nursing Dept. 204 ebscohost.comS- Market & Mgt. Dept. 175 proquest.umi.comS- Library 167 ebscohost.comS- Sociology Dept. 151 ebscohost.comS- Sociology Dept. 134 csa.comS- History Dept. 121 serials.abc-clio.comS- Exercise & Mov Sci 110 ebscohost.comS- Political Sci Dept. 104 ebscohost.comS- Library 103 ILL_article.cfmS- Library 100 webscript.exe?fs.scrS- History Dept. 94 webscript.exe?fs.scr

Page 40: Crushing, Blending, and Stretching Data

07/29/08

By Dept and Service

Department Count ServiceS- Information Systems 933 http://www.wpunj.edu/scripts/webscript.exe?fs.scrS- Accounting and Law 549 http://www.lexis-nexis.com/universeS- Psychology Dept. 364 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psychS- Nursing Dept. 114 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=c8hS- Sociology Dept. 96S- Sociology Dept. 75 http://search.ebscohost.com/login.asp?profile=asp

S- Philosophy Dept. 74S- Library 65 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=aspS- Anthropology Dept. 62 http://www.sciencedirect.com/S- History Dept. 61 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=AHLS- Psychology Dept. 61 http://search.ebscohost.com/login.asp?profile=psyartS- History Dept. 58 http://serials.abc-clio.com/active/start?_appname=serials&initialdb=HAS- Psychology Dept. 54 http://search.ebscohost.com/login.asp?profile=psychS- Psychology Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=psyartS- English Dept. 42 http://search.ebscohost.com/login.aspx?authtype=ip,uid&profile=mzh

http://www.csa.com/htbin/dbrng.cgi?&db=socioabs-set-c&adv=1

http://webspirs4.silverplatter.com:8900/c119646?sp.form.first.p=srchmain.htm&sp.dbid.p=S(PHIL

Page 41: Crushing, Blending, and Stretching Data

Admin VLANs Labs VLANs

Vlan ID Vlan Name Vlan ID Vlan Name

2 Servers 3 Lab Servers

4 Admin 9 Imaging

5 Science 160 Lib Labs

6 Test Servers 174 STU VPN

7 NAS 175 Ben Shahn Lab

101 Energy Management 178 Hobart Lab

102 Diebold 179 SCI Lab

104 Xerox 187 CS Lab

150 Media Services 192 Atrium

161 Dorms Offices 209 Labs

162 RBI 212 Resnet Labs

163 Police 214 Raub Labs

164 Maintenance 228 VR Labs

IP Address Location = 149.151.VlanID.*

Page 42: Crushing, Blending, and Stretching Data

07/29/08

Some concerns

Patron Privacy and Standards

Page 43: Crushing, Blending, and Stretching Data

07/29/08

Using Voyager as the model for Patron Privacy

Page 44: Crushing, Blending, and Stretching Data

07/29/08

• Active Circ transactions are stored in a table with patron ID and statistical categories.

• Completed Circ transactions are stored in a table without the patron ID, but still with the patron statistical categories.

• The Patron Table contains the total counts of transactions for each patron, but no link to which transactions they are.

Page 45: Crushing, Blending, and Stretching Data

07/29/08

• EZProxy transactions would be stored in one table with patron statistical categories, but without the user ID.

• User ID s would be stored in another table with counts for each service divided by academic year.

• Logs are collected monthly and loaded and deleted monthly.

Page 46: Crushing, Blending, and Stretching Data

46

Example of EZProxy log entry

nj.dhcp.embarqhsd.net

-

theuser

1/1/2008 4:25:15 AM

GET

http://ezproxy.wpunj.edu:2048/connect?session=sGHMbeSss121YxZa&url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr

HTTP/1.1

302

537

http://ezproxy.wpunj.edu:2048/login?url=http://www.wpunj.edu/scripts/webscript.exe?fs.scr

Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322)

• Ip address

• (Not used)

• user id

• date/time

• Method

• page retrieved

• Version

• response code

• no. of bytes

• Referring URL

• User agent

Page 47: Crushing, Blending, and Stretching Data

47

Perl Script for loading ezproxy log into MySQL

use strict;my %month=(Jan=>'01',Feb=>'02',Mar=>'03',Apr=>'04',May=>'05',Jun=>'06',Jul=>'07',Aug=>'08',Sep=>'09',Oct=>'10',Nov=>'11',Dec=>'12');while (<>){ my $pattern = '^(\S*) (\S*) (\S*) (\S*) '. '\[(..)\/(...)\/(....):(..):(..):(..) .....\]'. ' "(\S*) (\S*) (\S*)" '. '(\d*) (-|\d*) "([^"]*)" "([^"]*)"'; if (m/$pattern/){ my ($tgt,$ref,$agt) = (esc($12),esc($16),esc($17)); my $byt = $15 eq '_'?'NULL':$15; print "INSERT INTO ezproxylogs VALUES ('$1','$2','$3',". " TIMESTAMP '$7/$month{$6}/$5 $8:$9:$10','$11','$tgt',". "'$13',$14,$byt,'$ref','$agt');\r."; }else{ print "--Skipped line $.\n"; }}

sub esc{ my ($p) = @_; $p =~ s/'/''/g; return $p;}

Page 48: Crushing, Blending, and Stretching Data

48

Created table to assist the linking

SELECT PATRON_ADDRESS.ADDRESS_TYPE,Left([ADDRESS_LINE1],InStr([ADDRESS_LINE1],"@")-1) AS usr,PATRON_ADDRESS.PATRON_ID, PATRON_ADDRESS.ADDRESS_STATUS,PATRON_ADDRESS.EFFECT_DATE, PATRON_ADDRESS.EXPIRE_DATE,PATRON_ADDRESS.MODIFY_DATE, PATRON_ADDRESS.MODIFY_OPERATOR_ID INTOemailprefixFROM PATRON_ADDRESSWHERE (((PATRON_ADDRESS.ADDRESS_TYPE)="3"));

Page 49: Crushing, Blending, and Stretching Data

Reporting and Standards

• Reporting– emailed periodically - e.g., daily

dossiers, and other event triggered reports.

– On demand, via email, web pages or a printer.

• Standards– Share data for comparative research. – Groups of libraries and consortia

Page 50: Crushing, Blending, and Stretching Data

50

Questions?

Ray Schwartz, Systems Specialist Librarian

Cheng Library, William Paterson University,

Wayne, New Jersey, USAschwartzr2 @ wpunj.edu