analysis of url references in etds: a case study at the university of north texas mark e. phillips...

22
Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips [email protected] du Assistant Dean for Digital Libraries Daniel G. Alemneh [email protected] u Digital Curation Coordinator for Digital Libraries Brenda Reyes Ayala [email protected] Graduate Assistant, UNT Web Archiving Team

Upload: valentine-shepherd

Post on 05-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Analysis of URL References in ETDs: A Case Study at the University of North

TexasMark E. [email protected]

Assistant Dean for Digital Libraries

Daniel G. [email protected]

Digital Curation Coordinator for Digital Libraries

Brenda Reyes [email protected]

Graduate Assistant, UNT Web Archiving Team

Page 2: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

• Background• ETD at UNT

• Curating cited URLs • URL references, linking patterns,

• Methods• URL Extraction & Indexing

• Findings• Summary

Outline

Page 3: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Background: ETD at UNT

Page 4: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

UNT & ETD

The University of North Texas (UNT) began accepting theses and dissertations in electronic format in 1999. ◦UNT is one of the early adopters of what was

to become the ETD movement in higher education

◦One of the first three American universities to require ETDs for graduation.

Page 5: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

UNT & ETD

• The UNT Libraries play an active role in facilitating access to UNT’s ETDs– Digital Projects Unit took on a stewardship role

• Develop appropriate Metadata• Integrate Value added services into the ETDs

– Multiple formats (PDF, JPG, )– Integrate Related contents (Datasets, videos, audios e.g. recitals)

– Started retrospective conversion projects: • Digital retro-conversion (in-house project) for pre-1999 theses and

dissertations previously available only in paper or microform.

Page 6: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries
Page 7: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Visits from 200+ Countries: http://digital.library.unt.edu/explore/collections/UNTETD/browse/

Page 8: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Curating Cited URLs

Page 9: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

• The UNT Libraries carried out this research to better understand what effect this shift to the Web had on the use of Web resources as the research focus, or primary citation target of theses and dissertations.

• In order to answer this question, the authors analyzed the scope of referencing Web resources, how it differs between academic degree levels, and how it has changed over the past twelve years at UNT.

Why Case Study

Page 10: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

54%46%

UNT ETDs SizeDoctoral Master's

Degree Level

Total # of

Documents

# of Documents

without URLs

# and % of ETDs that

contain URLs

Average URLs per item

# %

Doctoral 2,347 745 1,602 68.2% 14.02

Master’s 1,988 877 1,111 55.8% 11.12

Total 4,335 1,622 2,713 62.6% 12.83

Page 11: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

URL Range # of ETDs % of ETDs Cumulative %

0 1,622 37.42% 37.42%

1 399 9.20% 46.62%

2-9 1,292 29.80% 76.42%

10-30 769 17.74% 94.16%

31+ 253 5.84% 100.00%

UNT ETDs breakdown of several ranges of URLs per document

Page 12: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

• The average number of URLs per document in the overall UNT ETD dataset is 8.03 with a standard deviation of 21.6.

• These numbers represent documents which contained from 0 to 809 URLs each.

• Removing the documents that did not contain URLs and re-computing the average changed it to 12.83 URLs per document with a standard deviation of 26.19.

UNT ETDs breakdown of URLs per document

Page 13: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Top-Level Domain

# of Documents # of Documents with URLs

Remark

com 1,814 66.86%

org 1,579 58.20%

edu 1,212 44.67%

gov 1,084 39.96%

net 445 16.40%

us 361 13.31%

uk 269 9.92%

ca 184 6.78%

au 112 4.13%

de 100 3.69%

Top-Level Domain Reference

Page 14: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Second-Level Sub-Domain

# of Documents # of Documents with URLs

Remark

ed.gov 295 10.87%

state.tx.us 274 10.10%

unt.edu 233 8.59%

census.gov 185 6.82%

cdc.gov 128 4.72%

wikipedia.org 96 3.54%

nih.gov 84 3.10%

utexas.edu 80 2.95%

microsoft.com 75 2.76%

nytimes.com 71 2.62%

Ten Most Referenced Second Level Sub-Domain

Page 15: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Year # of ETDs # of ETDs with URLs % of ETDs with URLs

1999 120 28 23.33%

2000 315 127 40.32%

2001 290 116 40.00%

2002 298 146 48.99%

2003 328 198 60.37%

2004 304 181 59.54%

2005 284 199 70.07%

2006 326 235 72.09%

2007 349 258 73.93%

2008 336 242 72.02%

2009 311 132* 42.44%*

2010 366 286 78.14%

2011 418 335 80.14%

2012 290 230 79.31%

No. of ETDs with URLs by domain names

Page 16: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Year # of ETDs with URLs

.com .org .edu .gov .net

# % # % # % # % # %

1999 28 14 42.9% 9 32.1% 11 39.3% 4 14.3% 4 14.3%

2000 127 66 52.0% 70 55.1% 53 41.7% 41 32.3% 19 15.0%

2001 116 63 54.3% 67 57.8% 57 49.1% 39 33.6% 20 17.2%

2002 146 102 69.9% 73 50.0% 62 42.5% 47 32.2% 18 12.3%

2003 198 133 67.2% 96 48.5% 84 42.4% 70 35.4% 24 12.1%

2004 181 122 67.4% 89 49.2% 84 46.4% 66 36.5% 23 12.7%

2005 199 141 70.9% 112 56.3% 100 50.3% 83 41.7% 43 21.6%

2006 235 155 66.0% 143 60.9% 116 49.4% 98 41.7% 40 17.0%

2007 258 182 70.5% 157 60.9% 122 47.3% 116 45.0% 35 16.6%

2008 242 166 68.6% 140 57.9% 99 41.0% 91 37.6% 40 16.5%

2009 132 87 65.9% 83 62.9% 69 52.3% 50 37.9% 25 19.0%

2010 286 199 69.6% 170 59.4% 127 44.4% 129 45.1% 50 17.5%

2011 335 231 69.0% 215 64.2% 134 40.0% 146 43.6% 66 20.0%

2012 230 153 66.5% 155 67.4% 94 40.9% 104 45.2% 38 16.5%

Longitudinal Data For ETDs with URLs

Page 17: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

62% of the publications analyzed in this work included URLs. Doctoral level publications at 68.2%Master’s level at 55.8%

The percentage of ETDs that include URLs consistently increasedFrom 23% in 1999 to almost 80% in

2012

Summary

Page 18: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Across the years, there were more doctoral dissertations than masters’ theses with URLs referenced.

The .gov domain is the fourth most referenced top-level domain, however, it accounts for nearly half of the top ten most referenced domainsA further investigation at the domain or subdomain level

could reveal additional patterns that may show more content based information about the URL references.

Summary …

Page 19: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Looking Ahead

Page 20: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

The URLs referenced in a large corpus of ETDs may be present interesting insight into the subjects, disciplines and patterns in these documents which warrants further investigation.

This research provides a preliminary framework for technical methods appropriate for approaching future analysis of the data. A deeper investigation into the scope of the target URLs across an entire ETD corpus could provide a better understanding of the content-based URL linking patterns

Additionally an investigation into how specific disciplines or subject areas are referencing URLs in their ETDs would be helpful in identifying particularly high areas of URL linking versus lower levels.

An analysis of URL inclusion in ETDs across institutions and even nations would make a logical follow-on investigation that would show if higher level trends exist in ETDs.

Future Works

Page 21: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Finally a further investigation into URL extraction from text would be beneficial to the ETD community in several ways:

It would allow libraries to extract URLs not only from born digital ETDs but also from theses and dissertations that are being retrospectively digitized in institutions that have not had longstanding ETD policies.

It would allow for investigation into ways of normalizing or completing malformed URLs that may provide for better analysis of content referenced and its availability in Web archives.

Future Works…

Page 22: Analysis of URL References in ETDs: A Case Study at the University of North Texas Mark E. Phillips mark.phillips@unt.edu Assistant Dean for Digital Libraries

Thank You!Ameseginalehu

!Gracias!