term co-occurrence analysis as an interface to digital libraries
DESCRIPTION
Term Co-occurrence Analysis as an Interface to Digital Libraries. Jan W. Buzydlowski Howard D. White Xia Lin College of Information Science and Technology Drexel University, Philadelphia, Pennsylvania, USA. Digital Library Research. First Wave How to store it Next Wave - PowerPoint PPT PresentationTRANSCRIPT
Term Co-occurrence Analysis as an Interface to Digital Libraries
Jan W. Buzydlowski
Howard D. White
Xia Lin
College of Information Science and Technology
Drexel University, Philadelphia, Pennsylvania, USA
Digital Library Research
First Wave– How to store it
Next Wave– How to retrieve it (IR)
• Text Mining• Visual Information Retrieval Interface (VIRI)
Term Co-occurrence Analysis (TCA)– Co-occurrence vs. lexical associations– Maps vs. lists
Term Definition Unit of Analysis
– Words– Documents– Authors– Journals
Section of Focus– Abstract/Text– Title– Bibliography– Keywords
Example
Words in Title– Term– Co-occurrence– Analysis– Interface– Digital– Library
Authors in Bibliography– Salton-G– Chen-C– White-HD– Ding-Y– Cleveland-W– McCain-K– Lin-X– Schvaneveldt-R– Kamada-T– Fruchterman-T
Term Co-occurrence Methodology
User determines which terms are of interest– Via a seed term– From a pre-defined list
The system returns the pair-wise co-occurrence counts of the terms over the collection of records
Example Unit: Author; Section: Bibliography User Supplied List: Plato, Aristotle, Smith, Brown For a given data set (N = 4 unique terms)
– Article 1: Plato, Aristotle, Smith, …– Article 2: Plato, Smith, …– Article 3: Plato, Aristotle, Smith, Brown, …
The following co-citations (C(4,2) = 6) are found– COMBINATION COUNT ARTICLES– Plato and Smith 3 1, 2, 3– Plato and Aristotle 2 1, 3– Plato and Brown 1 3– Aristotle and Smith 2 1, 3– Aristotle and Brown 1 3– Smith and Brown 1 3
Term Co-occurrence Significance
The frequent co-occurrence of term pairs within a set of documents indicates a strong association between those terms, whereas a infrequent count indicates the opposite
– The association you would expect is borne out by the frequency
– The frequency you compute suggests a level of association
Pain and Management Pain and Obtainment
Plato and Aristotle Plato and Cher
Science and Nature Science and National Tattler
A and B C and D
Term Co-occurrence Uses
Allows a user to get a “foothold” with just one term– One seed term returns many other related
terms Allows a user to get a “overview” with
user-supplied/system-supplied terms– Co-occurrence counts with visualization
Seeding
User types in – One term, e.g., Plato– Boolean expression, e.g., Plato AND Brown
System supplies top n terms, in ranked order of frequency of co-occurrence with the initial term
Example
For Plato seed:
ARISTOTLEPLUTARCHCICEROHOMERBIBLEEURIPIDESARISTOPHANESXENOPHONAUGUSTINEHERODOTUSKANT-IAESCHYLUS
SOPHOCLESTHUCYDIDESOVIDHESIODDIOGENES-LAERTIHEIDEGGER-MDERRIDA-JPINDARNIETZSCHE-FHEGEL-GWFVERGILAQUINAS-T
Need for Visualization
Given a list of user- / system-supplied terms– Find the frequency of co-occurrence of each pair-wise
combination of terms• Plato AND Aristotle = 1,920• Plato AND Plutarch = 380,• …
– Too many numbers to take in at once• C(25, 2) = (25 * 24)/ 2 = 300 pairs
Three major visualization techniques– Multidimensional Scaling (MDS)– Self-Organizing (Kohonen) Maps (SOMs)– PathFinder Networks (PFNETs)
RR Sokal
PHA Sneath
JC Gower
JH Ward
JD CarrollJB Kruskal
VE McGee
RN Shepard
JA HartiganHA Skinner
SC Johnson
M Wish
P Arabie
RK Blashfield
PE Green
White’s MDS map of 15 co-cited classificationists, ca. 1990
White’s PFNet of co-cited authors in Biblical and literary hermeneutics, 1988-1997
SCHLEIERMACHER F
GADAMER HG
KANT I
HEGEL GWF
BARTH K
DILTHEY W
HEIDEGGER M
PLATO
BIBLE
ARISTOTLE
HABERMAS J
DERRIDA J
RICOEUR P
GOETHE JWV
BULTMANN R
FRANK M
NIETZSCHE F
TILLICH P
FICHTE JG
PANNENBERG W
TROELTSCH E
SCHELLING FWJ
SCHLEGEL FV
LUTHER M
EBELING G
Our System Three tiered
– User interface
– Server
– Database
Real-time and interactive Significant data sources
– ISI AHCI– MedLine
Live interface for retrieval
BRS Search EngineWeb Server
Java Servlets
Web-based Map Interface
Java Applet
MappingProcedures
Application Server
OracleDatabases
PUBMED Search Engine
Database Interface API
– String [ ] findRel( String, int )– Int [ ] findOcc( String [ ] )
Implemented on:– BRS
• API via a wrapper
– Oracle• API via JDBC
– Noah• Specialized co-occurrence database• API via JNI
Future Plans
User Study– Preference
• Type of map, etc.
– Cognitive map• How well does the map match experts’ mental
models
Larger datasets Additional data sources