x-informatics web search; text mining b 2013 geoffrey fox [email protected] associate dean for
TRANSCRIPT
![Page 1: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/1.jpg)
X-Informatics Web Search; Text Mining B
2013
Geoffrey [email protected]
http://www.infomall.org/X-InformaticsSpring2013/index.html
Associate Dean for Research and Graduate Studies, School of Informatics and Computing
Indiana University Bloomington
2013
![Page 2: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/2.jpg)
![Page 3: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/3.jpg)
The Course in One Sentence
Study Clouds running Data Analytics processing Big Data to solve problems in X-Informatics
![Page 4: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/4.jpg)
Document Preparation
![Page 15: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/15.jpg)
Inverted Index
![Page 16: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/16.jpg)
![Page 17: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/17.jpg)
![Page 20: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/20.jpg)
Index Construction
![Page 23: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/23.jpg)
Then sort by termID and then docIDhttp://www.ifis.cs.tu-bs.de/teaching/ss-11/irws
![Page 27: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/27.jpg)
Query Structure and Processing
![Page 28: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/28.jpg)
![Page 29: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/29.jpg)
![Page 30: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/30.jpg)
![Page 34: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/34.jpg)
Link Structure Analysisincluding PageRank
![Page 35: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/35.jpg)
![Page 36: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/36.jpg)
![Page 37: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/37.jpg)
![Page 38: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/38.jpg)
![Page 39: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/39.jpg)
![Page 40: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/40.jpg)
![Page 41: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/41.jpg)
![Page 42: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/42.jpg)
![Page 43: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/43.jpg)
![Page 44: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/44.jpg)
Size of face proportional to PageRank
![Page 45: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/45.jpg)
PageRank d=0.85
![Page 46: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/46.jpg)
d = 0.85
![Page 47: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/47.jpg)
PageRank• PageRank is probability that Page will be visited by a surfer is
clicks each link on page with equal probability– minor corrections for pages with no outgoing links
• Found Iteratively with each page getting at each iteration a contribution equal to its page rank divided by #Links on page
• PR(Page i) = Page j pointing at I PR(Page j)/(Number of Pages linked on Page j)
• One adds to this the chance 1-d that surfer types a random URL into web browser.
• That takes PageRank to d times above plus (1 - d) divided by total number of pages on web
• On general principles, this will converge whatever the starting point– It can be written as iterative matrix multiplication
![Page 48: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/48.jpg)
Related Applications• Thinking of Page Rank as reputation• A version of PageRank has recently been proposed as a
replacement for the traditional Institute for Scientific Information (ISI) impact factor, and implemented at eigenfactor.org. Instead of merely counting total citation to a journal, the "importance" of each citation is determined in a PageRank fashion.– Impact Factor is number of citations of each article– The Eigenfactor score of a journal is an estimate of the percentage of time
that library users spend with that journal. The Eigenfactor algorithm corresponds to a simple model of research in which readers follow chains of citations as they move from journal to journal.
• A similar new use of PageRank is to rank academic doctoral programs based on their records of placing their graduates in faculty positions. In PageRank terms, academic departments link to each other by hiring their faculty from each other (and from themselves).
![Page 49: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/49.jpg)
EF= EigenfactorAI = Article Influence over the first five years after publication
Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson's Journal Citation Reports (JCR) is 100
Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports database has an article influence of 1.00
![Page 50: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/50.jpg)
None done here!
![Page 51: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/51.jpg)
![Page 52: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/52.jpg)
Summary Issues
![Page 53: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/53.jpg)
![Page 54: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/54.jpg)
![Page 55: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/55.jpg)
![Page 56: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/56.jpg)
![Page 57: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/57.jpg)
Crawling the Web
![Page 58: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/58.jpg)
![Page 59: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/59.jpg)
![Page 60: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/60.jpg)
![Page 61: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/61.jpg)
Web Advertising and Search
![Page 62: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/62.jpg)
![Page 63: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/63.jpg)
![Page 64: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/64.jpg)
![Page 65: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/65.jpg)
![Page 66: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/66.jpg)
![Page 67: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/67.jpg)
![Page 68: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/68.jpg)
![Page 69: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/69.jpg)
CS236621 Technion
![Page 70: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/70.jpg)
Clustering and Topics
![Page 71: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/71.jpg)
Grouping Documents Together• The responses to a search query give you a group
documents• If we represent documents as points in a space, we can try
to identify regions– Clustering: Nearby regions of points– Support Vector Machine: Chop space up into parts– (Gaussian) Mixture Models: A type of fuzzy clustering– K-Nearest Neighbors (if have examples)
• Alternatively we can determine “hidden meaning” with a topic model– Latent Semantic Indexing– Latent Dirichlet Allocation– With lots of variants of these methods to find “latent factors”
![Page 72: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/72.jpg)
Topic Models• Illustrated by Google News• These try to group documents by Topics such as
“Presidential Election” and not by inclusion of particular phrases
• You imagine each document is a set of topics (the latent factors) and each topic is a bag of words.
• Find the best set of topics and best set of words in topics
![Page 73: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/73.jpg)
A Latent Factor Finding Method
http://www.ifis.cs.tu-bs.de/teaching/ss-11/irws
![Page 74: X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox gcf@indiana.edu Associate Dean for](https://reader035.vdocuments.site/reader035/viewer/2022070408/56649e6b5503460f94b68bd1/html5/thumbnails/74.jpg)
https://portal.futuregrid.org
An example of DA-PLSA
Top 10 popular words of the AP news dataset for 30 topics. Processed by DA-PLSI and showing only 5 topics among 30 topics