automatic classification document and filing jonathan mcelroy advisor: franz j. kurfess
TRANSCRIPT
![Page 1: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/1.jpg)
Automatic Classification Document and Filing
Jonathan McElroy
Advisor: Franz J. Kurfess
![Page 2: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/2.jpg)
Overview
Introduction Classification Techniques Hidden Markov Models Similar Systems Novel Approach
![Page 3: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/3.jpg)
Introduction
Creating an assistant document filer that learns from the user.
Novelty – taking different classification approaches to forming a hierarchical folder system based on user’s filing patterns. Also uses natural and specific learning and Markov Models to determine user’s style of filing.
![Page 4: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/4.jpg)
Classification - Bayesian
Probabilistic method using Bayes Theorem [1] [12]Bayes Theorem
Now sum up the probabilities that a word in A will be in class B.
![Page 5: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/5.jpg)
Classification – Bayesian (cont.)
Each word is independent of each other.
Often performs just as well as more complicated techniques like decision trees, rule-based learning and instance based learning.
![Page 6: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/6.jpg)
Classification – Vector Based
The text documents are turning into vectors [1]
Support Vector Machines [14]Supervised learning.Forms a divide between examples mapped
in space. New objects mapped are classified based
on where they are.
![Page 7: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/7.jpg)
Classification – Vector (cont.)
T-Route [1]An average document vector created
for K classes.Uses a term-document matrixWTR is size M X K.Wij represents number of times ti
occurs in cj.
![Page 8: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/8.jpg)
Classification – Vector (cont.)
Vectorization [1]
![Page 9: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/9.jpg)
Classification – Vector (cont.)
T-Trans [1]A unique document vector created for
K classes.WTT is size M X N.Wij represents number of times ti
occurs in dj.Document is assigned same class as
column vector in WTT with the smallest Euclidian distance from document.
![Page 10: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/10.jpg)
Classification - Improvements
Latent Semantic Analysis Looks at relationships between words
and documents and then forms concepts to link eachother.
![Page 11: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/11.jpg)
Classification - Improvements
Term Weighting [1] [15]Term Frequency – Inverse Document
Frequency The importance increases proportionally to
the number of times a word appears in the document but is offset by the frequency of the word in the corpus
![Page 12: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/12.jpg)
Classification - Improvements
Term Weighting [1] [15]Mutual Information - look at two
different classes and infers which keywords being used to classify one of them will also lead to a misclassification of the other one
• Measures their mutual dependence
![Page 13: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/13.jpg)
Classification - Improvements
Term Weighting [1] [15]Bellegarda – Combines the global
weighting with a localized weighting for a word.
Creates a new term-document Wij with ti in dj
![Page 14: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/14.jpg)
Hidden Markov Models [4]
Method for learning patterns. Specifically for filing patterns.
HMM - describe two related discrete-time stochastic processes.First – hidden stateSecond – visible variables
![Page 15: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/15.jpg)
Hidden Markov Models [4]
Example: User files using 2 different types of filing: Date, Area of Interest.
Observations about the documents in nodes will lead to a filing type using probabilities of each type, and node data.
![Page 16: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/16.jpg)
Hidden Markov Models [4]
Date Area
Unrelated Documents
Similar DatesRelated Documents
![Page 17: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/17.jpg)
Similar Systems
Email Classification/ Routing Systems Hierarchical Systems Semantic Desktops
![Page 18: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/18.jpg)
Similar Systems
Email Classification/ Routing Systems[6] System reroutes information from a
central database to multiple users with different profiles, by using evolving classifying agents that filter the data
[1] Continually receive new text based documents and working to classify and extract important information out of them.
![Page 19: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/19.jpg)
Similar Systems
Hierarchical Systems [7]At each level a context sensitive signature
and feature selection is created and then focused to cut out noise and stop words.
Bayesian > Vector
![Page 20: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/20.jpg)
Similar Systems
Semantic Desktops CALO [13]
• project lead by SRI International that focused on development of a smart desktop
• automate interrelated decision making tasks that have resisted automation and allow them to react appropriately to situations that are unusual.
![Page 21: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/21.jpg)
Similar Systems
Semantic Desktops DEVONThink [10]
• Seeks to make an all inclusive information gatherer and organizer
• Sorts, classifies and shows relationships between documents automatically, but has shortcomings
![Page 22: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/22.jpg)
My Approach
Hierarchical approach to classificationClassifying each node in a directory
Also uses natural and specific learning. Letting user choose how involved with learning.
Using Markov Models to determine user’s style of filing. Automatic placement of files that do not fit into any current nodes.
![Page 23: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/23.jpg)
My Approach
The user is able to drag newly received files and drop them onto the program.
Files are be classified by their content and put in the location that the user would most likely have placed them.
Any changes by user are recorded and added to classification.
![Page 24: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/24.jpg)
References[1] Tailby, R., Dean, R., Milner, B., and Smith, D. 2006. Email classification for automated service handling. In Proceedings of the 2006
ACM Symposium on Applied Computing (Dijon, France, April 23 - 27, 2006). SAC '06. ACM, New York, NY, 1073-1077. http://doi.acm.org/10.1145/1141277.1141530
[2] Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (Mar. 2002), 1-47. http://doi.acm.org/10.1145/505282.505283
[3] Fu, Y., Ke, W., and Mostafa, J. 2005. Automated text classification using a multi-agent framework. In Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (Denver, CO, USA, June 07 - 11, 2005). JCDL '05. ACM, New York, NY, 157-158. http://doi.acm.org/10.1145/1065385.1065420
[4] Frasconi, P., Soda, G., and Vullo, A. 2002. Hidden Markov Models for Text Categorization in Multi-Page Documents. J. Intell. Inf. Syst. 18, 2-3 (Mar. 2002), 195-217. http://dx.doi.org/10.1023/A:1013681528748
[5] Cohen, W. W. and Singer, Y. 1999. Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. 17, 2 (Apr. 1999), 141-173. http: //doi.acm.org/10.1145/306686.306688
[6] Clack, C., Farringdon, J., Lidwell, P., and Yu, T. 1997. Autonomous document classification for business. In Proceedings of the First international Conference on Autonomous Agents (Marina del Rey, California, United States, February 05 - 08, 1997). AGENTS '97. ACM, New York, NY, 201- 208. http://doi.acm.org/10.1145/267658.267716
[7] Chakrabarti, S., Dom, B., Agrawal, R., and Raghavan, P. 1998. Scalable feature selection, classication and signature generation for organizing large text databases into hierarchical topic taxonomies. The VLDB Journal 7, 3 (Aug. 1998), 163-178. http://dx.doi.org/10.1007/s007780050061
[8] Baker, L. D. and McCallum, A. K. 1998. Distributional clustering of words for text classication. In Proceedings of the 21st Annual international ACM SIGIR Conference on Research and Development in information Retrieval (Melbourne, Australia, August 24 - 28, 1998). SIGIR '98. ACM, New York, NY, 96-103. http://doi.acm.org/10.1145/290941.290970
[9] Cognitive Assistant that Learns and Organizes http://caloproject.sri.com/[10] http://www.devon-technologies.com/products/devonthink/index.html[11] http://nepomuk.semanticdesktop.org/[12] Fan, H. and Ramamohanarao, K. 2003. A Bayesian approach to use emerging patterns for classification. In Proceedings of the 14th
Australasian Database Conference - Volume 17 (Adelaide, Australia). K. Schewe and X. Zhou, Eds. ACM International Conference Proceeding Series, vol. 143. Australian Computer Society, Darlinghurst, Australia, 39-48.
[13] Cognitive Assistant that Learns and Organizes. http://caloproject.sri.com/[14] Support Vector Machines. December, 2009. http://en.wikipedia.org/wiki/Support_vector_machine.[15] K. Yu, X. Xu, M. Ester, H.-P. Kriegel. Feature weighting and instance selection for collaborative filtering. Knowledge and
Information Systems, 5(2), 201-224, 2003
![Page 25: Automatic Classification Document and Filing Jonathan McElroy Advisor: Franz J. Kurfess](https://reader035.vdocuments.site/reader035/viewer/2022081515/56649ed35503460f94be4101/html5/thumbnails/25.jpg)
Questions
Do you?