automatic syllabus classification jcdl – vancouver – 22 june 2007
DESCRIPTION
Automatic Syllabus Classification JCDL – Vancouver – 22 June 2007. Edward A. Fox (presenting co-author), Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones, William Cameron, GuoFang Teng, and Lillian (“Boots”) Cassel. Why Study the Syllabus Genre?. Educational resource - PowerPoint PPT PresentationTRANSCRIPT
Automatic Syllabus Automatic Syllabus ClassificationClassification
JCDL – Vancouver – 22 June 2007JCDL – Vancouver – 22 June 2007
Edward A. Fox (presenting co-Edward A. Fox (presenting co-author),author),
Xiaoyan Yu, Manas Tungare, Weiguo Fan, Xiaoyan Yu, Manas Tungare, Weiguo Fan, Manuel Perez-Quinones, William Manuel Perez-Quinones, William
Cameron, GuoFang Teng, and Lillian Cameron, GuoFang Teng, and Lillian (“Boots”) Cassel(“Boots”) Cassel
Why Study the Syllabus Why Study the Syllabus Genre?Genre?
►Educational resourceEducational resource► Importance to the educational Importance to the educational
communitycommunity EducatorsEducators Students Students Self-learnersSelf-learners
►Thanks to NSF DUE grant 5328255 Thanks to NSF DUE grant 5328255 (personalization support for NSDL)(personalization support for NSDL)
Where to look for a specific Where to look for a specific syllabus?syllabus?
►Non-standard publishing mechanisms: Instructor’s website CMSs (courseware management systems,
e.g., Sakai) Catalogs
►Limited access outside the university►Search on the WebSearch on the Web
Many non-relevant links in search resultsMany non-relevant links in search results
Syllabus Library Syllabus Library
►BootstrappingBootstrapping Identify true syllabi from search resultsIdentify true syllabi from search results Store in a repositoryStore in a repository Develop tools & applications Develop tools & applications
►Scaling upScaling up Encourage contributions from educational Encourage contributions from educational
communitiescommunities
An Essential Step towards An Essential Step towards Syllabus Library: ClassificationSyllabus Library: Classification
►Classification Objects:Classification Objects: Potential syllabi in Computer Science: Potential syllabi in Computer Science:
search on the Web, using syllabus search on the Web, using syllabus keywords, only in the educational domainskeywords, only in the educational domains
►Class DefinitionClass Definition►Feature SelectionFeature Selection►Model SelectionModel Selection►Training and TestingTraining and Testing
Four ClassesFour Classes
Class distribution on 1020 documents manually tagged
Partial 20%
Noise 18%
Entry 13%
Full 49%
Noise
Full Full SyllabusSyllabus
PartialPartial
SyllabuSyllabuss
EntrEntry y
PagePage
NoiseNoise
Syllabus ComponentsSyllabus Components
►course code►title►class time& location►offering institution►teaching staff►course description►objectives
►web site►prerequisite►textbook►grading policy►schedule►assignment►exam and
resources
FeaturesFeatures
►84 Genre-specific Features84 Genre-specific Features the occurrences of keywords the positions of keywords, and the co-occurrences of keywords and links
►A series of keywords for each syllabus component
Classification ModelsClassification Models
► Discriminative ModelsDiscriminative Models Support Vector Machines (SVM)Support Vector Machines (SVM) SMO-L: SMO-L: Sequential Minimal Optimization,
accelerating the training process of SVM SMO-P: SMO with a polynomial kernelSMO-P: SMO with a polynomial kernel
► Generative ModelsGenerative Models Naïve Bayes (NB)Naïve Bayes (NB) NB-K: Applying kernel methods to estimate the NB-K: Applying kernel methods to estimate the
distribution of numeric attributes in NB modelingdistribution of numeric attributes in NB modeling
EvaluationEvaluation
►Training corpus: 1020 out of the 8000+ Training corpus: 1020 out of the 8000+ potential syllabipotential syllabi
►All in HTML, PDF, PostScript, or Text All in HTML, PDF, PostScript, or Text ►Manual tagging on the training corpusManual tagging on the training corpus
Unanimous agreement by three co-authorsUnanimous agreement by three co-authors
►Evaluation strategy: ten-fold cross Evaluation strategy: ten-fold cross validationvalidation
►Metrics: FMetrics: F11 (an overall measure of (an overall measure of classification performance)classification performance)
Results w. random setResults w. random set
Best items are in purple boxes.
Acctr: Classification accuracy on the training set.
Results (Cont’d)Results (Cont’d)
►SVM outperforms NB regarding our SVM outperforms NB regarding our syllabus classification on average. syllabus classification on average.
►All classifiers fail in identifying the All classifiers fail in identifying the partial syllabus class. partial syllabus class.
►The kernel settings for NB are not The kernel settings for NB are not helpful in the syllabus classification helpful in the syllabus classification task. task.
►Classification accuracy on training data Classification accuracy on training data is not that good. is not that good.
Future WorkFuture Work
►Feature selectionFeature selection Add general feature selection methods on Add general feature selection methods on
text classificationtext classification e.g., Document Frequency, Information e.g., Document Frequency, Information
Gain, and Mutual InformationGain, and Mutual Information
Hybrid: combine our genre-specific Hybrid: combine our genre-specific features with the general featuresfeatures with the general features
Future Work (Cont’d)Future Work (Cont’d)
►Syllabus LibrarySyllabus Library Welcome to Welcome to http://doc.cs.vt.eduhttp://doc.cs.vt.edu Share your favorite course resources – not Share your favorite course resources – not
limited to the syllabus genre.limited to the syllabus genre.
► Information ExtractionInformation Extraction Semantic searchSemantic search
►PersonalizationPersonalization
SummarySummary
►Towards a syllabus libraryTowards a syllabus library Starting from search results on the webStarting from search results on the web Classification of the search results for true Classification of the search results for true
syllabisyllabi► SVM is a better choice for our syllabus SVM is a better choice for our syllabus
classification task.classification task.
►Towards an educational on-line Towards an educational on-line community around the syllabus librarycommunity around the syllabus library
Q & AQ & A