movie categorization according to subtitles -- nlp course project
TRANSCRIPT
![Page 1: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/1.jpg)
Information Extraction from Information Extraction from Movie SubtitlesMovie Subtitles
Extraction of information from subtitles to index moviesExtraction of information from subtitles to index movies
Dogan Kaya [email protected]
CS 578
Natural Language Processing
Graduate Course
Computer Engineering
Bilkent University – Turkey
![Page 2: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/2.jpg)
Proposed System in one Sentence Proposed System in one Sentence
A platform for movie indexing via subtitle analysis
![Page 3: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/3.jpg)
OutlineOutline
Introduction
Video Categorization Method
WordNet Domains
Conclusions - Future Work
![Page 4: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/4.jpg)
IntroductionIntroduction Multimedia databases are becoming popularMost video classification methods are based on
visual/audio signal processingText processing is more lightweight than
visual/audio processingHigh-level semantics are more closely related to
human language than to visual features Subtitles capture the semantics of the
corresponding video
![Page 5: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/5.jpg)
Text Pre-processingText Pre-processing
Subtitles are segmented into sentencesA Part of Speech Tagger is applied to each
sentence (Stanford Log-linear Part-Of-Speech
Tagger)Stop words removed based on a stop
words list
![Page 6: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/6.jpg)
KeywordKeyword ExtractionExtraction
TextRank algorithm to extract keywordsTextRank :
represents the text as a graph,A ranking algorithm based on Google’s PageRanksorts vertices in decreasing rank order,extracts the top highly ranked vertices for further processing
TextRank Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts, in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain, July 2004
![Page 7: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/7.jpg)
WordNetWordNet“WordNet is a semantic lexicon for the
English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets.“(en.wikipedia.org)
![Page 8: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/8.jpg)
WordNet RelationsWordNet Relations
hypernyms: Y is a hypernym of X if every X is a (kind of) Y (canine is a hypernym of dog)
hyponyms: Y is a hyponym of X if every Y is a (kind of) X (dog is a hyponym of canine)
coordinate terms: Y is a coordinate term of X if X and Y share a hypernym (wolf is a coordinate term of dog, and dog is a coordinate term of wolf)
holonym: Y is a holonym of X if X is a part of Y (building is a holonym of window)
meronym: Y is a meronym of X if Y is a part of X (window is a meronym of building)
![Page 9: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/9.jpg)
Word Sense DisambiguationWord Sense DisambiguationWords have many possible meanings,
called sensesA Word Sense Disambiguation (WSD)
algorithm is needed to determine the correct sense of each word
WSDis based on the lexical database WordNet
WSD Banerjee, S., Pedersen, T.: An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet. In the Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLING-02) Mexico City, Mexico (2002)
![Page 10: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/10.jpg)
WordNet Domains ExtractionWordNet Domains Extraction
Augment WordNet with domain labels
A taxonomy of ~200 domain labels
Each Synset annotated at least one domain label
WordNet domains
http://wndomains.itc.it/wordnetdomains.html WN domains:
![Page 11: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/11.jpg)
Example WordNet DomainExample WordNet Domain
![Page 12: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/12.jpg)
WordNet Domains Extraction IIWordNet Domains Extraction IIFor each video:
Extract the WordNet domains for each keyword’s sense
Calculate the frequency occurrence of each domain label
Sort domain labels in decreasing order according to their occurrence frequency
![Page 13: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/13.jpg)
Correspondences between Correspondences between categories & WN domainscategories & WN domains
For each category label:
Look up in WordNet the senses related to it (include senses related through hypernym & hyponym relations)
Obtain the corresponding WordNet domains
Calculate the occurrence score for each domain
Sort domains in decreasing occurrence order
![Page 14: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/14.jpg)
medicine, biology, mathematicsscience
military, historywar
animals, biology, entomologyanimals
WordNet domains Category
![Page 15: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/15.jpg)
Category label assignmentCategory label assignmentCompare the ordered list with the WN domains of
each video with the ordered list of the WN domains of each category
medicine, biology, mathematics
science
military, historywar
animals, biology, entomologyanimals
WordNet domains Category
Example:
animals, entomology, biology
WN domains of a videoanimals
biology, mathematics, physics
science
![Page 16: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/16.jpg)
![Page 17: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/17.jpg)
Conclusions & Future WorkConclusions & Future WorkConclusions
An approach that is based only on text and uses natural language processing techniquesNo training phase (unsupervised approach)
WordNet Domain mapping
Future WorkDefinition of domain knowledge more close to movie classification (mpeg-7)
Improved WSD
![Page 18: Movie Categorization According to Subtitles -- NLP Course Project](https://reader036.vdocuments.site/reader036/viewer/2022062318/5525d7f35503467c6f8b4aac/html5/thumbnails/18.jpg)
Thank you!Thank you!
Questions & Comments
http://doganberktas.com