novel representations and methods in text classification

Novel representations and methods in text classification

Manuel Montes, Hugo Jair EscalanteInstituto Nacional de Astrofísica, Óptica y Electrónica, Mexico.

http://ccc.inaoep.mx/~mmontesg/

http://ccc.inaoep.mx/~hugojair/{mmontesg, hugojair}@inaoep.mx

7th Russian Summer School in Information RetrievalKazan, Russia, September 2013

OVERVIEW OF THIS COURSENovel represensations and methods in text classification

Text classification

• Text classification is the assignment of free-text documents to one or more predefined categories based on their content

• In this course we will review the basics of text classification (document preprocessing and representation and classifier construction) with emphasis on novel or extended document representations

Day 1 (today): Introduction to text classification

• We will provide a general introduction to the task of automatic text classification:– Text classification: manual vs

automatic approach– Machine learning approach to

text classification– BOW: Standard representation for

documents– Main learning algorithms for text

classification – Evaluation of text classification

Day 2: Concept-based representations

• This session elaborates on concept-based representations of documents. That is document representations extending the standard representation with –latent– semantic information. • Distributional representations • Random indexing• Concise semantic analysis• Multimodal representations• Applications in short-text classification, author profiling

and authorship attribution

Day 3: Modeling sequential and syntactic information

• This session elaborates on different alternatives that extend the BOW approach by including sequential and syntactic information. – N-grams– Maximal frequent sequences– Pattern sequences– Locally weighted bag of words– Syntactic representations– Applications in authorship attribution

Day 4: Non-conventional classification methods

• This session considers text classification techniques specially suited to work with low quality training sets.

– Self-training– PU-learning – Consensus classification– Applications in short-text classification, cross-lingual

text classification and opinion spam detection.

Day 5: Automatic construction of classification models

• This session presents methods for the automated construction of classification models in the context of text classification. – Introduction to full model selection – Related works– PSMS– Applications in authorship attribution

Some relevant material

ccc.inaoep.mx/~mmontesg/

• Notes of this courseteaching Russir 2013

• Notes of our course on text miningteaching text mining

• Notes of our course on information retrievalteaching information retrieval

INTRODUCTION TO TEXT CLASSIFICATION

Novel represensations and methods in text classification

Text classification

• Text classification is the assignment of free-text documents to one or more predefined categories based on their content

Documents (e.g., news articles)Categories/classes

(e.g., sports, religion, economy)

Manual classification

• Very accurate when job is done by experts– Different to classify news in general categories than

biomedical papers into subcategories. • But difficult and expensive to scale– Different to classify thousands than millions

• Used by Yahoo!, Looksmart, about.com, ODP, Medline, etc.

Hand-coded rule based systems

• Main approach in the 80s• Disadvantage knowledge acquisition bottleneck– too time consuming, too difficult, inconsistency issues

Experts

Labeleddocuments

Knowledgeengineers

Rule 1, if … then … elseRule N, if … then …

Classifier

New document

Document’s category

Example: filtering spam email• Rule-based classifier

Hastie et al. The Elements of Statistical Learning, 2007, Springer.

Machine learning approach (1)

• A general inductive process builds a classifier by learning from a set of preclassified examples.– Determines the characteristics associated with each one

of the topics.

Ronen Feldman and James Sanger, The Text Mining Handbook

Machine learning approach (2)

Labeleddocuments(training set)

Rules, trees, probabilities,

prototypes, etc.

Classifier

Newdocument

Document’scategory

Learningalgorithm

Experts

How to represent document’s information?

Representation of documents

• First step: transform documents, which typically are strings of characters, into a representation suitable for the learning algorithm.– Generation of a document-attribute representation

• The most common used document representation is the bag of words.– Documents are represent by the set of different words

in all of the documents– Word order is not capture by this representation– There is no attempt for understanding their content

Representation of documents

t1 t1 … tn

: wi,j

All documents(one vector per document)

Weight indicating the contributionof word j in document i.

Vocabulary from the collection(set of different words)

Each different word is a feature!How to compute their weights?

Preprocessing• Eliminate information about style, such as html or

xml tags.– For some applications this information may be useful. For

instance, only index some document sections.• Remove stop words– Functional words such as articles, prepositions,

conjunctions are not useful (do not have an own meaning).• Perform stemming or lemmatization– The goal is to reduce inflectional forms, and sometimes

derivationally related forms.

am, are, is → be car, cars, car‘s → car Have to do them always?

Term weighting - two main ideas

• The importance of a term increases proportionally to the number of times it appears in the document.– It helps to describe document’s content.

• The general importance of a term decreases proportionally to its occurrences in the entire collection.– Common terms are not good to discriminate between

different classes

Term weighting – main approaches• Binary weights: – wi,j = 1 iff document di contains term tj , otherwise 0.

• Term frequency (tf):– wi,j = (no. of occurrences of tj in di)

• tf x idf weighting scheme:

– wi,j = tf(tj, di) × idf(tj), where:

• tf(tj, di) indicates the ocurrences of tj in document di

• idf(tj) = log [N/df(tj)], where df(tj) is the number of

documets that contain the term tj.

Normalization?

Extended document representations• BOW is simple and tend to produce good results, but it has

important limitations– Does not capture word order neither semantic information

• New representations attempt to handle these limitations. Some examples are:– Distributional term representations – Locally weighted bag of words – Bag of concepts – Concise semantic analysis – Latent semantic indexing– Topic modeling– …

The topic of this course

Dimensionality reduction

• A central problem in text classification is the high dimensionality of the features space.– Exist one dimension for each unique word found in

the collection can reach hundreds of thousands– Processing is extremely costly in computational terms– Most of the words (features) are irrelevant to the

categorization task

How to select/extract relevant features?How to evaluate the relevance of the features?

Two main approaches• Feature selection• Idea: removal of non-informative words according to

corpus statistics• Output: subset of original features– Main techniques: document frequency, mutual information

and information gain• Re-parameterization– Idea: combine lower level features (words) into higher-level

orthogonal dimensions– Output: a new set of features (not words)– Main techniques: word clustering and Latent semantic

indexing (LSI)Out of the scope of this course

Learning the classification model

novel representations and methods in text classification

Documents

traffic monitoring and application classification: a novel...

a novel two approach for record pair classification

a novel multiclass text classification algorithm based on

a novel kuhnian ontology for epistemic classification of

classification and toxicity mechanisms of novel flame...

classification and novel class detection in data streams...

novel representations and methods in text...

novel multiscale representations of data sets for...

a novel wide area protection classification technique for

categories of representations and classification of von...

a novel deep-learning-based bug severity classification

rosch, prototype classification and logical classification -...

data stream classification and novel class detection

a novel chemometric classification for ftir spectra of

novel phonotactic learning: syllable-level and co-occurrence...

classification of a novel marine bacterial species: htcc...

novel representations and methods in text...

a great novel in a small language: representations of the...

a novel analysis of performance classification and

learning feature representations for music...