representation of hypertext documents based on terms, links and text compressibility julian...

Representation of hypertext documents Representation of hypertext documents based onbased on

terms, links and text compressibility terms, links and text compressibility

Julian SzymańskiJulian Szymański Department of Computer Systems Architecture, Department of Computer Systems Architecture,

GdaGdańńsk University of Technology, Polandsk University of Technology, Poland

Włodzisław DuchWłodzisław DuchDepartment of Informatics, Nicolaus Copernicus University, ToruDepartment of Informatics, Nicolaus Copernicus University, Toruńń, Poland, Poland

School of Computer Engineering, Nanyang Technological University, SingaporeSchool of Computer Engineering, Nanyang Technological University, Singapore

OutlineOutline

Text representationsText representations WordsWords ReferencesReferences CompressionCompression

Evaluation of text representationsEvaluation of text representations Wikipedia dataWikipedia data SVM & PCA SVM & PCA Experimental results and conclusionsExperimental results and conclusions

Future directionsFuture directions

Text representationText representationAmount of the information in the Internet grows rapidly. Thus machine support is Amount of the information in the Internet grows rapidly. Thus machine support is needed for:needed for:

Categorization (supervised or unsupervised)Categorization (supervised or unsupervised) Searching / retrievalSearching / retrieval

Human understand the text. Machine doesn't. To process the text machine requires Human understand the text. Machine doesn't. To process the text machine requires it in computationable form.it in computationable form.

Results of text processing strongly depends on the methods used for text Results of text processing strongly depends on the methods used for text representation.representation.

Processing natural language – several approaches to that problem:Processing natural language – several approaches to that problem: Logic (ontologies), Logic (ontologies), Statistical processing of large text copora, Statistical processing of large text copora, Geometry Geometry mainly used in machine learning.mainly used in machine learning.

Machine learning for NLP uses text featuresMachine learning for NLP uses text features

The aim of the experiments presented here is to find hypertext representation The aim of the experiments presented here is to find hypertext representation suitable for automatic categorizationsuitable for automatic categorization

Information retrieval 4 Wiki project – improvement of existing Wikipedia category Information retrieval 4 Wiki project – improvement of existing Wikipedia category systemsystem

Text representation with featuresText representation with features

Convenient way for machine processing of the is a Convenient way for machine processing of the is a vector of the features.vector of the features.

Text set is represented as a matrix of the Text set is represented as a matrix of the NN features related with text features related with text k k by the weight by the weight cc..

Where features come from?Where features come from?

k – document number, n – feature number, c – feature value

WordsWordsThe most intuitive approach is to take words as features. Words content The most intuitive approach is to take words as features. Words content should describe well subject of the text.should describe well subject of the text.

nn-th word has value -th word has value CC in context of in context of kk-th document calculated as:-th document calculated as:

wherewhere tf – tf – term frequency. Describes how many times the word n appears in k term frequency. Describes how many times the word n appears in k

documentdocument.. idf – idf – inverse document frequency. Describes how seldom n word inverse document frequency. Describes how seldom n word

appears in whole text set.appears in whole text set. Proportion of nuber of all documents and Proportion of nuber of all documents and nuber of the documents containing the given word.nuber of the documents containing the given word.

Problem: high dimensional sparse vectors. Problem: high dimensional sparse vectors.

BOW - Bag of Words that looses syntaxBOW - Bag of Words that looses syntax

Preprocessing: stopwords, stemming. Features -> termsPreprocessing: stopwords, stemming. Features -> terms

Some other possibilities n-grams, profiles of the letter frequiencies.Some other possibilities n-grams, profiles of the letter frequiencies.

ReferencesReferencesScientific articles contains bibliography. Web documents contains Scientific articles contains bibliography. Web documents contains hyperlinks. They can be used as representaton space where document is hyperlinks. They can be used as representaton space where document is represented by other document it is refferenced to.represented by other document it is refferenced to.

Typically binary vector containing 0 – lack of reference to the given Typically binary vector containing 0 – lack of reference to the given document 1 – existance of the referencedocument 1 – existance of the reference

Some possible extensions:Some possible extensions: Not all articles are equqll. Ranking algorithms such as PageRank, HITS Not all articles are equqll. Ranking algorithms such as PageRank, HITS

allow to measur eimportance of the documents and provide intsead allow to measur eimportance of the documents and provide intsead binary walue weight that deccribe importance while one article points to binary walue weight that deccribe importance while one article points to the another.the another.

We can use references of higher order, that captures references not We can use references of higher order, that captures references not only from neighbours but also loog further.only from neighbours but also loog further.

Similary like words representation, sparse vectors but much more lower Similary like words representation, sparse vectors but much more lower dimensions,dimensions,

Poor for capturing semantic.Poor for capturing semantic.

CompressionCompressionUsually we need to show differences and similarities between text in the Usually we need to show differences and similarities between text in the repository. They can be calculated using eg. Cosine distance which is repository. They can be calculated using eg. Cosine distance which is suitable for high dimensional, sparse vectors.suitable for high dimensional, sparse vectors.

Square matrix describing text similarity.Square matrix describing text similarity.

Other possibility is to make representation space based on algorithmic Other possibility is to make representation space based on algorithmic information estimated using standard file compression techniques.information estimated using standard file compression techniques.

Key idea: If two documents are similar their concatenation will lead to a file Key idea: If two documents are similar their concatenation will lead to a file size slightly larger than the size of a single compressed file.size slightly larger than the size of a single compressed file.

Two similar files will be compressed better that two different. Two similar files will be compressed better that two different.

complexity-based similarity measure as a fraction by which the sum of the complexity-based similarity measure as a fraction by which the sum of the separately compressed files exceeds the size of the jointly compressed file.separately compressed files exceeds the size of the jointly compressed file.

where A and B denote text files, and the suffix p denotes the compression where A and B denote text files, and the suffix p denotes the compression operation.operation.

szymanski

The data The data The three ways to generate numerical representation of texts have The three ways to generate numerical representation of texts have been compared on a set of articles selected from the Wikipediabeen compared on a set of articles selected from the Wikipedia

Articles that belog to sub categories of Super category Science:Articles that belog to sub categories of Super category Science: Chemistry →Chemistry → Chemical compounds, Chemical compounds, Biology →Biology → TreesTrees Mathematics →Mathematics → AlgebraAlgebra Computer science →Computer science → MS (Microsoft) operating systemsMS (Microsoft) operating systems Geology Geology →→ Volcanology.Volcanology.

Rough view of the class distributionRough view of the class distributionPCA projections of the data with two principal components having the highest variancePCA projections of the data with two principal components having the highest variance

Projection of dataset on two highest principal components for text representation based Projection of dataset on two highest principal components for text representation based on terms, links and compressionfor on terms, links and compressionfor

Number of components used that Number of components used that complete complete 90% of variance and cumulative sum of 90% of variance and cumulative sum of primary components variance for successive text representationsprimary components variance for successive text representations

SVM classificationSVM classificationClassification may be used as a method for validation of theClassification may be used as a method for validation of the te text representations. The better xt representations. The better results classifier obtains – the better representation is. Information extracted by different text results classifier obtains – the better representation is. Information extracted by different text representations may be estimated by comparing classifier errors in various feature spaces.representations may be estimated by comparing classifier errors in various feature spaces.

Multiclass classification with SVM performed with one-versus-other class approach has Multiclass classification with SVM performed with one-versus-other class approach has been used with two-fold crossvalidation repeated 50 times for accurate averaging of the been used with two-fold crossvalidation repeated 50 times for accurate averaging of the results.results.

Raw rRaw rrepresentation based on complexity gives the best results. representation based on complexity gives the best results.

Reducing the dimensionality removing the features that are related only to one article Reducing the dimensionality removing the features that are related only to one article improves the results.improves the results.

Introducing cosine kernel improves considerably resultsIntroducing cosine kernel improves considerably results

SVM and PCA reductionSVM and PCA reductionSelecting components that complete 90% of variance has been used for Selecting components that complete 90% of variance has been used for dimensionality reductiondimensionality reduction

It worsen the results of classification for terms and links (to high reduction?) It worsen the results of classification for terms and links (to high reduction?)

PCA does not influence complexity representationPCA does not influence complexity representation

as in previous results introduction of cosine kernel improves classification. as in previous results introduction of cosine kernel improves classification.

For terms it is even slightly betterFor terms it is even slightly better

SumSummmararyy

Complexity measure allowed for much more compact representation, as seen Complexity measure allowed for much more compact representation, as seen from the cumulative contribution of principal components and achieved best from the cumulative contribution of principal components and achieved best accuracy in PCA-reduced space with only 36 dimensions.accuracy in PCA-reduced space with only 36 dimensions.

AAfter using cosine kernel term based representation is slightly more accurate. fter using cosine kernel term based representation is slightly more accurate.

Explicit representation of kernel spaces and the use of linear SVM classifier Explicit representation of kernel spaces and the use of linear SVM classifier allows to find important reference documents for a given category, as well as allows to find important reference documents for a given category, as well as identify collocations and phrases that are important for characterization of identify collocations and phrases that are important for characterization of each category.each category.

Distance-typed kernels improves results and reduces dimensionality in terms Distance-typed kernels improves results and reduces dimensionality in terms and links representations. and links representations.

Improvement is also in the case of representation based on complexity where Improvement is also in the case of representation based on complexity where

similaritysimilarity,, based on distance based on distance,, is second-order transformation is second-order transformation..

Future directionsFuture directions

Different methods of representation extract different information from texts. Different methods of representation extract different information from texts. They show different aspects of the documents .They show different aspects of the documents .

In future we plan to combine representations and use one, joint In future we plan to combine representations and use one, joint representation.representation.

We plan introduce more background knowledge and capture some We plan introduce more background knowledge and capture some semantics.semantics.

Wordnet can be used as semantic space where words from the article Wordnet can be used as semantic space where words from the article are mapped.are mapped.

Wordnet is made as a network of interconnected synsets – elementary Wordnet is made as a network of interconnected synsets – elementary atoms that brings meaning.atoms that brings meaning.

Mapping requires usage of disambiguation techniques. Mapping requires usage of disambiguation techniques. It allow to use activations of a WordNet semantic network and then It allow to use activations of a WordNet semantic network and then

calculate distances between them what should give better semantic calculate distances between them what should give better semantic similarity measures.similarity measures.

Large scale classifier for whole Large scale classifier for whole WWikipedia.ikipedia.

Thank for yor attentionThank for yor attention

representation of hypertext documents based on terms, links and text compressibility julian...

Documents

text machine

text set

results of text processing

word n

n features

machine processing

given document

nth word