presenter : chang, shih-jie authors : adnan yahya and ali salhi 2014. acm talip

27
Intelligent Database Systems Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP. Arabic Text Categorization Based on Arabic Wikipedia

Upload: oro

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Arabic Text Categorization Based on Arabic Wikipedia. Presenter : CHANG, SHIH-JIE Authors : ADNAN YAHYA and ALI SALHI 2014. ACM TALIP . Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Presenter : CHANG, SHIH-JIE

Authors : ADNAN YAHYA and ALI SALHI

2014. ACM TALIP.

Arabic Text Categorization Based on Arabic Wikipedia

Page 2: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments

Page 3: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Motivation

A challenge due to the correlation between certain subcategories and overlap between main categories.

EX:

Page 4: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Objectives• To solve this, we use algorithm and further adopt the two

approaches .

Page 5: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

CATEGORIZATION CORPORA - Training Data

Related Tags Approach

Page 6: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Page 7: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Testing Data

10 categories with 40 documents in each category

Page 8: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - PREPROCESSING TECHNIQUES

Root Extraction (RE) Light Stemming (LS) Special Expressions Extraction

Page 9: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology- CATEGORIZATION PROCESSCategorize the input text in two phases

Phase one: we categorize the text into one of the main categories.

Phase two:We further categorize the input text based on subcategories:

Page 10: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Page 11: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - Basic Categorization Algorithm (BCA)

Page 12: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

has frequency 7 in the 300-word

Page 13: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - Percentage and Difference Categorization (PDC) Algorithm

The category with the highest sum of flag values is considered to be the best match for the input text.

Page 14: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology – PDC Algorithm vs. BCA Algorithm

Page 15: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(1) Overlapping Main Categories for Phase Two

Problem : The possible high correlation between subcategories of different main categories

Page 16: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

(2) Replacing Main Categories by Groups of Related Categories

Page 17: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology – Enhancing Main/Subcategories Grouping

Page 18: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - Word Filtration Techniques within Categories

Page 19: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Methodology - The result of applying the three techniques

Page 20: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Modified PDC with N Scales Define a scaling of

1 0.5 0

1 0.5 00.250.75

Page 21: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Further Testing on the PDC AlgorithmTool Root ExtractionTool Light Stemming & Light10Tool Double WordsTool Expressions Extraction

Page 22: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Page 23: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Training Data Characteristics

Page 24: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

COMPARISON WITH RELATED WORK

Page 25: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Using Testing Data from the Reference Categories

Page 26: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Conclusions– To use training and testing data from same source by

splitting the corpus into test and training components. This consistently gives better results.

– However, we believe that the second method (different source ) makes more sense, as the tests will

be more credible and indicative of performance in real-life environments.

Page 27: Presenter   : CHANG, SHIH-JIE  Authors     :  ADNAN YAHYA and ALI SALHI  2014. ACM  TALIP

Intelligent Database Systems Lab

Comments• Advantages

– To.• Applications

– Arabic Text Categorization .