project 1: machine learning using neural networks ver 1.1

16
Project 1: Project 1: Machine Learning Using Neural Machine Learning Using Neural Networks Networks Ver 1 .1

Upload: josephine-newman

Post on 03-Jan-2016

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Project 1: Machine Learning Using Neural Networks Ver 1.1

Project 1:Project 1:Machine Learning Using Neural Machine Learning Using Neural NetworksNetworks

Ver 1.1

Page 2: Project 1: Machine Learning Using Neural Networks Ver 1.1

2 (C) 2006, SNU Biointelligence La

boratory

OutlineOutline

Classification using ANN Learn and classify text documents Estimate several statistics on the dataset

Page 3: Project 1: Machine Learning Using Neural Networks Ver 1.1

3 (C) 2006, SNU Biointelligence La

boratory

Network StructureNetwork Structure

Class 1

Class 3

Class 2Input

Page 4: Project 1: Machine Learning Using Neural Networks Ver 1.1

CLASSIC3 DatasetCLASSIC3 Dataset

Page 5: Project 1: Machine Learning Using Neural Networks Ver 1.1

5 (C) 2006, SNU Biointelligence La

boratory

CLASSIC3CLASSIC3

Three categories: 3891 documents CISI: 1,460 document abstracts on information retrieval from In

stitute of Scientific Information. CRAN: 1,398 document abstracts on Aeronautics from Cranfiel

d Institute of Technology. MED: 1,033 biomedical abstracts from MEDLINE.

Page 6: Project 1: Machine Learning Using Neural Networks Ver 1.1

6 (C) 2006, SNU Biointelligence La

boratory

Text Presentation in Vector Text Presentation in Vector SpaceSpace

. . .

1 0 0 0 2 0 0 1

0 3 0 1 0 0 0 1

문서집합

Term vectors

1 0 2 0 1 0 1 0

0 1 1 3 1 0 0 1

2 0 0 0 0 1 0 1

0 0 1 0 0 0 3 0

0 2 1 0 0 0 0 1

0 0 3 0 0 1 0 0

1 0 1 1 0 0 2 1

0 1 1 0 1 0 0 0

0 0 0 0 3 1 0 0

baseball

specs

graphics

hockey

unixspace

d1

d2

d3

dn

Term-document matrix

stemmingstop-words eliminationfeature selection

1 0 1 0 0 0 0 2

Bag-of-Words representation

VSM representation

Dataset Format

Page 7: Project 1: Machine Learning Using Neural Networks Ver 1.1

7 (C) 2006, SNU Biointelligence La

boratory

Dimensionality ReductionDimensionality Reduction

Sort by scoreScoring measure

(on individual feature)

ML algorithm

term (or feature) vectors

choose terms with higher values

individual feature

scores

Term Weighting

TF or TF x IDF

documents in vector space

TF: term frequencyIDF: Inverse Document Frequency

)/log()(IDF ii nNw N: Number of documentsni: number of documents that contain the j-th word

Page 8: Project 1: Machine Learning Using Neural Networks Ver 1.1

8 (C) 2006, SNU Biointelligence La

boratory

Construction of Document Construction of Document VectorsVectors Controlled vocabulary

Stopwords are removed Stemming is used. Words of which document frequency is less than 5 is removed.

Term size: 3,850 A document is represented with a 3,850-dimensional vector of whic

h elements are the frequency of words. Words are sorted according to their values of information gain.

Top 100 terms are selected 3,830 (examples) x 100 (terms) matrix

Page 9: Project 1: Machine Learning Using Neural Networks Ver 1.1

Experimental ResultsExperimental Results

Page 10: Project 1: Machine Learning Using Neural Networks Ver 1.1

10 (C) 2006, SNU Biointelligence La

boratory

Data Setting for the Data Setting for the ExperimentsExperiments Basically, training and test set are given.

Training : 2,683 examples Test : 1,147 examples

N-fold cross-validation (Optional) Dataset is divided into N subsets. The holdout method is repeated N times.

Each time, one of the N subsets is used as the test set and the other (N-1) subsets are put together to form a training set.

The average performance across all N trials is computed.

Page 11: Project 1: Machine Learning Using Neural Networks Ver 1.1

11 (C) 2006, SNU Biointelligence La

boratory

Number of EpochsNumber of Epochs

Page 12: Project 1: Machine Learning Using Neural Networks Ver 1.1

12 (C) 2006, SNU Biointelligence La

boratory

Number of Hidden UnitsNumber of Hidden Units

Number of Hidden Units Minimum 10 runs for each setting

# Hidden

Units

Train Test

Average SD

Best Worst Average SD

Best Worst

Setting 1

Setting 2

Setting 3

Page 13: Project 1: Machine Learning Using Neural Networks Ver 1.1

13 (C) 2006, SNU Biointelligence La

boratory

Page 14: Project 1: Machine Learning Using Neural Networks Ver 1.1

14 (C) 2006, SNU Biointelligence La

boratory

Other Methods/ParametersOther Methods/Parameters

Normalization method for input vectors Class decision policy Learning rates ….

Page 15: Project 1: Machine Learning Using Neural Networks Ver 1.1

15 (C) 2006, SNU Biointelligence La

boratory

ANN SourcesANN Sources

Source codes Free software Weka NN libraries (C, C++, JAVA, …) MATLAB tool box

Web sites http://www.cs.waikato.ac.nz/~ml/weka/ http://www.faqs.org/faqs/ai-faq/neural-nets/part5/

Page 16: Project 1: Machine Learning Using Neural Networks Ver 1.1

16 (C) 2006, SNU Biointelligence La

boratory

SubmissionSubmission

Due date: April 18 (Tue) Both ‘hardcopy’ and ‘email’

Used software and running environments Experimental results with various parameter settings Analysis and explanation about the results in your own way FYI, it is not important to achieve the best performance