stat2 25 09

STAT Requirement Analysis

2/24/09

New Goal: Avoiding “analysis paralysis”

Outline• Use Case• Flowchart and domain objects definitions• “The” Domain Model• Sample Code• Class Diagram• Survey Result• Revised project plan, tickets, etc.

Flowchart and DefinitionsCorpusReader reads text from a source into Corpus. No processing is done and everything (label, metadata, etc.) stays in text format.

FeatureExtrator converts the text in Documents to features*

Annotator transforms the corpus into another corpus by adding annotations

Corpus is a set of Documents in text format.

Dataset is a set of Instances which are feature representation of Document text.

Learner uses the dataset to learn a model

Model is a set parameters* learned from the data by Learner

* Not modeled

Classifier uses the model to predict classes in and produces a Classification

Classification contains predictions and information about them

ClassificationEvaluator computes the evaluation metrics for the Classification

Text Data

Corpus

Dataset

Model

Classification

ClassificationEvaluation contains evaluation metrics ClassificationEvaluation

CorpusReader

FeatureExtractor

DatasetLearnerModel

Classifier

Classification ClassificationEvaluator ClassificationEvaluation

Document

Instance

contains

contains

processed-by

produces

learns-fromproduced-by

used-by

produces

evaluated-by

classified-by

produces

produces

1 *1 11 1

1

1

1

1

1 1 1 1

1

1

1

1

1

1

1 *1 1CorpusReader provides the protected variation for input sources (file, web, etc.)

STAT Domain Model – v5

Instances are representation of documents needed for machine learning

Classification hold the predictions produced by Classifier

1

evaluate-on

1

This object holds the evaluation results calculated by ClassificationEvaluator and also reference to all relevant objects to produce useful reports

Annotator

Corpus

1

modifies

0..1

Document could contain Labels, Annotations, MetaData, etc.

Any class to mutates the corpus, e.g., POSTagger Any class that converts the

document components to feature vector. Also includes feature selection, aggregation, etc.

* Notably absent from this diagram are classes needed by Corpus (e.g.,WordList, Split, etc.) or by machine learning component s(e.g., DistanceMetric, ProbabilityDistribution, etc.)

stat2 25 09

Education