stat2 25 09
TRANSCRIPT
STAT Requirement Analysis
2/24/09
New Goal: Avoiding “analysis paralysis”
Outline• Use Case• Flowchart and domain objects definitions• “The” Domain Model• Sample Code• Class Diagram• Survey Result• Revised project plan, tickets, etc.
Flowchart and DefinitionsCorpusReader reads text from a source into Corpus. No processing is done and everything (label, metadata, etc.) stays in text format.
FeatureExtrator converts the text in Documents to features*
Annotator transforms the corpus into another corpus by adding annotations
Corpus is a set of Documents in text format.
Dataset is a set of Instances which are feature representation of Document text.
Learner uses the dataset to learn a model
Model is a set parameters* learned from the data by Learner
* Not modeled
Classifier uses the model to predict classes in and produces a Classification
Classification contains predictions and information about them
ClassificationEvaluator computes the evaluation metrics for the Classification
Text Data
Corpus
Dataset
Model
Classification
ClassificationEvaluation contains evaluation metrics ClassificationEvaluation
CorpusReader
FeatureExtractor
DatasetLearnerModel
Classifier
Classification ClassificationEvaluator ClassificationEvaluation
Document
Instance
contains
contains
processed-by
produces
learns-fromproduced-by
used-by
produces
evaluated-by
classified-by
produces
produces
1 *1 11 1
1
1
1
1
1 1 1 1
1
1
1
1
1
1
1 *1 1CorpusReader provides the protected variation for input sources (file, web, etc.)
STAT Domain Model – v5
Instances are representation of documents needed for machine learning
Classification hold the predictions produced by Classifier
1
evaluate-on
1
This object holds the evaluation results calculated by ClassificationEvaluator and also reference to all relevant objects to produce useful reports
Annotator
Corpus
1
modifies
0..1
Document could contain Labels, Annotations, MetaData, etc.
Any class to mutates the corpus, e.g., POSTagger Any class that converts the
document components to feature vector. Also includes feature selection, aggregation, etc.
* Notably absent from this diagram are classes needed by Corpus (e.g.,WordList, Split, etc.) or by machine learning component s(e.g., DistanceMetric, ProbabilityDistribution, etc.)