stat tech reportv1

STAT Technical Report Version 0.1

by the Stat Team

Mehrbod Sharifi

Jing Yang

The Stat Project, guided by

Professor Eric Nyberg and Anthony Tomasic

March. 5, 2009

Chapter 1

Introduction to STAT

In this chapter, we give an brief introduction to the Stat project to audience reading this document.We explain the background, the motivation, the scope, and the stakeholders of this project so thataudience can understand why we are doing so, what we are going to do, and who may be interestedin our project.

1.1 Overview

Stat is an open source machine learning framework in Java for text analysis with focus on semi-supervised learning algorithms. Its main goal is to facilitate common textual data analysis tasksfor researcher and engineers, so that they can get their works done straightforwardly and efficiently.

Applying machine learning approaches to extract information and uncover patterns from tex-tual data has become extremely popular in recent years. Accordingly, many software have beendeveloped to enable people to utilize machine learning for text analytics and automate such pro-cess. Users, however, find many of these existing software difficult to use, even if they just wantto carry out a simple experiment; they have to spend much time learning those software and mayfinally find out they still need to write their own programs to preprocess data to get their targetsoftware running.

We notice this situation and observe that many of these can be simplified. A new softwareframework should be developed to ease the process of doing text analytics; we believe researchersor engineering using our framework for textual data analysis would feel the process convenient,conformable, and probably, enjoyable.

1.2 Purpose

Existing software with regard to using machine learning for linguistic analysis have tremendouslyhelped researchers and engineers make new discoveries based on textual data, which is unarguablyone of the most form of data in the real world.

As a result, many more researchers, engineers, and possibly students are increasingly interestedin using machine learning approaches in their text analytics. However, the bar for entering thisarea is not low. Those people, some of which even being experienced users, find existing softwarepackages are not generally easy to learn and convenient to use.

1

For example, although Weka has a comprehensive suite of machine learning algorithms, it isnot designed for text analysis, lacking of naturally supported capabilities for linguistic conceptsrepresentation and processing. MinorThird, on the other hand, though designed specifically as apackage for text analysis, turns out to be rather complicated and difficult to learn. It also does notsupport semi-supervised and unsupervised learning, which are becoming increasingly importantmachine learning approaches.

Another problem for many existing packages is that they often adopt their own specific inputand output format. Real-world textual data, however, are generally in other formats that are notreadily understood by those packages. Researchers and engineers who want to make use of thosepackages often find themselves spending much time seeking or writing ad hoc format conversioncode. These ad hoc code, which could have been reusable, are often written over and over againby different users.

Researchers and engineers, when presented common text analysis tasks, usually want a text-specific, lightweight, reusable, understandable, and easy-to-learn package that help them get theirworks done efficiently and straightforwardly. Stat is designed to meet their requirements. Moti-vated by the needs of users who want to simplify their work and experiment related to textual datalearning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitatetheir analytics task on textual data.

In a nutshell, Stat is an open source framework aimed at providing researchers and en-

gineers with a integrated set of simplified, reusable, and convenient toolkits for textual

data analysis. Based on this framework, researchers can carry out their machine learning

experiments on textual data conveniently and comfortably, and engineers can build their

own small applications for text analytics straightforwardly and efficiently.

1.3 Scope

This project involves developing a simplified and reusable framework (a collection of foundationclasses) in Java that provides basic and common capabilities for people to easily perform machinelearning analysis on various kind of textual data.

The previous section may give an impression for an impossible task. In this section, we clearlystate what is and is not included in this project.

The main deliverable for this project is a set of specifications, which defines a simplified frame-work for text analysis based on NLP and machine learning. We explain how succinctly the frame-work should be used and how easily it can be extended.

We also provide introductory implementations of the framework, including tools and packagesserving foundation classes of the framework. They are

• Dataset and framework object adaptors: A set of classes that will allow reading andwriting files in various formats, supporting importing and exporting dataset as well as loadingand saving framework objects.

• Linguistic and machine learning packages wrappers: A set of classes that integrateexisting tools for NLP and Machine Learning and can be used within the framework. Thesewrappers hides the implementation and variation details of these packages to provide a setof simplified and unified interfaces to framework users.

2

• Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algo-rithms that are not available from the existing packages.

Finally, we provide a set documents that reflect our development process and give guidance tousers of how to use our framework. These documents are

• Technical report: An report summarizing major artifacts we have and documenting vision,goals, motivation, and decisions. It also includes result of requirements and design phase,and results of final evaluation. This report gives overall comprehension to whom want tounderstand our package as well as software development process well.

• Executive summary: An summary that gives brief introduction to users of our framework,explaining benefits that this framework can bring to them in text analysis.

• JavaDocs, tutorials, and examples: JavaDocs about APIs specifications extracted fromcomments in the code. Tutorials and concretes examples are also provided to ease the processlearning this framework.

3

1.4 Stakeholders

Below is the list of stakeholder and how this project will affect them:

• Researchers, particularly in language technology but also in other fields, would be ableto save time by focusing on their experiments instead of dealing with various input/outputformat which is routinely necessary in text processing. They can also easily switch betweenvarious tools available and even contribute to STAT so that others can save time by usingtheir adaptors and algorithms.

• Software engineers, who are not familiar with the machine learning can start using thepackage in their program with a very short learning phase. STAT can help them develop clearconcepts of machine learning quickly. They can build their applications using functionalityprovided STAT easily and achieve high level performance.

• Developers of learning package, can provide plug-ins for STAT to allow ease of integrationof their package. They can also delegate some of the interoperability needs through thisprogram (some of which may be more time consuming to be addressed within their ownpackage).

• Beginners to text processing and mining, who want fundamental and easy to learncapabilities involving discovering patterns from text. They will be benefited from this projectby saving their time, facilitating their learning process, and sparking their interests to thearea of language technology.

4

Chapter 2

Survey Analysis

This project was faced with many challenges from the beginning. There are many question, someof subjective nature, that really needs to be addresses by our target audience. For this reason, wedesigned a survey to obtain a better understanding and provide a more suitable solution to thisproblem. In this chapter, we explain the process of designing the survey, collecting informationand some analysis of the collected data.

2.1 Designing the Survey

The primary goals of doing a survey was the following:

• Understanding the potential users of the package: their programming habit, problem solvingstrategies, experience in various area and tools, etc.

• Setting priority for which criteria to focus on for our design and implementation

The survey needed to be short and question to be very specific to get better responses. Themaximum number of question was set at 10 questions. Several draft of the questions was reviewingwithin the STAT group and the software engineering class students and instructors several timesuntil finalize. We also obtained and incorporate some advices from other departments. The finalsurvey was designed on the SurveyMonkey.com.

2.2 Distribution

The target users of STAT are two main groups with different needs: researchers and industryprogrammer. The survey contains questions to distinguish there two group but the final frameworkshould address the needs from both groups. After conducting a test run with this the STAT groupand the class, we sent the survey out to the Language Technology Institute student mailing list(representing researchers) and also to student in iLab (Prof, Ramayya Krishnan, Heinz School ofBusiness) representing industry programmers.

2.3 Analysis of Results

As of 2/25/09, we have received 23 responses and they are individually reviewed by STAT membersand also in aggregate. Below we summarized the finding of the survey result and some charts:

5

• While many different programming language are used (Python, R, C++) but over 90%mentioned Java as one of the languages (25% consider themselves expert). The programminghours range from 2-60.

• Users don’t seem to distinguish much between industry and research applications and this isperhaps more research for the different to be transparent.

• Most users are not familiar with Operation Research but everyone is somewhat familiar withMachine Learning (if not specifically text classification or data mining).

• Data type expectedly were mostly textual (plain, XML, HTML, etc. as opposed to Excel,though it was mentioned) and sources were files, databases and web.

• Over 50% chose: ”I write a program to preprocess data and then use an external machinelearning package.”

• Easy of API use, Performance and Extensibility were the top three choice in design but inaddition to those in textual descriptions user pointed out mostly problems with input andoutput formats.

Figure 2.1: Distribution of familiarity with packages

6

Figure 2.2: Distribution of design preference

7

Chapter 3

Analysis of Related Packages

In this chapter, we analyze a few main competitors of our projects. We focus on two academictoolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, anddiscuss why and how we can do better than these competitors.

3.1 Weka

Weka is a comprehensive collection of machine learning algorithms for solving data mining problemsin Java and open sourced under the GPL.

3.1.1 Strengths of Weka

Weka is a very popular software for machine learning, due to the its main strengths:

• Provide comprehensive machine learning algorithms. Weka supports most currentmachine learning approaches for classification, clustering, regression, and association rules.

• Cover most aspects for performing a full data mining process. In addition to learn-ing, Weka supports common data preprocessing methods, feature selection, and visualization.

• Freely available. Weka is open source released under GNU General Public License.

• Cross-platform. Weka is cross-platform fully implemented in Java.

Because of its supports of comprehensive machine learning algorithm, Weka is often used foranalytics in many form of data, including textual data.

3.1.2 Limitations of using Weka for text analysis

However, Weka is not designed specifically for textual data analysis. The most critical drawbackof using Weka for processing text is that Weka does not provide “built-in” constructs for naturalrepresentation of linguistics concepts1. Users interested in using Weka for text analysis often findthemselves need to write some ad-hoc programs for text preprocessing and conversion to Wekarepresentation.

• Not good at understanding various text format. Weka is good at understanding itsstandard .arff format, which is however not a convenient way of representation text. Usershave to worry about how can they convert textual data in various original format such as

1Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary

utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straight-

forwardly

8

raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. tobe understandable by Weka. As a result, they need to spend time seeking or writing externaltools to complete this task before performing their actual analysis.

• Unnecessary data type conversion. Weka is superior in processing nominal (aka, categor-ical) and numerical type attributes, but not string type. In Weka, non-numerical attributesare by default imported as nominal attributes, which usually is not a desirable type for text(imagine treating different chunks of text as different values of a categorical attribute). Onehave to explicitly use filters to do a conversion, which could have been done automatically ifit knows you are importing text.

• Lack of specialized supported for linguistics preprocessing. Linguistics preprocessingis a very important aspect of textual data analysis but not a concern of Weka. Weka doesnot (at least, not dedicated to) take care this issue very seriously for users. Weka has aStringToWordVector class that performs all-in-one basic linguistics preprocessing, includingtokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is lessflexible and lack of other techniques (such as part-of-speech tagging and n-gram processing)for users who want fined grain and advanced linguistics controls.

• Unnatural representation of textual data learning concepts. Weka is designed forgeneral purpose machine learning tasks so have to protect too many variations. As a results,domain concepts in Weka are abstract and high-level, package hierarchy is deep, and thenumber of classes explodes. For example, we have to use Instance rather than Document andInstances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaningfor text processing. First adding many Attribute to a cryptic FastVector which then passedto a Instances in order to construct a dataset appears very awkward to users processingtext. Categorize filters first according to attribute/instance then supervised /unsupervisedmake non-expert users feel confusing and hard to find their right filters. Many users may feelunconformable programmatically using Weka to carry out their experiments related to text.

In summary, for users who want enjoyable experience at performing text analysis, they needbuilt-in capabilities to naturally support representing and processing text. They need specializedand convenient tools that can help them finish most common text analysis tasks straightforwardlyand efficiently. This cannot be done by Weka due to its general-purpose nature, despite its com-prehensive tools.

Figure 3.1 shows the domain model we extracted for Weka for basic text analysis.

9

Partial UML Domain Model of Weka ( Preliminary )

Attribute

possibleValues

Note : when you see ClassA "contains" a number of ClassB, it is probably that Weka implements it as ClassA maintains a "FastVector" whose elements are instances of ClassB.

Instances Instance

attributeValues

contain

1 *

contain

1

StringToWordVector tranform-attribute 1 1

Classifier

built-from classify

1

Evaluation

evaluate-on

evaluate

1

1

1 1

1 1

*

1

NominalToString transform-type 1 1

Figure 3.1: Partial domain model for Weka for basic text analysis

3.2 MinorThird

Figure 3.2 shows the domain model we extracted for MinorThird for basic text analysis.

10

Figure 3.2: Partial domain model for MinorThird for basic text analysis

11

Chapter 4

Requirements specifications

Here we first explain in detail the major features of our framework.

• Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Javaprogramming knowledge can learn our package without much efforts, understand its logicalflow quickly, be able to get started within a small amount of time, and finish the most commontasks with a few lines of code. Since our framework is not designed for general purposes andfor including comprehensive features, there are space for us to simplify the APIs to optimizefor those most typical and frequent operations.

• Extensible and Reusable. Built-in modular supports are provided the core routines acrossvarious phases in text analysis, including text format transformation, linguistic processing,machine learning, and experimental evaluation. Additional functionalities can be extendedon top of the core framework easily and user-defined specifications are pluggable. Existingcode can be used cross environment and interoperate with external related packages, such asWeka, MinorThird, and OpenNLP.

• High performance. Performance in terms of speed of algorithms we wrap and we imple-ment should acceptable for typical experiment and dataset. Specifically, there should not besignificant degrade in performance in using our framework with capabilities provided by ex-ternal packages; performance of algorithm we implemented should be should not be degradefrom its best complexity much because of our implementation flaws.

4.1 Functional Requirements

In this section, we define most common use cases of our framework and address them in the degreeof detail of casual use case. The “functional requirements” of this project are that the users canuse libraries provided by our framework to complete these use cases more easily and comfortablythan not use.

Actors

Since our framework assumes that all users of interests are programming using our APIs, there isonly one role of human actor, namely the programmer. This human actor is always the primaryactor. There are some possible secondary and system actors, namely the external packages ourframework integrates, depending on what specific use cases the primary actor is performing.

12

Fully-dressed Use Cases

Use Case UC1: Document Classification Experiment

Scope: Text analysis application using STAT framework

Level: User goal

Primary Actor: Researcher

Stakeholder and Interests:

• Researcher: Want to test and evaluate a classification algorithm (supervised, semi-supervised or unsupervised) by applying it on a (probably well-known) corpus; the taskneeds to be done efficiently with easy and straightforward coding

Preconditions:

• STAT framework is correctly installed and configured

• The corpus is placed on a source readable by STAT framework

Postconditions:

• A model is trained and test documents in the corpus are classified. Evaluation resultsare displayed

Main Success Scenario:

1. Researcher imports the corpus from its source into memory. Specifically, the systemreads data from the source, parses the raw format, extracts information according tothe schema, and constructs an in-memory object to store the corpus

2. Researcher performs preprocessing on the corpus. Specifically, for each document, theresearcher tokenizes the text, removes the stopwords, performs stemming on the tokens,performs filtering, and/or other potential preprocessing on body text and meta data

3. Researcher converts the corpus into the feature vectors needed for machine learning.The feature vectors are created by analyzing the documents in the corpus, deriving orfiltering features, adding or removing documents, sampling documents, handling missingentries, normalizing features, selecting features, and/or other potential processing

4. Researcher splits the processed corpus into training and testing set

5. Researcher chooses a machine learning algorithm, set its parameters, and uses it to traina model from the training set

6. Researcher classifies the documents in the test set based on the model trained

7. Researcher evaluates the classification based on classification results obtained on thetest set and its true labels. Classification is evaluated mainly on classification accuracyand classification time or if it is unsupervised, on other unsupervised metrics such asAdjusted Rand Index.

8. Researcher displays the final evaluation result

13

Use Case UC1: Document Classification Experiment (cont.)

Extensions:

1a. The framework is unable to find the specified source.

1. Throw source not found exception

1b. Researcher loads a previously saved corpus in native format from a file on the disk directlyto memory object, thus researcher does not handle source, format, or schema explicitly.

1a. File not found:

1.Throw file not found exception

1b. Malformed native format:

1.Throw malformed native format exception

4a. Researcher specify a parameter k larger than the number of document or smaller than 1

1. Throw invalid argument exception

1-3, 5a. Researcher saves the in-memory objects of different level of processed corpus rep-resentation to disk in native format which can be loaded back lately, after finishing eachstep.

1-3, 5b. Research exports the in-memory objects of different processed corpus representationto disk in external formats (e.g., weka arff, csv) which can be processed by external software.

6a. Researcher saves the in-memory model object to disk, which can be loaded back lately.

6b. Researcher loads a previously saved model in native format from a file on the disk directlyto memory object.

1a. File not found:

1. Throw file not found exception

1b. Malformed native format:

1.Throw malformed native format exception

4-8b. To perform k-fold cross validation, the corpus is split to k parts in step 4, and steps5-8 are repeated k-times by switching each split a testing split and the rest as training.Researcher combines the evaluations on different test sets obtained in the previous steps andforms a final classification evaluations

6c. Unsupported learning parameters (the learning algorithm cannot handle the combinationof parameters the researcher specifies)

1. Throw unsupported learning parameters exception

6d. Unsupported learning capability (the learning algorithm cannot handle the format anddata in training set, potentially caused by unsupported feature type, class type, missingvalues, etc).

1. Identify exception cause(s)

2. Throw corresponding exception(s)

14

Use Case UC1: Document Classification Experiment (cont.)

8a. Incompatible between test set and classification (potentially caused by difference in schemabetween training set and test set)

1. Throw incompatible evaluation exception

10a. The researcher customizes the display instead of using the default display format.

1.The researcher obtains specific fields of the evaluations via interfaces provided

2.The researcher constructs a customized format using the fields he/she extracts

3.The researcher display it customized format and/or write to a destination

Special Requirements:

• Pluggable preprocessors in step 2-3

• Pluggable learning algorithm in step 6

• Learning algorithm should be scalable to deal with large corpus

• Researcher should be able to visualize results after various steps to trace the state ofdifferent objects (e.g., preprocessed corpus, models, classifications, evaluations)

• Researcher should be able to customize the visualization output

Open Issues:

• How to address the variations issues in reading different sources

• How to (in what form) let research specify parameters for different learning algorithms

• What specifically need to be able to export, persist, and visualize?

• How to implement the corpus splitting in an efficient way (don’t create extra objects)

• How to deal with performance issues of storing large corpora in the memory

• How to deal with internal representation of the dataset in efficient data structure

4.2 Non-functional Requirements

• Open source. It should be made available for public collaboration, allowing users to use,change, improve, and redistribute the software.

• Portability. It should be consistently installed, configured, and run independent to differentplatforms, given its design and implementation on Java runtime environment.

• Performance. It should not be the bottleneck in terms of machine learning analysis. Thatis, by wrapping other existing packages, no significant reduce in performance impact wouldbe observed. For algorithms we implement, our coding should achieve the best performanceas the best complexity as the algorithm could be.

• Documentation. Its code should be readable, self-explained, and documented clearly andunambiguously for critical or tricky part. It should include an introduction guide for users

15

to get started, and preferably, provides sample dataset, tutorial, and demos for user to runexamples out of the box.

16

4.3 Domain model

In this section, we present the domain model diagram and some explanations about it. A lot oftime was spent on this domain model and it has been evolving to a relative stable one, which willguide our design in the iteration I.

4.3.1 Domain model diagram

Figure 4.1 shows the domain model diagram of STAT project (for the first iteration). Thisdomain model intends to give the top level of understanding of concepts in the project ratherthan a comprehensive one that includes every details. There are a number of concepts such as“Annotation”, “Label”, “ProbabilityDistribution” ”DistanceMetric”, and “Partition” (ormaybe “Split”), are not shown in this diagram.

Figure 4.1: Partial domain model for Weka for basic text analysis

Note the diagram also lacks of some top level concepts related to unsupervised learning. Theseare topics in milestone requirement and design iteration II. A new domain model incorporates theseconcepts will be proposed at that phase. For now, we focus this domain model and give detailclarifications of what these current concepts are.

4.3.2 Domain concepts clarifications

Mebrbod: Revise definitions of these objects here.

• CorpusReader. CorpusReader read text from a source to a Corpus. No content transfor-mation is done and everything (label, metadata, body, etc) stay in text format.

• Corpus. Corpus is a set of Documents in text format.

• Annotator. Annotator transforms a Corpus to another Corpus by adding annotations.

17

• FeatureExtractor. FeatureExtractor transforms a Corpus to a Dataset by convertingDocuments in text format to Instances into feature representations.

• Dataset. Dataset is a set of Instances which are features representation of Documentstext.

• Learner. Learner learns a Model from a Dataset.

• Model. Model is a collection of parameters learned from a Dataset by the Learner.

• Classifier. Classifier uses the model to assign classes to Instances in a Dataset andproduces a Classification with respect to the Dataset.

• Classification. Classification contains the classification results and descriptive informa-tion about the classification process, e.g., which model and classifier are used to produce theclassification results.

• ClassificationEvaluator. Classification computes the evaluation metrics for a classifi-cation and produces a ClassificationEvaluation.

• ClassificationEvaluation. ClassificationEvaluation contains the evaluation resultsand descriptive information about the classification and evaluation process.

18

Bibliography

[1] TBD

[2] TBD

19