augustus overview open source analytics

Training in Analytics

2

Website and CommunityAugustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML).

It is written in Python and is freely available.

http://augustus.googlecode.com


4

Getting Augustus

● Releases can be downloaded from the website under the Download tab.

● Current release are also on the main page's Featured side bar

● Augustus can be directly checked out from source control. We use Subversion.

● Project members can be granted commit access.


6

Source

●All of the source files are viewable on line with markup and revision history.

●The raw version of each file is also available.

http://augustus.googlecode.com/http://augustus.googlecode.com/source/browsesource/browse


8


9


10

Documentation and Community

WIKI▼The wiki is intended for people who want to install Augustus for use and possibly develop new features.

FORUM▼The forum is open for any general discussion regarding Augustus.


11


12


13

Using Augustus

● Model Development● Use Cycle● Work Flow


14

Development and Use Cycle

The typical model development and use cycle with Augustus is as follows:

1.Identify suitable data with which to construct a new model.

2.Provide a model schema which proscribes the requirements for the model.

3.Run the Augustus producer to obtain a new model.

4.Run the Augustus consumer on new data to effect scoring.


15

Development and Use Cycle

2. Model schema1. Data Inputs

Running Augustus

3. Obtain new model with Producer

4. Score with Consumer


17

Work Flows

●Augustus is typically used to construct models and score data with models.

● Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.


18

Components

● Pre-processing● Producers● Consumers● Post-Processing


19

Producers and Consumers

●The Producers and Consumers require configuration with XML-formatted files.

●Supplying the schema, configuration and training data to the Producer yields a completely specified model.

●The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.


20

Post Processing

●Augustus can accommodate a post-processing step. While not necessary, this is often useful to:

▼ Re-normalize the scoring results or perform an additional transformation.

▼ Supplement the results with global meta-data such as timestamps.

▼ Format the results.▼ Select certain interesting values from the

results.▼ Restructure the data for use with other

applications.


21

Segments

Segments are covered elsewhere, but Augustus supports segments and this can be described at the Producer level.

● Augustus was originally written to an Open Data draft RFC for segmented models. Augustus 0.3.x conform to the RFC.

● PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard.

● Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.

Result of Scoring


23

Case Study: Auto● Auto is an example distributed with

Augustus, found in the examples directory.▼ It consists of four simple examples of applying

vector channel analysis to a single field of a stream of input records.

▼ The examples use two types of data files. ▼ The data consists of records with three entries: Date, Color, and Automaker.

▼ The Weighted examples have an additional 'weight' column, named Count. The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.


24

Work Flow Overview


25

Auto: Weighted BatchUsing the Baseline for Training:

$ cd WeightedBatch

`-- scripts

|-- consume.py

|-- postprocess.py

`-- produce.py

http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch


26

Input for the ProducerThe Producer takes the training data set. In the code, we have declared how we want to test the data

import augustus.modellib.baseline.producer.Producer as Producer

def makeConfigs(inFile, outFile, inPMML, outPMML):

#open data file

inf = uni.UniTable().fromfile(inFile)

#start the configuration file

test = ET.SubElement(root, "test")

test.set("field", "Automaker")

test.set("weightField", "Count")

test.set("testStatistic", "dDist")

test.set("testType", "threshold")

test.set("threshold", "0.475")


27

Input for the Producer Continued

# use a discrete distribution model for test

baseline = ET.SubElement(test, "baseline")

baseline.set("dist", "discrete")

baseline.set("file", str(inFile))

baseline.set("type", "UniTable")

# create the segmentation declarations for the two fields at this level

'''

Taken out for the example, other Use Cases will focus on Segments

segmentation = ET.SubElement(test, "segmentation")

makeSegment(inf, segmentation, "Color")

'''

#output the configuration file

tree = ET.ElementTree(root)

tree.write(outFile)


28

Running the Producer( Training)$ cd scripts

$ python2.5 produce.py -f wtraining.nab -t20

(0.000 secs) Beginning timing

(0.000 secs) Creating configuration file

(0.001 secs) Creating input PMML file

(0.001 secs) Starting producer

(0.000 secs) Inputting configurations

(0.001 secs) Inputting model

(0.008 secs) Collecting stats for baseline distribution

(0.011 secs) Events 20.067% processed





(0.000 secs) Making test distributions from statistics

(0.002 secs) Outputting PMML

(0.062 secs) Lifetime of timer


29

Model generated by the Producer

<PMML version="3.1">

<Header copyright=" " />

<DataDictionary>

<DataField dataType="string" name="Automaker" optype="categorical" />

<DataField dataType="string" name="Color" optype="categorical" />

<DataField dataType="float" name="Count" optype="continuous" />

</DataDictionary>

<BaselineModel functionName="baseline">

<MiningSchema>

<MiningField name="Automaker" />

<MiningField name="Color" />

<MiningField name="Count" />

</MiningSchema>

</BaselineModel>

</PMML>


30

Model generated by the Producer (Cont)

The structure is determined by code in the Producer.py:

def makePMML(outFile):

#create the pmml

root = ET.Element("PMML")

root.set("version", "3.1")

header = ET.SubElement(root, "Header")

header.set("copyright", " ")

dataDict = ET.SubElement(root,

"DataDictionary")

It then goes on for each Data and Mining Field:

dataField = ET.SubElement(dataDict, "DataField")

dataField.set("name", "Automaker")

dataField.set("optype", "categorical")

dataField.set("dataType", "string")

. . .

miningSchema = ET.SubElement(baselineModel, "MiningSchema")

miningField = ET.SubElement(miningSchema, "MiningField")

miningField.set("name", "Automaker")


31

Producer OutputThe training step used the code in producer.py to generate a model and get expected results.

Training generated the following files:.

|-- consumer

| `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES

BASED ON THE TRAINING DATA

`-- producer

|-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY,

MINING SCHEMA

`-- wtraining.nab.xml MODEL FILE USED FOR TRAINING


32

Training XMLThis provides:

● Model with expected values from Training that is used when we score

● Test Distribution

● Baeline data and how it is to be handled

$ cat producer wtraining.nab.xml

<model input="../producer/wtraining.nab.pmml"

output="../consumer/wtraining.nab.pmml">

<test field="Automaker" testStatistic="dDist" testType="threshold"

threshold="0.475" weightField="Count">

<baseline dist="discrete" file="../data/wtraining.nab"

type="UniTable" />

</test>

</model>


33

Unitable

●Unitable is used to hold the data that is read in.

●It allows us to encapsulate the data is a why which allows us to manipulate it efficiently.

●It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure.

●More to follow.


34

Running the Consumercd script

$ python2.5 consume.py -b wtraining.nab -f wscoring.nab

Ready to score

.

|-- consumer

| |-- wscoring.nab.wtraining.nab.xml

| `-- wtraining.nab.pmml

|-- postprocess

| `-- wscoring.nab.wtraining.nab.xml

`-- producer

|-- wtraining.nab.pmml

`-- wtraining.nab.xml

This examples generates a report in the post process directory.


35

Consumer (Scoring) output$ cat consumer/wscoring.nab.wtraining.nab.xml

<pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name="../data/wscoring.nab" type="UniTable" /> </inputData> <inputModel> <fromFile name="../consumer/wtraining.nab.pmml" /> </inputModel>

<output> <report name="report"> <toFile name="../postprocess/wscoring.nab.wtraining.nab.xml" /> <outputRow name="event"> <score name="score" /> <alert name="alert" /> <segments name="segments" /> </outputRow> </report> </output></pmmlDeployment>


36

Scoring Report

$ cat postprocess/ wscoring.nab.wtraining.nab.xml

<report>

<event>

<score>0.471458430077</score>

<alert>True</alert>

<Segments></Segments>

</event>

</report>


37

Unitable● The Unitable is one of the main components of

the Augustus system. ▼Data read into Augustus is stored in a Unitable. ▼Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context.

● Designed to hold data in a way which allows it to be acted upon by numpy.

▼Takes advantage of new features and improvements which are put into numpy by the scientific Python community.

● Unitable can be used outside of the Augustus scoring flow.

▼Find a standalone example on the wiki


38

Key Features of Unitable● File format that matches the native

machine memory storage of the data-allowing for memory-mapped access to the data.▼ No parsing or sequential reading

● Fast vector operations using any number of data columns.

● Support for demand driven, rule based calculations. ▼ Derived columns defined in terms of

operations on other columns, including other derived columns, and made available when referenced.


39

Key Features of Unitable (cont)

●Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events.

●Ability to invoke calculations in scalar or vector mode transparently. ▼ One set of rule definitions can be applied to

an entire data set in batch mode, or to individual rows of real-time events.


40

For more information

Open Data Group400 Lathrop AvenueRiver Forest IL 60305

[email protected]

http://code.google.com/p/augustus/

augustus overview open source analytics

Business

augustus producer

data import augustus

community augustus

augustus consumer

bar augustus

augustus releases

new data

test test