augustus overview open source analytics

39
Training in Analytics 2 Website and Community Augustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). It is written in Python and is freely available. http://augustus.googlecode.com

Upload: jtrussell

Post on 29-Jun-2015

1.557 views

Category:

Business


0 download

DESCRIPTION

An introduction to Augustus, an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML). Augustus is able to produce and consume models with 10,000s of segments. Developed by Open Data Group, written in Python, PMML 4.0 compliant and freely available.

TRANSCRIPT

Page 1: Augustus Overview  Open Source Analytics

Training in Analytics

2

Website and CommunityAugustus is an open source scoring engine for statistical and data mining models based on the Predictive Model Markup Language (PMML).

It is written in Python and is freely available.

http://augustus.googlecode.com

Page 2: Augustus Overview  Open Source Analytics
Page 3: Augustus Overview  Open Source Analytics

Training in Analytics

4

Getting Augustus

● Releases can be downloaded from the website under the Download tab.

● Current release are also on the main page's Featured side bar

● Augustus can be directly checked out from source control. We use Subversion.

● Project members can be granted commit access.

Page 4: Augustus Overview  Open Source Analytics
Page 5: Augustus Overview  Open Source Analytics

Training in Analytics

6

Source

●All of the source files are viewable on line with markup and revision history.

●The raw version of each file is also available.

http://augustus.googlecode.com/http://augustus.googlecode.com/source/browsesource/browse

Page 6: Augustus Overview  Open Source Analytics
Page 7: Augustus Overview  Open Source Analytics

Training in Analytics

8

Page 8: Augustus Overview  Open Source Analytics

Training in Analytics

9

Page 9: Augustus Overview  Open Source Analytics

Training in Analytics

10

Documentation and Community

WIKI▼The wiki is intended for people who want to install Augustus for use and possibly develop new features.

FORUM▼The forum is open for any general discussion regarding Augustus.

Page 10: Augustus Overview  Open Source Analytics

Training in Analytics

11

Page 11: Augustus Overview  Open Source Analytics

Training in Analytics

12

Page 12: Augustus Overview  Open Source Analytics

Training in Analytics

13

Using Augustus

● Model Development● Use Cycle● Work Flow

Page 13: Augustus Overview  Open Source Analytics

Training in Analytics

14

Development and Use Cycle

The typical model development and use cycle with Augustus is as follows:

1.Identify suitable data with which to construct a new model.

2.Provide a model schema which proscribes the requirements for the model.

3.Run the Augustus producer to obtain a new model.

4.Run the Augustus consumer on new data to effect scoring.

Page 14: Augustus Overview  Open Source Analytics

Training in Analytics

15

Development and Use Cycle

2. Model schema1. Data Inputs

Page 15: Augustus Overview  Open Source Analytics

Running Augustus

3. Obtain new model with Producer

4. Score with Consumer

Page 16: Augustus Overview  Open Source Analytics

Training in Analytics

17

Work Flows

●Augustus is typically used to construct models and score data with models.

● Augustus includes a dedicated application for creating, or producing, predictive models rendered as PMML-compliant files. Scoring is accomplished by consuming PMML-compliant files describing an appropriate model.

Page 17: Augustus Overview  Open Source Analytics

Training in Analytics

18

Components

● Pre-processing● Producers● Consumers● Post-Processing

Page 18: Augustus Overview  Open Source Analytics

Training in Analytics

19

Producers and Consumers

●The Producers and Consumers require configuration with XML-formatted files.

●Supplying the schema, configuration and training data to the Producer yields a completely specified model.

●The Consumers provide for some configurability of the output but post-processing can be used to render the output according to the user's needs.

Page 19: Augustus Overview  Open Source Analytics

Training in Analytics

20

Post Processing

●Augustus can accommodate a post-processing step. While not necessary, this is often useful to:

▼ Re-normalize the scoring results or perform an additional transformation.

▼ Supplement the results with global meta-data such as timestamps.

▼ Format the results.▼ Select certain interesting values from the

results.▼ Restructure the data for use with other

applications.

Page 20: Augustus Overview  Open Source Analytics

Training in Analytics

21

Segments

Segments are covered  elsewhere, but Augustus supports segments and this can be described at the Producer level.

● Augustus was originally written to an Open Data draft RFC for segmented models.  Augustus 0.3.x conform to the RFC.  

● PMML 4 formalized the specification for segments and it deviates somewhat from the RFC. Augustus 0.4.x conforms to this standard. 

● Augustus 0.3.x and 0.4.x both support segments, they differ in how they handle them.

Page 21: Augustus Overview  Open Source Analytics

Result of Scoring

Page 22: Augustus Overview  Open Source Analytics

Training in Analytics

23

Case Study: Auto● Auto is an example distributed with

Augustus, found in the examples directory.▼ It consists of four simple examples of applying

vector channel analysis to a single field of a stream of input records.

▼ The examples use two types of data files. ▼ The data consists of records with three entries: Date, Color, and Automaker.

▼ The Weighted examples have an additional 'weight' column, named Count. The Count field records the number of occurrences of identical tuples in the non-weighted data and collapses them into one record.

Page 23: Augustus Overview  Open Source Analytics

Training in Analytics

24

Work Flow Overview

Page 24: Augustus Overview  Open Source Analytics

Training in Analytics

25

Auto: Weighted BatchUsing the Baseline for Training:

$ cd WeightedBatch

`-- scripts

|-- consume.py

|-- postprocess.py

`-- produce.py

http://code.google.com/p/augustus/source/browse/#svn/trunk/examples/auto/WeightedBatch

Page 25: Augustus Overview  Open Source Analytics

Training in Analytics

26

Input for the ProducerThe Producer takes the training data set. In the code, we have declared how we want to test the data

import augustus.modellib.baseline.producer.Producer as Producer

def makeConfigs(inFile, outFile, inPMML, outPMML):

#open data file

inf = uni.UniTable().fromfile(inFile)

#start the configuration file

test = ET.SubElement(root, "test")

test.set("field", "Automaker")

test.set("weightField", "Count")

test.set("testStatistic", "dDist")

test.set("testType", "threshold")

test.set("threshold", "0.475")

Page 26: Augustus Overview  Open Source Analytics

Training in Analytics

27

Input for the Producer Continued

# use a discrete distribution model for test

baseline = ET.SubElement(test, "baseline")

baseline.set("dist", "discrete")

baseline.set("file", str(inFile))

baseline.set("type", "UniTable")

# create the segmentation declarations for the two fields at this level

'''

Taken out for the example, other Use Cases will focus on Segments

segmentation = ET.SubElement(test, "segmentation")

makeSegment(inf, segmentation, "Color")

'''

#output the configuration file

tree = ET.ElementTree(root)

tree.write(outFile)

Page 27: Augustus Overview  Open Source Analytics

Training in Analytics

28

Running the Producer( Training)$ cd scripts

$ python2.5 produce.py -f wtraining.nab -t20

(0.000 secs) Beginning timing

(0.000 secs) Creating configuration file

(0.001 secs) Creating input PMML file

(0.001 secs) Starting producer

(0.000 secs) Inputting configurations

(0.001 secs) Inputting model

(0.008 secs) Collecting stats for baseline distribution

(0.011 secs) Events 20.067% processed

(0.009 secs) Events 40.134% processed

(0.009 secs) Events 60.201% processed

(0.009 secs) Events 80.268% processed

(0.009 secs) Events 100.000% processed

(0.000 secs) Making test distributions from statistics

(0.002 secs) Outputting PMML

(0.062 secs) Lifetime of timer

Page 28: Augustus Overview  Open Source Analytics

Training in Analytics

29

Model generated by the Producer

<PMML version="3.1">

<Header copyright=" " />

<DataDictionary>

<DataField dataType="string" name="Automaker" optype="categorical" />

<DataField dataType="string" name="Color" optype="categorical" />

<DataField dataType="float" name="Count" optype="continuous" />

</DataDictionary>

<BaselineModel functionName="baseline">

<MiningSchema>

<MiningField name="Automaker" />

<MiningField name="Color" />

<MiningField name="Count" />

</MiningSchema>

</BaselineModel>

</PMML>

Page 29: Augustus Overview  Open Source Analytics

Training in Analytics

30

Model generated by the Producer (Cont)

The structure is determined by code in the Producer.py:

def makePMML(outFile):

#create the pmml

root = ET.Element("PMML")

root.set("version", "3.1")

header = ET.SubElement(root, "Header")

header.set("copyright", " ")

dataDict = ET.SubElement(root,

"DataDictionary")

It then goes on for each Data and Mining Field:

dataField = ET.SubElement(dataDict, "DataField")

dataField.set("name", "Automaker")

dataField.set("optype", "categorical")

dataField.set("dataType", "string")

. . .

miningSchema = ET.SubElement(baselineModel, "MiningSchema")

miningField = ET.SubElement(miningSchema, "MiningField")

miningField.set("name", "Automaker")

Page 30: Augustus Overview  Open Source Analytics

Training in Analytics

31

Producer OutputThe training step used the code in producer.py to generate a model and get expected results.

Training generated the following files:.

|-- consumer

| `-- wtraining.nab.pmml MODEL WITH EXPECTED VALUES

BASED ON THE TRAINING DATA

`-- producer

|-- wtraining.nab.pmml BASELINE DATA, DATA DICTIONARY,

MINING SCHEMA

`-- wtraining.nab.xml MODEL FILE USED FOR TRAINING

Page 31: Augustus Overview  Open Source Analytics

Training in Analytics

32

Training XMLThis provides:

● Model with expected values from Training that is used when we score

● Test Distribution

● Baeline data and how it is to be handled

$ cat producer wtraining.nab.xml

<model input="../producer/wtraining.nab.pmml"

output="../consumer/wtraining.nab.pmml">

<test field="Automaker" testStatistic="dDist" testType="threshold"

threshold="0.475" weightField="Count">

<baseline dist="discrete" file="../data/wtraining.nab"

type="UniTable" />

</test>

</model>

Page 32: Augustus Overview  Open Source Analytics

Training in Analytics

33

Unitable

●Unitable is used to hold the data that is read in.

●It allows us to encapsulate the data is a why which allows us to manipulate it efficiently.

●It can be thought of, in part, as a data structure holding a spread sheet of data with columns, types, etc and the relevant operations which can be performed on the data and the data structure.

●More to follow.

Page 33: Augustus Overview  Open Source Analytics

Training in Analytics

34

Running the Consumercd script

$ python2.5 consume.py -b wtraining.nab -f wscoring.nab

Ready to score

.

|-- consumer

| |-- wscoring.nab.wtraining.nab.xml

| `-- wtraining.nab.pmml

|-- postprocess

| `-- wscoring.nab.wtraining.nab.xml

`-- producer

|-- wtraining.nab.pmml

`-- wtraining.nab.xml

This examples generates a report in the post process directory.

Page 34: Augustus Overview  Open Source Analytics

Training in Analytics

35

Consumer (Scoring) output$ cat consumer/wscoring.nab.wtraining.nab.xml

<pmmlDeployment> <inputData> <readOnce /> <batchScoring /> <fromFile name="../data/wscoring.nab" type="UniTable" /> </inputData> <inputModel> <fromFile name="../consumer/wtraining.nab.pmml" /> </inputModel>

<output> <report name="report"> <toFile name="../postprocess/wscoring.nab.wtraining.nab.xml" /> <outputRow name="event"> <score name="score" /> <alert name="alert" /> <segments name="segments" /> </outputRow> </report> </output></pmmlDeployment>

Page 35: Augustus Overview  Open Source Analytics

Training in Analytics

36

Scoring Report

$ cat postprocess/ wscoring.nab.wtraining.nab.xml

<report>

<event>

<score>0.471458430077</score>

<alert>True</alert>

<Segments></Segments>

</event>

</report>

Page 36: Augustus Overview  Open Source Analytics

Training in Analytics

37

Unitable● The Unitable is one of the main components of

the Augustus system. ▼Data read into Augustus is stored in a Unitable. ▼Results in a very fast, efficient object for data shaping, model building, and scoring, both in a batch and real-time context.

● Designed to hold data in a way which allows it to be acted upon by numpy.

▼Takes advantage of new features and improvements which are put into numpy by the scientific Python community.

● Unitable can be used outside of the Augustus scoring flow.

▼Find a standalone example on the wiki

Page 37: Augustus Overview  Open Source Analytics

Training in Analytics

38

Key Features of Unitable● File format that matches the native

machine memory storage of the data-allowing for memory-mapped access to the data.▼ No parsing or sequential reading

● Fast vector operations using any number of data columns.

● Support for demand driven, rule based calculations. ▼ Derived columns defined in terms of

operations on other columns, including other derived columns, and made available when referenced.

Page 38: Augustus Overview  Open Source Analytics

Training in Analytics

39

Key Features of Unitable (cont)

●Can handle huge real-time data rates by automatically switching to vector mode when behind, and scalar mode when keeping up with individual input events.

●Ability to invoke calculations in scalar or vector mode transparently. ▼ One set of rule definitions can be applied to

an entire data set in batch mode, or to individual rows of real-time events.

Page 39: Augustus Overview  Open Source Analytics

Training in Analytics

40

For more information

Open Data Group400 Lathrop AvenueRiver Forest IL 60305

[email protected]

http://code.google.com/p/augustus/