reu summer 2009 project

REU SUMMER 2009 PROJECTAssociation Rule Preprocessing

By: Walter Garcia

University of Houston - Downtown

PROJECT GOALS

Convert Heartfelt Study data set into MAFIA format.

Run the converted data set through the MAFIA program to find maximal frequent item sets.

Convert the MAFIA output into SemAna format. Run the converted data set through the SemAna

program to find unknown and correct relations. Use our method to validate other studies that

have been performed on the Heartfelt Study. Find a relation that is interesting, useful, and

correct that has not yet been discovered.

The Heartfelt Study examined 383 children aged 11-16 years. It included 140 African-American, 117 Hispanic, and 126 Non-Hispanic White.

The original heartfelt itemset contains 16911 unique transactions and each transaction contains 101 different attributes (items) such as heart rate, age, posture, BMI, obesity, etc. Here is a screenshot if the file.

MAFIA is an acronym for MAximal Frequent Itemset Algorithm. It finds the most frequent subsets in a transactional dataset.MAFIA accepts input in the format below. As you can see every transaction is a set of intergers. However, the original dataset includes integers, real numbers, and “?” that represent missing data.

MAFIA Format Original Format

I used a program called WEKA program from the University of Waikato to analyze and discretize the items into 10 unique items or less each. This assigned a unique integer to each item as required by the MAFIA program.

After discretizing the items in each attribute I saved the results in an excel file for cross referencing later. Here is a screenshot:

My program converts the original itemset file into MAFIA format by performing the following actions:

Read transaction as a STRING Converts the STRING into a character array Tokenizes the char array into multiple character

arrays and outputs a matching integer value to an outputset.ascii file as it tokenizes

Repeat until the End of File

Once the program completes the conversion the outpuset file looks like this: Ready for MAFIA!

When the outpuset file is loaded into the MAFIA program we get the following output:

WHAT DOES THE OUTPUT MEAN?

In the previous example we ran MAFIA with the following parameters:mafia –mfi .7 –ascii outputset.ascii mfi.txtThis means that MAFIA will accept the input file in ascii format and find the most frequent subsets from the item dataset with a minimum support of 70% or found in at least 11838 transactions.

WHAT DOES THE OUTPUT MEAN?

If we take one line from the MAFIA output MFI file we can find out what it means by cross referencing it with the excel file: For example, we examine the line below. The number in parenthesis means that the subset {351, 314, 239, 136} was found 11874 times in the dataset.

351 314 239 136 (11874)By looking at the excel file: 351 means a RELAX1 selection of 1 (Child was relaxed)314 means a TAXHYN selection of 0 (Anger Traits were High)239 means a AGE2 selection of <= 14 (Age less than or equal to 14 years)136 means a RAW.S.AN selection of <= 113 (Raw Trait Anger score < 113)

What does the output mean?One problem with the MAFIA output that we saw in the previous

slide is that MAFIA will find every single frequent subset. It includes subsets that are trivial or incorrect. What we need now is a way to filter the MAFIA output to find subsets that are interesting, useful, unknown and correct. For this we use a program called SemAna (Semantic Analyzer). When the MAFIA output set is converted and processed through the SemAna program it places all of the trivial subsets in a file called trivial.rule all of the unknown and correct subsets in a file called UnKnownCorrect.rule file. Here is a screenshot of the unknown/correct file.

In order to validate our findings (frequent subsets) I am comparing our results to studies that have already been performed by other scientists on the Heartfelt Study to see if they match.

Validating our Method


The first study I analyzed was “Blood Pressure and Sexual Maturity in Adolescents” found in the American Journal of Human Biology (2001). This study found that Systolic Blood Pressure in adolescents increases as their Sexual Maturity increases.


Using our method I found the subsets below. This shows that as the TANNER (Sexual Maturity Measurement) increases Systolic Blood Pressure also increases.

TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' ZHTCM='(-1.96-.7792]' TANNER='(1.8-2.6]' MATURE=0 SBP='(102.75-119]' OBESITY=0

TANNER='(2.6-3.4]' MATURE=0 SBP='(102.75-119]' WHRATIO='(.775-.85]' TANNER='(2.6-3.4]' SBP='(119-135.25]' MATURE=0

TANNER='(3.4-4.2]' SBP='(119-135.25]' APWAIST='(64.88-78.06]' MATURE=1 OBESITY=0TANNER='(3.4-4.2]' SBP='(119-135.25]' MAP='(80-94]' MATURE=1

Future Work

• I will continue to validate more studies using our method.

• Find a relation that is interesting, useful, and correct that has not yet been discovered.

reu summer 2009 project

Documents

mafia program

mafia mfi

mafia output mfi file

ascii file

mafia formatoriginal

outpuset file

input file

following output