data analysis using data flux

DATA ANALYSIS USING DATA FLUXFROM-SUNIL PAI

TYPICAL USAGE - CUSTOMER DATA OPERATIONS• Data De-Duping• Data Standardization• Data Analysis and Data Profiling• Data Consolidation from various sources• Comparing multiple data sets as per predefined parameters• Insert Data in to Target Data Bases• Match at the glance Reports for various New Acquisitions

DF NODE - DATA INPUTS

DF could use various Input sources such as Relational Databases (using queries), Excel files, Access Files, Text files

This sources are connected Via ODBCExamples-A query is inserted in SQL Query Node .By selecting a

database/Access file in the node propertiesFor Excel-Area needs to be defined for selection by using Name

manager under formula tab in excel sheet .For excel sheets Data Source Input node is used

DF NODE - DATA OUTPUTS

By using DF we can insert a Job/Result output in an Excel, Access ,Text, relational database like Oracle /Sql Server

DF uses Insert/Update/Target/Output utilities for Data output stage

Examples-The output result can be directly inserted into Database table by using Data Target Insert Node

Output can also be taken in an text file via Text file output node

DF NODE – QUALITY• Standardization dfPower Architect's Standardization node is used to make similar items

the same The various definition of standardizations are Name, Address,

Organization,Zip, Phone, email address ,country, State ,Non Alpha numeric remover, Numeric remover, Alpha Numeric remover ,space remover ,Quotation remover etc

Various Schemas can also be selected which can be defined in QKB of DataFlux

For Example-using full company names instead of initials ("International Business Machines" vs. "IBM"),

Delta Inc to Delta IncorporatedSigma corp to Sigma Corporation

DF NODE – QUALITY

• Standardization (More Examples)-Addresses1 Comcast Center to 1 Comcast Ctr10 Glenlake Pkwy north east to 10 Glenlake Pkwy NE"North Dakota" vs. "ND“United States vs USA

DF NODE – QUALITY

• Parsing DF Power Architect's Parsing node is a simple but intelligent tool for separating

multi-part field values into multiple, single-part fields. For example, if you have a Name field that includes the value "Mr. Igor Bela Bonski III, Esq.," you can use parsing to create six separate fields:

Name Prefix: "Mr."Given Name: "Igor"Middle Name: "Bela"Family Name: "Bonski"Name Suffix: "III"Name Appendage: "Esq."

DF NODE – INTEGRATION

• Match Codes dfPower Architect's Match Codes is to identify duplicate

records in your data. These steps create match codes, that evaluate the quantity of duplicate fields in your data and eliminate the extra fields.

Match codes can be set from 50%(Lowest) to 100%(Exact) and various schemas can be selectedField Name Defination Sensitivity

Account Name Bussiness TiTtle 85%Address_Line1 Address/Address Long 85%City City Exact-All,Exact-10 charactersCountry Country Exact-All,Exact-10 characters

DF NODE – INTEGRATION• ClusteringDFPower Architect's Data Clustering node is used to employ

the clustering functionality to group match duplicates or set of unique records as per conditions defined. See cluster numbers in given example belowCluster Account Name Account Address1 Match Criteria

7231 New Jersey Manufacturers Insurance Company 301 Sullivan Way Exact Company Name + Address-17231 New Jersey Manufacturers Insurance Company 301 Sullivan Way Exact Company Name + Address-17231 New Jersey Manufacturers Insurance Company 301 Sullivan Way Exact Company Name + Address-17663 Metlife, Incorporated 27-01 Queens Plz N Exact Company Name + Address-17663 Metlife, Incorporated 27-01 Queens Plz N Exact Company Name + Address-17791 Eaton Corporation 34899 Curtis Blvd Exact Company Name + Address-17791 Eaton Corporation 34899 Curtis Blvd Exact Company Name + Address-1

DF NODE – INTEGRATION

• Surviving Record Identification DFPower Architect's Surviving Record Identification (SRI) node examines

clustered data and determines a surviving record for each cluster. This process lets you eliminate duplicate information in a data source. The surviving record is identified using one or more user-configurable record rules. The user may also enter field rules to perform automated field-level edits of the surviving record's data during SRI processing. The SRI step can be configured to keep all existing data, marking the surviving records with a flag or primary key value, or it can remove all data except for that associated with the surviving records.

Examples- Consider you have set of duplicate Accounts and addresses in the system and you need to keep one distinct record out of those duplicates but the record should have proper phone numbers in it. You can use SRI node and define rule for selection which can be done in properties of SRI Node. Please see the example given in the next slide

DF NODE – INTEGRATION• Surviving Record IdentificationExamples (Continued) –Please see the cluster column and the

Surviving record column given below. So each cluster has only one surviving recordCluster Account Name Account Address1 Phone Surviving Record

7231 New Jersey Manufacturers Insurance Company 301 Sullivan Way (609) 883-1300 TRUE7231 New Jersey Manufacturers Insurance Company 301 Sullivan Way Null FALSE7231 New Jersey Manufacturers Insurance Company 301 Sullivan Way 987 FALSE7663 Metlife, Incorporated 27-01 Queens Plz N 1-800-638-5000 TRUE7663 Metlife, Incorporated 27-01 Queens Plz N Null FALSE7791 Eaton Corporation 34899 Curtis Blvd 1-900-735-5674 TRUE7791 Eaton Corporation 34899 Curtis Blvd Null FALSE

DF MATCH EXAMPLES• Standardization and Match codes combined in job flow gives

Remarkable results as shown below Exact or 100% Match results

Input-COMPANY NAME Matched/Output Company Name ADDRESS 1(Input) ADDR(Matched)Netscape Communications Corporation Netscape Communications Corporation 501 E Middlefield Rd 501 E Middlefield RdAlston & Bird L L P Alston & Bird LLP 1201 W Peachtree St 1201 W Peachtree StGeorgia Perimeter College Georgia Perimeter College 3251 Panthersville Rd 3251 Panthersville RdCounty of Oneida County of Oneida 800 Park Ave 800 Park AveEli Lilly and Company Eli Lilly and Company PO Box 6034 PO Box 6034Actuate Corporation Actuate Corporation 2207 Bridgepointe Pkwy. Ste. 500 2207 Bridgepointe Pkwy Ste 500Shriners Hospitals For Children Shriners Hospitals For Children 3551 N Broad St 3551 N Broad StCatholic Health Initiatives Catholic Health Initiatives 440 Creamery Way 440 Creamery WayEl Paso Electric Company El Paso Electric Company 123 W Mills Ave 123 W Mills Ave

DF MATCH EXAMPLES• 75% Match Results

Input-Name Matched Name Input-ADDRESS Matched-AddressArizona State University Arizona State University University Dr and also Mill Ave University Drive & Mill AvenueCybernet Software Systems, Inc. Cybernet Software Systems Incorporated 3031 Tisch Way Ste. 1002 3031 Tisch WayVertrue Inc. Vertrue Incorporated 20 Glover Ave. 20 Glover AveDollar Bank, FSB Dollar Bank 3 Gateway Center 3 Gateway Center 8 EastTextron Inc. Textron Incorporated 40 Westminster Street 40 Westminster StArcher Technologies Archer Technologies LLC 13200 Metcalf, Suite 300 13200 Metcalf AveBMW Financial Services NA BMW Financial Services NA Incorporated 5515 Park Center Circle 5515 Parkcenter CirGreat American Financial Resources, Inc. Great American Financial Resources Incorporated250 E. 5th St. 250 E 5th StCec Entertainment, Inc. CEC Entertainment Incorporated 4441 W Airport Freeway 4441 W Airport Fwy

DF MATCH EXAMPLES• Loose and Tight Contact Matches-See email addresses

100 % Matches

EMAILADDRESS(Input Source) EMAIL_ADDRESS(Matched) NAME (Input Source) FIRST_NAME-Matched [email protected] [email protected] Adam Fenech Adam [email protected] [email protected] Bradd Piontek Bradd Piontek

EMAILADDRESS -Input CONTACT_EMAIL_ADDRESS-Matched NAME-Input [email protected] [email protected] Brent Alexander Brent [email protected] [email protected] Chris Sims Chris Sims

DF NODE – UTILITIES

• Data Joining NodeThis nodes is used to joining data form various sources such

as Two different databases/Excels/Access etcDFPower Architect's Data Joining job flow step is based on the

SQL concept of JOIN. You can use Data Joining to combine two data sets in an intelligent way so that the records of one, the other, or both data sets are used as the basis for the resulting data set


• SQL LookupSQL Lookup lets the user find rows in a database table that

have one or more fields matching those in the job flow. It provides an explicit advantage with performance, especially with large databases since the large database is not copied locally on the hard drive in order to perform the operation (as is the case with joins).


• SQL ExecuteThis is a stand-alone node (no parents or children) that lets

you construct and execute any valid SQL statement (or series of statements). It performs some database-specific task(s), either before, after, or in-between architect job flows.

Examples-SQL Statements like Update, delete ,commit for a particular table can be used in this node


Data Union DFPower Architect's Data Union node is based on the SQL concept of

UNION. As with Data Joining, use the Data Union node to combine data from two data sets. Unlike Data Joining, however, Data Union does not perform an intelligent combination. Rather, Data Union simply adds the two data sets together; the resulting data set contains one record for each record in each of the original data sets

Examples- Data from two or more sheets/Databases/DF job flows needs to be clubbed together. This node performs the Task


• BranchThis step lets multiple children (up to 32) simultaneously

access data from a single source. Depending on step's configuration and children's access patterns, you can pass data from the parent directly to each of the children, or it may be temporarily stored in memory and/or disk caches, before being passed to the children.

In other words it can be one input and multiple outputs(Max-32)


ConcatenateDFPower Architect's Concatenate node performs the opposite function

of the Parse node. Rather than separate a single field into multiple fields, Concatenate combines one or more fields into a single field.

ExampleSuffix-Mr First Name- Rahul Last Name- JainConcatenate output – Mr Rahul Jain


• Expression Use DFPower Architect's Expression node to run a Visual BASIC-like language to

process your data sets in ways that are not built into dfPower Studio. The Expression language provides many statements, functions, and variables for manipulating data

Examples like creating a column Match Criteria in middle of Job flow.The syntax would be

Expression Match_Criteria = “ “Pre-Processing Expression string Match Criteria

DF NODE – UTILITIES• Data SortingUse DFPower Architect's Data Sorting node to re-

order (Ascending or Descending way)your data set at any point in a job flow.

DF NODE – PROFILING• Basic Statistics DFPower Architect's Basic Statistics node is used to calculate

statistics about your data, such as value ranges, counts, or sums for any given field

The Basic Statistics node is typically used on numerical rather than text fields. However, statistics such as Count, Missing, MAX, and MIN could be useful on any field type

This can be used in middle of the job as well to do a Fault finding by checking the counts of each step

Examples Basic stat of Siebel TableRow_Id Created Created_By Account Name Partner Flag Email Addr Phone CSN

Records 267413 267413 267413 267413 267413 267413 267413 267413Count 267413 267413 267413 267413 267413 5 72552 181643Null Count 0 0 0 0 0 267408 194861 85770Distinct yes yes yes yes no yes yes yesMin 1 0-5200 1/1/1980 0:00 0-1 N [email protected] ###iswrong 1Max 1 O-2 9/9/2010 21:55 1-XVOET ültje GmbH Y [email protected] xxxxxxxxx

DF NODE – PROFILING

Pattern AnalysisDFPower Architect's Pattern Analysis node is used to

generate a new field containing alphanumeric patterns that represent each value in a selected field. You can specify whether these patterns represent each character or each word (as separated by spaces) in a field.

DF NODE – PROFILING• Frequency Distribution DFPower Architect's Frequency Distribution node is used to

calculate the number of occurrences of each unique value in a field.For example, Frequency Distribution can determine how many customers in your customer database are in each of the 50 US states, the District of Columbia, and the 13 Canadian provinces.State Count of Customers %Total

CA 19593 12CO 4041 2CT 2807 1DC 2555 1DE 746 0FL 7105 4GA 5198 3GE 1 0

GEO GEO_count GEO %Americas 187235 57Asia Pacific 30642 9EMEA 107412 33

DF NODE – PROFILING

• Data validation DFPower Architect's Data Validation node is used to analyze

the content of data by setting validation conditions. These conditions create validation expressions that you can use to filter data for a more accurate view of that data.

DF NODE – ENRICHMENT

Address Verification DFPower Architect Address Verification (US/Canada/World) node to

verify, correct, and enhance any addresses in your existing data (QKB). Address Verification (US/Canada/World) uses geographic information from various reference databases to match and standardize addresses. You can also use Address Verification (US/Canada) for proper casing and CASS /SERP compliance. The addresses are distinguished as per codes mentioned in the next slide. So it gives the status of addresses i.e how valid it is

DF NODE – ENRICHMENT• For US Addresses

Text Result Code

Numeric Result Code Description

OK 0 Address was verified successfully.

PARSE 11Error parsing address. Components of the address may be missing.

CITY 12

Could not locate city/state or zip in the USPS database. At least (city and state) or ZIP must be present in the input.

MULTI 13

Ambiguous address. There were two or more possible matches for this address with differing data.

NOMATCH 14No matching address found in the USPS data.

OVER 15One or more input strings is too long (maximum 100 characters).

• For Canada AddressesResult Code Description

0 No error occurred1 Internal error

2 Cannot load database3 Invalid - unspecified reason4 Invalid civic number5 Invalid street6 Invalid unit

7 Invalid delivery mode

8 Invalid delivery installation9 Invalid city

10 Invalid province11 Invalid postal code

12 Address is not Canadian

• Rest of World(Excluding US and Canada)

Result Code Description0 Address correct as entered.1 Address corrected automatically.2 Address needs to be corrected, but could not

3Address needs to be corrected, but could not be determined automatically. There is a fair

4Address needs to be corrected, but could not be determined automatically. There is a small

DF NODE – MONITORING Data Monitoring The Data Monitoring node enables you to analyze data according to

business rules you create using the Business Rule Manager. The business rules you create in Rule Manager can analyze the structure of the data and trigger an event, such as logging a message or sending an email alert, when a condition is detected. By using the Data Monitoring node, you can insert these business rules in your job flow to analyze data at various points in the flow.

data analysis using data flux

Documents