qualitystage_ investigate stage intro • pr3 systems blog

6
PR3 Systems Home Call Us: 630-364-1469 Home About Us Services Industry Capabilities Blog Careers Contact Us PR3 Systems Blog QualityStage: Investigate Stage Introduction September 24, 2014 Artur Kopec 1 Comment Information Management Introduction How confident are you in your data quality? Many organizations would unwillingly admit that their data quality is not to the level that they are happy with. Data quality is an issue which plagues many IT departments across the world. Managers and executives can be more confident in their decisions if there is an initiative to improve data quality. QualityStage is the data quality application belonging to the InfoSphere Information Server platform. One basic concept in QualityStage is data investigation, this post will show you how to use the investigate stage. The Investigate Stage The Investigate stage allows you to assess what some of the current data quality issues are. Investigating the data can fall under one of two broad categories: Single Domain Investigation: Used for investigate individual single domain columns. Some examples of single domain columns could be: First Name, Zip Code, SSN Free Form Investigation: Used for investigated free form field, or columns which have more than one domain in them. For example: 1444 N. Farnsworth Ave. Aurora IL The single domain investigations produce reports which let you view frequency distributions of values as well as formats of the data. Free form investigations also require a standardization rule which has to be applied to the data. Once the data has been standardized, it will produce two reports: a token report and a pattern report. Types of Investigations The Investigate stage can be found under the Data Quality category in the palette of the DataStage Designer. The stage is capable of running three types of investigations: Character Discrete: Useful for investigating single domains. Requires choosing a Mask for the investigated columns of C, T, or X. Character Concatenate: Useful for investigating relationships between single domain fields. This investigation concatenates the values in one or more columns before running an investigation. Requires choosing a Mask for the investigated columns of C, T, or X. Word: Useful for investigating free form fields. Requires a standardization rule be applied to the data. More on Masks Masks influence how the stage will investigate the data. A mask has to be specified for every character in the column. You can use any combination of masks for each column which will be investigated. There are three types of masks: C, T, or X. To search type and hit enter Recent Posts Use Cases for Real-Time Data Integration using Change Data Delivery QualityStage: Investigate Stage (Continued) QualityStage: Investigate Stage Introduction DataStage 11.3 New Features Cognos Data Manager To DataStage Conversion Categories Big Data Business Intelligence Cloud Data Governance Database Hardware Information Management Software Development Meta Log in Entries RSS Comments RSS WordPress.org

Upload: muralikrishna

Post on 02-Oct-2015

7 views

Category:

Documents


0 download

DESCRIPTION

QualityStage_ Investigate Stage Intro • PR3 Systems Blog

TRANSCRIPT

  • PR3 Systems Home Call Us: 630-364-1469

    Home About Us Services Industry Capabilities Blog Careers Contact Us

    PR3 Systems Blog

    QualityStage: Investigate Stage Introduction

    September 24, 2014 Artur Kopec 1 Comment Information Management

    Introduction

    How confident are you in your data quality? Many organizations would unwillingly admit that their data quality is

    not to the level that they are happy with. Data quality is an issue which plagues many IT departments across

    the world. Managers and executives can be more confident in their decisions if there is an initiative to improve

    data quality. QualityStage is the data quality application belonging to the InfoSphere Information Server platform.

    One basic concept in QualityStage is data investigation, this post will show you how to use the investigate

    stage.

    The Investigate Stage

    The Investigate stage allows you to assess what some of the current data quality issues are. Investigating the

    data can fall under one of two broad categories:

    Single Domain Investigation: Used for investigate individual single domain columns. Some examples of

    single domain columns could be: First Name, Zip Code, SSN

    Free Form Investigation: Used for investigated free form field, or columns which have more than one

    domain in them. For example: 1444 N. Farnsworth Ave. Aurora IL

    The single domain investigations produce reports which let you view frequency distributions of values as well as

    formats of the data. Free form investigations also require a standardization rule which has to be applied to the

    data. Once the data has been standardized, it will produce two reports: a token report and a pattern report.

    Types of Investigations

    The Investigate stage can be found under the Data Quality category in the palette of the DataStage Designer.

    The stage is capable of running three types of investigations:

    Character Discrete: Useful for investigating single domains. Requires choosing a Mask for the investigated

    columns of C, T, or X.

    Character Concatenate: Useful for investigating relationships between single domain fields. This

    investigation concatenates the values in one or more columns before running an investigation. Requires

    choosing a Mask for the investigated columns of C, T, or X.

    Word: Useful for investigating free form fields. Requires a standardization rule be applied to the data.

    More on Masks

    Masks influence how the stage will investigate the data. A mask has to be specified for every character in the

    column. You can use any combination of masks for each column which will be investigated. There are three

    types of masks: C, T, or X.

    To search type and hit enter

    Recent Posts

    Use Cases for Real-Time Data

    Integration using Change Data

    Delivery

    QualityStage: Investigate Stage

    (Continued)

    QualityStage: Investigate Stage

    Introduction

    DataStage 11.3 New Features

    Cognos Data Manager To

    DataStage Conversion

    Categories

    Big Data

    Business Intelligence

    Cloud

    Data Governance

    Database

    Hardware

    Information Management

    Software Development

    Meta

    Log in

    Entries RSS

    Comments RSS

    WordPress.org

  • Character Mask Usage

    C View the actual character values of the data

    T View the pattern of the data

    X Ignore the character of the data

    Here are some examples of using different masks in the Investigate stage. When using a T mask characters are

    represented with an a and numbers are represented with a n.

    Data Mask Result

    111-11-1111 CCCCCCCCCCC 111-11-1111

    111-11-1111 TTTTTTTTTTT nnn-nn-nnnn

    (123) 456-7890 CCCCCCCCCCCCCC (123) 456-7890

    (123) 456-7890 CCCCCXXXXXXXXX (123)

    X234 CX21 TTTTTTTTT annnbaann

    X234 CX21 TTTTXXXXX annn

    QualityStage Job

    Well take a look at 2 different examples of using the Investigate stage. We will run a Character Discrete

    Investigation on 2 columns from our source data. Lets create a very simple job using the Sequential File stage,

    the Investigate stage, and another target Sequential File stage.

    The source data is customer data with some quality issues which has to be investigated. Here is a sample of

    the data:

    Character Discrete with a C Mask

    First we will first run a Character Discrete Investigation on the Source System column using a C mask. Lets

    open up the Investigate stage. The stage defaults to a Character Discrete Investigate. You can choose which

    type of investigation to run by clicking on the associated button at the top of the window. You should see all of

    the available input columns.

  • Well select the SourceSystem column and click on the Add To Selected Columns button. Youll be presented

    with a new window which lets you choose which mask you want to use. Well select the C mask and then click

    on the OK button.

    You will see that the SourceSystem column was moved from the Available Data Columns section to the

    Scheduled Process section. You can also see which mask was chosen for the column. You can add additional

    columns to investigate as well if you would like.

    Click on the Stage Properties button in order to open a new window representing the stage properties. Map the

    columns to the output in the Output > Mapping tab. Now we will close out of the Investigate stage. The target

    Sequential File stage has already been configured so we can save, compile, and run the job.

    Once the job completes you can view the investigation report in the target stage. Since we were using a C

  • mask the qsInvPattern column will always have the same value as the qsInvSample column. The last two

    columns represent a frequency distribution of the 3 different values which the stage found for the SourceSystem

    column. We can see that about 53% of our data came from the A source system, 13% from the H source

    system, and 34% from the L source system.

    Character Discrete Investigate with a T Mask

    Now lets run a different investigation. Well open up the Investigate stage, select the SourceSystem column,

    and click on the Delete button. We will now run a Character Discrete investigation for the PolicyNumber column

    with a T mask. We can select the PolicyNumber column, click on the Add To Selected Columns button, and

    use a T mask for every character.

    Click on the OK button and notice that the PolicyNumber column has moved from the Available Data Columns

    section to the Scheduled Process section. It is defined with the T mask.

    Lets close the Investigate stage and run our job. Once the job completes you can view the investigation report

    in the target stage. You can see that the stage found 5 different formats for the PolicyNumber column,

    represented by the qsInvPattern column. The qsInvSample column took a sample value from the incoming data

    so that you can easily view it in the investigation report. The qsInvCount and qsInvPercent columns represent

    our frequency distribution of the 5 different formats. You can see that about 3% of the data was in the

    aanaannnnnn format, 17% was found in the aanannnnnnn format, 63% was found in the aannannnnnn

    format, less than 1% was in the nnnnnnnn format, and 16% was in the nnnnnnnnn format. Rememeber that

    an a in the pattern represents a character and an n represents an number.

  • Data Profiling Data Quality Investigate Stage QualityStage

    Share this Story

    About the Author

    Artur Kopec is a software instructor for PR3 Systems training services. He instructs

    courses covering IBMs InfoSphere Information Server suite of applications with a

    primary focus on the technical aspect of the ETL (extract, transform, and load) tool,

    DataStage.

    Name (required) E-Mail (required) Website

    The investigate stage is useful for assessing the frequency distributions of values and formats of your data.

    Once you know have a better understanding of your data, you can more efficiently cleanse it during a

    downstream process. The stage can also be used to see how well a standardization rule handles incoming

    data. Well take a look at that in another post.

    Related Posts

    QualityStage: Investigate Stage (Continued) (October 24, 2014)

    Response (1)

    Leave a reply

    QualityStage: Investigate Stage Continued PR3 SystemsOctober 24, 2014 at 2:24 pm Reply

    [] previous blog post gave our readers an introduction to the Investigate stage in the QualityStage application. It showed

    you how to use the Investigate stage to run two types []

  • Submit comment

    IBM Premier Business Partner

    Contact Info

    ILLINOIS, USA 1444 N. Farnsworth Ave, Suite 200, Aurora, IL 60505

    Phone: 630-364-1469

    Fax: 630-364-5719

    E-Mail: [email protected]

    Web: http://pr3systems.com

    Copyright 2015 PR3 Systems Blog