dqs mds-matching 15042015

36
DQS/MDS INTROS & DQS MATCHING Microsoft SQL Server 2012 SQL Server 2014 Neil Hambly SQL Server Evangelist / Practice Lead PASS London Chapter Leader Melissa Data MVP PASS Virtual Chapter “Professional Development” Leader Contributing Author

Upload: neil-hambly

Post on 09-Aug-2015

145 views

Category:

Presentations & Public Speaking


0 download

TRANSCRIPT

Page 1: Dqs mds-matching 15042015

DQS/MDS INTRO’S

&

DQS MATCHING

Microsoft

SQL Server 2012

SQL Server 2014

Neil Hambly SQL Server

Evangelist /

Practice Lead

PASS

London

Chapter Leader

Melissa Data MVP

PASS

Virtual Chapter

“Professional

Development”

Leader

Contributing

Author

Page 2: Dqs mds-matching 15042015

Agenda

Matching Project

What is record matching?

Data Issues

DQS Matching Process

DQS Data Matching Principles

Matching Policy

DQS Intro

MDS Intro

Page 3: Dqs mds-matching 15042015

Data Cleansing:

Modifications, removal, correcting data or incomplete,

either computer-assisted or interactively.

Matching:

Identification of duplicates in a rules-based process,

perform de-duplication., verifying data quality using

reference data provider. Use reference data services from

Azure Marketplace providers

Profiling:

Analysis of data for insight into its data quality , domain

management, matching, and data cleansing processes.

Profiling is a powerful tool in a DQS data quality solution.

Monitoring:

Determine the state of data quality activities. Validate

data quality solution is doing what it was designed to do.

Knowledge Base:

DQS is a knowledge-driven solution , analyzing data using

knowledge built with DQS. Create data quality processes

to enhance the knowledge of data , continuously

improving data quality

Page 4: Dqs mds-matching 15042015

• Create a Matching Policy

• Data Quality Matching

• Match Similar Data

Page 5: Dqs mds-matching 15042015

Master Data Services Configuration Manager

Tool to create and configure Master Data Services

databases and web applications.

Master Data Manager

Web application for performing administrative tasks

(creating a model or business rule), and that users

access to modify master data.

MDSModelDeploy.exe

Tool to create packages of your model objects and

data, for deploying to other environments.

Master Data Services web service

Developers can use to develop custom solutions for

Master Data Services.

Master Data Services Add-in for Excel

Manage data and create new entities and attributes.

Page 6: Dqs mds-matching 15042015
Page 7: Dqs mds-matching 15042015
Page 8: Dqs mds-matching 15042015
Page 9: Dqs mds-matching 15042015

Import Example

Page 10: Dqs mds-matching 15042015

Record matching is the task of identifying

records that match the same real world

entity.

Page 11: Dqs mds-matching 15042015

The Cost of Duplicate Data

…a few examples…

Direct marketing communications are doubled up unnecessarily.

Product shipments and customer-site based services could be sent

to the wrong address due to an incorrect duplicate record being

used.

Your sales reporting may be inaccurate due to an over-

inflated number of customers.

Inaccurate sales analysis due to sales being split between multiple

records that represent the same customer, resulting in an

undervaluing of some key customers.

Page 12: Dqs mds-matching 15042015

Where do Duplicate Records come from?

Poorly designed software No verification of existing records upon entry

Formatting &

abbreviations

"Doctor Robert Smith" Vs. "Dr. Bob Smith".

Data validation Human errors can creep into the system when fields’

input is not validated

Company merging and

acquisitions

Merging systems may result in duplicates in the merged

data.

Change of attributes The same person may appear to not exist in the

database if some of the attributes were changed

(e.g., address, name etc.)

Page 13: Dqs mds-matching 15042015

…Data Issues…

There are different ways to represent the same person or address in a database:

Data is ‘fuzzy’ in nature (spelling mistakes, abbreviations etc.).

Page 14: Dqs mds-matching 15042015

How Data Issues Affects Matching?

Matching Results

Matching Results Reasoning

The Data

Page 15: Dqs mds-matching 15042015
Page 16: Dqs mds-matching 15042015

Integrated

Profiling

Progress

Notifications

Status

Build

Use

DQ Projects

Knowledge Management

Knowledge

Base

Sample

Data

Page 17: Dqs mds-matching 15042015

Identifies exact and approximate matches, enabling

removal of duplicate data.

Enables creating a matching policy interactively using a

computer-assisted process.

Ensures that values that are equivalent, but were entered

in a different format or style, are in fact rendered

uniform.

Page 18: Dqs mds-matching 15042015
Page 19: Dqs mds-matching 15042015

A matching policy is prepared in the knowledge base.

A matching policy consists of matching rules that

assess how well one record matches to another.

Specify in the rule whether records’ values have to be

an exact match, similar, or prerequisite.

Train your policy by running and tuning each rule

separately.

Page 20: Dqs mds-matching 15042015

Identify the attributes in your data that are most

significant for matching.

Create domains/composite domains based on your data

structure.

Define matching rules.

Birth Date Gender

Composite Domain Full Name

F. Name M. Name L. Name Email Phone

Composite Domain Full Address

Street City State Country

Page 21: Dqs mds-matching 15042015

Similarity, select Similar if field values can be similar. Select Exact if field values must

be identical.

Weight, determines the contribution of each domain in the rule to the overall

matching score for two records.

Prerequisite validates whether field values return a 100% match; else the records are

not considered a match.

Minimum matching score is the threshold at or above which two records are

considered to be a match.

Page 22: Dqs mds-matching 15042015

Domains of type ‘Date’, ‘Integer’ or ‘Decimal’ can be matched using the

‘Similar’ property by assigning a tolerance either in percentage or integer.

Field values that fall within the defined tolerance are considered a match.

Page 23: Dqs mds-matching 15042015

Uniqueness Usage Description Domains

Low

• Define as Prerequisite

• Define with lower weights

Provides discriminatory

information

Gender, City, State

High

• Define as Similar or Exact

• Define with higher weights

Provides highly identifiable

information and is highly

discriminatory

Names (First, Last,

Company),

Address Line 1

Completeness Usage Description

Low

Do not use or define with low weight High level of missing values

High

Include for matching if the column

provides highly identifiable

information

Low level of missing values

Page 24: Dqs mds-matching 15042015

• The Matching Results tab displays statistics for the current and previous run of a matching rule.

• Restore the previous rule.

Page 25: Dqs mds-matching 15042015

Home Team Song Artist

Page 26: Dqs mds-matching 15042015
Page 27: Dqs mds-matching 15042015

The DQS matching system uses the knowledge accumulated in the

knowledge base to propose matching candidates. This knowledge

includes:

Synonyms, Syntax Errors and their Leading Value (by domain)

Domain Values and their synonyms and syntax errors are used

by the matching system to find identical or similar records.

Term-Based Relations (TBR)

TBR improves consistency of data attributes values by transforming data values to a single form using user-defined term relations. In matching, TBRs are only applied in-memory for boosting matching accuracy.

Nulls and Equivalents (“Unknown”, “99999”…)

Manage values that represent missing data by linking to the

‘DQS_Null’ value to assure that they are considered as a

match.

Page 28: Dqs mds-matching 15042015

String 1 String 2 Similarity Score Character

Before After

175 CLEARBROOK ROAD P.O. BOX 535 175 CLEARBROOK ROAD P.O.BOX 535 0.92 1.00 .

1834 E. 42ND STREET 1834 E. 42ND. ST. 0.695 0.857 .

1721 DE KALB AVE, NE 1721 DE KALB AVE NE 0.88 1.00 ,

14538 S. GARFIELD AVE., BLDG. 1-B 14538 S GARFIELD AVE BLDG 1B 0.676 0.944 , . -

#704, SJ Technoville BD, 60-19 704 SJ Technoville BD 60 19 0.65 1.00 # , -

Example:

Page 29: Dqs mds-matching 15042015
Page 30: Dqs mds-matching 15042015

Export - export both matching results (clusters) and survivors

(unique records).

A Matching project is performed in three steps:

Mapping - map source columns to domains.

Matching - run matching and view the results; it includes additional

functionality such as:

• Reject records

• Filter results by ‘Matched’ & ‘Unmatched’ and by matching

score.

• Display clusters in two different methods (overlapping and

non- overlapping )

Page 31: Dqs mds-matching 15042015

In Overlapping clusters a record may appear more than once in various clustered

results. This structure may be harder to read since the same record exists in multiple

clusters.

In Non-Overlapping clusters, the system unifies clusters containing the same

record. This structure is easier to read as you won't repeat the same observation

twice.

Overlapping Clusters

(A~B) , (B~C)

Non-Overlapping Cluster

(A~B~C)

Page 32: Dqs mds-matching 15042015

Overlapping Clusters

Non-Overlapping Clusters

Page 33: Dqs mds-matching 15042015

Check the Rejected box to move the records out of the proposed cluster upon

moving to the next page in the activity. Unlike the Cleansing Data Project where

records move between tabs instantly, the rejected records are not removed from

the clusters on the user interface.

DQS Client User Interface

Exported Matching Results

Page 34: Dqs mds-matching 15042015

Matching and Survivorship results can be exported to a SQL table,

Excel or CSV file for further analysis or consumption.

Page 35: Dqs mds-matching 15042015

in a matching rule

In a Matching Rule Minimum matching score parameter