insight beyond beyond measure measure larger than life ? presented by adrian hodgson

21
How big ? Jobcentre & client activity data - 11 million people 80 Gbytes of data per month on DLT Processing 20 tables & 250m records per month Largest tables have 80 million records, 5 Gbytes Key extract retrievals take 6 - 20 hours to run Full SPSS extracts for clients - 5 Gbytes Twenty five SIR databases - over 100 Gbytes from 1 Mbyte to 15 Gbytes

Post on 18-Dec-2015

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

How big ? Jobcentre & client activity data - 11 million

people

80 Gbytes of data per month on DLT

Processing 20 tables & 250m records per month

Largest tables have 80 million records, 5 Gbytes

Key extract retrievals take 6 - 20 hours to run

Full SPSS extracts for clients - 5 Gbytes

Twenty five SIR databases - over 100 Gbytes

from 1 Mbyte to 15 Gbytes

Page 2: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Evaluation Database - Overview Background & project requirements

Setting the ‘environment’

Visual PQL

The ‘Data Dictionary’

Program generators

Fuzzy matching

Use of the data

New developments

Questions

Page 3: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Background Government program started in January 1998

New Deal for Young People was set up to encourage and assist unemployed groups into full time sustainable employment. It gives unemployed people aged 18-24 the opportunity to develop their potential, gain skills and experience and find work.

Employment Service needed an evaluation database

March 98 - ORC issues first extract with 18000 clients

June 98 - program expanded to over 25’s

June 99 - expanded to cover all Jobcentre clients

October 00 - Source database migrates from six regional Ingres databases to single Oracle database

Page 4: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Contract extension -press notice The new contract will run to May 2004, with an option to run a further

two years to May 2006.

ORC International’s database tools are designed to help the ES evaluate the service it provides to all clients registered on the Government’s JobCentre computer system.They allow the ES to regularly monitor clients with sustainable jobs, the effectiveness of equal opportunities measures, and the relationship between job vacancies and the labour market skills base.

“It is our intention to create a project web site, which will allow multiple level access to different categories of users. This will include project documentation and progress reports, access to tabulations and small data extracts, and customer feedback areas, as well as links to other related sites.”

ORC International is part of Opinion Research Corporation, which was founded in 1938 with offices in the United States, Europe, Asia, Latin America and Africa, the Company provides integrated marketing services to both businesses and governments in more than 100 countries. http://www.orc.co.uk

Page 5: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Project Requirements LMS (Labour Market System) is a multi-user transaction system used in Jobcentres

Needed new database with evaluative functionality

Linked to additional data-sets including clerical

Flexibility to change structure periodically

Combine cross regional records for the same client

Extracts provided to ES for statistical purposes

Page 6: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Setting the environmentPROGRAM

PQL CONNECT DATABASE 'CLMIDS’ PREFIX '<CLI_MIDS>'

END PROGRAM

SET DATABASE CLMIDS

SET PROCFILE '<LMSPROC>’ | reset to main procedure file

( main procedure file held in separate .SR4 file )

Midlands database prefix

retval = globals ( ’CLI_MIDS ’ , '\\urmston\d\50413\sirdb\client\')

and so on for the procedure file & other databases

Page 7: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Visual PQL (1)

execute dbms ‘CALL ddict.cprog ($’ + tablenam + ‘$,$’ + region + ‘$,$’ + extract + ‘$,$’ + newdata + ‘$,$’ + editdata + ‘$,$’ + dbnum + ‘$)’

Page 8: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Visual PQL (2)

Page 9: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Linking SIR to other software - Winzip

open inf /dsnvar=fnamein /iostat=ierr1 /write /lrecl=300

ifthen ( ierr1 ne 0)

. write 'File not found ’

write // 'Unzipping file using winzip

pql escape "C:\Program Files\WinZip\WINZIP32.EXE -e zipfn fnamein "

write 'Back to sir !!!!!! '

c Build in a loop to check that input file now unzipped and readyc wait and repeat if not

else. write 'File opened OK ' fnamein /endif

Page 10: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

The ‘Data Dictionary’

Data Dictionar

y

Tablename

Field names

Data types

Start / End Columns

Sort ids

Include / exclude

& date flags

Generate Schemas

Edit flags

Page 11: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Automatic schema generation Source field names, labels data types read from

HTML into data dictionary database

Program to create SIR variable names

strip out underscore characters ‘_’

trims field length to eight

if duplicated replaces eighth character with a number

Program sets column positions based on field types

Data types converted from Ingres / Oracle to SIR

Procedure for Date Integers and true date fields

Page 12: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Load programs - the report files (1)

Client_action record

Read 78224996

\ fixed_width_data \ 2001_05 \north: 19156652

east: 6949927

west: 11515552

south: 11774525

northwest: 11352025

midlands: 12147512

oracle: 5328791

dodgy: 12

Page 13: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Load program generation (2)

Develop standard file naming & directory location

Read pre -processing report files to pick up numbers of records to be loaded

Detect any missing report files using iostat values

Write loading program code to pql files

Read these pql files back into the procedure file

Page 14: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Load generator -finishing touches (3)

open inf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands1.txt' / read / lrecl = 250open ouf6 / dsn = 'Q:\ps\50413\ProgGen\ldmidlands2.txt' / write / lrecl = 250

write ( ouf6 ) 'PROCEDURE INITLOAD.<6>:T’ | add the procedure header line / 'call initload.dropall' | add call to drop all databases // 'call initload.connmids' / | add call to connect midlands

loop. read (inf6, iostat = ierr6) textline(a250) | copy the rest of the program. if (ierr6 ne 0) exit loop . write (ouf6) textlineend loop

write (ouf6) 'call initload.dropmids' | add module call to drop database / 'END PROCEDURE’ | add the END PROCEDURE

close (inf6) | close the input and output filesclose (ouf6)

pread 'Q:\ps\50413\ProgGen\ldmidlands2.txt’ | Pread the SIR pql

Page 15: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Load program - Midlands database (4)

call initload.dropall | drop all connected databases

call initload.connmids | connect the midlands database

call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\,

d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client2001_04ORCIDsort, edt, 0.75, 1, 1, 181450, 0)

call initload.update (\\Urmston\d\50413\, sir_input_files\lms\2001_04\Midlands\edited\, d:\50413\sirdb\client\Midlands\log\2001_04\, Midlands_client_action2001_04ORCIDsort, edt, 0.75, 2, 1, 236835, 0)

……. | more call to load data

call initload.dropmids | disconnect the midlands database

Page 16: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Fuzzy matching Linking data by best combinations of Nino, name,dob

Stripping non significant text

blanks, apostrophes, hyphens

Methods grown organically on case by case basis

Variety of scoring methods eg surname matches

HODGSON 4 points direct matches

HODGESON 3points * 0.9 for misaligned -> 6.7/8 =84%

Flexible generic modules applicable to all match types

Reporting of false positive and negative matches

Manual review of near /doubtful matches

Page 17: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

Generic fuzzy matching - issues Key fields different sizes and names across

applications

Some key fields absent or with high % missing values

Quality of key fields varies widely

Matching varies from a handful to millions of records

Bringing Access and SIR together - Visual PQL ?

How to assess ‘false’ positives when no other common fields in data sets being matched

-> set of core procedures with options to bypass ?

Page 18: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

General Issues Continual growth in database & extract size

Data Irregularities

Embedded carriage returns in text fields

Date formats (American /English)

The team

Keeping the routine /repeated processing ‘interesting’

Mushrooming similar code - keeping it generic /’black box’

SIR

several large tabfiles required - largest currently 11Gb

Some retrievals crash with sir.exe error 1 time in 5 -why?

P4 /Windows 2000 mixed performance

Page 19: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

New Developments (1) - Project web site Management tool for display of project statistics

Focus for collecting project documents ( Word )

FAQ’s

Glossary of abbreviations & acronyms ?

which variable should I use for this ?

How is leaving date derived ?

Where can I get the latest data dictionary for NDYP ?

What’s the ORC variable name for expct_start_date ?

Small sampling /extraction tools

Links to other related government sites

Page 20: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

New Developments (2) - Processing Full benefits data set arrives on Monday

Increasingly complex extracts in SPSS and SAS

Moving more of the data processing to Linux

Generic fuzzy matching tools

Adding other data , deprivation index , surveys

Sir 2002 - reading an 800 byte record into a single string

Secondary indexing /lookups - replace Case 0

Better linking of the processing zipping /spss /excel /HTML

Questions

Page 21: INSIGHT BEYOND BEYOND MEASURE MEASURE Larger than Life ? Presented By Adrian Hodgson

INSIGHTINSIGHT BEYONDBEYOND MEASUREMEASURE

Larger than the Evaluation Database ?Presented By

Adrian Hodgson