sugi 28, paper 4 sugi 25 paper 129: paul m. dorfman private detectives in a data warehouse:...
Post on 16-Dec-2015
234 Views
Preview:
TRANSCRIPT
SUGI 28, PAPER 4
SUGI 25
Paper 129: Paul M. DorfmanPrivate Detectives in a Data Warehouse: Key-Indexing, Bitmapping, and Hashing
SUGI 26Paper 8: Paul M. DorfmanTable Look-Up by Direct Addressing: Key-Indexing — Bitmapping — Hashing
Paper 128: Paul M. DorfmanQuick Disk Table Look-up via Hybrid Indexing into a Directly Addressed SAS Data Set
SUGI 27Paper 12: Paul M. Dorfman and Gregg P. SnellHashing Rehashed
GENERATION I
Hand-Coded Direct-Addressing Routines Presented At:
Associative Array
SUGI 27Presentation: Jason Secosky
The DATA step in Version 9: What’s New?
SAS® Version 9 DATA Step Object: Associative Array
Hash
GENERATION II
==
1
2
Main Entry: pro·pae·deu·tic Pronunciation: "prO-pi-'dü-tik, -'dyü-Function: nounEtymology: Greek propaideuein to teach beforehand, from pro- before + paideuein to teach, from paid-, pais childDate: 1798
1: preparatory study or instruction
Merriam-Webster online http://www.m-w.com
PROPAEDEUTICS
1
3
Main Entry: di·rect ad·dress·ing Pronunciation: "d&-'rekt, &-'dres-ingFunction: speedy table lookup
1: accessing key values “directly” by their location (address, node) in a table, as opposed to searching for them by comparing the search key to all or some table values
2: key-indexed search
Hashing Rehashed. SUGI 27, Orlando, FL, 2002.
DIRECT ADDRESSING
21
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
DEFINE
KEY-INDEXING
43
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
00
LOAD
DEFINE
KEY-INDEXING
2
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
00
01
DEFINE
LOAD
KEY-INDEXING
2
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
00
02
01
DEFINE
LOAD
KEY-INDEXING
2
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
00
02
01
DEFINE
LOAD
KEY-INDEXING
2
WORK.SMALL
KEY S_SAT
185 00
971 01
400 02
260 03
922 04
970 05
543 06
532 07
050 08
067 09
data match ; array hkey (0:999) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; if missing (hkey(key)) then hkey(key) = s_sat ; end ; do until ( eof2 ) ; set large end = eof2 ; s_sat = hkey(key) ; if s_sat > . then output ; end ; stop ;run ;
ARRAY HKEY(0:999)
HKEY(000)=.
…
HKEY(185)=.
HKEY(400)=.
HKEY(971)=.
…
HKEY(999)=.
00
01
02
SEARCH
DEFINE
LOAD
KEY-INDEXING
1
How can one possibly use key-indexing when SSN (or any other large or non-integer value) is the key?
LIMITATIONS
Key Must Be Integer
Key Range Limited By Memory
(9-Digit SSN Would Require 8 GB)
So you may be asking yourself...
KEY-INDEXING
321
Main Entry: 1hash Pronunciation: 'hashFunction: transitive verbEtymology: French hacher, from Old French hachier, from hache battle-ax, of Germanic origin; akin to Old High German hAppa sickle; akin to Greek koptein to cutDate: 1590
1 a : to chop (as meat and potatoes) into small pieces
Merriam-Webster online http://www.m-w.com
HASHING
21
Main Entry: 1hash·ing Pronunciation: 'hash-ingFunction: make BIG keys smaller
1: converting a long-range key (numeric or character) to a smaller-range integer number with a mathematical algorithm or function that must be:
• Rapidly computable
• Distribute the resulting keys uniformly
• Produce an integer in [0:HSIZE-1] range
2: H = MOD(KEY,HSIZE)
Private Detectives in a Data Warehouse. SUGI 25, Indianapolis, IN, 2000.
HASHING
21
GENERATIONS
Which generation of hashing is the most efficient for subsetting LARGE based on the values of KEY in SMALL to produce a file MATCH?
data small ; input key s_sat ; cards ; 185 00 971 11 400 22 260 33 922 44 970 55 543 66 532 77 050 88 067 99 ; run ;
Our Mission:Match unsorted SMALL
(satellite variable S_SAT)To unsorted LARGE
LARGE cannot be sorted
Assume:
Memory will hold SMALL
(integer variable KEY)
7654321
h = mod(key,&hsize);
%let load = 0.625;
data _null_;
do p=ceil(p/&load) by 1 until(j=up+1);
up = ceil(sqrt(p));
do j=2 to up until(not mod(p,j)); end;
end;
call symput('hsize',left(put(p,best.)));
stop;
set small nobs=p;
run;
%put hsize=&hsize;
hash_size=17
GENERATION I
Choose a proper function
Determine Load Factor
Calculate Optimal Array Size
KEY HASH_ADDR
185
971
400
260
922
970
543
532
050
067
Hash15
Hash
Hash
Hash
Hash
Hash
HashHash
Hash
Hash
02
09
05
04
01
16
05
16
16
Collision Resolution(Linear Probing)
54321
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT
185 00
971 11
400 22
260 33
922 44
970 55
543 66
532 77
050 88
067 99
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=. (02)=.
(03)=. (03)=.
(04)=. (04)=.
(05)=. (05)=.
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=. (09)=.
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
DEFINELOADSEARCH
DEFINE• Load Factor• Array Size• Define
13121110
KEY S_SAT
185 00
971 11
400 22
260 33
922 44
970 55
543 66
532 77
050 88
067 99
KEY S_SAT H(KEY)
185 00
971 11
400 22
260 33
922 44
970 55
543 66
532 77
050 88
067 99
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=. (02)=.
(03)=. (03)=.
(04)=. (04)=.
(05)=. (05)=.
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=. (09)=.
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
15
185 00
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11
400 22
260 33
922 44
970 55
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=. (02)=.
(03)=. (03)=.
(04)=. (04)=.
(05)=. (05)=.
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=. (09)=.
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=185 (15)=00
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
02
11971
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22
260 33
922 44
970 55
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=971 (02)=11
(03)=. (03)=.
(04)=. (04)=.
(05)=. (05)=.
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=. (09)=.
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
09
22400
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33
922 44
970 55
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=971 (02)=11
(03)=. (03)=.
(04)=. (04)=.
(05)=. (05)=.
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
05
33260
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44
970 55
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=971 (02)=11
(03)=. (03)=.
(04)=. (04)=.
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
04 44922
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=. (01)=.
(02)=971 (02)=11
(03)=. (03)=.
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
01
55970
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55 01
543 66
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=970 (01)=55
(02)=971 (02)=11
(03)=. (03)=.
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=. (16)=.
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
16
66543
9
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55 01
543 66 16
532 77
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=970 (01)=55
(02)=971 (02)=11
(03)=. (03)=.
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=543 (16)=66
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
77532
05
98
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55 01
543 66 16
532 77 05
050 88
067 99
DEFINE
HKEY HSAT
(00)=. (00)=.
(01)=970 (01)=55
(02)=971 (02)=11
(03)=. (03)=.
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=543 (16)=66
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
16
88050
765
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55 01
543 66 16
532 77 05
050 88 16
067 99
DEFINE
HKEY HSAT
(00)=050 (00)=88
(01)=970 (01)=55
(02)=971 (02)=11
(03)=. (03)=.
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=543 (16)=66
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
16
99067
4321
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat); retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_; array hsat (0:&hsize) _temporary_; ...
GENERATION I
KEY S_SAT H(KEY)
185 00 15
971 11 02
400 22 09
260 33 05
922 44 04
970 55 01
543 66 16
532 77 05
050 88 16
067 99 16
DEFINE
HKEY HSAT
(00)=050 (00)=88
(01)=970 (01)=55
(02)=971 (02)=11
(03)=067 (03)=99
(04)=922 (04)=44
(05)=260 (05)=33
(06)=. (06)=.
(07)=. (07)=.
(08)=. (08)=.
(09)=400 (09)=22
(10)=. (10)=.
(11)=. (11)=.
(12)=. (12)=.
(13)=. (13)=.
(14)=. (14)=.
(15)=. (15)=.
(16)=543 (16)=66
(17)=. (17)=.
*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep=key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey(0:&hsize) _temporary_; array hsat(0:&hsize) _temporary_; do until ( eof1 ) ; set small end = eof1 ; do h=mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h)=key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; ...
LOAD
185 00
SEARCH
...
do until ( eof2 ) ; set large end = eof2 ; do h=mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;
h = mod(key,&hsize);
%let load = 0.625;
data _null_;
do p=ceil(p/&load) by 1 until(j=up+1);
up = ceil(sqrt(p));
do j=2 to up until(not mod(p,j)); end;
end;
call symput('hsize',left(put(p,best.)));
stop;
set small nobs=p;
run;
%put hsize=&hsize;
hash_size=17
Choose a proper function
Determine Load Factor
Calculate Optimal Array Size
KEY HASH_ADDR
185
971
400
260
922
970
543
532
050
067
Hash15
Hash
Hash
Hash
Hash
Hash
HashHash
Hash
Hash
02
09
05
04
01
16
05
16
16
Collision Resolution(Linear Probing)
GENERATION II
54321
GENERATION II
DEFINE
LOAD
SEARCH
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;
LOAD
SEARCH
DEFINE
321
return code = object.method;
GENERATION II
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;
data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
dcl hash hh ();
Parameter Type MatchingVerbose Syntax
DEFINE
7654321
set small(keep=key s_sat) point = _n_ ;
GENERATION II
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;
LOAD
data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');
1
GENERATION II
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); rc = hh.DefineKey ( 'key' ); rc = hh.DefineData ( 's_sat' ); rc = hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; rc = hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; rc = hh.find () ; if rc = 0 then output ; end ; stop ; run ;
LOAD
SEARCH
data match ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof1 ) ; set small end = eof1 ; hh.add () ; end ; do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
DEFINE set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');
DEFINE
GENERATIONS
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');
32
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;
LOAD
GENERATIONS
1
data match ( drop = rc ) ; length key $9 s_sat 8 ; declare AssociativeArray hh (); hh.DefineKey ( 'key' ); hh.DefineData ( 's_sat' ); hh.DefineDone (); do until ( eof2 ) ; set large end = eof2 ; if hh.find ()=0 then output ; end ; stop ; run ;
set small(keep=key s_sat) point = _n_ ;dcl hash hh ();(dataset:'small');
%let load = 0.625 ; data _null_ ; do p=ceil(p/&load) by 1 until(j=up+1); up = ceil(sqrt(p)); do j=2 to up until(not mod(p,j)); end; end; call symput('hsize', left(put(p,best.))); stop; set small nobs=p;run;*0 if dupes to be pulled ; %let nodupes = 1 ; data match (keep = key s_sat l_sat) ; retain nodupes &nodupes.. ; array hkey (0:&hsize) _temporary_ ; array hsat (0:&hsize) _temporary_ ; do until ( eof1 ) ; set small end = eof1 ; do h = mod (key, &hsize) by +1 ; if h = &hsize then h = 0 ; if hkey(h) = key and nodupes
then leave ; if hkey(h) = . then do ; hkey(h) = key ; hsat(h) = s_sat ; leave ; end ; end ; end ; do until ( eof2 ) ; set large end = eof2 ; do h = mod (key, &hsize) by +1 until ( hkey(h) = . ) ; if h = &hsize then h = 0 ; if hkey(h) = key then do ; s_sat = hsat(h) ; output ; if nodupes then leave ; end ; end ; end ; stop ; run ;
SEARCH
GENERATIONS
Modify your hash code
• %let nodupes = 1 ; • if nodupes then leave ;
Three options !
• hh.add (); Keeps 1st occurrence
• hh.replace (); Keeps last
• create additional keys to discriminate further
DUPLICATE KEYS
CHALLENGES
4321
Change your hash function
• rescale the key h = mod (key*1000, &hsize)
• use a numeric informat h=mod (input(key,16.), &hsize)
Change your array declaration
• array hkey(0:&hsize) $9 _temporary_
Already taken care of !
• set small(keep=key s_sat) point = _n_ ;
• hh.DefineKey ( 'key' );
CHALLENGES
21
NON-INTEGER KEYS• Fractional signed SAS numbers• Digit strings (char var of digits)• Arbitrary characters
DYNAMIC TABLE PROCESSING
Simple coding
• DO LOOP through the array
do h=0 to &hsize – 1; if missing(hkey(h)) then continue; key = hkey(h); s_sat = hsat(h); < … further processing … >end;
GENERATION I GENERATION II
Use the HASH ITERATOR !
Dcl hash hh (dataset: ‘’sample’, ordered: 1);dcl hiter hi (‘hh’);< … hh.define stuff … >do rc=hi.first() by 0 while (rc=0); (key is automatically populated) rc=hi.next();end;
CHALLENGES
654321
COMPOSITE KEYS
Change your hash function
• Possible but VERY difficult
• Complexity grows as the number of components to the key grows
GENERATION I GENERATION II
Just add the key(s) !
• set small(keep=k1 k2 k3 s_sat) point = _n_ ;
• hh.DefineKey ( 'k1', 'k2', 'k3' );
CHALLENGES
321
PEEK UNDER THE HOOD
STARFLEET COMMAND ENGINEERINGACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICSACCESSING VERSION 9 SCHEMATICS
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACEHASH ITERATORADELSON-VELSKII & LANDIS TREES
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
declare AssociativeArray hh
• instantiates or creates the object
( dcl hash myhash )
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.DefineKey
• define a set of hash keys
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.DefineData
• define a set of hash table satellites
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.DefineDone
• tell SAS the definitions are done
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.Add
• insert the key and satellites
(if the key is not yet in the table)
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.Replace
• insert the key and satellites
(overwrites any existing data)
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.Find
• search for the key
• if found, extract the satellite
and update the host DATA STEP variables
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.Check
• search for the key
• if found, just return rc=0
PEEK UNDER THE HOOD
DATA STEP COMPONENT INTERFACE
hh.Delete
• delete the hash table from memory
PEEK UNDER THE HOOD
HASH ITERATOR
declare hiter hi
dcl hiter myhashiterator
• instansiates or creates the object
(don’t forget to add “ordered: 1”
to the hash declaration)
PEEK UNDER THE HOOD
HASH ITERATOR
hi.First
• fetch the smallest key into the host variable
PEEK UNDER THE HOOD
HASH ITERATOR
hi.Next
• fetch the next key in ascending order
PEEK UNDER THE HOOD
HASH ITERATOR
hi.Last
• fetch the largest key into the host variable
PEEK UNDER THE HOOD
HASH ITERATOR
hi.Prev
• fetch the previous key in descending order
PEEK UNDER THE HOOD
ADELSON-VELSKII & LANDIS TREES
http://www.seanet.com/users/arsen/avltree.html
Binary Trees populated such that they average O(log(N)) search behavior regardless of the distribution.
For example:
• Insert the value: 05• Insert the value: 02
AVL maintains balance by rotating the values and preserving the search structure
HASHEXP
• controls the number of trees to create
• 2**
• EXP=16 HSIZE=65,536
HASHEXP
54321
CONCLUSION
Simple numeric key falling in a limited range
• SAS date and time values are good examples
• This is the area where Generation I key-indexed search complete dominates the competition both in computer and programming efficiency
A short comparison of I vs. II
4321
CONCLUSION
Simple numeric key with the range up to 9 digits; no satellites needed
• Bitmapping is king
1
CONCLUSION
Simple numeric key or short (up to 10 bytes) character key
• Both generations do well
• If ultimate speed is the issue, Generation I (barely)
• Generation II has the advantage of coding simplicity
321
CONCLUSION
Composite keys
• Generation I is better if the keys can be rapidly combined in a short integer
• Otherwise, Generation II dominates
21
CONCLUSION
Retrieving data by key from a hash table in order
• Generation I can provide such functionality only through array sorting
• Generation II hash iterator object is designed for this purpose, works very fast
21
CONCLUSION
Storing and handeling duplicate key entries in a hash table
• Generation I is more flexible
• Generation II only lets you control which duplicate takes over
2121
CONCLUSION
Dynamic Data Step Processing
• Generation II is ideal
• Table grows at run-time as new entries are added• No need to allocate giant memories beforehand
321
CONCLUSION
Dynamic DATA Step Structure
is the first ever…
Finally…
Although still experimental in V9…
The Generation II Hash Object
(AssociativeArray)
4321
HASHING:
GENERATIONS
SUGI 28
GENERATIONS
top related