1 chapter 4: introduction to lookup techniques 4.1 introduction to lookup techniques 4.2 in-memory...

75
1 Chapter 4: Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques 4.2 In-Memory Lookup Techniques 4.3 Disk Storage Techniques

Upload: herbert-higgins

Post on 29-Dec-2015

248 views

Category:

Documents


1 download

TRANSCRIPT

1

Chapter 4: Introduction to Lookup Techniques

4.1 Introduction to Lookup Techniques

4.2 In-Memory Lookup Techniques

4.3 Disk Storage Techniques

2

Chapter 4: Introduction to Lookup Techniques

4.1 Introduction to Lookup Techniques 4.1 Introduction to Lookup Techniques

4.2 In-Memory Lookup Techniques

4.3 Disk Storage Techniques

3

Objectives Define table lookup. List table lookup techniques.

4

Table Lookups

Data Values

Lookup Values

look

up

5

6

4.01 Multiple Choice PollWhich of these is an example of a table lookup?

a. You have the data for January sales in one data set, February sales in a second data set, and March sales in a third. You need to create a report for the entire first quarter.

b. You want to send birthday cards to employees. The employees’ names and addresses are in one data set and their birthdates are in another.

c. You need to calculate the amount each customer owes for his purchases. The price per item and the number of items purchased are stored in the same data set.

7

4.01 Multiple Choice Poll – Correct AnswerWhich of these is an example of a table lookup?

a. You have the data for January sales in one data set, February sales in a second data set, and March sales in a third. You need to create a report for the entire first quarter.

b. You want to send birthday cards to employees. The employees’ names and addresses are in one data set and their birthdates are in another.

c. You need to calculate the amount each customer owes for his purchases. The price per item and the number of items purchased are stored in the same data set.

8

Overview of Table Lookup Techniques Arrays, hash objects, and formats provide an

in-memory lookup table. The DATA step MERGE statement, multiple SET

statements in the DATA step, and SQL procedure joins use lookup values that are stored on disk.

9

Chapter 4: Introduction to Lookup Techniques

4.1 Introduction to Lookup Techniques

4.2 In-Memory Lookup Techniques4.2 In-Memory Lookup Techniques

4.3 Disk Storage Techniques

10

Objectives Describe arrays as a lookup technique. Describe hash objects as a lookup technique. Describe formats as a lookup technique.

11

12

4.02 Multiple Answer PollWhich techniques do you currently use when you perform table lookups with a single data set?

a. Arrays

b. Hash object

c. Formats

d. None of the above

13

Overview of ArraysAn array is similar to a numbered row of buckets.

...

1 2 3 4

14

Overview of ArraysAn array is similar to a numbered row of buckets.

SAS puts a value in a bucket based on the bucket number.

1 2 3 4

...

15

Overview of ArraysAn array is similar to a numbered row of buckets.

SAS puts a value in a bucket based on the bucket number.

A value is retrieved from a bucket based on the bucket number.

1 2 3 4

16

DATA data-set-name; ARRAY array-name { subscript } <$><length> <array-elements> <(initial-value-list)>; < READ statement (s)> new-variable=array-name{subscript-value};RUN;

DATA data-set-name; ARRAY array-name { subscript } <$><length> <array-elements> <(initial-value-list)>; < READ statement (s)> new-variable=array-name{subscript-value};RUN;

Overview of ArraysGeneral form of the ARRAY statement:

The READ statement can be the SET, MERGE or INFILE/INPUT statement.

The ARRAY statement associates variables or initial values to be retrieved using the array name and a subscript value.

The assignment statement retrieves values from the array based on the value of the subscript.

17

Overview of Arrays

data country_info; array Cont_Name{91:96} $ 30 _temporary_ ('North America', ' ', 'Europe', 'Africa', 'Asia', 'Australia/Pacific'); set orion.country; Continent=Cont_Name{Continent_ID};run;

The ARRAY statement associates variables or initial values to be retrieved using the array name and a subscript value.

The assignment statement retrieves values from the array based on the value of the subscript.

p304d01

18

19

Setup for the Poll

data country_info; array Cont_Name{91:96} $ 30 _temporary_ ('North America', ' ', 'Europe', 'Africa', 'Asia', 'Australia/Pacific'); set orion.country; Continent=Cont_Name{Continent_ID};run;

p304d01

20

4.03 Multiple Choice PollIn p304d01, how many elements are in the array Cont_name?

a. 0

b. 5

c. 6

d. unknown

21

4.03 Multiple Choice Poll – Correct AnswerIn p304d01, how many elements are in the array Cont_name?

a. 0

b. 5

c. 6

d. unknown

22

Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.

Key Data Data

...

23

Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.

SAS puts value(s) in the data bucket(s) based on the value(s) in the key bucket.

Key Data Data

...

24

Overview of a Hash ObjectA hash object is similar to rows of buckets that are identified by the value of a key.

SAS puts value(s) in the data bucket(s) based on the value(s) in the key bucket.

Value(s) are retrieved from the data bucket(s) based on the value(s) in the key bucket.

Key Data Data

25

DATA data-set-name; < READ statement(s) > IF _N_=1 THEN DO; DECLARE HASH object-name(<attribute:value>); object-name.DEFINEKEY('key-name'); object-name.DEFINEDATA('data-name'); object-name.DEFINEDONE(); END; return-code=object-name.FIND(<key: value>);RUN;

DATA data-set-name; < READ statement(s) > IF _N_=1 THEN DO; DECLARE HASH object-name(<attribute:value>); object-name.DEFINEKEY('key-name'); object-name.DEFINEDATA('data-name'); object-name.DEFINEDONE(); END; return-code=object-name.FIND(<key: value>);RUN;

Overview of Hash ObjectsGeneral form of the hash object:

The READ statement can be the SET, MERGE, or INFILE/INPUT statement.

The syntax within the DOgroup defines and canpopulate the hash object.

The FIND method retrieves the data value based on the key value.

26

Overview of Hash Objects

data country_info; length Continent_Name $ 30; if _N_=1 then do; declare hash Cont_Name(dataset:'orion.continent'); Cont_Name.definekey('Continent_ID'); Cont_Name.definedata('Continent_Name'); Cont_Name.definedone(); end; set orion.country; rc=Cont_Name.find(key:Continent_ID); if rc=0;run;

The syntax within the DO group defines and populates the hash object.

The FIND method retrieves the data value based on the key value.

p304d02

27

28

Setup for the Poll

data country_info; length Continent_Name $ 30; if _N_=1 then do; declare hash Cont_Name(dataset:'orion.continent'); Cont_Name.definekey('Continent_ID'); Cont_Name.definedata('Continent_Name'); Cont_Name.definedone(); end; set orion.country; rc=Cont_Name.find(key:Continent_ID); if rc=0;run;

p304d02

29

4.04 Multiple Choice PollIn p304d02, how many times do the statements in the DO group execute?

a. only once

b. once for every observation in the data set orion.country

c. once for every observation in the data set orion.continent

30

4.04 Multiple Choice Poll – Correct AnswerIn p304d02, how many times do the statements in the DO group execute?

a. only once

b. once for every observation in the data set orion.country

c. once for every observation in the data set orion.continent

31

Overview of a FormatA format is similar to rows of buckets that are identified by the data value.

Data Value Label

...

32

Overview of a FormatA format is similar to rows of buckets that are identified by the data value.

SAS puts data values and label values in the buckets when the format is used in a FORMAT statement, PUT function, or PUT statement.

Data Value Label

...

33

Overview of a FormatA format is similar to rows of buckets that are identified by the data value.

SAS puts data values and label values in the buckets when the format is used in a FORMAT statement, PUT function, or PUT statement.

SAS uses a binary search on the data value bucket in order to return the value in the label bucket.

Data Value Label

34

Overview of a FormatGeneral form of the user-defined format:

The READ statement can be the SET, MERGE, or INFILE/INPUT statement.

PROC FORMAT;VALUE <$>fmtname range-1=label-1

. . . range-n=label-n;RUN;

DATA data-set-name; < READ statement(s)>; new-variable=PUT(variable,fmtname.);RUN;

PROC FORMAT;VALUE <$>fmtname range-1=label-1

. . . range-n=label-n;RUN;

DATA data-set-name; < READ statement(s)>; new-variable=PUT(variable,fmtname.);RUN;

When the PUT function executes, the format is loaded into memory, and a binary search is used to retrieve the format value.

The FORMAT stepcompiles the formatand stores it on disk.

35

Overview of a Format

proc format; value Cont_Name

91='North America' 93='Europe' 94='Africa' 95='Asia' 96='Australia/Pacific';run;

data country_info; set orion.country; Continent=put(Continent_ID,Cont_Name.);run;

When the PUT function executes, the format

is loaded into memory, and a binary search is used to retrieve the format value.

The FORMAT step compiles the format and stores it on disk.

p304d03

36

Chapter 4: Introduction to Lookup Techniques

4.1 Introduction to Lookup Techniques

4.2 In-Memory Lookup Techniques

4.3 Disk Storage Techniques4.3 Disk Storage Techniques

37

Objectives List methods for combining data horizontally. Use multiple SET statements to combine data

horizontally. Compare methods for combining SAS data sets.

38

Combining Data HorizontallyDATA step techniques for combining data horizontally include using the following: MERGE statement multiple SET statements UPDATE statement MODIFY statement

In addition, you can use the SQL procedure with an inner or outer join.

39

40

4.05 Multiple Answer PollWhich techniques do you currently use when you perform table lookups with multiple data sets?

a. MERGE statement

b. Joins

c. Multiple SET statements

d. UPDATE statement

e. MODIFY statement

f. None of the above

41

Overview of Merges and JoinsThe DATA step MERGE and the SQL join operators are similar to multiple stacks of buckets that are referred to by the value of one or more common variables.

By Value(s) Data Data By Value(s) Data Data

42

DATA Step MERGE StatementGeneral form of the DATA step merge:

Matches on equal values for like-named variables:

Continent_ID Continent_ID

Continent_ID

DATA data-set-name; MERGE SAS-data-sets; BY variables;RUN;

DATA data-set-name; MERGE SAS-data-sets; BY variables;RUN;

43

DATA Step MERGE Statement

proc sort data=orion.country out=country; by Continent_ID;run;

data country_info; merge country orion.continent; by Continent_ID;run;

Matches on equal values for like-named variables

p304d04

44

45

Setup for the Poll

proc sort data=orion.country out=country; by Continent_ID;run;

data country_info; merge country orion.continent; by Continent_ID;run;

p304d04

46

4.06 Multiple Choice PollIn p304d04, if the data set country has seven observations and the data set orion.continent has five observations, what stops the execution of the DATA step?

a. end of file for work.country, the data set with the most observations

b. end of file for orion.continent, the last data set listed in the MERGE statement

c. end of file for the data set that contains the final value of the BY variable Continent_ID

47

4.06 Multiple Choice Poll – Correct AnswerIn p304d04, if the data set country has seven observations and the data set orion.continent has five observations, what stops the execution of the DATA step?

a. end of file for work.country, the data set with the most observations

b. end of file for orion.continent, the last data set listed in the MERGE statement

c. end of file for the data set that contains the final value of the BY variable Continent_ID

48

You can use an SQL procedure inner or outer join to create a SAS data set.

General form of the SQL procedure CREATE TABLE statement with an inner join:

PROC SQL; CREATE TABLE SAS-data-set AS SELECT column-1, column-2,… ,column-n FROM table-1, table-2,…,table-n WHERE joining criteria ORDER BY sorting criteria;QUIT;

PROC SQL; CREATE TABLE SAS-data-set AS SELECT column-1, column-2,… ,column-n FROM table-1, table-2,…,table-n WHERE joining criteria ORDER BY sorting criteria;QUIT;

The SQL Procedure

Performs an inner join based on the WHERE criteria

49

The SQL Procedureproc sql; create table country_info as select country.*, Continent_Name from orion.country, orion.continent

where country.Continent_ID= continent.Continent_ID; order by country.Continent_ID;quit;

Performs an inner join where the Continent_ID values from both data sets are equal

p304d05

50

51

4.07 Multiple Choice PollWhich of the following is true of the SQL inner join?

a. The resulting data set contains only the observations with matching key values.

b. The resulting data set contains both the observations with matching key values and those observations where the key values do not match.

52

4.07 Multiple Choice Poll – Correct AnswerWhich of the following is true of the SQL inner join?

a. The resulting data set contains only the observations with matching key values.

b. The resulting data set contains both the observations with matching key values and those observations where the key values do not match.

53

Multiple SET StatementsThe DATA step with multiple SET statements combines data sets by performing one-to-one reading.

Data Data Data Data

54

Multiple SET StatementsYou can use multiple SET statements to combine observations from several SAS data sets.

When you use multiple SET statements, the following occurs: Processing stops when SAS encounters the end-of-file

marker on either data set. The variables in the PDV are not reinitialized when

a second SET statement is executed.

55

Multiple SET StatementsGeneral form of the DATA step with multiple set statements:

DATA data-set-name; SET SAS-data-set; SET SAS-data-set; RUN;

DATA data-set-name; SET SAS-data-set; SET SAS-data-set; RUN;

56

Multiple SET Statements

data country_info; set orion.country; set orion.continent; run;

Country_ Country_ Continent_ Country_FormerObs Country Name Population ID ID Name Continent_Name

1 AU Australia 20,000,000 160 91 North America 2 CA Canada . 260 93 Europe 3 DE Germany 80,000,000 394 94 East/West Germany Africa 4 IL Israel 5,000,000 475 95 Asia 5 TR Turkey 70,000,000 905 96 Australia/Pacific

p304d06

Listing of country_info

57

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_1 2 . 1

58

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_1 2 A . 1

D

59

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_1 2 A 3 1

D

60

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_1 2 A 3 1

Implicit OUTPUT;Implicit RETURN;

D

61

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_1 2 A . 2

Initialize PDV.

D

62

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_2 3 A . 2

D

63

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_2 3 B . 2

D

64

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_2 3 B 5 2

D

65

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_2 3 B 5 2

Implicit OUTPUT;Implicit RETURN;

D

66

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_2 3 B . 3

Initialize PDV.

D

67

Execution

...

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_3 4 B . 3

D

68

Execution

threeX Y Z Total1 2 A 32 3 B 5

oneX Y1 22 33 4

twoZAB

data three; set one; set two; Total=X+Y;run;

PDVX Y Z Total _N_3 4 B . 3

EOF

D

Processing stops.

69

70

Setup for the PollThe previous example created a data set named three with two observations.

Using the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?

data three; set one; set two; Total=X+Y;run;

oneX Y1 22 33 4

twoZAB

data three; set two; set one; Total=X+Y;run;

71

4.08 Multiple Choice PollUsing the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?

a. 5

b. 2

c. 3

d. 6

72

4.08 Multiple Choice Poll – Correct AnswerUsing the same one and two data sets, if the SET statements were reversed, how many observations would be in the data set three?

a. 5

b. 2

c. 3

d. 6

73

DATA Step Methods for Reading SAS DataCode Which variables are reinitialized

to missing at the topof the DATA step?

What stops the DATA step?

data two; set one; New_Var=Value;run;

variables created in the DATA step

end of the file for data set one

data three; merge one two; by Var; New_Var=Value;run;

variables created in the DATA step

all variables when the BY value changes

the last end of file that is encountered

data three; set one two; New_Var=Value;run;

variables created in the DATA step

all variables when SAS finishes reading data set one and starts reading data set two

end of the file for data set two

data three; set one; set two; New_Var=Value;run;

variables created in the DATA step

the first end of file that is encountered

74

Chapter Review1. What are the three types of in-memory table lookups?

2. What are three types of disk storage table lookups?

3. When multiple SET statements are executed, when does execution stop?

75

Chapter Review – Correct Answers1. What are the three types of in-memory table lookups?

arrays, hash objects, and formats

2. What are three types of disk storage table lookups?

PROC SQL, the DATA step with a MERGEstatement, or the DATA step with multiple SET statements

3. When multiple SET statements are executed, when does execution stop?

Execution stops when the first end of file isencountered.