build your data skills: introduction to sql · s-q-l vs. sequel? 10 terminology. dataset....

80
Build your Data Skills: Introduction to SQL Kaeli Samson, MA, MPH Department of Biostatistics College of Public Health University of Nebraska Medical Center June 25 th , 2019

Upload: others

Post on 23-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Build your Data Skills: Introduction to SQLKaeli Samson, MA, MPHDepartment of BiostatisticsCollege of Public HealthUniversity of Nebraska Medical Center

June 25th, 2019

Page 2: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

2

Top 3 Skills:1. SQL2. Github3. Marketing

Hadley Wickham, PhDChief Scientist, RStudio

Page 3: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

3

• Brief Intro to SQL• Terminology• General syntax/structure• Description of Dataset• Basic Queries• Creating New Variables• Joins• Helpful SQL Code

Overview

Page 4: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

4

SQL = Structured Query Language

Typically associated with use in database management, but also great for data management, generally!

Brief Introduction to SQL

Page 5: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

5

Brief Background in Database DesignStudent Student

ContactCourse Course

DescriptionInstructor Instructor

Contact

Josie 555-1234 Calculus II Integration Julie 555-8888

Ken 555-9845 Calculus II Integration Joe 555-2222

Brooke 555-7878 Calculus II Integration Julie 555-8888

Addison 555-1111 Calculus II Integration Julie 555-8888

Cole 555-6127 Calculus II Integration Julie 555-8888

Samantha 555-1534 Calculus II Integration Joe 555-2222

Josh 555-5463 Calculus II Integration Joe 555-2222

Josie 555-1234 GIS I Mapping Paul 555-3333

Ken 555-9845 GIS I Mapping Paul 555-3333

Page 6: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

6

Brief Background in Database Design

InstructorID

InstructorName

Instructor Contact

1 Julie 555-8888

2 Joe 555-2222

3 Paul 555-3333

Course ID Course Name

Course Description

A Calculus II Integration

B GIS I Mapping

StudentID

Student Name

StudentContact

1 Josie 555-1234

2 Ken 555-9845

3 Brooke 555-7878

4 Addison 555-1111

5 Cole 555-6127

6 Samantha 555-1534

7 Josh 555-5463

Enrollment ID Student ID Course ID Instructor ID

1 1 A 1

2 2 A 2

3 1 B 3

Page 7: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

7

SQL can be used to “query” data, but can do more, such as:• Create new variables• Join tables together• Insert observations• Edit observations • Delete observations

Brief Introduction to SQL

Page 8: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Terminology

Page 9: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

9

S-Q-L vs. Sequel?

Page 10: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

10

Terminology

Dataset

Observation

Variable

Table

Row

Column

SAS SQL

Note: Since this presentation uses SAS to run SQL, SAS terms will be used interchangeably with SQL terms, although I acknowledge in some fields of study these terms are not considered synonymous.

Page 11: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

General Structure and Syntax of SQL

Page 12: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

12

SELECT Choose variables/columns for your table

FROM Indicate source(s) of data (i.e. datasets)

WHERE Subsetting criteria for rows

GROUP BY Grouping desired for summary variables

ORDER BY Sort order for rows

Common SQL Clauses

Clauses must be in this specific order!

Page 13: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

13

Common SQL ClausesCREATE TABLESELECTFROMWHEREGROUP BYORDER BY

SELECTFROMWHEREGROUP BYORDER BY

Prints query result (i.e. table) to output

Saves query result (i.e. table) as SAS dataset

Page 14: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

14

SQL in SASPROC SQL ;

SELECTFROMWHEREGROUP BYORDER BY ;

QUIT ;

Specific to SAS

Page 15: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

15

SQL in SASNote: While SQL is an ANSIstandard language, each software that runs it, including SAS, may have their own options that are specific to that software. As such, some of the code in this presentation may not work outside SAS, but the general principles will still apply.

Page 16: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Structure of Example Dataset

Page 17: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

17

• State Identifiers (x3)• Name• Abbreviation• FIPS Code

• State Information• Division• Population• Number of representatives• Change in number of seats

Data: sashelp.us_data

Page 18: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

18

sashelp.us_data (abbr.)

Page 19: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

19

data us_pop;set sashelp.us_data

(rename=(population_2010=pop_2010));keep statename state division pop_2010;

run;

New Dataset

us_pop

Page 20: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Basic Queries

Page 21: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

21

Printing all variables and observations in a dataset

proc sql;select var1, var2, var3, var4from dataset;

quit;

Basic Structure of SQL Code

SQL

Page 22: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

22

Basic Structure of SQL Code

dataset

variables

dataset

variables

Traditional SAS Code:

Printing all variables and observations in a dataset

proc print data=us_pop;var statename pop_2010 state division;

run;

proc sql;select statename, pop_2010, state, divisionfrom us_pop;

quit;

Page 23: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

23

Printing all variables and observations in a dataset

Basic Structure of SQL Code

Proc Print Proc SQL

Page 24: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

24

Printing all variables and observations in a dataset

proc sql;select *from dataset;

quit;

Basic Structure of SQL Code

select all variables

Page 25: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

25

Printing all observations in a dataset

proc print data=us_pop;run;

proc sql;select *from us_pop;

quit;

Basic Structure of SQL Code

Traditional SAS Code:

Page 26: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

26

Printing all observations in a dataset

Basic Structure of SQL Code

Page 27: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

27

Printing unique observations in a dataset

proc sql;select divisionfrom us_pop;

quit;

proc sql;select distinct divisionfrom us_pop;

quit;

Basic Structure of SQL Code

only select unique observations

Page 28: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

28

Printing unique observations in a dataset

Basic Structure of SQL Code

proc sql;select distinct divisionfrom us_pop;

quit;

proc sql;select divisionfrom us_pop;

quit;

Page 29: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

29

Printing a subset of observations

proc sql;select *from datasetwhere var in (“A”, “B”, “C”);

quit;

Basic Structure of SQL Code

restrict observations with where clause

Page 30: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

30

Printing a subset of observations

proc print data=us_pop;where division in ("West North Central", "Mountain");

run;

proc sql;select *from us_popwhere division in ("West North Central", "Mountain");

quit;

Basic Structure of SQL Code

Traditional SAS Code:

Page 31: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

31

Printing a subset of observations

Basic Structure of SQL Code

Page 32: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

32

Other examples of where clause in SQLproc sql;

select *from us_popwhere pop_2010 between 0 and 1000000;

quit;

proc sql;select *from us_popwhere 0 le pop_2010 le 1000000 and division = "Mountain";

quit;

Basic Structure of SQL Code

Page 33: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

33

Other examples of where clause in SQL

Basic Structure of SQL Code

where pop_2010 between 0 and 1000000;

where 0 le pop_2010 le 1000000 and division = "Mountain";

Page 34: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

34

Sort observations in a dataset

proc sql;select *from datasetorder by var;

quit;

proc sql;select *from datasetorder by var1, var2 desc;

quit;

Basic Structure of SQL Code

sort output using ‘order by’ clause

Page 35: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

35

Sort observations in a dataset

proc sort data=us_pop;by division descending population_2010;

run;

proc print data=us_pop;run;

proc sql;select *from us_poporder by division, population_2010 desc;

quit;

Basic Structure of SQL Code

Traditional SAS Code:

Page 36: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

36

Sort observations in a dataset

Basic Structure of SQL Code

Page 37: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Creating New Variables

Page 38: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

38

Creating new variables

proc sql;select statename, pop_2010, pop_2010/1000000 as new_popfrom us_pop;

quit;

Creating New Variables

new variable

name

new variable

definition

(not optional)

Page 39: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

39

Creating new variables

proc sql;select statename, pop_2010, pop_2010/1000000 as new_popfrom us_pop;

quit;

Creating New Variables

Page 40: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

40

Cleaning up new variables

proc sql;select

statename, pop_2010, pop_2010/1000000 as new_pop format=8.1 label="Pop in Millions"

from us_pop;quit;

new variable

label

new variable format

Creating New Variables

Page 41: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

41

Cleaning up new variablesproc sql;

selectstatename, pop_2010, pop_2010/1000000 as new_pop format=8.1 label="Pop in Millions"

from us_pop;quit;

Creating New Variables

Page 42: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

42

Summary Variables

proc sql;select mean(pop_2010) as mean_popfrom us_pop;

quit;

Creating New Variables

new variable

name

new variable

definition

Page 43: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

43

Summary Variables

proc sql;select mean(pop_2010) as mean_popfrom us_pop;

quit;

Creating New Variables

The summary function is applied to the entire data set (when there is

no group by clause)

Page 44: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

44

Using the count functionproc sql;

select count(*)from us_pop;

quit;

proc sql;select count(division)from us_pop;

run;

proc sql;select count(distinct division)from us_pop;

run;

Creating New Variables

Page 45: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

45

Summary Variables by Group

proc sql;select division, mean(pop_2010) as div_mean_popfrom us_popgroup by division;

quit;

Creating New Variables

new variable

name

new variable

definition

Grouping variable: will calculate summary statistics for each unique value of this variable

Page 46: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

46

Summary Variables by Groupproc sql;

select division, mean(pop_2010) as div_mean_popfrom us_popgroup by division;

quit;

Creating New Variables

Page 47: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

47

Summary Variables by GroupNote: It’s important to have your grouping variable in both your select and your group by clauses!

proc sql;select division, mean(pop_2010) as div_mean_popfrom us_popgroup by division;

quit;

Creating New Variables

Page 48: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

48

Summary Variables by GroupThis is what happens if you leave the grouping variable out of the select clause…

proc sql;select mean(pop_2010) as div_mean_popfrom us_popgroup by division;

quit;

Creating New Variables

Page 49: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

49

Summary Variables by GroupThis is what happens if you forget to includes the group by clause:

proc sql;select division, mean(pop_2010) as div_mean_popfrom us_pop;

quit;

Creating New Variables

Page 50: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

50

proc sql;create table div_stats asselect division, mean(pop_2010) as div_mean_popfrom us_popgroup by division;

quit;

Creating New SAS Datasetsnew SAS

dataset namenot

optional

Page 51: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Joins

Page 52: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

52

sashelp.us_data (abbr.)

Page 53: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

53

data us_pop;set sashelp.us_data

(rename=(population_2010=pop_2010));keep statename state division pop_2010;

run;

data us_rep;set sashelp.us_data (keep=

state statecodereps_2010);

run;

data us_seatch;set sashelp.us_data (keep=

statecodestatenameseat_change_2010);

run;

New Datasets

Page 54: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

54

New Datasets

us_rep

us_pop

us_seatch

Page 55: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

55

Basic set up of a joinproc sql;

select dataset1.*, dataset2.*

from dataset1, dataset2

where dataset1.common_var = dataset2.common_var;quit;

Joins

Common variable structure:

data_source.variable_nameWhen joining tables, it is important to let SQL know which variables are coming from each dataset, especially when variables in different datasets have the same name.

Page 56: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

56

Basic set up of a joinproc sql;

select dataset1.*, dataset2.*

from dataset1, dataset2

where dataset1.common_var = dataset2.common_var;quit;

Joins

Without the where statement, the output table would be a Cartesian product join, where every observation in the first dataset would be joined to the first observation in the second table, and that would be repeated for each subsequent observation in the second table, until you had a table with 52*52 = 2,704 observations!

Page 57: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

57

Use abbreviations for dataset namesproc sql;

select dataset1.*, dataset2.*

from dataset1, dataset2

where dataset1.common_var = dataset2.common_var;quit;

proc sql;select abbr1.*,

abbr2.*from dataset1 as abbr1,

dataset2 as abbr2where abbr1.common_var = abbr2.common_var;

quit;

Joins

the ‘as’ is optional in the from clause

equivalent

Page 58: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

58

Desired Join

us_rep

us_pop

Page 59: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

59

Example of a join – keep all variables

proc sql;select pop.*,

rep.*from us_pop as pop,

us_rep as repwhere pop.state = rep.state;

quit;

Joins

Note that unlike merges in the data step, there’s no need to sort the input datasets prior to a join!

Page 60: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

60

Example of a join – keep all variablesproc sql;

select pop.*, rep.*from us_pop as pop, us_rep as repwhere pop.state = rep.state;

quit;

Joins

Page 61: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

61

Save joined tables as SAS dataset

proc sql;create table pop_rep asselect pop.*,

rep.*from us_pop as pop,

us_rep as repwhere pop.state = rep.state;

select *from pop_rep;

quit;

Joins

prints the new dataset

Page 62: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

62

Desired Joinus_pop

us_seatch

But we only want the following

variables in the final joined table:• Statecode• Population• Seat Change

Page 63: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

63

Joins

us_pop us_seatch

We only want the following variables in the final joined table:

• Statecode• Population• Seat Change

proc sql;select statecode,

pop_2010, seat_change_2010

from us_pop as pop, us_seatch as seat

where pop.statename = seat.statename;quit;

Page 64: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

64

Joinsproc sql;

select statecode, pop_2010, seat_change_2010

from us_pop as pop, us_seatch as seat

where pop.statename = seat.statename;quit;

Page 65: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

65

Desired Join

us_rep

us_pop

us_seatch

Keep:• Statename• Population• Representatives• Seat Change

Page 66: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

66

Joinsproc sql;

select pop.statename,pop_2010, reps_2010,seat_change_2010

from us_pop as pop, us_rep as rep,us_seatch as seat

where pop.state = rep.state ANDrep.statecode = seat.statecode;

quit;

Keep:• Statename• Population• Representatives• Seat Change

us_repus_pop us_seatch

Page 67: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

67

Joinsproc sql;

select pop.statename,pop_2010, reps_2010,seat_change_2010

from us_pop as pop, us_rep as rep,us_seatch as seat

where pop.state = rep.state ANDrep.statecode = seat.statecode;

quit;

Page 68: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

Some Good Uses for SQL

Page 69: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

69

Joins related to dates

seizure blood

The goal is to match up blood data that was collected before the day the seizure occurred,

but no more than one week before

Page 70: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

70

Joins related to dates

seizure blood

proc sql;select s.*,

b.*,(bld_date - sz_date) as days_diff

from seizure as s,blood as b

where s.pt_ID = b.pt_ID AND-7 le CALCULATED days_diff le -1;

quit;

Page 71: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

71

Joins related to datesproc sql;

select s.*, b.*,(bld_date - sz_date) as days_diff

from seizure as s,blood as b

where s.pt_ID = b.pt_ID AND-7 le CALCULATED days_diff le -1;

quit;

Page 72: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

72

Joins related to datesproc sql;create table sz_bld asselect

s.*, b.*,(bld_date - sz_date) as days_diff

from seizure as s,blood as b

where s.pt_ID = b.pt_ID AND-7 le CALCULATED days_diff le -1

order by pt_id, days_diff;

select *from sz_bld;

quit;

*If you only want one blood draw per patient, and prefer the draw that was closest to the seizure date;data sz_bld_single;

set sz_bld;by pt_id days_diff;

if last.pt_id then output;run;

Page 73: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

73

Confirming User-Defined Formatsproc format;

value age_group low - 30 = "Age Group 1"35 - 45 = "Age Group 2"46 - 55 = "Age Group 3"56 - 65 = "Age Group 4"66 - high = "Age Group 5";

run;

proc sql;select distinct ageatstart,

ageatstart format=age_group.from sashelp.heart;

quit;

Page 74: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

74

Confirming User-Defined Formats

proc format;value age_grouplow - 30 = "Age Group 1"35 - 45 = "Age Group 2“46 - 55 = "Age Group 3“56 - 65 = "Age Group 4"66 - high = "Age Group 5";

run;

proc sql;select distinct ageatstart,

ageatstart format=age_group.from sashelp.heart;

quit;

Page 75: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

75

Confirming User-Defined Variables

The goal is to create a new variable ‘risk’, which takes the value ‘At risk’ if any of the following variables have values with a star:

Page 76: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

76

Confirming User-Defined Variables

data heart;set sashelp.heart;

length risk $8.;

*Create overall risk variable;if bp_status = "High" OR

weight_status ne "Normal" then risk = "At risk";else risk = "Ok";

run;

proc sql;select distinct bp_status,

weight_status, risk, count(*) as total

from heartgroup by bp_status, weight_status, risk;

quit;

Page 77: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will

77

Confirming User-Defined Variables

proc sql;select distinctbp_status, weight_status, risk, count(*) as totalfrom heartgroup bybp_status, weight_status, risk;

quit;

Page 80: Build your Data Skills: Introduction to SQL · S-Q-L vs. Sequel? 10 Terminology. Dataset. Observation; Variable. ... Note: Since this presentation uses SAS torun SQL, SAS terms will