efficient sas programming with large data
DESCRIPTION
Efficient SAS programming with Large Data. Aidan McDermott Computing Group, March 2007. Axes if Efficiency. processing speed: CPU real storage: disk memory … user: functionality interface to other systems ease of use learning user development: methodologies reusable code - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/1.jpg)
Efficient SAS programming with Large Data
Aidan McDermott
Computing Group, March 2007
![Page 2: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/2.jpg)
Axes if Efficiency
• processing speed:– CPU– real
• storage:– disk– memory– …
• user:– functionality– interface to other systems– ease of use– learning
• user development:– methodologies– reusable code– facilitate extension, rewriting– maintenance
![Page 3: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/3.jpg)
Dataset / Table
![Page 4: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/4.jpg)
• Datasets consist of three parts
![Page 5: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/5.jpg)
General (and obvious) principles
• Avoid doing the job if possible
• Keep only the data you need to perform a particular task (use drop, keep, where and if’s)
![Page 6: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/6.jpg)
Combining datasets -- concatenation
![Page 7: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/7.jpg)
General (and obvious) principles
• Often efficient methods were written to perform the required task – use them.
![Page 8: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/8.jpg)
General (and obvious) principles
• Often efficient methods were written to perform other tasks – use them with caution.
• Write data driven code– it’s easier to maintain data than to update code
• Use length statements to limit the size of variables in a dataset to no more than is needed.– don’t always know what size this should be, don’t
always produce your own data.
• Use formatted data rather than the data itself
![Page 9: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/9.jpg)
Memory resident datasets
![Page 10: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/10.jpg)
Compressing Datasets
• Compress datasets with a compression utility such as compress, gzip, winzip, or pkzip and decompress before running each SAS job– delays execution and there is need to keep track of
data and program dependency.
• Use a general purpose compression utility and decompress it within SAS for sequential access.– system dependent (need a named pipe), sequential
dataset storage.
![Page 11: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/11.jpg)
Compressing Datasets
![Page 12: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/12.jpg)
SAS internal Compression
• allows random access to data and is very effective under the right circumstances. In some cases doesn’t reduce the size of the data by much.
• “There is a trade-off between data size and CPU time”.
![Page 13: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/13.jpg)
• indata is a large dataset and you want to produce a version of indata without any observations
![Page 14: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/14.jpg)
The data step is a two stage process• compile phase• execute phase
![Page 15: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/15.jpg)
Data step logic
![Page 16: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/16.jpg)
Data step logic
![Page 17: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/17.jpg)
![Page 18: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/18.jpg)
data step
![Page 19: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/19.jpg)
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y
gender C 1 n y
admit N 8 n y date8.
length N 8 n y
discharge N 8 n n date8.
_N_
_ERROR_ 0
PDV: compile phase
![Page 20: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/20.jpg)
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 1
_ERROR_ 0
PDV: execute phase
![Page 21: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/21.jpg)
data admits; set admits; discharge = admit + length; format discharge date8.;run;
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
PDV: execute phase
![Page 22: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/22.jpg)
data admits; set admits; discharge = admit + length; format discharge date8.;run; /* implicit output */
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8. 15757
_N_ 1
_ERROR_ 0
PDV: execute phase
![Page 23: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/23.jpg)
Name type size drop retain format value
patientID C 6 n y 321C-4
gender C 1 n y M
admit N 8 n y date8. 15736
length N 8 n y 21
discharge N 8 n n date8.
_N_ 2
_ERROR_ 0
data admits; set admits; discharge = admit + length; format discharge date8.;run;
PDV: execute phase
![Page 24: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/24.jpg)
Efficiency: suspend the PDV activities
![Page 25: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/25.jpg)
General principles
• Use by processing whenever you can• Given the data below, for each region, siteid,
and date, calculate the mean and maximum ozone value.
![Page 26: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/26.jpg)
General principles
• Easy:
![Page 27: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/27.jpg)
General principles
• Suppose there are multiple monitors at each site and you still need to calculate the daily mean?– Combine multiple observations onto one line and
then compute the statistics?
• Suppose you want the 10% trimmed mean?
• Suppose you want the second maximum?– Use Arrays to sort the data?– Write your own function?
![Page 28: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/28.jpg)
![Page 29: Efficient SAS programming with Large Data](https://reader035.vdocuments.site/reader035/viewer/2022062301/568146b3550346895db3d154/html5/thumbnails/29.jpg)