s orting with sas l ong, very long and large, very large d ata aldi kraja division of statistical...
TRANSCRIPT
![Page 1: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/1.jpg)
SORTING WITH SAS LONG, VERY LONG AND
LARGE, VERY
LARGE DATA
Aldi KrajaDivision of Statistical Genomics
SAS seminar seriesJune 02, 2008
![Page 2: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/2.jpg)
SORT AND MERGE EXAMPLE data a; input id m1 $ m2 $ m3 $ DNAreserve; datalines; 1 1/1 1/2 1/1 12 2 1/2 1/1 2/2 14 3 2/2 1/1 1/1 15 4 1/2 1/2 1/2 16 5 1/1 2/2 1/1 15 ; run; proc sort data=a; by id; run;
![Page 3: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/3.jpg)
SORT AND MERGE EXAMPLE (CONT.) data b; input id age sex SBP DBP; datalines; 1 23 1 128 95 2 25 2 115 84 3 30 1 120 85 4 27 1 130 90 5 35 2 122 82 ; run; proc sort data=b; by id; run;
![Page 4: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/4.jpg)
SORT AND MERGE EXAMPLE (CONT.) data ab; merge a (in=in1) b (in=in2); by id ; if in1 and in2; run;
proc print data=ab; title "A and B merged"; run; A and B merged Monday, June 2, 2008
Obs id m1 m2 m3 DNAreserve age sex SBP DBP
1 1 1/1 1/2 1/1 12 23 1 128 95 2 2 1/2 1/1 2/2 14 25 2 115 84 3 3 2/2 1/1 1/1 15 30 1 120 85 4 4 1/2 1/2 1/2 16 27 1 130 90 5 5 1/1 2/2 1/1 15 35 2 122 82
![Page 5: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/5.jpg)
EXAMPLE 2: JOIN TABLES WITH SQL
proc sql; create table sqlab as select * from a, b where a.id=b.id; quit; proc print data=sqlab; title "SQL joined tables"; run;
![Page 6: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/6.jpg)
TIME:
Merge: sorting a: real time 0.01 seconds cpu time 0.01 seconds sorting b: real time 0.01 seconds cpu time 0.01 seconds Merge: real time 0.01 seconds cpu time 0.01 seconds
NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.01 seconds
Test it with large and long data if there is any advantage of using proc sql
![Page 7: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/7.jpg)
EXAMPLE 3: SORT FLAGS (IN THE DESCRIPTOR PORTION OF A DATASET)
The CONTENTS Procedure
Data Set Name WORK.A Observations 5
Member Type DATA Variables 5 Sort Information
Sortedby id Validated YES
![Page 8: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/8.jpg)
EXAMPLE 3: SORT FLAGS (CONT.) data one (sortedby=id); input id; datalines; 1 4 3 5 2 ; run; proc contents data=one; title " data one with option sortedby=id "; run;
![Page 9: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/9.jpg)
EXAMPLE 3: SORT FLAGS (CONT.) proc sort data=one; by id; run;
data two; set one; by id; run;
proc sql; create index id on one(id); quit;
proc datasets nolist; modify one; index create id; run;
![Page 10: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/10.jpg)
SORTING LARGE DATA ON MANY KEYS
Problems: Disk space or temporary space may be
inadequate Time needed may be quite long The software or the operating system may not
work correct during the sorting of large data
Work directory normally is located under /tmp of a server. If my data to be sorted is 3 GB and the /tmp is set to 1GB can SAS do the SORT?
What about if 8-jobs run in parallel in the same server with 8 processors, and try to do SORT on different very large and long sets , but for different purposes?
![Page 11: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/11.jpg)
EXAMPLE 4: TAGSORT OPTION data a; input pedid id m1 $ m2 $ m3 $ DNAreserve; datalines; 1 1 1/1 1/2 1/1 12 1 2 1/2 1/1 2/2 14 1 3 2/2 1/1 1/1 15 2 6 1/2 1/2 1/2 16 2 5 1/1 2/2 1/1 15 2 4 2/2 2/2 1/2 12 ; run;
proc sort tagsort data=a nodupkey out=sorted_a;
by pedid id ; run;
![Page 12: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/12.jpg)
TAGSORT
Introduced in versin 6.07 Can produce important improvements in
clock time but increases the cpu time Internally sort will store in the temporary files
only the sort-keys and observation numbers These sort-keys and the observation
numbers are the “tags” of tagsort. At the end of the sort, the tags are used to
retrieve the entire record from the entire set, but now ordering them in sorted order.
Potential gains when the set is very large
![Page 13: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/13.jpg)
EXAMPLE 5: GENESTAR PROJECT PROBLEM
8 large text files Read into SAS 8 SAS datasets
The data are very large
S1-S400By
1,044,977
S1-S400By
1,044,977
S1-S149By
1,044,977
S1-S687By
1,044,977
![Page 14: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/14.jpg)
GENESTAR PROJECT PROBLEM A. split data for each subject as a new
dataset d1-d3236 B. split data for each subject into 25
chromosomes d1c1-d1c25 …….. d3236c1-d3236c25 Transpose markers by batches of 200
markers at a time and place data together for a chromosome
Finally with proc append, place together subjects of the same chromosome.Subject marker geno
genocall1 m1 1/1 0.75601 m2 1/2 0.76899
………………started ended
Subject m1 m2 …1 0.7560 0.768992 0.9999 0.98999
………………Subject m1 m2 …
1 1/1 1/32 1/2 3/3
………………
![Page 15: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/15.jpg)
SORT IN THE GENESTAR PROJECT
sas -memsize 16G pgm.sas & MPRINT(SORTIT): proc sort
data=in1.rawdataf8 nodupkey out=a (keep=barcoden) ; SYMBOLGEN: Macro variable BYL resolves to barcoden MPRINT(SORTIT): by barcoden ; MPRINT(SORTIT): run;
NOTE: There were 718126154 observations read from the data set IN1.MYDATA. ERROR: Insufficient memory. NOTE: The SAS System stopped processing this step because of errors. NOTE: SAS set option OBS=0 and will continue to check
![Page 16: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/16.jpg)
SORT ON LARGE DATA, IS IT NECESSARY? I resolved the problem in the following way: a)
removed from the data every other variable and kept only the by variable in the set. b) only after a), the sorting with nodupkey worked.
In addition where I had another similar sorting, I removed the sorting and used steps that do the same thing without sorting.
Only now the program does not run out of memory, which means that SAS did not have limit toward the number of observations, but the limit was on the memory use in our server (needed more than 16GB of mem) ???. (32/64b issues and -memsize 0)
![Page 17: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/17.jpg)
EXAMPLE 6, SORT WITH SQL
proc sql; create table sql_a as select * from a order by pedid, id; quit;
![Page 18: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/18.jpg)
EXAMPLE 7: MERGE WITH INDEX WITHOUT SORTING DATA proc contents data=a; title "a is not sorted"; run; proc contents data=b; title "b is not sorted"; run; data a_index (index=(id)); set a; run; data b_index (index=(id)); set b; run;
data final; set b_index ; set a_index key=id; run; proc print data=final; title "Merged data based on index= id"; run;
![Page 19: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/19.jpg)
PROBLEMS WITH INDEXING
Indexing can be faster than sorting The difference can be significant in large data SAS will create an extra file for the index and
this will be a large file. For example in a 1.2GB dataset SAS may create an index file of ~ 340 MB
Advantage: a set indexed on many variables can be used as just sorted in one of the variables
Proc datasets has an index, also SQL has indexing: for example
proc datasets library=work; modify a; create index idlist=(pedid id); run;
![Page 20: S ORTING WITH SAS L ONG, VERY LONG AND LARGE, VERY LARGE D ATA Aldi Kraja Division of Statistical Genomics SAS seminar series June 02, 2008](https://reader036.vdocuments.site/reader036/viewer/2022062421/56649c7b5503460f9492e803/html5/thumbnails/20.jpg)
READINGS:
Paul M. Dorfman. QuickSorting an array. Paper 96-26.
Paul M. Dorfman. Table look-up by direct addressing: key indexing – Bitmapping – Hashing. Paper 8-26
Paper 075-29 Randomly Selecting Observations Robert Patten, Lands’ End, Dodgeville, WI http://www2.sas.com/proceedings/sugi29/075-
29.pdf