1. subject - cpb-ap-se2.wpmucdn.com€¦ · web viewcut: is a unix command line utility which is...
TRANSCRIPT
Linux command line tools24/10/17
Prepared for the Centre of System Genomics
Created by Bobbie Shaban
Centre for System Genomics
This document outlines the material for the tutorial. This tutorial will also be recorded and uploaded to the Genomic Databases Resource Hub (COGENT) as a webinar for download: https://blogs.unimelb.edu.au/system-genomics/
This tutorial assumes that you have an account. If you don’t have an account please contact your group leader to give you access.
1. Subject1 Linux command line tools
Bobbie Shaban: [email protected]
2. Glossarycut: is a Unix command line utility which is used to extract sections from each line of input.
uniq: is a Unix utility which, when fed a text file, outputs the file with adjacent identical lines collapsed to one.
wc: (short for word count) is a command in Unix-like operating systems. The program reads either standard input or a list of files and generates one or more of the following statistics: newline count, word count, and byte count.
du: is a command line utility for reporting file system disk space usage. It can be used to find out disk usage for files and folders and to show what is taking up space. It supports showing just directories or all files, showing a grand total, outputting in human readable format and can be combined with other UNIX tools to output a sorted list of the largest files of folders on a system.
Bobbie Shaban: [email protected]
3. TutorialDU
How to view a disk usage summary of a directory
To view a disk usage summary of a directory pass the directory to the du command. This will print a summary of the files and folders in a directory.
[bshaban@snowy033 tutorials]$ echo $PWD
/vlsci/SG0009/bshaban/tutorials
[bshaban@snowy033 tutorials]$ du
26400 ./tute5
32 ./tute2
2144 ./tute4/start/new_backup/new_backup/new_backup
730816 ./tute4/start/new_backup/new_backup
1459552 ./tute4/start/new_backup
3647840 ./tute4/start
362208 ./tute4/backup
64 ./tute4/vimdiff
0 ./tute4/transfer
4288 ./tute4/bunch_of_text_files
3872 ./tute4/unzip/bunch_of_text_files
3872 ./tute4/unzip
362208 ./tute4/new_backup/backup
Bobbie Shaban: [email protected]
0 ./tute4/new_backup/transfer
1086624 ./tute4/new_backup
5829312 ./tute4
5855744 .
The output show the disk usage in kilobytes in the first column followed by the full path to the file or folder. Folders are summaries so include a sum of files and folders within them.
How to view a grand total for a directory
To view a grand total for a directory pass the -c option. This will show the full output and append a total line.
[bshaban@snowy033 tutorials]$ du -c
26400 ./tute5
32 ./tute2
2144 ./tute4/start/new_backup/new_backup/new_backup
730816 ./tute4/start/new_backup/new_backup
1459552 ./tute4/start/new_backup
3647840 ./tute4/start
362208 ./tute4/backup
64 ./tute4/vimdiff
0 ./tute4/transfer
4288 ./tute4/bunch_of_text_files
3872 ./tute4/unzip/bunch_of_text_files
Bobbie Shaban: [email protected]
3872 ./tute4/unzip
362208 ./tute4/new_backup/backup
0 ./tute4/new_backup/transfer
1086624 ./tute4/new_backup
5829312 ./tute4
5855744 .
5855744 total
How to view disk usage in human readable format
To view disk usage in human readable format pass the -h option. Instead of showing file size in kilobytes for all files and folders the output is modified to into a human readable format.
[bshaban@snowy033 tutorials]$ du -h
26M ./tute5
32K ./tute2
2.1M ./tute4/start/new_backup/new_backup/new_backup
714M ./tute4/start/new_backup/new_backup
1.4G ./tute4/start/new_backup
3.5G ./tute4/start
354M ./tute4/backup
64K ./tute4/vimdiff
0 ./tute4/transfer
Bobbie Shaban: [email protected]
4.2M ./tute4/bunch_of_text_files
3.8M ./tute4/unzip/bunch_of_text_files
3.8M ./tute4/unzip
354M ./tute4/new_backup/backup
0 ./tute4/new_backup/transfer
1.1G ./tute4/new_backup
5.6G ./tute4
5.6G .
How to view the file size of a directory
To view the file size of a directory pass the -s option to the du command followed by the folder. This will print a grand total size for the folder to standard output.
[bshaban@snowy033 tutorials]$ du -s tute4
5829312 tute4
Along with the -h option a human readable format is possible.
[bshaban@snowy033 tutorials]$ du -sh tute4/
5.6G tute4/
WC
What is the wc command in UNIX?
The wc command in UNIX is a command line utility for printing newline, word and byte counts for files. It can return the number of lines in a file, the number of characters in a file and the number of words in a file. It can also be combine with pipes for general counting operations.How to get count information on a file
To get count information on a file use the wc command with no options.
wc hg.bed
Bobbie Shaban: [email protected]
197782 2373384 23385651 hg.bed
The output is number of lines, number of words, number of bytes, filename.
How to print the number of lines in a file
To print the number of lines in a file (or more specifically newline counts) use the -l option.
wc -l hg.bed
197782 hg.bed
How to print the number of characters in a file
To print the number of characters in a file (or more specifically newline counts) use the -m option.
wc -m hg.bed
23385651 hg.bed
How to print the number of bytes in a file
To print the number of bytes in a file (or more specifically newline counts) use the -c option.
wc -c hg.bed
23385651 hg.bed
How to print the number of words in a file
To print the number of bytes in a file (or more specifically newline counts) use the -w option.
wc -w hg.bed
2373384 hg.bed
How to count records in a number of files
To count the number of records (or rows) in several files the wc can used in conjunction with pipes. In the following example there are three files. The requirement is to find out the sum of records in all three files.
Bobbie Shaban: [email protected]
wc -l *
10000 a.txt
20000 b.txt
197782 hg.bed
227782 total
Done. There are 227782 records across the 5 files.
How to count the number files in a directory
To count the number of folders and files in a directory wc can be combined with the ls command. By passing the -1 options to ls it will each folder or line on a new line. This can be piped to wc to give a count.
ls | wc
3 3 19
CUT
The cut command in UNIX is a command line utility for cutting sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and delimiter. It can also be used to cut data from file formats like CSV.How to cut by byte position
To cut out a section of a line by specifying a byte position use the -b option.
echo 'baz' | cut -b 2
a
echo 'baz' | cut -b 1-2
ba
Bobbie Shaban: [email protected]
echo 'baz' | cut -b 1,3
bz
How to cut by character
To cut by character use the -c option. This selects the characters given to the -c option. This can be a list of comma separated numbers, a range of numbers or a single number.Where your input stream is character based -c can be a better option than selecting by bytes as often characters are more than one byte.In the following example character ‘♣’ is three bytes. By using the -c option the character can be correctly selected along with any other characters that are of interest.
echo '♣foobar' | cut -c 1,6
♣a
echo '♣foobar' | cut -c 1-3
♣fo
How to cut based on a delimiter
To cut using a delimiter use the -d option. This is normally used in conjunction with the -f option to specify the field that should be cut.In the following example a CSV file exists and is saved as names.csv.
John,Smith,34,London
Arthur,Evans,21,Newport
George,Jones,32,Truro
The delimiter can be set to a comma with -d ','. cut can then pull out the fields of interest with the -f flag. In the following example the first field is cut.
cut -d ',' -f 1 names.csv
John
Arthur
George
Multiple fields can be cut by passing a comma separated list.
Bobbie Shaban: [email protected]
cut -d ',' -f 1,4 names.csv
John,London
Arthur,Newport
George,Truro
How to modify the output delimiter
To modify the output delimiter use the --output-delimiter option. Note that this option is not available on the BSD version of cut. In the following example a semi-colon is converted to a space and the first, third and fourth fields are selected.
echo 'how;now;brown;cow' | cut -d ';' -f 1,3,4 --output-delimiter=' '
how brown cow
UNIQ
The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields. The command expects adjacent comparison lines so it is often combined with the sort command.Uniq expects adjacent lines
The uniq commands expects adjacent lines in inputs. To find unique occurrences where the lines are not adjacent a file needs to be sorted before passing to uniq. uniq will operate as expected on the following file that is named authors.txt.
Chaucer
Chaucer
Orwell
Larkin
Bobbie Shaban: [email protected]
Larkin
As duplicates are adjacent uniq will return unique occurrences and send the result to standard output.
uniq authors.txt
Chaucer
Orwell
Larkin
Suppose that a file exists where the duplicates in the file are not adjacent.
Chaucer
Larkin
Orwell
Chaucer
Larkin
Passing this file to uniq will simply return the contents of the file. Where files are not already sorted the sort command can be used to sort the file first before piping to uniq. An article outlining the usage of sort is available here.
sort authors2.txt | uniq
Chaucer
Orwell
Larkin
How to show a count of the number of times a line occurred
To output the number of occurrences of a line use the -c option in conjunction with uniq. This prepends a number value to the output of each line.
uniq -c authors.txt
Bobbie Shaban: [email protected]
2 Chaucer
2 Larkin
1 Orwell
How to only show repeated lines
To only show repeated lines pass the -d option to uniq. This will output only lines that occur more than once and write the result to standard output.
uniq -d authors.txt
Chaucer
Larkin
How to only show lines that are not repeated
To only show lines that are not repeated pass the -u option to uniq. This will output only lines that are not repeated and write the result to standard output.
uniq -u authors.txt
Orwell
How to ignore characters in comparison
To ignore characters in a comparison pass the -s option to uniq. This will ignore the characters specified in the comparison and output the result to standard output.Suppose a list of authors exsits in a file that is saved as authors.txt. The file has some numbers in front of the names of the authors.
1Chaucer
2Chaucer
3Larkin
4Larkin
5Orwell
To return a list of the authors numbers can be ignored by using the -soption. This will skip the number of characters it is given before doing the comparison.Bobbie Shaban: [email protected]
uniq -s 1 authors.txt
1Chaucer
3Larkin
5Orwell
How to ignore fields in comparison
To ignore fields in a comparison pass the -f option to uniq. This will run the comparison on the specified field and output the result to standard output.Suppose a file exists with a list of cricketers and the clubs that they play for. This is saved as cricketers.txt.
Tom Westley Essex
Ravi Bopara Essex
Marcus Trescothick Somerset
Joe Root Yorkshire
Jonny Bairstow Yorkshire
A field is considered as a string of non-blank characters separated from adjacent fields by blanks. The uniq utility may be used to group by the county that these cricketers play for.
uniq -f 2 cricketers.txt
Tom Westley Essex
Marcus Trescothick Somerset
Joe Root Yorkshire
As with the -s option uniq outputs the first occurrence it finds. It is possible to combine with the -c option to output a count.
Bobbie Shaban: [email protected]
uniq -f -2 cricketers.txt
2 Tom Westley Essex
1 Marcus Trescothick Somerset
2 Joe Root Yorkshire
To just see the list of counties sed and cut may be used to clean this up.
uniq -f 2 -c cricketers.txt | sed 's/^\s*//' | cut -d ' ' -f 1,4
2 Essex
1 Somerset
2 Yorkshire
Bobbie Shaban: [email protected]