1. subject - cpb-ap-se2.wpmucdn.com€¦ · web viewcut: is a unix command line utility which is...

Linux command line tools24/10/17

Prepared for the Centre of System Genomics

Created by Bobbie Shaban

Centre for System Genomics

This document outlines the material for the tutorial. This tutorial will also be recorded and uploaded to the Genomic Databases Resource Hub (COGENT) as a webinar for download: https://blogs.unimelb.edu.au/system-genomics/

This tutorial assumes that you have an account. If you don’t have an account please contact your group leader to give you access.

1. Subject1 Linux command line tools

Bobbie Shaban: [email protected]

2. Glossarycut: is a Unix command line utility which is used to extract sections from each line of input.

uniq: is a Unix utility which, when fed a text file, outputs the file with adjacent identical lines collapsed to one.

wc: (short for word count) is a command in Unix-like operating systems. The program reads either standard input or a list of files and generates one or more of the following statistics: newline count, word count, and byte count.

du: is a command line utility for reporting file system disk space usage. It can be used to find out disk usage for files and folders and to show what is taking up space. It supports showing just directories or all files, showing a grand total, outputting in human readable format and can be combined with other UNIX tools to output a sorted list of the largest files of folders on a system.


3. TutorialDU

How to view a disk usage summary of a directory

To view a disk usage summary of a directory pass the directory to the du command. This will print a summary of the files and folders in a directory.

[bshaban@snowy033 tutorials]$ echo $PWD

/vlsci/SG0009/bshaban/tutorials

[bshaban@snowy033 tutorials]$ du

26400 ./tute5

32 ./tute2

2144 ./tute4/start/new_backup/new_backup/new_backup

730816 ./tute4/start/new_backup/new_backup

1459552 ./tute4/start/new_backup

3647840 ./tute4/start

362208 ./tute4/backup

64 ./tute4/vimdiff

0 ./tute4/transfer

4288 ./tute4/bunch_of_text_files

3872 ./tute4/unzip/bunch_of_text_files

3872 ./tute4/unzip

362208 ./tute4/new_backup/backup


0 ./tute4/new_backup/transfer

1086624 ./tute4/new_backup

5829312 ./tute4

5855744 .

The output show the disk usage in kilobytes in the first column followed by the full path to the file or folder. Folders are summaries so include a sum of files and folders within them.

How to view a grand total for a directory

To view a grand total for a directory pass the -c option. This will show the full output and append a total line.

[bshaban@snowy033 tutorials]$ du -c

26400 ./tute5

32 ./tute2

2144 ./tute4/start/new_backup/new_backup/new_backup

730816 ./tute4/start/new_backup/new_backup

1459552 ./tute4/start/new_backup

3647840 ./tute4/start

362208 ./tute4/backup

64 ./tute4/vimdiff

0 ./tute4/transfer

4288 ./tute4/bunch_of_text_files

3872 ./tute4/unzip/bunch_of_text_files


3872 ./tute4/unzip

362208 ./tute4/new_backup/backup


1086624 ./tute4/new_backup

5829312 ./tute4

5855744 .

5855744 total

How to view disk usage in human readable format

To view disk usage in human readable format pass the -h option. Instead of showing file size in kilobytes for all files and folders the output is modified to into a human readable format.

[bshaban@snowy033 tutorials]$ du -h

26M ./tute5

32K ./tute2

2.1M ./tute4/start/new_backup/new_backup/new_backup

714M ./tute4/start/new_backup/new_backup

1.4G ./tute4/start/new_backup

3.5G ./tute4/start

354M ./tute4/backup

64K ./tute4/vimdiff

0 ./tute4/transfer


4.2M ./tute4/bunch_of_text_files

3.8M ./tute4/unzip/bunch_of_text_files

3.8M ./tute4/unzip

354M ./tute4/new_backup/backup


1.1G ./tute4/new_backup

5.6G ./tute4

5.6G .

How to view the file size of a directory

To view the file size of a directory pass the -s option to the du command followed by the folder. This will print a grand total size for the folder to standard output.

[bshaban@snowy033 tutorials]$ du -s tute4

5829312 tute4

Along with the -h option a human readable format is possible.

[bshaban@snowy033 tutorials]$ du -sh tute4/

5.6G tute4/

WC

What is the wc command in UNIX?

The wc command in UNIX is a command line utility for printing newline, word and byte counts for files. It can return the number of lines in a file, the number of characters in a file and the number of words in a file. It can also be combine with pipes for general counting operations.How to get count information on a file

To get count information on a file use the wc command with no options.

wc hg.bed


197782 2373384 23385651 hg.bed

The output is number of lines, number of words, number of bytes, filename.

How to print the number of lines in a file

To print the number of lines in a file (or more specifically newline counts) use the -l option.

wc -l hg.bed

197782 hg.bed

How to print the number of characters in a file

To print the number of characters in a file (or more specifically newline counts) use the -m option.

wc -m hg.bed

23385651 hg.bed

How to print the number of bytes in a file

To print the number of bytes in a file (or more specifically newline counts) use the -c option.

wc -c hg.bed

23385651 hg.bed

How to print the number of words in a file

To print the number of bytes in a file (or more specifically newline counts) use the -w option.

wc -w hg.bed

2373384 hg.bed

How to count records in a number of files

To count the number of records (or rows) in several files the wc can used in conjunction with pipes. In the following example there are three files. The requirement is to find out the sum of records in all three files.


wc -l *

10000 a.txt

20000 b.txt

197782 hg.bed

227782 total

Done. There are 227782 records across the 5 files.

How to count the number files in a directory

To count the number of folders and files in a directory wc can be combined with the ls command. By passing the -1 options to ls it will each folder or line on a new line. This can be piped to wc to give a count.

ls | wc

3 3 19

CUT

The cut command in UNIX is a command line utility for cutting sections from each line of files and writing the result to standard output. It can be used to cut parts of a line by byte position, character and delimiter. It can also be used to cut data from file formats like CSV.How to cut by byte position

To cut out a section of a line by specifying a byte position use the -b option.

echo 'baz' | cut -b 2

a

echo 'baz' | cut -b 1-2

ba


echo 'baz' | cut -b 1,3

bz

How to cut by character

To cut by character use the -c option. This selects the characters given to the -c option. This can be a list of comma separated numbers, a range of numbers or a single number.Where your input stream is character based -c can be a better option than selecting by bytes as often characters are more than one byte.In the following example character ‘♣’ is three bytes. By using the -c option the character can be correctly selected along with any other characters that are of interest.

echo '♣foobar' | cut -c 1,6

♣a

echo '♣foobar' | cut -c 1-3

♣fo

How to cut based on a delimiter

To cut using a delimiter use the -d option. This is normally used in conjunction with the -f option to specify the field that should be cut.In the following example a CSV file exists and is saved as names.csv.

John,Smith,34,London

Arthur,Evans,21,Newport

George,Jones,32,Truro

The delimiter can be set to a comma with -d ','. cut can then pull out the fields of interest with the -f flag. In the following example the first field is cut.

cut -d ',' -f 1 names.csv

John

Arthur

George

Multiple fields can be cut by passing a comma separated list.


cut -d ',' -f 1,4 names.csv

John,London

Arthur,Newport

George,Truro

How to modify the output delimiter

To modify the output delimiter use the --output-delimiter option. Note that this option is not available on the BSD version of cut. In the following example a semi-colon is converted to a space and the first, third and fourth fields are selected.

echo 'how;now;brown;cow' | cut -d ';' -f 1,3,4 --output-delimiter=' '

how brown cow

UNIQ

The uniq command in UNIX is a command line utility for reporting or filtering repeated lines in a file. It can remove duplicates, show a count of occurrences, show only repeated lines, ignore certain characters and compare on specific fields. The command expects adjacent comparison lines so it is often combined with the sort command.Uniq expects adjacent lines

The uniq commands expects adjacent lines in inputs. To find unique occurrences where the lines are not adjacent a file needs to be sorted before passing to uniq. uniq will operate as expected on the following file that is named authors.txt.

Chaucer

Chaucer

Orwell

Larkin


Larkin

As duplicates are adjacent uniq will return unique occurrences and send the result to standard output.

uniq authors.txt

Chaucer

Orwell

Larkin

Suppose that a file exists where the duplicates in the file are not adjacent.

Chaucer

Larkin

Orwell

Chaucer

Larkin

Passing this file to uniq will simply return the contents of the file. Where files are not already sorted the sort command can be used to sort the file first before piping to uniq. An article outlining the usage of sort is available here.

sort authors2.txt | uniq

Chaucer

Orwell

Larkin

How to show a count of the number of times a line occurred

To output the number of occurrences of a line use the -c option in conjunction with uniq. This prepends a number value to the output of each line.

uniq -c authors.txt


https://shapeshed.com/unix-sort/

2 Chaucer

2 Larkin

1 Orwell

How to only show repeated lines

To only show repeated lines pass the -d option to uniq. This will output only lines that occur more than once and write the result to standard output.

uniq -d authors.txt

Chaucer

Larkin

How to only show lines that are not repeated

To only show lines that are not repeated pass the -u option to uniq. This will output only lines that are not repeated and write the result to standard output.

uniq -u authors.txt

Orwell

How to ignore characters in comparison

To ignore characters in a comparison pass the -s option to uniq. This will ignore the characters specified in the comparison and output the result to standard output.Suppose a list of authors exsits in a file that is saved as authors.txt. The file has some numbers in front of the names of the authors.

1Chaucer

2Chaucer

3Larkin

4Larkin

5Orwell

To return a list of the authors numbers can be ignored by using the -soption. This will skip the number of characters it is given before doing the comparison.Bobbie Shaban: [email protected]

uniq -s 1 authors.txt

1Chaucer

3Larkin

5Orwell

How to ignore fields in comparison

To ignore fields in a comparison pass the -f option to uniq. This will run the comparison on the specified field and output the result to standard output.Suppose a file exists with a list of cricketers and the clubs that they play for. This is saved as cricketers.txt.

Tom Westley Essex

Ravi Bopara Essex

Marcus Trescothick Somerset

Joe Root Yorkshire

Jonny Bairstow Yorkshire

A field is considered as a string of non-blank characters separated from adjacent fields by blanks. The uniq utility may be used to group by the county that these cricketers play for.

uniq -f 2 cricketers.txt

Tom Westley Essex

Marcus Trescothick Somerset

Joe Root Yorkshire

As with the -s option uniq outputs the first occurrence it finds. It is possible to combine with the -c option to output a count.


uniq -f -2 cricketers.txt

2 Tom Westley Essex

1 Marcus Trescothick Somerset

2 Joe Root Yorkshire

To just see the list of counties sed and cut may be used to clean this up.

uniq -f 2 -c cricketers.txt | sed 's/^\s*//' | cut -d ' ' -f 1,4

2 Essex

1 Somerset

2 Yorkshire


1. subject - cpb-ap-se2.wpmucdn.com€¦ · web viewcut: is a unix command line utility which is...

Documents