data manipulation with awk - eth z · programming philosophy programming in awk: building a list of...
TRANSCRIPT
![Page 1: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/1.jpg)
Data Manipulation with AWKEvangelos Pournaras, Izabela Moise, Dirk Helbing
Evangelos Pournaras, Izabela Moise, Dirk Helbing 1
![Page 2: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/2.jpg)
AWK
A "Swiss knife" for data manipulation, retrieval, formatting,processing, transformation, prototyping and more...
Evangelos Pournaras, Izabela Moise, Dirk Helbing 2
![Page 3: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/3.jpg)
About AWKCheck www.awk.info.
• A pattern scanning and processing language.
• AWK name: Alfred V. Aho, Peter J. Wein-berger and Brian W.Kernighan (creators)
• An evolving yet, stable, cross-platform language.
• Written in 1977 at AT&T Bell Laboratories.• Data-driven language.
– Posix standard for AWK:– Various Implementations: gawk, nawk, mawk, spawk, etc.
"AWK is a convenient and expressive programming language thatcan be applied to a wide variety of computing and data-manipulationtasks."
Evangelos Pournaras, Izabela Moise, Dirk Helbing 3
![Page 4: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/4.jpg)
What you can do with AWK
• Manage small databases
• Validate data
• Produce indexes & perform document preparation tasks
• Experiment with algorithms you can adapt later to otherprogramming languages
Evangelos Pournaras, Izabela Moise, Dirk Helbing 4
![Page 5: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/5.jpg)
Implementations
• GAWK– Extract bits and pieces of data for processing– Sort bits– Perform simple network communications
• MAWK– Efficiency, byte code interpreter
• JAWK– Java support
• NAWK, XGAWK, SPAWK, QTAWK, RunAWK, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing 5
![Page 6: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/6.jpg)
AWK Advantages
• Very simple
• Easy learning curve
• Standardized
• On-the-fly calculations
• No need to open/close files• Interpreted, not compiled
– Avoiding the edit-compile-test-debug lifecycle
Evangelos Pournaras, Izabela Moise, Dirk Helbing 6
![Page 7: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/7.jpg)
Programming Philosophy
• Programming in AWK: Building a list of rules• Rules consist of a pattern and an action
– (pattern-1){action}(pattern-2){action}...
• Linear scans, handling one data element per time– Resembling Hadoop philosophy– Random access seek times vs. hard drives sizes
• Manipulating delimited text files in a single pass• By design, division of a file in records & fields
– Each line is a record– Fields are delimited by a special character
Every clause is a potential action performed on the current record!Evangelos Pournaras, Izabela Moise, Dirk Helbing 7
![Page 8: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/8.jpg)
Comparison with other Languages
A case study with converting triplets to sparse matrices:
Source: https://github.com/brendano/awkspeed
Evangelos Pournaras, Izabela Moise, Dirk Helbing 8
![Page 9: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/9.jpg)
Running an AWK program
Three ways to run an AWK program from command line:
1. >awk ‘program’ input-file1 input-file2 ...
2. >awk -f program-file input-file1 input-file2 ...
3. Unix script: my-awk-script.sh
#!/usr/bin/awk -f#awk rules go here
Evangelos Pournaras, Izabela Moise, Dirk Helbing 9
![Page 10: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/10.jpg)
Program Structure
# Initialization bodyBEGIN{# initialization actions}#Main execution body{# main program actions}# Finalization bodyEND{# Final actions}
Evangelos Pournaras, Izabela Moise, Dirk Helbing 10
![Page 11: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/11.jpg)
AWK Demonstrationexample-01.awk, example-02.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing 11
![Page 12: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/12.jpg)
AWK Regular Expressions
A pattern enclosed in slashes (‘/’) checked if it matches each inputrecord.
• letters, numbers, both.
• /foo/
• ˜ matches
• !˜ does not match
• | alternation expression
• ˆ matches the beginning of a string
• $ matches the end of a string
• . matches any single character
Evangelos Pournaras, Izabela Moise, Dirk Helbing 12
![Page 13: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/13.jpg)
AWK Demonstration
Evangelos Pournaras, Izabela Moise, Dirk Helbing 13
![Page 14: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/14.jpg)
Scripts
>awk ’/.edu/ {print $0}’ mail-list.txt>awk ’$1 ~ /J/’ inventory-shipped.txt>awk ’$3 ~ /edu$|be$/’ mail-list.txt>awk ’{if (length($0)>max) max=length($0)}END{print max}’ mail-list.txt>awk ’NF>0’ inventory-shipped.txt>awk ’END{print NR}’>awk ’NR%2==0’ mail-list.txt>awk ’$1=="Jan" {sum+=$5} END{print sum}’
inventory-shipped.txt
Evangelos Pournaras, Izabela Moise, Dirk Helbing 14
![Page 15: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/15.jpg)
Variables
• No variable declaration is needed.
• No type declaration is needed.• Built-in variables:
– NF: number of fields– NR: current record number– FS: field separator
Evangelos Pournaras, Izabela Moise, Dirk Helbing 15
![Page 16: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/16.jpg)
Functions
Specified as follows:
function awkFunction(a,b,c,d){return a+b+c+d
}
Built-in functions:
• Numeric:– sqrt, log, sin, cos, rand, log, etc.
• String:– index, length, match, split, substr, etc.
Evangelos Pournaras, Izabela Moise, Dirk Helbing 16
![Page 17: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/17.jpg)
Arrays
Associative arrays:
• String for indices rather than numbers
• arrayname[string]=value• Multi-dimensional arrays:
– Supported by concatenation of indices into one string– foo[5,12]="value"
Evangelos Pournaras, Izabela Moise, Dirk Helbing 17
![Page 18: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/18.jpg)
AWK Demonstrationexample-03.awk, example-04.awk
Evangelos Pournaras, Izabela Moise, Dirk Helbing 18
![Page 19: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/19.jpg)
AWK Example - Arrays
BEGIN{}{
letters[$4]++;}END{
for(var in letters)print var, "exists", letters[var], "times."
if("A" in letters)print "A exists"
elseprint "A does not exist"
}
Evangelos Pournaras, Izabela Moise, Dirk Helbing 19
![Page 20: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/20.jpg)
Proposed Literature
AWK scripts:https://github.com/data-science-course/lectures/tree/master/awk
A. D. Robbins.Gawk: Effective AWK Programming.Free Software Foundation, Inc., 4.1 edition, April 2014.
How to read the user guide:
• Fast reading: Chapters 1-10
• Practical examples: Chapters 11
Evangelos Pournaras, Izabela Moise, Dirk Helbing 20
![Page 21: Data Manipulation with AWK - ETH Z · Programming Philosophy Programming in AWK: Building a list of rules Rules consist of a pattern and an action – (pattern-1){action} (pattern-2){action}](https://reader030.vdocuments.site/reader030/viewer/2022041107/5f09f01d7e708231d429399d/html5/thumbnails/21.jpg)
What is next?
• SQL and relational databases
• Plotting and visualizing data
Evangelos Pournaras, Izabela Moise, Dirk Helbing 21