awk
- Details
- Last Updated: Tuesday, 20 August 2024 15:32
- Published: Tuesday, 06 June 2023 23:28
- Hits: 584
awk:
awk is the widely used utility for text manipulation in text based files. It can do a lot of complicated pattern matching, stripping, etc that will require bunch of code in any other scripting language. awk and sed are very commonly used together and almost suffice for 99.9% of text manipulation encountered in real life.
awk is a full scripting languages, though it's most suited for text manipulation. awk was named after the initials of the people who wrote it. Now we have the GNU implementation of awk called gawk. gawk is the one that is installed on all Linux systems, and awk actually is a soft link to gawk.
$ awk --version => shows "GNU Awk 4.0.2"
Good tutorial on awk: https://www.howtogeek.com/562941/how-to-use-the-awk-command-on-linux/
Syntax:
awk syntax is extremely simple. The entire awk
program is enclosed in single quotes ('
). It has rules, comprised of patterns and actions.
- Patterns: Patterns are enclosed in curly braces { }. If no pattern is provided, then it works on every line of text since all lines match.
- Action: Action is executed on text that matches the patterns.
- Rule: Together, a pattern and an action form a rule.
- Fields: Fields are separate regions within a line. We define fields since that's the only way we can extract certain data within a piece of text. By default, awk considers a field to be a string of characters surrounded by whitespace, the start of a line, or the end of a line. Fields are identified by a dollar sign (
$
) and a number. So, $1 represents the first field, $2 the 2nd field, and $NF the last field (NF stands for number of fields). $0 represents the full line from start to end. If we want to change the input field separator to something else, we should use option-F
(separator string) option to tell awk to use that (i.e -F: uses colon (:
) as the separator to figure out fields for input file). NOTE: no space b/w -F and : Also, this -F option is before we we write the pgm in quotes. ex: awk -F: '{print $2}'. We can also write awk 'FS=":" {print $2}' => This also does the same thing as FS means i/p field separator. - awk works on an input text file and can show output on screen using "print" function, or redirect the output to another file. Just as input file has fields, o/p file also has field separator which is a space by default. To change the separator to something else, we use option OFS=":" (output field separator is a :). This option is to be used within quotes (i.e as part of the awk pgm). ex: awk 'OFS=":" {print $1}'. This OFS applies to only those print statements that are following that OFS.
Few more awk cmd:
- print: print function prints strings by enclosing in double quotes, i.e print "my name is". To print var, we don't need quotes. ex: {print "my name is"} {print $name}. or in one as {print "my name is", $name}. If $name is put within quotes, then it's treated as a literal, and is printed as is =>"my name is $name". comma is optional, however we put as it automatically invokes a space. See below ex:
- print specific field of certain cmd:
- ls -al | awk '{print "3rd field" $3}' => This takes the o/p of ls -al cmd and prints "3rd field" followed by the third field of each line, which is the userid of the owner. With no comma, there is no space b/w field and $3. To invoke space either add a space within the text as "3rd field " or add a comma.
- awk '{print $1 $4}' input.txt => This prints field 1 and field 4 of each line in file input.txt. However, the printed fields will have no space b/w them, so they will appear as one word. The output field separator is invoked on putting a comma, i.e {print $1,$4}. To change it to something else, use OFS=":". i.e awk 'OFS=":" {print $1 $4}' input.txt
- convert field to number: When we get certain field from a file, let's say $3, it's still grepped a sa string. So, if we try to use that number outside of awk, it will error, complaining it's not numeric. In such cases, add 0 to the field to automatically convert it to number.
- set num1 = `zgrep "Number" *.report.rpt.gz | awk -F" " '{print $5+0}'; set sum = `echo "$num1 + $num2" | bc -l` => In this, if we just use "print $5" without adding a 0, then that field is returned as a string. Adding it with num2 gives an error. But if we add 0, then num1 automatically is cast to integer/float and and then summing works.
- print specific field of certain cmd:
- BEGIN/END rules: A
BEGIN
rule is executed once before any text processing starts. In fact, it’s executed beforeawk
even reads any text. AnEND
rule is executed after all processing has completed. You can have multipleBEGIN
andEND
rules, and they’ll execute in order. We can have as many lines of cmd as we want in BEGIN and END by enclosing them in {}, and they will execute only once in begin and end of awk program. - awk as script: We can also write a full script in awk, especially when pgm gets big. # are comments. First line needs to be #!/usr/bin/awk -f. See the ex on link above.
Few examples of awk:
- pattern matching and conditions: We an add specific conditions:
- awk -F: 'BEGIN {print "START"; FS="/"; OFS=":"} $3 >= 1000 {print $1,$6} END {print "DONE"}' /etc/passwd => This prints 1st and 6th entries only if 3rd entry is >= 1000.
- awk '/^UUID/ {print $1}' /etc/fstab => This searches for pattern UUID at start of line, and all lines that have that, have their first field printed.
- To extract all lines between given markers => This is very common use case, where we want to use scripts to grep for text b/w markers:
- awk '/## START OF MARKER ##/{a=1}/## END OF MARKER ##/{print;a=0}' /home/my.rpt > ~/extracted_marker.txt => extracted_marker.txt has all text b/w "START" and "END". "a" is just a var (can be any var name). CMD "print" is optional, as by default print happens.
- awk '/## START OF MARKER ##/{flag=1} flag; /## END OF MARKER ##/{flag=0}' /home/my.rpt > ~/extracted_marker.txt => alternative way of above cmd. The line "flag" prints lines while flag is set. This is a shortcut of this cmd => {if (flag==1) print}
Substitute text between specific Markers with contents from another file: This is not easy. We saw how to do this in sed. But that sed cmd doesn't work in any other shell besides bash. So, we have an alternative way in awk cmd, that achieves this functionality.
awk 'BEGIN {flag=0;} /START of pattern/,/END of pattern/ { if (flag == 0) { system("cat replacement.txt"); flag=1} next } 1' original_file.txt > modified_file.txt => Here we grep for markers in original_file.txt and replace all the text b/w those markers (including the markers themselves) with text in replacement.txt. We save the resulting modified file as modified_file.txt (instead of saving as original_file.txt)
However, if are patterns are stored in var, or the file names are var, then we have to get those var outside of single quotes (As anything inside single quote is for awk to evaluate). When we put in double quote, then shell expands it, and then awk just concatenates that when running the cmd.
ex: awk 'BEGIN {flag=0;} /## START OF '"$pat1"' ##/,/## END OF '"$pat2"' ##/ { if (flag == 0) { system("cat /home/'"$name"'_replacement.txt"); flag=1} next } 1' original_file.txt > modified_file.txt