Book Image

Learning AWK Programming

By : Shiwang Kalkhanda
5 (1)
Book Image

Learning AWK Programming

5 (1)
By: Shiwang Kalkhanda

Overview of this book

AWK is one of the most primitive and powerful utilities which exists in all Unix and Unix-like distributions. It is used as a command-line utility when performing a basic text-processing operation, and as programming language when dealing with complex text-processing and mining tasks. With this book, you will have the required expertise to practice advanced AWK programming in real-life examples. The book starts off with an introduction to AWK essentials. You will then be introduced to regular expressions, AWK variables and constants, arrays and AWK functions and more. The book then delves deeper into more complex tasks, such as printing formatted output in AWK, control flow statements, GNU's implementation of AWK covering the advanced features of GNU AWK, such as network communication, debugging, and inter-process communication in the GAWK programming language which is not easily possible with AWK. By the end of this book, the reader will have worked on the practical implementation of text processing and pattern matching using AWK to perform routine tasks.
Table of Contents (11 chapters)

AWK standard options

In this section, we discuss the three standard options that are available with all versions of AWK and other GAWK-supported options or GNU extensions. Each option in AWK begins with a dash and consists of a single character. GNU-style long options are also supported, which consist of two dashes (- -) followed by a keyword, which is the full form of an abbreviated option to uniquely identify it. If the option takes an argument, it is either immediately followed by the = sign and an argument value or a keyword, and the argument's value separated by whitespace. If an option with a value is given more than once, its last value is used.

Standard command-line options

AWK supports the following standard options, which can be provided in a long or short form interchangeably from the command line.

The -F option – field separator

By default, fields in the input record are separated by any number of spaces or tabs. This behavior can be altered by using the -F option or the FS variable. The value of the field separator can be either a single character or a regular expression. It controls the way AWK splits an input record into fields:

-Ffs

--field-separator

In the next example, we illustrate the use of the -F option. Here we used the -F to print the list of usernames that has been assigned to the bash shell from the /etc/passwd file. This file contains userdetails separated by a colon (:):

$ awk -F: '/bash/ { print $1 }' /etc/passwd 

This can also be performed as follows:


$ awk --field-separator=: '/bash/ { print $1 }' /etc/passwd

The output on execution of this code is as follows:

at
bin
daemon
ftp
games
lp
man
news
nobody
...................
...................

The -f option (read source file)

This option is used to read the program source from the source file instead of in the first non-option argument. If this option is given multiple times, the program consists of the concatenation of the contents of each specified source file:

-f source-file

--file=source-file

First, we will create 2 programs to print line number 2 and line number 4, respectively. Then, we will use the -f option to source those files for execution with the interpreter, as follows:

$ vi f1.awk
NR==2 { print NR, $0 }
$ vi f2.
NR==4 { print NR, $0 }

Now, first use only f1. for sourcing:

$ awk -f f1.awk  cars.dat 

This can also be performed as follows:

awk --file=f1.awk cars.dat

The output on execution of this code is as follows:

2 honda           city        2005        60000       3

Now, we will source both the files together. AWK will concatenate the contents of the two sourced files and execute them on the cars.dat filename, as follows:

$ awk -f f1.awk -f f2.awk cars.dat

This can also be performed as follows:

$ awk --file=f1.awk --file=f2.awk  cars.dat

The output on execution of this code is as follows:

2 honda           city        2005        60000       3
4 chevy beat 2005 33000 2

The -v option (assigning variables)

This option assigns a value to a variable before the program executes. Such variable values are available inside the BEGIN block. The -v option can only set one variable at a time, but it can be used more than once, setting another variable each time:

-v var=val

--assign var=val

The following example describes the usage of the -v option:

$ awk -v name=Jagmohon 'BEGIN{ printf "Name = %s\n", name }'

This can also be performed as follows:

$ awk --assign=Jagmohan 'BEGIN{ printf "Name = %s\n", name }'

The output on execution of this code is as follows:

Name = Jagmohan

Here is a multiple-value assignment example:

$ awk -v name=Jagmohon -v age=42 'BEGIN{ printf "Name = %s\nAge = %s\n", name, age }'

The output on execution of this code is:

Name = Jagmohan
Age = 42

GAWK-only options

Till now, we have discussed standard POSIX options. In the following section, we will discuss some important GNU extension options of GAWK.

The --dump-variables option (AWK global variables)

This option is used to print a sorted list of global variables, their types, and final values to file. By default, it prints this list to a file named awkvars.out in the current directory. It is good to have a list of all global variables to avoid errors that are created by using the same name function in your programs. The following is the command to print the list in the default file:

-d[file]
--dump-variables[=file]
$ awk --dump-variables  ' ' 

This can also be performed as follows:

 $ awk   -d   ' '

On execution of this command, we will have a file with the name awkvars.out in our current working directory, which has the following contents:

$ cat awkvars.out
ARGC: 1
ARGIND: 0
ARGV: array, 1 elements
BINMODE: 0
CONVFMT: "%.6g"
ENVIRON: array, 99 elements
ERRNO: ""
FIELDWIDTHS: ""
FILENAME: ""
FNR: 0
FPAT: "[^[:space:]]+"
FS: " "
IGNORECASE: 0
LINT: 0
NF: 0
NR: 0
OFMT: "%.6g"
OFS: " "
ORS: "\n"
PREC: 53
PROCINFO: array, 15 elements
RLENGTH: 0
ROUNDMODE: "N"
RS: "\n"
RSTART: 0
RT: ""
SUBSEP: "\034"
TEXTDOMAIN: "messages"

The --profile option (profiling)

This option enables the profiling of AWK programs, that is, it generates a pretty-printed version of the program in a file. By default, the profile is created in a file named awkprof.out. The optional file argument allows you to specify a different filename for the profile file. No space is allowed between -p and the filename, if a filename is supplied:

-p[file]
--profile[=file]

The profile file created contains execution counts for each statement in the program in the left margin, and function call counts for each function. In the next example, we will create a file with a name sample and redirect the output of the AWK command to /dev/null:

$ awk --profile=sample \
'BEGIN { print "**header**" }
{ print }
END{ print "**footer**" }' cars.dat > /dev/null

This same action can also be performed as follows:

$ awk -psample \
'BEGIN { print "**header**" }
{ print }
END{ print "**footer**" }' cars.dat > /dev/null

To view the content of profile, we execute the cat command, as follows:

$ cat sample
# gawk profile, created Thu Sep 14 17:20:27 2017

# BEGIN rule(s)

BEGIN {
1 print "**header**"
}

# Rule(s)

12 {
12 print $0
}

# END rule(s)

END {
1 print "**footer**"
}

The –pretty-print option: It is the same profiling option discussed in the preceding section:

-o[file]
--pretty-print[=file]

The --sandbox option

This option disables the execution of the system() function, which can execute shell commands supplied as an expression to AWK. It also disables the input redirections with getline, output redirections with print and printf, and dynamic extensions. This is very useful when you want to run AWK scripts from questionable/untrusted sources and need to make sure the scripts can't access your system (other than the specified input data file):

-S

--sandbox

In the following example, we first execute the echo command within the system function without the --sandbox option, and then again with the --sandbox option to see the difference:

$ awk 'BEGIN { system("echo hello") }'

The preceding AWK command executes the echo hello command using the system function and returns a 0 value to the system upon successful execution. The output on execution of the preceding command is:

hello

Now, we use the --sandbox option with the AWK command to disable the execution of the echo hello shell command using the system function of AWK. In the next example, the system function will not execute as we have used the --sandbox option while executing it:

$ awk --sandbox 'BEGIN{ system("echo hello")}' 

The output on execution of the preceding command is:

awk: cmd. line:1: fatal: 'system' function not allowed in sandbox mode

The -i option (including other files in your program)

This option is equivalent to the @include directive, which is used to source a file in the current AWK program. However, it is different from the -f option in two aspects. First, when we use the -i option, the program sourced is not loaded if it has been previously loaded, whereas -f always loads a file. The second difference is this after processing an -i argument, GAWK still expects to find the main source code via the -f option or on the command line:

-i source-file
--include source-file

In the next example, we will use the f1.awk and f2.awk files we created earlier to describe how the -i option works:

$ awk  -i  f1.awk  'NR==5 { print NR,  $0 }'  cars.dat

The output on execution of the given code is:

2 honda           city        2005        60000       3
5 honda city 2010 33000 6

Now, we are using the -i option to include the f1.awk file inside the -f option to execute f2.awk, as follows:

$ awk  -i  f1.awk   -f   f2.awk    cars.dat

The output on execution of the preceding code is:

2 honda           city        2005        60000       3
4 chevy beat 2005 33000 2

The next example shows it is mandatory to specify the AWK command or main source file using the -f option for executing a program with the -i option:

$ awk  -i  f1.awk cars.dat

The output on execution of the code is:

awk: cmd. line:1: cars.dat
awk: cmd. line:1: ^ syntax error

Include other files in the GAWK program (using @include)

This is a feature that is specific to GAWK. The @include keyword can be used to read external AWK source files and load in your running program before execution. Using this feature, we can split large AWK source files into smaller programs and also reuse common AWK code from various AWK scripts. It is useful for grouping together various AWK functions in a file and creating function libraries, using the @include keyword and including it in your program. It is similar to the -i option, which we discussed earlier.

It is important to note that the filename needs to be a literal string constant in double quotes.

The following example illustrates this. We'll create two AWK scripts, inc1 and inc2. Here is the inc1 script:

$ vi inc1
BEGIN { print "This is inc1." }

And now we create inc2, which includes inc1, using the @include keyword:

$ vi inc2
@include "inc1"
BEGIN { print "This is inc2." }

$ gawk -f inc2

On executing GAWK with inc2, we get the following output:

This is inc1.
This is inc2.

So, to include external AWK source files, we just use @include followed by the name of the file to be included, enclosed in double quotes. The files to be included may be nested. For example, create a third script, inc3, that will include inc2:

$ vi inc3
@include "inc2"
BEGIN { print "This is inc3." }

$ gawk -f inc3

On execution of GAWK with the inc3 script, we get the following results:

This is inc1.
This is inc2.
This is inc3.

The filename can be a relative path or an absolute path. For example:

@include "/home/usr/scripts/inc1"

This can also be performed as follows:

@include "../inc1"

Since AWK has the ability to specify multiple -f options on the command line, the @include mechanism is not strictly necessary. However, the @include keyword can help you in constructing self-contained GAWK programs, and eliminates the need to write complex command lines repetitively.

The -V option

This option displays the version information for a running copy of GAWK as well as license info. This allows you to determine whether your copy of GAWK is up to date with respect to whatever the Free Software Foundation (FSF) is currently distributing. This can be done as shown in the following code block:

$ awk -V

It can also be performed as follows:

$ awk --version

The output on execution of the preceding code is: