Tuesday, November 3, 2015

AWK command

Introduction :

awk is a very powerful command in unix that helps us easily manipulate a file or read through a file .AWK takes input from the console or from a file which is specified with the command

syntax :

The syntax to execute AWK in command line would be as follows

awk ' BEGIN {} END {} ' <FILENAME>

The begin and end statements are really optional here.BEGIN  prints the command once before the awk loops through the file and END commands prints it after the execution of AWK is completes.

we can simply print the first column of a file using the below command

awk ' { print $1 } ' sample.txt 

The file sample.txt will be read and the first column is printed in this case .By default it assumes the file is tab delimited or space delimited .

let us assume that we have a sample file sample.txt with the following records

name address phone salary
abc xcheufhe 12121212 10000
xyz fmmrkfkkr 2323254 1000000
cns dfffggggggg 123454545 3999
sdsds dkdkwdjwej 16767676 5000


Now lets run the basic awk commands on them and compare the outputs

1.awk ' { print $1 } ' sample.txt

The output will be as follows

name
abc
xyz
cns
sdsds






2.awk ' BEGIN { print "start" } { print $1 } END { print "done" } ' sample.txt

Output will be

start 
name
abc
xyz
cns
sdsds

done 

This is the difference between using begin and end in the awk command

3 awk ' { print $1"\t" $2 } ' sample.txt

Output will be

name    address   
abc    xcheufhe   
xyz    fmmrkfkkr   
cns    dfffggggggg   
sdsds    dkdkwdjwej

The \t here seperates the two fileds name and address with a tab space.if that is not given the result will be nameaddress (without space).


awk ' { print $2=" " ; print $0 } ' sample.txt

Output will be

name  phone salary 
abc  12121212 10000
xyz  2323254 1000000
cns  123454545 3999
sdsds  16767676 5000


We can see that all the columns except the 2nd column ie address is printed here .$0 prints all the columns and since we have given $2 as blank that column did not appear in the output

4.awk ' { print $2=$3=" "; print $0 } ' sample.txt

Output will be 

name   salary
abc   10000
xyz   1000000
cns   3999
sdsds   5000
 

Here the columns 2 and 3 ie address and phone number is excluded from the output

5.Suppose we have a huge file and we need to print a range of values say column 2 to 6 from tht file we can use the below command

awk -v a=2 -v b=6 ' {for (i=a;i<=b;i++);print $i } ' sample.txt

The -v argument stands for variables and it can be used inside the begin end loop.here we are assiging two variables a and b with the minimum and maximum range we need and then we are using a for loop to iterate and print all the columns starting from 2 to 6 .


Built in variables available with the awk command :

There are 8 most popular built in variables that comes handy with an awk command .Lets go through each one by one .

1.FS or input field seperator :

By default the awk command assumes that the file is space or tab delimited one .
suppose if we have a file with say comma delimited we may have to explicitly mention that when using in the awk command

awk -F "," ' { print $1} sample.txt


2.OFS or output field seperator :

awk -F "," ' BEGIN { OFS="=";} { print $1,$2,$3; } ' sample.txt

This command reads the file in a comma seperated values and prints the columns 1 ,2, and 3 seperated by =.note that OFS cannot be directly used in the command line .it has to be enclosed within a begin and end block .

 3. RS or record seperator :

awk -F "," ' BEGIN { RS="\n"; OFS=":"; } { print $1,$2,$3} ' sample.txt

 this command assumes the record seperator is a new line which is the default and ofs is : and the file is actually comma delimited .so it reads the file assuming the record ends with a new line and outputs them with a : .this command also should be used within a begin and end statement 

4 NR or total number of records in the file :

awk ' BEGIN { print "stats" } { print "processing record-",NR } END { print NR,"number of records processed " } ' sample.txt

If there are 10 records in the file the output will be something like this 

stats
record processed -  1
record processed -  2
record processed -  3
record processed -  4
record processed -  5
record processed -  6
record processed -  7
record processed -  8
record processed -  9
record processed -  10
10 records processed


5 NF or number of fields in a record 

This command will give the number of fields in the file for each record

awk -F "," ' { print NR , "=" ,NF } ' sample.txt

this will read the file in a comma delimited format and counts the number of records = number of fileds 

the output will be something like 

1 = 5
2 = 5
3 = 5
4 = 5
5 = 0
6 = 0
7 = 0
8 = 0
9 = 0
10 = 0


this means that the 1  st row to 4 th row has 5 fileds and rest of the rows are empty 

6 FILENAME 

This command prints the filename as many times as the NR 

awk ' { print FILENAME } ' sample.txt

will print sample.txt 10 times since the file has 10 records 

awk -F "," ' BEGIN { OFS=":";} {print $0,FILENAME} ' sample.txt

This will print something like this 

name,place,address,phonenumber,salary:sample.txt
aparna,cochin,trinityworld,9037289898,1000:sample.txt
anjali,palakkad,infopark,9090909090,100000:sample.txt
anusha,banglore,electroncity,903456565,40000:sample.txt


Some simple AWK commands 

1.return the number of lines in a file :
awk ' END { print NR } ' <filename>

2.print the odd lines in a file 

awk ' { if (NR % !=0) print $0} ' <filename>

3.Print the even lines in a file 

awk ' { if (NR %==0) print $0} ' <filename>

4.Print the length of the longest line in the file 

awk ' { if (length($0) > max) max= length($0) } END { print max} ' <filename>

5.Print the longest line in the file

awk ' { if (length($0) > max) max = $0 } END { print max } ' <filename>

Exit status :

if an AWK command runs successfully the exit status will be 0 else it will be 1 .
We can manually give an exit code also .in that case the awk command will exit with that code