Sunday 15 January 2017

Using AWK to match columns from multiple files

I came across an interesting requirement at a facebook forum today. The requirement was to match columns two & three of one file with columns one & two of another file & print the entities from the first file which do not have nay matches.

Here are the files:

[root@alive ~]# cat file1
d1,40,gold
d2,30,silver
d3,20,bronze
d4,10,iron
d5,5,wood
d6,20,gold
d7,10,wood
d8,5,gold
d9,10,silver
[root@alive ~]# cat file2
gold,40
silver,30
bronze,20
iron,10
wood,5

The AWK one liner that works is:

awk -F',' 'NR==FNR{c[$1$2]++;next};!c[$3$2]' file2 file1

The following will be the output of he above AWK statement:

d6,20,gold
d7,10,wood
d8,5,gold
d9,10,silver


Now, let's do a step by step breakdown of what just happened.

-F ',' 
(The file columns are comma separated. So we changed the field separator)

NR==FNR 
(When you have two input files to awk, FNR will reset back to 1on the first line of the next file whereas NR will continuing increment from where it left off. By checking FNR==NR we are essentially checking to see if we are currently parsing the first file.)
 
c[$1$2]++;next
(Here we are parsing the first file file2 creating an associative array with columns one & two & post increment the value by one. The next command tells AWK not to process any further commands and proceed to next record.)

!c[$3$2] 
(This line only executes when FNR==NR is false, i.e. we are not parsing file2 and thus must be parsing file1. We then use the first fields $2 and $3 of file1 as the key to index into our 'seen' list created earlier.  If the value returned is 0 it means we didn't see it in file1 and therefore we should print this line. Conversely, if the value is non-zero then we did see it in file1 and thus we should not print its value. Note that !c[$2$3] is equivalent to !a[$2$3]{print} because the default action when one is not given is to print the entire line.)

In case the requirement changes and we needed to print the matching lines instead of those that didn't match, we'd modify our AWK expression as follows:


awk -F',' 'NR==FNR{c[$1$2]++;next};c[$3$2]' file2 file1

And the resulting output will be:

d1,40,gold
d2,30,silver
d3,20,bronze
d4,10,iron
d5,5,wood

I hope this has been an interesting AWK read.

No comments:

Post a Comment

Using capture groups in grep in Linux

Introduction Let me start by saying that this article isn't about capture groups in grep per se. What we are going to do here with gr...