Sunday 11 December 2016

Regular expressions in perl

In this article I discus working with regular expressions in perl. Less theory & more code as usual.
Regular expressions are pieces/parts of a string resembling a particular pattern being searched in a file.
In perl, strings enclosed within backslahes ( //) are considered as input regular expression matching & the =~ is used to indicate that the expression to the right is a regular expression match. The !~ is used to indicate a negative regex match. More on that towards the end of the article.


Pattern matching:
In the first example, we open a text file via a file handle & assign it to the array & then iterate through the file in a foreach loop & print any line containing the word test in it.

This is the test file:

root@buntu:~# cat testfile
this is a test
still a test

a regex test

using regex test in arrays

c:\documnets and settings
c:\downloads

c:\program files

d:\tmpfiles

SahiLSuri708


This is the script:

root@buntu:~# cat reg1.pl
#!/usr/bin/perl -w

open (FH1, "testfile") || die "can't open file" ;

my @art = <FH1> ;

foreach (@art) {
        if (/test/) { print }
        }

This is the output from the script:

root@buntu:~# ./reg1.pl
this is a test
still a test
a regex test
using regex test in arrays


Search & replace:
With the basic pattern match illustrated in the above example, in this next example I demonstrate searching for a pattern & replacing it with something else.

root@buntu:~# cat regex.pl
#!/usr/bin/perl -w

my $my_name = "sahil suri" ;

$my_name =~ s/^sahil/Sahil/ig ;
$my_name =~ s/suri$/Suri/ig ;

print "$my_name \n" ;


open (FH1, "testfile") || die "can't open file" ;

my @art = <FH1> ;

foreach (@art) {
        s/test\s/Perl Test /g ;
        print "$_ \n" ;
        }

In the first part of the script, I performed a search and replace in the value of the scaler variable my_name. Here, the s implies the search, i states that search should be case insensitive and g indicates a global replacement of the matched string. If we do not mention g then only the first occurance of the string is replaced. The ^ and $ are called anchor tags. The ^ symbol indicates the beginning of the sentence and the $ indicates the end of a sentence.
In the second part of the script, I do a search & replace in an array variable via a foreach loop. So, each line of the file testfile is checked for the existance of the string test.followed by a space indicated by \s. If a match is found then the match is replace by the string Perl Test and is printed via the print statement that follows.


Using grouping:
Grouping allows us to match regular expressions & assign the matched values to variables $1, $2 and so on. Here are a couple of example:

my $os = "Solaris 1 2 3 4" ;
$os =~ /(\s\d)(\s\d)(\s\d)(\s\d)/ ;

print "$1 $2 $3 $4 \n" ;

my $time = `date +%r` ;

$time =~ /(\d\d):(\d\d):(\d\d)\s(.*)/ ;

print "$1 $2 $4 \n" ;

In order to group selections we enclose them in parenthesis i.e (). The \d indicates a single digit, \s indicates a space character & .* indicates all characters. In the first example, the four digits get selected & their values get assigned to $1, $2, $3 & $4 respectively. After that we print them. 
In the second example, we select the hours, minutes, seconds & AM/PM from the date command & print the hours, minutes & AM/PM respectively.

We can also use grouping to match alphanumeric strings like Sahil777. Given below is an example of performing such a match.

open (FH1, "testfile") || die " couldn't open file" ;

my @abc = <FH1> ;

foreach (@abc) {
        if (/(.*)(\d)/) { print ; }
        }


Character classes:
This type of regex matching is helpfule when we want to match a character or a group of characters from the string. Here's an exmaple:

my @ays = ("may", "say", "hay", "yay", "hurray", "whey" ) ;

foreach (@ays) {
        if (/[mshyr]ay/i)
                {
                  print "$_ \n" ;
                }
        }

The regex match /[mshyr]ay/i matches any words comprising of the characters m or s or h or y or r followed by the characters ay.


Alternates:
These are loosely equivallent to an OR match i.e. if any of the alternates match, perform the required actions on them as intented in the script. Here's an exmaple:

my @ors = ("solaris", "linux", "hp-ux", "aix", "debian", "ubuntu") ;

foreach (@ors) { if (/solaris|linux|aix/) { print "$_ \n" } }

The above example looks for the strings solaris or linux or aix in the array ors and prints them if nay match is found.

In addition to the match shown above, we can also group alternates as shown in the following example:

foreach (@ors) { if (/(solaris|linux|aix)/) { print "$1 \n" } }


Ommiting special characters and blank lines:
To discard the special meaning of a character we use the backslah to escape it. Here's an example:

open (FH2, "testfile") ;
my @art2 = <FH2> ;

foreach (@art2) {
        if (/(.:)(.*\\)(.*)/) {
                print "$1$2$3 \n" ;
        }
}

The file testfile as I printed in the beginning of the article has a couple of windows type path names like c:\documnets and settings. The above sample code will interpret the backslash in the path name as a literal character & print it out.
To remove empty lines from the file before printing it use . character which will match to lines having atleast one character & skip empty lines:

open (FH1, "testfile") || die " couldn't open file" ;

my @art2 = <FH1> ;

foreach (@art2) { if (/./) { print $_ ; } }


Negative matches:
We may often find ourselves in a situation wherein we want to negate certain pattern from a file & print the rest of the file. We can accomplish this via a couple of methods. 
The first one is using something called a negative lockhead which is the ?! character. Given below is an example:

open (FH1, "testfile") || die " couldn't open file" ;

my @abc = <FH1> ;

foreach (@abc) {
        if (/^(?!c:)/ and /./) {
                print ;
        }
}

This code prints every line from the file that does not begin with c: and is not empty.

The second method to do a negative regex match is using !~ as illustrated in the below example:

open (FH1, "testfile") || die " couldn't open file" ;

my @abc = <FH1> ;

foreach (@abc) {
        #if ($_ !~ /[:]/ && /./) { print }
        if ($_ !~ /(test)/ && /./) { print }
        }

The above code skips any lines that contain the word test and any lines that are empty. The commented out if statement skips any lines that contain a colon and any empty lines.

No comments:

Post a Comment

Using capture groups in grep in Linux

Introduction Let me start by saying that this article isn't about capture groups in grep per se. What we are going to do here with gr...