Regular Expressions

Regular Expressions

Regular expressions are too huge of a topic to introduce here,but make sure that you understand these concepts.For tutorials,see perlrequick or perlretut.For the definitive documentation,see perlre.

Matches and replacements return a quantity.

The m// and s/// operators return the number of matches or replacements they made,respectively.You can either use the number directly,or check it for truth.

    if ( $str =~ /Diggle|Shelley/ ) {
        print "We found Pete or Steve!\n";
    }
 
    if ( my $n = ($str =~ s/this/that/g) ) {
        print qq{Replaced $n occurrence(s) of "this"\n};
    }

Don't use capture variables without checking that the match succeeded.

The capture variables, $1, etc, are not valid unless the match succeeded, and they're not cleared, either.

    # BAD: Not checked, but at least it "works".
    my $str = 'Perl 101 rocks.';
    $str =~ /(\d+)/;
    print "Number: $1"; # Prints "Number: 101";
 
    # WORSE: Not checked, and the result is not what you'd expect
    $str =~ /(Python|Ruby)/;
    print "Language: $1"; # Prints "Language: 101";

Instead, you must check the return value from the match:

    # GOOD: Check the results
    my $str = 'Perl 101 rocks.';
    if ( $str =~ /(\d+)/ ) {
        print "Number: $1"; # Prints "Number: 101";
    }
 
    if ( $str =~ /(Python|Ruby)/ ) {
        print "Language: $1"; # Never gets here
    }

XXX m// in list context gives a list of matches

Common match flags

/i - case insensitive match
/g - match multiple times

    $var = "match match match";
 
    while ($var =~ /match/g) { $a++; }
    print "$a\n"; # prints 3
 
    $a = 0;
    $a++ foreach ($var =~ /match/g);
    print "$a\n"; # prints 3

/m - ^ and $ change meaning
- Ordinarily, ^ means "start of string" and $, "end of string"
- /m makes them mean start and end of line, respectively

    $str = "one\ntwo\nthree";
    @a = $str =~ /^\w+/g;  # @a = ("one");
    @b = $str =~ /^\w+/gm; # @b = ("one","two","three")

Use \A and \z for start and end of string regardless of /m
\Z is the same as \z except it will ignore a final newline
- /s - . also matches newline

    $str = "one\ntwo\nthree\n";
    $str =~ /^(.{8})/s;
    print $1; # prints "one\ntwo\n"

Capture variables $1 and friends

Sets of capturing parentheses are stored in numeric variables
Parenthesis are assigned left to right:

    my $str = "abc";
    $str =~ /(((a)(b))(c))/;
    print "1: $1 2: $2 3: $3 4: $4 5: $5\n";
    # prints: 1: abc 2: ab 3: a 4: b 5: c

No upper limit on number of capturing parenthesis and variables

Avoid capture with ?:

If a parenthesis is followed by ?:, the group will not be captured
Useful if you don't want the matches to be saved

    my $str = "abc";
    $str =~ /(?:a(b)c)/;
    print "$1\n"; # prints "b"

Allow easier reading with the /x switch

If you're doing something tricky with a regex, comment it.
You can do this with the /x flag.
This ugly behemoth

    my ($num) = $ARGV[0] =~ m/^\+?((?:(?<!\+)-)?(?:\d*.)?\d+)$/x;

is more readable with whitespace and comments, as allowed by the /x flag.

    my ($num) =
        $ARGV[0] =~ m/^ \+?        # An optional plus sign, to be discarded
                    (              # Capture...
                    (?:(?<!\+)-)? # a negative sign, if there's no plus behind it,
                    (?:\d*.)?     # an optional number, followed by a point if a decimal,
                    \d+           # then any number of numbers.
                    )$/x;

Whitespace and comments are stripped unless escaped.

Automatically quote your regexes with \Q and \E

Automatically escapes regex metacharacters
Won't escape dollar signs

    my $num = '3.1415';
    print "ok 1\n" if $num =~ /\Q3.14\E/;
    $num = '3X1415';
    print "ok 2\n" if $num =~ /\Q3.14\E/;
    print "ok 3\n" if $num =~ /3.14/;

prints

    ok 1
    ok 3

Execute code with /e flag to s///

Allows arbitrary code to replace a string in a regular expression

    my $str = "AbCdE\n";
    $str =~ s/(\w)/lc $1/eg;
    print $str; # prints "abcde"

Use $1 and friends if necessary

Know when to use study

study is not helpful in the vast majority of cases. All it does is make a table of where the first occurrence of each of 256 bytes is in the string. This means that if you have a 1,000-character string, and you search for lots of strings that begin with a constant character, then the matcher can jump right to it. For example:

"This is a very long [… 900 characters skipped…] string that I have here, ending at position 1000"

Now, if you are matching this against the regex /Icky/, the matcher will try to find the first letter "I" that matches. That may take scanning through the first 900+ characters until you get to it. But what study does is build a table of the 256 possible bytes and where they first appear, so that in this case, the scanner can jump right to that position and start matching.