Monday, June 25, 2007

Perl 6: Round 4

Pugs revision: r16657

(To start off: I'm sorry about this taking so long in advance, life caught up.)

I'm pretty sure we're at the act you've probably been waiting to see: Rules and Grammars.

What you may know as perl 5 regular expressions, you now need to know as Perl 6 rules. The change in jargon is not substantial with regards to rules themselves: the new name was simply chosen because 'perl [5] regular expressions' were nowhere near the formal definition of a regular expression.

Grammar's are a new addition. They're basically just "Classes for Rules" (a grammar inherits from the base class Rule) and simply act as a namespace to organize your rules [in general.]

Rules can be tricky. There're a lot of pit falls and whatnot that you can fall into. It's therefore important that you be patient with them; you can build a lot of really useful things with primitive rules, but things have changed. As we go on, I'll try to address these things. For everything else, you will probably want to refer here, here, and here as well.

Important: It's good to note a lot of the things I'll describe here are *not* fully implemented (or even partially implemented) in the way they should be as according to the synopsis' and apocalypses, etc.. This is merely an introduction; full implementation of the rule engine is a milestone for Pugs, however, it is currently not yet completed.
If I show anything that does work (as far as I know,) it'll be in the pugs prompt.

Ready, set, begin.


Rules
As I said earlier, Rules are simply regular expressions in Perl 6. They merit their own keyword, rule and can be used one of several ways (when using rules, you are given back an object in the case of construction of them and their usage.)

The first and simplest way to use a Rule is matching. Matching is simple enough; match a string against a rule and give me the result. Matching is done in the form of:
if($str ~~ m/.../)
(Note: ~~ is the 'smart match operator.' It is analogous to perl 5's =~. See S03 for more.)


Rules in the form of m/.../ immediately match. You can also use the substitution form, s/.../.../ which also immediately matches (and, yes, substitutes.) Finally, using the simple /.../ form will immediately match given that context (i.e. used with the smart match operator. This form can also define a deferred match.)
Here're a few examples of using these forms, up to this point:
pugs> my $str = "hello";
pugs> if $str ~~ m/hello/ {
....> say "affirmative";
....> } else {
....> say "incorrect";
....> }
affirmative
Bool::True
pugs> if $str ~~ s/ll/l/ {
....> say "substitution worked";
....> } else {
....> say "substitution failed";
....> }
substitution worked
Bool::True
pugs> $str
"helo"
pugs>
You may want your rules to be a little more flexible than that, however, by using deferred matches. Using the form rx// you can define a deferred match that can be stuck in a variable, ex:
pugs> my $r = rx/^abcd$/;
The same could be expressed without the rx prefix (in relation to what I said earlier about /.../ and context.)
Without the rx prefix and defining a rule in this manner, you can also prefix the rule with any unary operator (S03) to force that rule to immediately match in a context it wouldn't normally (it will match in with $_.) For example:
my $one = ?/\d*/;                       # boolean context
my $two = ~/^(fun|unfun).*/; # string context
my $three = +/(\d|\s)*/; # numeric context
Another way to define a rule is using the rule block. A rule block can have a name follow it, or it can be anonymous and simply stuck inside a variable like a deferred match. For example:
pugs> rule x { \d+ };
pugs> my $y = rule { (one|two) };
After defining a named/anonymous rule (via any of the methods described above) or a deferred match (without an unary operator prefixing it), you can match them following these examples:
pugs> rule x { \d* };
pugs> my $y = rule { (one|two) };
pugs> my $z = rx/^success: (.*)/;

pugs> 434 ~~ /<x>/ # matches
pugs> "one" ~~ $y # matches
pugs> "success: asdf" ~~ $z # matches

Grammars
Grammars are simply classes for rules. Their declaration is analogous to that of a class; if the grammar keyword is followed by a block:
grammar Dog { ... }
The namespace of the grammar is confined to that block. If that block is absent:
grammar Dog;
It continues until the end of the source file.


The main difference in calling a rule that's defined within a grammar is that you simply have to give the fully qualified name, ex:
grammar Dog {
rule bark { ^ < bark woof ruff > $ }
}

"woof" ~~ /<Dog.bark>/; # True
Like classes, rules can do things like inherit and the like.


Important Detail #1:
I'm going to at this point take some time to tell you something important that is, well, very important: interpolation doesn't exist.

I'll give you a moment to let it sink in.

Whereas in perl 5 you could freely embed scalars and the like into your regexes, you cannot do this any longer in a rule. Rather, they are passed raw to the rule engine, which decides how to deal with them from there. This is because, now, regexes are not strings; they're programs. Wall described this in A05. To quote him:

"The problem with \Q$string\E arises because of the fundamental mistake of using interpolation to build regexes instead of letting the regex control how it treats the variables it references. Regexes aren't strings, they're programs. Or, rather, they're strings only in the sense that any piece of program is a string."

In this fashion, the common misconception is to think of your rule as a string, and therefore letting interpolation come as naturally as it would with any other string. Rather, let the rule engine figure out how to deal with your variables. Now, you will use the general syntax of an assertion with your variable to help the engine determine how things should be treated.

I figure I'd take this time to point something like this out, as it's pretty important. If you really really really need interpolation that badly, you can use the P5 rule modifier to acheive it (see below.)
Continuing...

Special characters & Co.
In perl 6 rules, like perl 5 regexes, you have a lot of special characters you can use inside your rules for specific purposes. Here I am merely going to list some of them and provide examples, this isn't a definitive reference to them.

Metacharacters
The general metacharacters you can have inside a rule itself are as follows:
.         Match any single character, including a newline.
^
Match the beginning of a string.
$
Match the end of a string.
^^
Match the beginning of a line.
$$
Match the end of a line.
|
Match alternate patterns (OR).
&
Match multiple patterns (AND).
\
Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter.
#
Mark a comment (to the end of the line).
:=
Bind the result of a match to a hypothetical variable.
( . . . ) Group patterns and capture the result.
[ . . . ] Group patterns without capturing.
{ . . . } Execute a closure (Perl 6 code) within a rule.
< . . . > Match an assertion.

These are mostly new, however, interpretation of their meaning should not be hard if you've used Perl 5 regular expressions before. Explanation of these are not really needed, just play around with them (we'll cover more on assertions in a moment, however)


Escape sequences
There are also plenty of escape sequences you can use inside a rule to specify things such as whitespace or group an entire word together. Here's a quick list:
\0[ . . . ]     Match a character given in octal (brackets optional).
\b Match a word boundary.
\B Match when not on a word boundary.
\c[ . . . ] Match a named character or control character.
\C[ . . . ] Match any character except the bracketed named or control character.
\d Match a digit.
\D Match a nondigit.
\e
Match an escape character.
\E
Match anything but an escape character.
\f
Match the form feed character.
\F
Match anything but a form feed.
\n
Match a (logical) newline.
\N
Match anything but a (logical) newline.
\h
Match horizontal whitespace.
\H
Match anything but horizontal whitespace.
\L[ . . . ]
Everything within the brackets is lowercase.
\Q[ . . . ] All metacharacters within the brackets match as literal characters.
\r Match a return.
\R Match anything but a return.
\s
Match any whitespace character.
\S
Match anything but whitespace.
\t
Match a tab.
\T
Match anything but a tab.
\U[ . . . ]
Everything within the brackets is uppercase.
\v
Match vertical whitespace.
\V
Match anything but vertical whitespace.
\w
Match a word character (Unicode alphanumeric plus "_").
\W
Match anything but a word character.
\x[ . . . ]
Match a character given in hexadecimal (brackets optional).
\X[ . . . ] Match anything but the character given in hexadecimal (brackets optional).
Most of these should be fairly self explanatory.

"Extensible metasyntax"
From S05: "Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions."

In general, the first leading character after the angle bracket determines the an assertion's semantics. Here're a few of them:

  • If there is whitespace after the opening bracket, and whitespace before the ending one, the the characters inside are treated 'quote style' and used in a non-capturing group. Ex:

    rx/< hello there how are you >/
    Is equivilant to:
    rx/[hello|there|how|are|you]/

  • A leading ? makes the assertion will cause no capture, given that it matches.
  • A leading $ causes an indirect subrule to be invoked. I'm pretty sure you've seen this before.
  • A leading :: causes an indirect subrule to be invoked, yet symbolically. What this means is you use this syntax:
    <::$var>
    And the contents of $var will be taken out, and what's inside will be treated as a rule name. If you've ever done php, this is analogous to the double dollar sign convention, ex:
    rule z { (\d+) };
    my $name = "z";

    "123" ~~ /<z>/;
    # above is the same as:
    "123" ~~ /<::$name>/;
  • A leading @ makes things act 'array-like.' This:
    "..." ~~ /<@arr>/;

    Is semantically the same as:
    "..." ~~ /[@arr[0] | @arr[1] | @arr[2] | ... ]/;


    However, rather than matching as literal, each element of the array will be treated as a subrule. This can be pretty useful, as you can match your text against an array of different rules.
  • A leading { (also followed by a closing } right before the ending angle bracket) basically allows you to define an in situ closure that is expected to return a rule, which at that point is matched.
  • A leading & treats a subroutine as if it will return a rule. This:
    <&foo()>

    Is the same as:
    <{ foo() }>

    It's pretty much just a shorthand.
  • A leading [ (like the curly bracket, also ending with a ]) indiciates a character class. This class can be negated by instead prefixing your opening bracket with a -. Examples:

    pugs> "a" ~~ /<[a..z]>/  # true
    pugs> "a" ~~ /<-[a..z]>/ # false
    pugs> "1" ~~ /<-[a..z]>/ # true


    There is additional flexibility in that you may 'add' and 'subtract' character classes. For example, to check that a string has no vowels:
    $str ~~ /<[a..z] - [aeiou]>/

  • Leading ! indicates negation (naturally.)

Rule modifiers

Aside from the above, you still have your handy dandy rule modifiers. Their usage is essentially the same, but now they are passed at the front of the rule rather than at the end as it both makes life easier for you and the parser. Here're a few of the modifiers you can use:
:i      Ignore case
:g Match as many times as possible
:s Treat whitespace as 'significant,' i.e. it must occur verbatim
:P5 Use Perl 5's regular expression syntax, rather than Perl 6.
:Nx Works like :g, however, the N specifies exactly how many times it
must match. The general form is :x(N)
:Nth Find the Nth occurance. Useful for substitutions, i.e.
s:5th/lbrary/library/ if you said something wrong in your
sentence. The general form is :nth(N)

In the case of just declaring a deferred rule (rx/.../) or a match (m/.../) these modifiers are placed after the rx/m token and delimited by a colon, i.e. rx:i/.../, m:g/.../, et cetera et cetera. There are plenty more, however, I'm leaving them out as I assume if you need them, you'll find them (sue me.)


Built-in rules
Aside from your rules, there're naturally plenty of built in ones you can use. Here are a few:
<alpha>       Match a Unicode alphabetic character.
<digit> Match a Unicode digit.
<sp> Match a single-space character (the same as \s).
<ws> Match any whitespace (the same as \s+).
<null> Match the null string.
<prior> Match the same thing as the previous match.
Like other rules, you can change their meanings with assertion semantics.


Hypothetical variables
Hypothetical variables are a new feature of Perl 6 rules. In a perl 6 rule, a hypothetical variable allows you to bind a variable within a rule. If your match fails, your hypothetical variables are automatically unbound from what they were (in the case that your match failed after the fact.) However, the variable must be in lexical scope before you may bind to it via a the := operator. This is less complicated than it sounds, here's the example:
my $z;
"I am a person" ~~ m/^$z := (\w+)/;
$z.say; # should print "I"
Fairly simple.


Conclusion
This has been a nice post. Hopefully, your rule-fu has increased. The changes may need a little time to get used to, however, in time all should be good. :) Like I said, a -lot- of this is not implemented, and I have not even breathed upon the technical surface of rules; I'm not exactly the definitive reference on them anyway. This should give you a taste, however.

Perhaps an unofficial 'Round 4b' is in order. We'll see...
Until next time...


Next round: ??
(that means I'm open to recommendations. If none arise, macros seem like a good topic)