Monday, June 25, 2007

Perl 6: Round 4

Pugs revision: r16657

(To start off: I'm sorry about this taking so long in advance, life caught up.)

I'm pretty sure we're at the act you've probably been waiting to see: Rules and Grammars.

What you may know as perl 5 regular expressions, you now need to know as Perl 6 rules. The change in jargon is not substantial with regards to rules themselves: the new name was simply chosen because 'perl [5] regular expressions' were nowhere near the formal definition of a regular expression.

Grammar's are a new addition. They're basically just "Classes for Rules" (a grammar inherits from the base class Rule) and simply act as a namespace to organize your rules [in general.]

Rules can be tricky. There're a lot of pit falls and whatnot that you can fall into. It's therefore important that you be patient with them; you can build a lot of really useful things with primitive rules, but things have changed. As we go on, I'll try to address these things. For everything else, you will probably want to refer here, here, and here as well.

Important: It's good to note a lot of the things I'll describe here are *not* fully implemented (or even partially implemented) in the way they should be as according to the synopsis' and apocalypses, etc.. This is merely an introduction; full implementation of the rule engine is a milestone for Pugs, however, it is currently not yet completed.
If I show anything that does work (as far as I know,) it'll be in the pugs prompt.

Ready, set, begin.

As I said earlier, Rules are simply regular expressions in Perl 6. They merit their own keyword, rule and can be used one of several ways (when using rules, you are given back an object in the case of construction of them and their usage.)

The first and simplest way to use a Rule is matching. Matching is simple enough; match a string against a rule and give me the result. Matching is done in the form of:
if($str ~~ m/.../)
(Note: ~~ is the 'smart match operator.' It is analogous to perl 5's =~. See S03 for more.)

Rules in the form of m/.../ immediately match. You can also use the substitution form, s/.../.../ which also immediately matches (and, yes, substitutes.) Finally, using the simple /.../ form will immediately match given that context (i.e. used with the smart match operator. This form can also define a deferred match.)
Here're a few examples of using these forms, up to this point:
pugs> my $str = "hello";
pugs> if $str ~~ m/hello/ {
....> say "affirmative";
....> } else {
....> say "incorrect";
....> }
pugs> if $str ~~ s/ll/l/ {
....> say "substitution worked";
....> } else {
....> say "substitution failed";
....> }
substitution worked
pugs> $str
You may want your rules to be a little more flexible than that, however, by using deferred matches. Using the form rx// you can define a deferred match that can be stuck in a variable, ex:
pugs> my $r = rx/^abcd$/;
The same could be expressed without the rx prefix (in relation to what I said earlier about /.../ and context.)
Without the rx prefix and defining a rule in this manner, you can also prefix the rule with any unary operator (S03) to force that rule to immediately match in a context it wouldn't normally (it will match in with $_.) For example:
my $one = ?/\d*/;                       # boolean context
my $two = ~/^(fun|unfun).*/; # string context
my $three = +/(\d|\s)*/; # numeric context
Another way to define a rule is using the rule block. A rule block can have a name follow it, or it can be anonymous and simply stuck inside a variable like a deferred match. For example:
pugs> rule x { \d+ };
pugs> my $y = rule { (one|two) };
After defining a named/anonymous rule (via any of the methods described above) or a deferred match (without an unary operator prefixing it), you can match them following these examples:
pugs> rule x { \d* };
pugs> my $y = rule { (one|two) };
pugs> my $z = rx/^success: (.*)/;

pugs> 434 ~~ /<x>/ # matches
pugs> "one" ~~ $y # matches
pugs> "success: asdf" ~~ $z # matches

Grammars are simply classes for rules. Their declaration is analogous to that of a class; if the grammar keyword is followed by a block:
grammar Dog { ... }
The namespace of the grammar is confined to that block. If that block is absent:
grammar Dog;
It continues until the end of the source file.

The main difference in calling a rule that's defined within a grammar is that you simply have to give the fully qualified name, ex:
grammar Dog {
rule bark { ^ < bark woof ruff > $ }

"woof" ~~ /<Dog.bark>/; # True
Like classes, rules can do things like inherit and the like.

Important Detail #1:
I'm going to at this point take some time to tell you something important that is, well, very important: interpolation doesn't exist.

I'll give you a moment to let it sink in.

Whereas in perl 5 you could freely embed scalars and the like into your regexes, you cannot do this any longer in a rule. Rather, they are passed raw to the rule engine, which decides how to deal with them from there. This is because, now, regexes are not strings; they're programs. Wall described this in A05. To quote him:

"The problem with \Q$string\E arises because of the fundamental mistake of using interpolation to build regexes instead of letting the regex control how it treats the variables it references. Regexes aren't strings, they're programs. Or, rather, they're strings only in the sense that any piece of program is a string."

In this fashion, the common misconception is to think of your rule as a string, and therefore letting interpolation come as naturally as it would with any other string. Rather, let the rule engine figure out how to deal with your variables. Now, you will use the general syntax of an assertion with your variable to help the engine determine how things should be treated.

I figure I'd take this time to point something like this out, as it's pretty important. If you really really really need interpolation that badly, you can use the P5 rule modifier to acheive it (see below.)

Special characters & Co.
In perl 6 rules, like perl 5 regexes, you have a lot of special characters you can use inside your rules for specific purposes. Here I am merely going to list some of them and provide examples, this isn't a definitive reference to them.

The general metacharacters you can have inside a rule itself are as follows:
.         Match any single character, including a newline.
Match the beginning of a string.
Match the end of a string.
Match the beginning of a line.
Match the end of a line.
Match alternate patterns (OR).
Match multiple patterns (AND).
Escape a metacharacter to get a literal character, or escape a literal character to get a metacharacter.
Mark a comment (to the end of the line).
Bind the result of a match to a hypothetical variable.
( . . . ) Group patterns and capture the result.
[ . . . ] Group patterns without capturing.
{ . . . } Execute a closure (Perl 6 code) within a rule.
< . . . > Match an assertion.

These are mostly new, however, interpretation of their meaning should not be hard if you've used Perl 5 regular expressions before. Explanation of these are not really needed, just play around with them (we'll cover more on assertions in a moment, however)

Escape sequences
There are also plenty of escape sequences you can use inside a rule to specify things such as whitespace or group an entire word together. Here's a quick list:
\0[ . . . ]     Match a character given in octal (brackets optional).
\b Match a word boundary.
\B Match when not on a word boundary.
\c[ . . . ] Match a named character or control character.
\C[ . . . ] Match any character except the bracketed named or control character.
\d Match a digit.
\D Match a nondigit.
Match an escape character.
Match anything but an escape character.
Match the form feed character.
Match anything but a form feed.
Match a (logical) newline.
Match anything but a (logical) newline.
Match horizontal whitespace.
Match anything but horizontal whitespace.
\L[ . . . ]
Everything within the brackets is lowercase.
\Q[ . . . ] All metacharacters within the brackets match as literal characters.
\r Match a return.
\R Match anything but a return.
Match any whitespace character.
Match anything but whitespace.
Match a tab.
Match anything but a tab.
\U[ . . . ]
Everything within the brackets is uppercase.
Match vertical whitespace.
Match anything but vertical whitespace.
Match a word character (Unicode alphanumeric plus "_").
Match anything but a word character.
\x[ . . . ]
Match a character given in hexadecimal (brackets optional).
\X[ . . . ] Match anything but the character given in hexadecimal (brackets optional).
Most of these should be fairly self explanatory.

"Extensible metasyntax"
From S05: "Both < and > are metacharacters, and are usually (but not always) used in matched pairs. (Some combinations of metacharacters function as standalone tokens, and these may include angles. These are described below.) Most assertions are considered declarative; procedural assertions will be marked as exceptions."

In general, the first leading character after the angle bracket determines the an assertion's semantics. Here're a few of them:

  • If there is whitespace after the opening bracket, and whitespace before the ending one, the the characters inside are treated 'quote style' and used in a non-capturing group. Ex:

    rx/< hello there how are you >/
    Is equivilant to:

  • A leading ? makes the assertion will cause no capture, given that it matches.
  • A leading $ causes an indirect subrule to be invoked. I'm pretty sure you've seen this before.
  • A leading :: causes an indirect subrule to be invoked, yet symbolically. What this means is you use this syntax:
    And the contents of $var will be taken out, and what's inside will be treated as a rule name. If you've ever done php, this is analogous to the double dollar sign convention, ex:
    rule z { (\d+) };
    my $name = "z";

    "123" ~~ /<z>/;
    # above is the same as:
    "123" ~~ /<::$name>/;
  • A leading @ makes things act 'array-like.' This:
    "..." ~~ /<@arr>/;

    Is semantically the same as:
    "..." ~~ /[@arr[0] | @arr[1] | @arr[2] | ... ]/;

    However, rather than matching as literal, each element of the array will be treated as a subrule. This can be pretty useful, as you can match your text against an array of different rules.
  • A leading { (also followed by a closing } right before the ending angle bracket) basically allows you to define an in situ closure that is expected to return a rule, which at that point is matched.
  • A leading & treats a subroutine as if it will return a rule. This:

    Is the same as:
    <{ foo() }>

    It's pretty much just a shorthand.
  • A leading [ (like the curly bracket, also ending with a ]) indiciates a character class. This class can be negated by instead prefixing your opening bracket with a -. Examples:

    pugs> "a" ~~ /<[a..z]>/  # true
    pugs> "a" ~~ /<-[a..z]>/ # false
    pugs> "1" ~~ /<-[a..z]>/ # true

    There is additional flexibility in that you may 'add' and 'subtract' character classes. For example, to check that a string has no vowels:
    $str ~~ /<[a..z] - [aeiou]>/

  • Leading ! indicates negation (naturally.)

Rule modifiers

Aside from the above, you still have your handy dandy rule modifiers. Their usage is essentially the same, but now they are passed at the front of the rule rather than at the end as it both makes life easier for you and the parser. Here're a few of the modifiers you can use:
:i      Ignore case
:g Match as many times as possible
:s Treat whitespace as 'significant,' i.e. it must occur verbatim
:P5 Use Perl 5's regular expression syntax, rather than Perl 6.
:Nx Works like :g, however, the N specifies exactly how many times it
must match. The general form is :x(N)
:Nth Find the Nth occurance. Useful for substitutions, i.e.
s:5th/lbrary/library/ if you said something wrong in your
sentence. The general form is :nth(N)

In the case of just declaring a deferred rule (rx/.../) or a match (m/.../) these modifiers are placed after the rx/m token and delimited by a colon, i.e. rx:i/.../, m:g/.../, et cetera et cetera. There are plenty more, however, I'm leaving them out as I assume if you need them, you'll find them (sue me.)

Built-in rules
Aside from your rules, there're naturally plenty of built in ones you can use. Here are a few:
<alpha>       Match a Unicode alphabetic character.
<digit> Match a Unicode digit.
<sp> Match a single-space character (the same as \s).
<ws> Match any whitespace (the same as \s+).
<null> Match the null string.
<prior> Match the same thing as the previous match.
Like other rules, you can change their meanings with assertion semantics.

Hypothetical variables
Hypothetical variables are a new feature of Perl 6 rules. In a perl 6 rule, a hypothetical variable allows you to bind a variable within a rule. If your match fails, your hypothetical variables are automatically unbound from what they were (in the case that your match failed after the fact.) However, the variable must be in lexical scope before you may bind to it via a the := operator. This is less complicated than it sounds, here's the example:
my $z;
"I am a person" ~~ m/^$z := (\w+)/;
$z.say; # should print "I"
Fairly simple.

This has been a nice post. Hopefully, your rule-fu has increased. The changes may need a little time to get used to, however, in time all should be good. :) Like I said, a -lot- of this is not implemented, and I have not even breathed upon the technical surface of rules; I'm not exactly the definitive reference on them anyway. This should give you a taste, however.

Perhaps an unofficial 'Round 4b' is in order. We'll see...
Until next time...

Next round: ??
(that means I'm open to recommendations. If none arise, macros seem like a good topic)


Anonymous said...

Thanks for the great articles! Much easier to read than the Synopses. :)

Although I have been reading Planet Perl 6 for a while, I somehow missed this blog. Great work!

Thomas Wittek said...

Interesting post.
But is reads a bit like a reference.

It would be nice to see some examples/applications of Perl6 rules.

austin said...

@anonymous: thanks! The blog was only recently added to planet six, I'd say about two weeks ago; not totally unsuprising you haven't seen anything on it yet.

@thomas: The problem is like I said, a lot of this stuff is not implemented fully or even at all. At least in Pugs; PGE is a fairly complete implementation of Perl 6 rules, however. This makes it worth looking into for further experimentation.
Thanks for the feedback though. :)

Anonymous said...

Wow, I got here via a little yak shaving.

The "journals" link on used to point to then it switched to Planet Perl 6 and then it broke entirely

After whining to the right people it began pointing back to Planet Perl 6 just in time to see one of these posts.

Thank you and very cool - please continue.


Wolf said...

Perl is the best scripting language for Text processing and handle regex. I have posted few articles related to those at my blog
Also Perl's Cpan has lots of support that I don't even need to think extra while developing project. I didn't find such help on other programming language except Java and .NET

Anonymous said...

Genial fill someone in on and this mail helped me alot in my college assignement. Thanks you on your information.

Anonymous said...


I mostly visits this website[url=].[/url] is filled with quality info. Let me tell you one thing guys, some time we really forget to pay attention towards our health. Are you really serious about your weight?. Recent Scientific Research displays that about 80% of all U.S. grownups are either chubby or weighty[url=].[/url] Hence if you're one of these people, you're not alone. Its true that we all can't be like Brad Pitt, Angelina Jolie, Megan Fox, and have sexy and perfect six pack abs. Now the question is how you are planning to have quick weight loss? [url=]Quick weight loss[/url] is really not as tough as you think. You need to improve some of you daily habbits to achive weight loss in short span of time.

About me: I am writer of [url=]Quick weight loss tips[/url]. I am also health expert who can help you lose weight quickly. If you do not want to go under hard training program than you may also try [url=]Acai Berry[/url] or [url=]Colon Cleansing[/url] for effective weight loss.

eco travel agency costa rica said...

Wow incredible blog about Perl 6: Round 4 I was looking information about this topic! I really enjoyed reading this weblog thanks for sharing

taboo milf sex stories said...

I kept saying Im yours Bob, please dont stop, fuck meharder, I love your black cock, I belong to you, Im yours. Especially withCathline and John hanging around Cairo for a week.
true oral sex stories
girl friend daughters fuck stories
brother sister fuck stories
lolita incest sex stories
rape victims stories
I kept saying Im yours Bob, please dont stop, fuck meharder, I love your black cock, I belong to you, Im yours. Especially withCathline and John hanging around Cairo for a week.

Anonymous said...

I will be your frequent visitor, that's for sure. pain relief Read a useful article about tramadol tramadol

Anonymous said...

Meteorite that struck today in Russia from entering the atmosphere have a width of 15 meters - NASA said. According to NASA, a meteor which fell in Russia was higher than that in the 8 October 2009 hit Indonesia. Lokaty

Anonymous said...

best software read book computer engineering software software [url=]sitemaster 200 software support[/url] subtitle addition software
[url=]Outils de systeme - Download OEM, Software Sale, OEM Software[/url] small e-business accounting software

Anonymous said...

t-max software palm m505 software [url=]best selling vector graphics software[/url] glitter animation software
[url=]Readiris Pro 11.5 [Mac] - Cheap Legal OEM Software, Software Sale, Download OEM[/url] sunbelt software counter spy

Anonymous said...

jackrabbit pharmacy pharmacy tech opportunities [url=]rice pharmacy illinois[/url]
u of w pharmacy school pharmacy forms [url=]generic imitrex[/url]
no scripts pharmacy zyrtec pravachol foreign pharmacy [url=]pharmacy claim forms[/url]
peoples pharmacy public radio walgreens pharmacy colorado springs dublin [url=]speman[/url]

Muneer Hussian said...

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

case study solution
case study solution
case study solution
case study solution

Anonymous said...

sky cam software billing cpu medical software [url=]zune movie converting software free[/url] simple dj mixing software for amatours
[url=]Adobe Creative Suite 5.5 Master Collection MultiLanguage - Software Store[/url] computer software plays music

Anonymous said...

milwaukee quest dating los angeles dating
free adult sex dating web site [url=]puerto rican dating services[/url] u dating your kitchen
hentai sex dating games [url=]free dating games online[/url] dating 2009 jelsoft enterprises ltd [url=]aditya8087[/url] coulter dating a democrat

Anonymous said...

propecia online pharmacy propecia side effects irreversible - propecia generic vs brand

Calyin Dyol said...

Nice Post Love reading It


Adams Scott said...

Your blog post is very unique and well research, Thanks for your research on academic knowledge.
Advertising agencies in Karachi | Advertising agencies in Pakistan