On Perl :: Index

Remember ASCII? Well, you might as well forget about it. We're all plugged-in and interconnected in one big writhing ball of global communication now. Luckily Perl has kept up with the times, but you've gotta know the secret codeword if you want to get in.

Let's say we're writing a script to break some arbitrary text up into a list of its words (which we are). In the olden days something like this would do...

my $input = q("Hello," world.);

# gather up strings of wordish characters
my @words = $input =~ /\w+/g;
print join('|', @words);

Hello|world

The trusty \w"word character" sequence does the job. But what happens if we expand the range of characters we consider wordish beyond the [A-Za-z_] range?

my $input = q("mirë dita, bună ziua, dobrý deň," world.);

my @words = $input =~ /\w+/g;
print join('|', @words);

mir|dita|bun|ziua|dobr|de|world

Even if you don't speak Albanian, Romanian or Slovak you can see that didn't do what it should have. Clearly our regular expression is tripping over those characters that are outside the ASCII set. So how do you tell perl to include UTF-8 characters in its definition of a "word character?"

Firstly make sure you are using a version of perl which is at least 5.8.1 or later. Before that UTF-8 support was not yet stable. I'll wait now if you need to upgrade... Done? Okay we can proceed.

There are two lines we need to add to our script to get things back on track. The first is a pragma telling perl we are using UTF-8 characters in our script's source code, and the second sets the output mode to our terminal window to UTF-8 (it would still work without the second addition, but you wouldn't be able to see that it worked).

use utf8;
binmode(STDOUT, ":utf8");

my $input = q("mirë dita, bună ziua, dobrý deň," world.)
my @words = $input =~ /\w+/g;
print join('|', @words);

mirë|dita|bună|ziua|dobrý|deň|world

Much better. But bear in mind you'll need a text-editor that is smart enough to deal with UTF-8 characters to even write that script, and you'll probably want to tick the UTF-8 box when you save it.

If for some reason you want to create strings with UTF-8 characters in them, but want your source code to remain ASCII (you Luddite), or maybe you just can't find the "white smiling face" key on your particular keyboard, you can use ASCII-compatible character names instead.

use charnames ':full';
binmode(STDOUT, ":utf8"); # to see UTF-8 in your console

my $greeting = "Hello \N{WHITE SMILING FACE}";
print $greeting;

Hello ☺

On the other hand, if you're getting your UTF-8 input from a separate text file, instead of having it embedded in your source code, you don't need the use utf8pragma, but you do need to tell perl to treat your filehandle in a UTF-8 way (which presumes it actually is encoded as UTF-8; if it's in some non UTF-8 encoding then just treating it as if it weren't will only make matters worse).

use IO::File;

my $source = IO::File->new('greetings.txt', 'r');
binmode($source, ':utf8' );

binmode(STDOUT, ":utf8");
while (my $hello = <source>) {
        chomp $hello;
        print "$hello world\n";
}

Finally, what if your input is coming from a web form submission? If you use CGI.pm you're well ahead of the problem already. You will, however, need to worry about the web browser, which, if it was released in the last year or two, should be able to handle UTF-8 -- as long as you tell it to. To be safe you must do so twice: when the form is submitted and when you display the results.

And even so, perl will still need a little extra nudging to get it to treat your input string as UTF-8 when applying the \wregular expression to it. The problem is that perl doesn't "know" that the bytes in $inputare to be treated as Unicode characters; we haven't told it so, and there's no reason to leave it guessing. We can settle the issue by using the decode_utf8function of the Encode module. Once we've done that, perl will treat our bytes as real live Unicode characters, even if those characters happen to be several bytes wide.

#!/usr/bin/perl -T

use CGI qw(:standard);
use Encode;
use strict;
use warnings;

my $input = param('input');

my $html;

if (!$input) {
        $html = <<"EOS";
<html>
<head><title>CGI Input</title></head>
<body>
Please enter some words:
<form action="/cgi-bin/utf.cgi" method="get" accept-charset="utf-8">
<input type="text" name="input" />
<input type="submit" />
</form>
</body>
</html>
EOS
}
else {
    $input = decode_utf8($input);
    my @words = $input =~ /\w+/g;
    my $output = '<ol>'.join('', map{"<li>$_</li>"}@words).'</ul>';
    $html = <<"EOS";
<html>
<head><title>CGI Result</title></head>
<body>
$output
</body>
</html>
EOS
}

print header(
    -type           => 'text/html',
    -charset        => 'utf-8',
    -Content_length => length $html
);
print $html;

This sort of thing is never as simple as most people first think it is. Probably because most people have been protected from the gory details of it all by legions of over-caffeinated programmers who have hidden it away from them. If you're still reading this it looks like you're keen to join those programmers. Grab a cup of coffee; welcome to the legion.

Saying Hello to UTF-8

More