Hi,
we got a strange problem here, and it took quite a while to pin it down to the code snippet below.
A PERL script parses through a bunch of source files looking for different copyright strings.
The files are being read by a routine, which is capable of reading in Unicode files.
The real regular expression used is a little complex (much more complex than the one in the snippet below).
It showed that the expression fails, when the Copyright sign (\xA9) was contained in the string
Further investigation showed, that the regular expression does NEVER match if modifier /i is used.
/i is vital for the rgx, because putting all static patterns into a sequence of [<UPERCASE char><lowercase char>] is a pain and
error prone.
We are using ActiveState PERL 5.8.4 build 810.
We've checked on Linux with PERL 5.8.5 : result is the same
Run the little script below and check the output. Here we always get + and then a -.
Is this a bug in PERLs RE machine or are we doing something wrong??
For the experts:
Uncomment the use re 'debug' and redirect the output (> debug.out 2>&1) and check the differences.
I can see the difference that when using /i some strange char is in front of the copyright char when a compare is done.
use Encode::Guess;
use Encode::Encoder;
#use re 'debug';
# $octets is a string containing a copyright message together with COPYRIGHT SIGN (\xa9) and REGISTERED SIGN (\xae)
my $octets = "#define VER_LEGALCOPYRIGHT_STR \"© Copyright 2002-2005 xxxxx®. All rights reserved.\"";
my ($decoder);
# extracted from the module which reads in files in ASCII|ISO-8859-1|Unicode...
Encode::Guess->add_suspects(qw/latin1/);
$decoder = Encode::Guess->guess ($octets);
my @parts = split (/\s*or/, $decoder);
if (scalar (@parts) > 0) {
foreach my $dec (@parts) {
if ($dec =~ /^iso/) {
print "$dec\n";
$decoder = Encode::Encoder->new($dec);
last;
}
}
}
else {
print "$decoder\n";
}
die ($decoder) unless (ref ($decoder));
# here $content is converted to PERLs internal UTF-8
my $content = $decoder->decode($octets);
# now let's have a regular expression check the content
if ($content =~ /\xA9\s*Copyright/) {
print "+\n";
}
else {
print ("-\n")
}
# check again with the same regular expression but modifier /i appended
if ($content =~ /\xA9\s*Copyright/i) {
print "+\n";
}
else {
print ("-\n")
}
__END__
----------------------------------------------------------------------
S y s K o n n e c t G m b H
A Marvell Company
Siemensstr. 23
D-76275 Ettlingen
----------------------------------------------------------------------
Axel Mock
Software Engineer
phone: +49 7243 502 319
fax: +49 7243 502 931
email:
amock@sysk...
http://www.syskonnect.de
-----------------------------------------------------------------------
_______________________________________________
ActivePerl mailing list
ActivePerl@list...
To unsubscribe:
http://listserv.ActiveState.com/mailman/mysubs
opensubscriber is not affiliated with the authors of this message nor responsible for its content.