devel - Re: [sympa-developpers] Sympatic unicode ?

Subject: Developers of Sympa

List archive

Re: [sympa-developpers] Sympatic unicode ?

From: IKEDA Soji <address@concealed>
To: "Stefan Hornburg (Racke)" <address@concealed>
Cc: address@concealed
Subject: Re: [sympa-developpers] Sympatic unicode ?
Date: Mon, 19 Mar 2018 13:28:46 +0900

On Wed, 14 Mar 2018 18:08:01 +0900
IKEDA Soji <address@concealed> wrote:

> However, I think cares to be took reading these files. Only using
> :utf8 (or :encoding) layer seems not perfect solution for me. I'll
> post in another day.

I'll describe. Impatient readers can skip to "Suggestions [*]".

At first, create a test file:

$ rm test
$ printf '\xED\xA0\x80\n' >> test
$ printf '\xF4\x8F\xBF\xBE\n' >> test
$ printf '\xF4\x90\x80\x80\n' >> test
$ printf '\xF8\x88\x80\x80\x80\n' >> test

This file "test" contains sequences to be prohibited by Unicode
standard.

Running attached validation_test.pl:

-------- 8< ------------ 8< ------------ 8< ------------ 8< --------
$ perl validation_test.pl test
=> :utf8
<= \x{d800}
<= \x{10fffe}
<= \x{110000}
<= \x{200000}
=> :encoding(UTF8)
<= \x{d800}
<= \x{10fffe}
<= \x{110000}
<= \x{200000}
=> :encoding(UTF-8)
utf8 "\xD800" does not map to Unicode at valtest.pl line 19.
utf8 "\x10FFFE" does not map to Unicode at valtest.pl line 19.
utf8 "\x110000" does not map to Unicode at valtest.pl line 19.
utf8 "\x200000" does not map to Unicode at valtest.pl line 19.
<= \x{D800}
<= \x{10FFFE}
<= \x{110000}
<= \x{200000}
=> :encoding(UTF-8-STRICT)
utf8 "\xD800" does not map to Unicode at valtest.pl line 19.
utf8 "\x10FFFE" does not map to Unicode at valtest.pl line 19.
utf8 "\x110000" does not map to Unicode at valtest.pl line 19.
utf8 "\x200000" does not map to Unicode at valtest.pl line 19.
<= \x{D800}
<= \x{10FFFE}
<= \x{110000}
<= \x{200000}
=> :utf8_strict
Can't decode ill-formed UTF-8 octet sequence <ED> at valtest.pl line 19.
-------- >8 ------------ >8 ------------ >8 ------------ >8 --------

":utf8" and ":encoding" layers accept ill-formed sequences
(":encoding(UTF-8)" and so on spews warning, but after all accepts
them). Worse, though these are not proper Unicode characters, Perl
internals accept these as legitimate characters.

Even if Perl allows them, they can cause disaster like buffer error
in some external modules or subsystems. That's why validation of
input is required.

Suggestions [*]
---------------

Considering validation along with normalization, function to read
entire UTF-8 text file (config file etc.) may be as below:

-------- 8< ------------ 8< ------------ 8< ------------ 8< --------
use strict;
use warnings;
use English qw(-no_match_vars);
use PerlIO::utf8_strict;
use Unicode::Normalize qw();

sub read_text_file {
my $path = shift;

open my $fh, '<:utf8_strict', $path or return undef;

my $text = eval { local $RS; <$fh> };
close $fh;
return undef unless defined $text;

return Unicode::Normalize::NFC($text);
}

1;
-------- >8 ------------ >8 ------------ >8 ------------ >8 --------

FYI, using Path::Tiny:

-------- 8< ------------ 8< ------------ 8< ------------ 8< --------
use strict;
use warnings;
use Path::Tiny qw();
use Unicode::Normalize qw();

sub read_text_file {
my $path = shift;

my $text = eval {
Path::Tiny::path($path)->slurp({binmode => ':unix:utf8_strict'})
#Path::Tiny::path($path)->slurp_utf8 # with Unicode::UTF8
};
return undef unless defined $text;

return Unicode::Normalize::NFC($text);
}

1;
-------- >8 ------------ >8 ------------ >8 ------------ >8 --------

Unicode::UTF8 provides validation the same as :utf8_strict layer, but
former replaces preverted sequences with REPLACEMENT CHARACTER \x{FFFD},
instead of die()-ing. (I feel either is ok.)

Some files would be better to be read line by line (e.g. dump of
list members created by closing a list: It can contain million of
records). In this case with Path::Tiny, openr() or openr_utf8() may
be used.

----
N.B.:

- Why I didn't use slurp_utf8() in above is that, if both
Unicode::UTF8 and PerlIO::utf8_strict were available, Path::Tiny
prefers to former. This is an example of clumsy by all-in-one
package.

- I couldn't find PerlIO layer to perform normalization.
Theoretically, Unicode normalization requires unlimited length
of buffer.

Regards,
-- Soji

--
株式会社コンバージョン
ITソリューション部システムソリューション1グループ池田荘児
〒140-0014 東京都品川区大井1-49-15 アクセス大井町ビル4F
e-mail address@concealed TEL 03-6429-2880
https://www.conversion.co.jp/

use strict;
use warnings;
use Encode qw(encode FB_PERLQQ);
use feature qw(say);

foreach my $layer (
    qw(:utf8
       :encoding(UTF8)
       :encoding(UTF-8)
       :encoding(UTF-8-STRICT)
       :utf8_strict)
) {
    say '=> ', $layer;

    my $fh;
    open $fh, '<', $ARGV[0];
    binmode $fh, $layer;

    while (<$fh>) {
        print '<= ', encode("ascii", $_, FB_PERLQQ);
    }

    close $fh;
}

Re: [sympa-developpers] Sympatic unicode ?, Stefan Hornburg (Racke), 03/02/2018
- Re: [sympa-developpers] Sympatic unicode ?, Marc Chantreux, 03/08/2018
  - Re: [sympa-developpers] Sympatic unicode ?, Soji Ikeda, 03/08/2018
    - Re: [sympa-developpers] Sympatic unicode ?, Marc Chantreux, 03/08/2018
- Re: [sympa-developpers] Sympatic unicode ?, IKEDA Soji, 03/14/2018
  - Re: [sympa-developpers] Sympatic unicode ?, IKEDA Soji, 03/19/2018
- <Possible follow-up(s)>
- Re: [sympa-developpers] Sympatic unicode ?, Soji Ikeda, 03/02/2018
  - Re: [sympa-developpers] Sympatic unicode ?, Marc Chantreux, 03/08/2018
    - Re: [sympa-developpers] Sympatic unicode ?, Stefan Hornburg (Racke), 03/08/2018
      - Re: [sympa-developpers] Sympatic unicode ?, Soji Ikeda, 03/08/2018
        
        Re: [sympa-developpers] Sympatic unicode ?, Stefan Hornburg (Racke), 03/08/2018
        
        Re: [sympa-developpers] Sympatic unicode ?, IKEDA Soji, 03/09/2018
      - Re: [sympa-developpers] Sympatic unicode ?, Marc Chantreux, 03/08/2018
- Re: [sympa-developpers] Sympatic unicode ?, IKEDA Soji, 03/02/2018
  - Re: [sympa-developpers] Sympatic unicode ?, IKEDA Soji, 03/08/2018
    - Re: [sympa-developpers] Sympatic unicode ?, Stefan Hornburg (Racke), 03/08/2018

List archive

Re: [sympa-developpers] Sympatic unicode ?