Skip to Content.
Sympa Menu

devel - Re: [sympa-developpers] Sympatic unicode ?

Subject: Developers of Sympa

List archive

Chronological Thread  
  • From: IKEDA Soji <address@concealed>
  • To: "Stefan Hornburg (Racke)" <address@concealed>
  • Cc: address@concealed
  • Subject: Re: [sympa-developpers] Sympatic unicode ?
  • Date: Fri, 9 Mar 2018 15:20:56 +0900

On Thu, 8 Mar 2018 15:54:25 +0100
"Stefan Hornburg (Racke)" <address@concealed> wrote:

> On 03/08/2018 10:35 AM, IKEDA Soji wrote:
> > On Fri, 2 Mar 2018 18:56:24 +0900
> > IKEDA Soji <address@concealed> wrote:
> >
> >> Secondarily important point is that **Text data is not unique**.
> >>
> >> Text data should be normalized and (if necessary) be case-folded
> >> at first we got it.
> >>
> >> - Unicode allows at least two sorts of normalization form. So we
> >> should normalize text data.
> >
> > Example: when we run attached testutf.pl,
> >
> > On xfs, ext4, NFS4, CIFS etc.:
> >
> > $ perl testutf.pl
> > => B\x{00e2}le
> > <= B\x{00e2}le
> > => \x{0130}stanbul
> > <= \x{0130}stanbul
> > => Ph\x{00fa} Qu\x{1ed1}c
> > <= Ph\x{00fa} Qu\x{1ed1}c
> >
> > On HFS+:
> >
> > $ perl testutf.pl
> > => B\x{00e2}le
> > <= Ba\x{0302}le
> > => \x{0130}stanbul
> > <= I\x{0307}stanbul
> > => Ph\x{00fa} Qu\x{1ed1}c
> > <= Phu\x{0301} Quo\x{0302}\x{0301}c
> >
> > HFS+ (macOS) allows pathnames with UTF-8, but holds them in a sort of
> > decomposed normalization form. Thus, even if the filesystem supports
> > Unicode, comparison between pathnames on memory and filesystem may
> > not always success.
> >
> >
> > Probably there may be similar cases with database.
>
> There is certainly a difference to databases, because with databases you
> can specify
> the encoding you want (e.g. xxx_enable_utf8 flags in DBI/DBD).

Encoding does not matter on example above.

With Unicode, validation and/or normalization of text data can be
performed on various subsystems. I presented an example I have known.

> And I'm not sure whether your script does the right thing.

That script uses utf8::all, creates path and reads directory entry.
readdir() certainly returns Unicode string, but since it returns
what filesystem holds, results are affected by normalization.

If there was right thing, it is preventing effect by filesystem.
For example, we can "escape" non-ASCII charcters in path names,
as current code does.


Regards,
-- Soji

> Regards
> Racke
>
> >
> >
> > Regards,
> > -- Soji
> >
> >
> >> Regards,
> >> -- Soji
> >>
> >>
> >> 2018/02/27 18:12、Marc Chantreux <address@concealed>のメール:
> >>
> >>> hello people,
> >>>
> >>> i really thing Sympatic should use
> >>>
> >>> use utf8:all;
> >>>
> >>> or at least
> >>>
> >>> use utf8;
> >>> use open qw< :encoding(UTF-8) :std >;
> >>>
> >>> what is your opinion about it ?
> >>>
> >>> regards,
> >>> marc
> >>>
> >
> >
>
>
> --
> Ecommerce and Linux consulting + Perl and web application programming.
> Debian and Sympa administration. Provisioning with Ansible.


--
株式会社 コンバージョン
ITソリューション部 システムソリューション1グループ 池田荘児
〒140-0014 東京都品川区大井1-49-15 アクセス大井町ビル4F
e-mail address@concealed TEL 03-6429-2880
https://www.conversion.co.jp/



Archive powered by MHonArc 2.6.19+.

Top of Page