devel - [sympa-developpers] Plan to support RFC 6783

Subject: Developers of Sympa

List archive

[sympa-developpers] Plan to support RFC 6783

From: IKEDA Soji <address@concealed>
To: address@concealed
Subject: [sympa-developpers] Plan to support RFC 6783
Date: Wed, 7 Oct 2015 17:32:29 +0900

Hi developers,

Though I will focus on bug fixes and additional refactoring in
6.2-branch for a while more, I wish to share what I investigated
on next major release. Comments and suggestions are appreciated.

Below is a response to 7.0 roadmap:
<http://www.sympa.org/dev/project_direction>

-------- 8< ------------ 8< ------------ 8< ------------ 8< --------

========================
Plan to support RFC 6783
========================

Table of Contents:

Extending Unicode repertoire used in identifiers
Restrictions to listname
Normalizing users' e-mails
Domain labels
Filesystem encoding
Internal encoding
Extending format of messages
Internationalized message types
Detecting internationalized messages
Downgrading internationalized messages

Extending Unicode repertoire used in identifiers
==============================================

Restrictions to listname
------------------------

Listname is used in local part of list addresses. EAI looks not
putting strict restrictions on repertoire used for local part
except that they must form valid UTF-8 sequences. However, It
is reasonable that we start with repertoire as small as possible.

In addition, since listname may be used in "List-Id:" field
(see RFC 2919), it should be suitable for domain name system
(see section 2).

After all, it looks in sense that listname including characters
beyond US-ASCII should be restricted by the measure adopted by IDNA
(see RFC 5892 and RFC 5893). Consequently,

- Uppercase letters are not allowed (even if they are non-Latin
letters). Listnames must be normalized to lowercase.

Problem: Case folding by Unicode in general is not round-trip
conversion and is locale-dependent. For example: Lowercase of
Turkish "I" is dotless "ı", while uppercase of "i" is dotted "İ";
lowercase of "SS" in German can be either "ss" or Eszett "ß".

To solve it, method of case-normalization should be customized by
language context.

- Listnames should be normalized by Normalization Form KC (NFKC;
see UAX #15).

Problem: This means, for example: Angstrom sign and Ohm sign
will be altered to "a" with ring accent and Greek letter omega
"ω", respectively; many variants of Chinese ideographs may be
altered to their singleton compatibility decomposition; conversely,
Greek final sigma "ς" will _not_ be altered to medial sigma "σ".

There seem no solutions to it.

- Listnames should conform to some "contextual rules". For example,
RTL letters and most of LTR letters must not co-exist in single
listname.

In addition, mixture of confusable characters (e.g. Latin "a" and
Cyrillic "а") in each listname would be better to be rejected.
Some measures such as Unicode Security Mechanisms (see UTS #39) can
be taken.

Notes:

* The restrictions and normalizations described above may be
enforced when the new listname is submitted. Existing listnames
(e.g. those migrated from other systems) should be checked by
looser rules.

* Legacy (ASCII-only) listnames should follow restrictions
used by current version of Sympa.

Normalizing users' e-mails
--------------------------

Policies to normalize e-mail addresses are vary by sites. So Sympa
herself should not normalize users' e-mails except traditional
case-folding of ASCII letters (tr/A-Z/a-z/). Use of lc() or fc() in
Unicode context is discouraged.

Domain labels
-------------

Since SMTPUTF8 uses raw UTF-8 strings for local parts and domain
labels, Sympa also should read them as raw UTF-8 string, and should
store them to database and so on as raw UTF-8 string.

There are a few exceptions:

- Since "List-*:" fields must have US-ASCII string value (see
RFC 6783, section 3.1), UTF-8 listnames and domains would be
encoded by %XX-encoding and/or Punycode (see RFC 3492).

N.B. "List-Id:" field value would be entirely encoded using
Punycode. See also the first section.

- Service URLs in the body of auto-generated messages would be shown
in both raw UTF-8 and Punycode-encoded forms.

Because encoded form is hard to read by human while not all MUAs
can detect URLs in raw UTF-8 form.

- Names of files and directories: See the next section.

Filesystem encoding
-------------------

File paths may contain listnames, domain names and e-mails.

On *nix in general, any bytes except "\0" and "/" are allowed in
path names. However, NFSv4 requires valid UTF-8 sequences.
Moreover, HFS+ requires modified Normalization Form D (NFD) and
this requirement conflicts to requirement described in the first
section.

After all, such strings in file paths should be encoded using
appropriate method.

Since QUOTED-PRINTABLE is used by current release of Sympa to
encode filename of shared documents, listnames, domain names
and e-mails may also be encoded using it.

N.B. In current release of Sympa, a function escape_chars() is
used to escape e-mails in file paths. They would be replaced
with QP-encoded path names.

Possible problem is that QP can generate very long path names:
Logically, maximum expansion rate is 12 octets per character.

Internal encoding
-----------------

Several paths in processing are not Unicode-aware, for example
handling MIME entities.

In source level, byte string ("utf8 flag" off) should be soleley
used to avoid mixture of byte string and Unicode string ("utf8 flag"
on). Processing specific to Unicode string should be localised as
much as possible.

Extending format of messages
============================

Internationalized message types
-------------------------------

Some legacy MIME types have "internationalized" counterparts:

message/rfc822 - message/global
message/delivery-status - message/global-delivery-status
message/disposition-notification - message/global-disposition-notification
text/rfc822-headers - message/global-headers

Sympa should handle these counterparts along with legacy ones.

For example, currently MIME-tools doesn't extract message/global
part, even if extract_nested_message() is true (see also CPAN RT
#106911). Sympa has to extract it by herself, if necessary.

Detecting internationalized messages
------------------------------------

There are no standard method to know whether an incoming message
is internationalized version (message/global) or legacy version
(message/rfc822). Sympa has to detect internationalized messages
using the messages themselves.

As an expedient, following criterion may detect message types:

- If the header of the top level of MIME structure, and/or envelope
sender ("Return-Path:" field), contain non-ASCII UTF-8 sequences,
the message is message/global. Otherwise it is message/rfc822.

Note that non-ASCII bytes not forming valid UTF-8 sequence must
not be a sign of internationalized message (anyway such messages do
not conform to MIME, though).

Downgrading internationalized messages
======================================

For the time being, lists have the mixture of SMTPUTF8-capable
subscribers and incapable subscribers (see also RFC 6783,
section 2.2). Internationalized messages should be downgraded for
incapable members (and incapable MTA).

When _one or both of_ following conditions are met, outgoing
internationalized messages (see previous section) should be
downgraded:

- Sympa does _not_ enable outbound SMTPUTF8 support (because
outbound MTA does not support it).

- The subscriber's e-mail does _not_ contain non-ASCII _and_ that
subscriber does _not_ enable SMTPUTF8 preference of her.

Downgrading process may modify only the header fields of the top
level of MIME structure. Body (including possible subparts) should
not be modified. Following modifications are possible.

- Non-ASCII e-mail addresses in the top-level header should be
replaced by filler somewhat.

Note: This method is information-lossy.

- Other non-ASCII strings in the top-level header should be encoded
using header encoding scheme.

- If "Content-Transfer-Encoding:" field is not found, "8bit" value
should be added.

-------- >8 ------------ >8 ------------ >8 ------------ >8 --------

Regards,

-- Soji

--
株式会社コンバージョンセキュリティ&OSSソリューション部池田荘児
〒140-0014 東京都品川区大井1-49-15 アクセス大井町ビル4F
e-mail address@concealed TEL 03-6429-2880
http://www.conversion.co.jp/

[sympa-developpers] Plan to support RFC 6783, IKEDA Soji, 10/07/2015

List archive

[sympa-developpers] Plan to support RFC 6783