Working with grapheme clusters

Bjoern Hoehrmann derhoermi at gmx.net
Sat Oct 26 08:09:00 PDT 2013


* Claude Pache wrote:
>You might know that the following ES expressions are broken:
>
>	text.charAt(0) // get the first character of the text
>	text.length > 100 ? text.substring(0,100) + '...' : text // cut the text after 100 characters
>
>The reason is *not* because ES works with UTF-16 code units instead of 
>Unicode code points (it's just a red herring!), but because _graphemes_ 
>(that is, what a human perceives as a "character") may span multiple 
>code units and/or code points.

The example is deceptively simple. Truncating a string is a hard problem
and a high quality implementation would probably be language-specific to
avoid problematic truncations like when a suffix changes the meaning of
a prefix; it would also take special characters into account, say you do
not want the last character before the "..." to be an open quote mark,
and if the string is 101 characters ending in "..." turning that into a
string of 103 characters ending in "....." would also be silly.

Another issue that is often ignored is that you might want to use the
truncated text in combination with other text, say in a HTML document
with a "more" or "permalink" or some such link after it. Something like

  <p>ABC &#x202E; DEF &#x202C; GHI <a href='...'>more</a></p>
  <p>ABC &#x202E; DEF ...          <a href='...'>more</a></p>

The second paragraph will render "ABC erom ... FED" because the control
character that restores the bidirectional text state got lost when the
string was truncated. These are all issues that counting graphemes in-
stead of 16 bit units does not address and it is not clear to me that it
would actually be an improvement.

"User-perceived character" is not an intuitive notion especially once
you leave the realm of letters from a familiar script. In a string that
contains 1 user-perceived character, what is the maximum number of zero-
width spaces in that string? The maximum number of scalar values? What
is the maximum width and maximum height of such a string when rendered,
the maximum number of UTF-8 bytes needed to encode such a string? Should
one perceive a horizontal ellipsis as three characters, or is it just
one? How many are two thin spaces?

My smartphone comes with a "News" application that displays the latest
headlines from various news sources and links to corresponding articles.
If you use it for a day or two you will notice that it's not of German
design, but one for a language that uses fewer or narrower "grapheme
clusters per unit of information" if you will. Many of the headlines do
not convey what the article might be about. A current example is 'Code-
name "Lustre" - Frankreich liefert' which is roughly 'code name "lustre"
- France supplies' ... what? What does France supply? Or "Dortmund droht
historische Pleite im" roughly "Dortmund faces historic ... in" where
"Pleite" could be "bankruptcy", "defeat", "failure", ... could be sport,
could be finance, can't tell.

That makes the application rather frustrating to use with german news. I
imagine it works better with english headlines which tend to use fewer
grapheme clusters. So truncating news headlines after a certain number
of grapheme clusters untailored to the specific script and language is
not the "right" design choice. Actually, it might be truncated by pixel
measures because there is a visual space to fit, but english and german
are very similar in their pixels per grapheme cluster metrics...

So it seems rather unlikely for someone to say "so we need the first 100
extended grapheme clusters as defined in UAX #29 of the string" and then
someone responding "yes, that is clearly the right solution".
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 


More information about the es-discuss mailing list