Internationalization: Comments on Text Segmentation straw man
ecmascript at lindenbergsoftware.com
Thu Apr 18 17:03:35 PDT 2013
In preparation for tomorrow's internationalization ad-hoc meeting, I reviewed the
Text Segmentation strawman:
Some issues go beyond internationalization and into general API design question - I'd appreciate input from the es-discuss crowd on them.
1) Text segmentation often works on small portions of potentially large documents. For example, when the user clicks or taps on a word, an application may need to find the word being selected. Only the text under the click/tap location is of interest, but that text may be part of a large DOM tree (not necessarily all within one node!). Using String values as input means that the application has to create a string including enough context so that the complete desired segment and some text outside the segment is guaranteed to be included - typically a paragraph. Some libraries, including ICU, accept alternative input types: iterators that can move forward and (unlike ES6 iterators) backward over strings, or generic interfaces that provide access to text. Should our text segmenters allow such alternative input types, or is it reasonable to expect callers to construct String values?
2) The strawman includes an extension to String.prototype.split, allowing it to accept a TextSegmenter as the separator argument. The only way to detect whether this argument will be understood would be indirect, by checking whether Intl.TextSegmenter exists. Is that acceptable?
Other comments are mostly for the ad-hoc team:
4) I think we should drop paragraph breaking. Paragraphs are usually defined by the document type - in plain text possibly any CR/LF combination, or two consecutive such combinations, or U+2029; in HTML the text within a <p> element and various other entities; etc. ES text segmenters shouldn't have to know document types.
5) Do we expect tailored grapheme clusters to be supported and commonly used? It seems to me that default grapheme clusters should be handled in regular expressions, not in this special-purpose API.
6) Line breaks on the other hand should be provided.
7) Is numSegments() really needed? If so, it should be countSegments() or such to indicate that it's a rather expensive operation.
8) We need to define more precisely what is meant by the different segment types. E.g., The notes on segmentType indicate that whitespace and punctuation are returned as separate words - is that generally accepted? Should sequences of punctuation or whitespace be treated as one word or multiple? Will line breaking report U+00AD as a breaking opportunity? Is it safe to assume that in the absence of U+00AD the segments reported are appropriate as input to a separate hyphenation engine, or do such engines need more context? Normative references to UTRs 14 and 29 might help.
9) Why isn't segmentType just a property of the object returned by segmentContaining?
10) Should positions be based on UTF-16 code units or Unicode code points? Code points seem more logical, but make it more difficult to map to underlying strings.
11) Should the end property of the object returned by segmentContaining be the index of the last character/code unit included in the segment, or the index after that character/code unit? The latter would be more compatible with String.prototype.substring.
12) If the second edition of ECMA-402 is based on ES6, this API should provide iterators:
13) "Anything else defaults to “word”." should be "Anything else results in a RangeError exception."
More information about the es-discuss