Internationalization: Comments on Text Segmentation straw man

Gillam, Richard gillam at
Mon Apr 29 18:19:18 PDT 2013


Finally had a chance to read this in detail and respond to it.  Sorry it too so long, and sorry I couldn't make it to the last ad-hoc meeting; let's just say things have been stressful here at work recently.  I still haven't had a chance to look at the minutes from the ad-hoc meeting; I hope we haven't progressed so far that it's not worth responding to this email anymore.

> 1) Text segmentation often works on small portions of potentially large documents. For example, when the user clicks or taps on a word, an application may need to find the word being selected. Only the text under the click/tap location is of interest, but that text may be part of a large DOM tree (not necessarily all within one node!). Using String values as input means that the application has to create a string including enough context so that the complete desired segment and some text outside the segment is guaranteed to be included - typically a paragraph. Some libraries, including ICU, accept alternative input types: iterators that can move forward and (unlike ES6 iterators) backward over strings, or generic interfaces that provide access to text. Should our text segmenters allow such alternative input types, or is it reasonable to expect callers to construct String values?

I see your point here, and I think I agree with it.  Do we have any kind of precedent in JS for what an abstract interface providing access to a potentially large body of text might look like?  That might be a bigger design job than the actual segmenter interface.

> 2) The strawman includes an extension to String.prototype.split, allowing it to accept a TextSegmenter as the separator argument. The only way to detect whether this argument will be understood would be indirect, by checking whether Intl.TextSegmenter exists. Is that acceptable?

I think I'd be cool with that.  Is there a consensus on this issue?

> 3) It would be useful to have more detail on the use cases, including how the text to be segmented might be represented, and whether segments would typically be accessed sequentially or randomly. Some indication of how common each use case is expected to be in JavaScript applications would also be useful.

I concur, and I don't think I'm qualified to provide this.

> 4) I think we should drop paragraph breaking. Paragraphs are usually defined by the document type - in plain text possibly any CR/LF combination, or two consecutive such combinations, or U+2029; in HTML the text within a <p> element and various other entities; etc. ES text segmenters shouldn't have to know document types.

That's fine with me.

> 5) Do we expect tailored grapheme clusters to be supported and commonly used? It seems to me that default grapheme clusters should be handled in regular expressions, not in this special-purpose API.

Yeah, maybe you're right.

> 6) Line breaks on the other hand should be provided.

They're in there: one of the segmentType values is lineBreak (or did someone after me add that?).

> 7) Is numSegments() really needed? If so, it should be countSegments() or such to indicate that it's a rather expensive operation.

The name change is reasonable.  You may be right that we don't need it, but it was kind of a pain in the butt to do with segmentContaining().

> 8) We need to define more precisely what is meant by the different segment types. E.g., The notes on segmentType indicate that whitespace and punctuation are returned as separate words - is that generally accepted?

I was basing this on my experience with ICU.  The ICU BreakIterator class didn't give you a way to have stuff between the segments, so the things between the boundaries the word-break iterator returned might or might not be "words."  I think it treated individual punctuation marks as "words" unto themselves and runs of whitespace as single "words."

> Should sequences of punctuation or whitespace be treated as one word or multiple?

See above; I think ICU treats runs of whitespace as "words," but single punctuation marks an symbols as "words."

> Will line breaking report U+00AD as a breaking opportunity?

I would assume so; yes.

> Is it safe to assume that in the absence of U+00AD the segments reported are appropriate as input to a separate hyphenation engine, or do such engines need more context?

I don't know; I would assume this depends on the sophistication of the engine, and that more sophisticated engines would need more context, possibly the whole paragraph.

> Normative references to UTRs 14 and 29 might help.


> 9) Why isn't segmentType just a property of the object returned by segmentContaining?

Duh.  That's definitely better than what I have.

> 10) Should positions be based on UTF-16 code units or Unicode code points? Code points seem more logical, but make it more difficult to map to underlying strings.

Aren't we still indexing strings based on UTF-16 code units?  If so, I'd think that's what you'd have to use here.  Regardless, I think the numbers have to be whatever we're normally indexing strings with.

> 11) Should the end property of the object returned by segmentContaining be the index of the last character/code unit included in the segment, or the index after that character/code unit? The latter would be more compatible with String.prototype.substring.

It should be the first index after the end of the segment so that it works with substring().

> 12) If the second edition of ECMA-402 is based on ES6, this API should provide iterators:


> 13) "Anything else defaults to “word”." should be "Anything else results in a RangeError exception."



More information about the es-discuss mailing list