Collation API not complete for search

Mark Davis ☕ mark at macchiato.com
Mon Mar 28 14:58:06 PDT 2011


Searching is discussed in UTS#10. It does need to be correlated with user's
expectations for matching, as you observe.

Mark

*— Il meglio è l’inimico del bene —*


On Mon, Mar 28, 2011 at 14:13, Shawn Steele <Shawn.Steele at microsoft.com>wrote:

>  Searching gets tricky.  Is the result greedy or not (matches as much as
> possible or as little as possible), etc.  There are lots of variations,
> which is why it was skipped from the initial v0.5.
>
>
>
> Comparison, Search and Casing are all dependent on each other.  If search
> finds a substring, we’d expect comparison to match that substring.
> Similarly, if one is using Turkish I, we expect all of them to do so.
>
>
>
> - Shawn
>
>
>
> *From:* Nebojša Ćirić [mailto:cira at google.com]
> *Sent:* Monday, March 28, 2011 1:36 PM
> *To:* Mark Davis ☕
> *Cc:* es-discuss at mozilla.org; Shawn Steele; Phillips, Addison
> *Subject:* Re: Collation API not complete for search
>
>
>
> Shawn, would you be ok with adding this new API to the list for 0.5 so we
> can support collation search?
>
>
>
> I'll edit the strawman in case nobody objects to this addition.
>
> 25. март 2011. 16.34, Nebojša Ćirić <cira at google.com> је написао/ла:
>
> In that case I wouldn't put this new functionality in the Collator object.
> A new StringSearch or StringIterator object would make more sense:
>
>
>
> options = {
>
>   collator[optional - default, collatorType=search],
>
>   source[required],
>
>   pattern[required]
>
> }
>
> LocaleInfo.StringIterator = function(options) {}
>
> LocaleInfo.StringIterator.prototype.first = function() { find
> first occurrence}
>
> LocaleInfo.StringIterator.prototype.next = function() { get me
> next occurrence of pattern in source}
>
> LocaleInfo.StringIterator.prototype.matchLength = function() { length of
> the match }
>
> ... (reset, setPosition...)
>
> 25. март 2011. 15.14, Mark Davis ☕ <mark at macchiato.com> је написао/ла:
>
>
>
> I think an iterator is a cleaner interface; we were just trying to minimize
> new API.
>
>
>
> In general, collation is context sensitive, so searching on substrings
> isn't a good idea. You want to search from a location, but have the rest of
> the text available to you.
>
>
>
> For the iterator, you would need to be able to reset to a location, but the
> context beforehand could affect what happens.
>
>
> Mark
>
> *— Il meglio è l’inimico del bene —*
>
>
>
>  On Fri, Mar 25, 2011 at 14:22, Mike Samuel <mikesamuel at gmail.com> wrote:
>
> 2011/3/25 Mike Samuel <mikesamuel at gmail.com>:
>
> > 2011/3/25 Nebojša Ćirić <cira at google.com>:
> >> find method wouldn't return boolean but an array of two values:
> >
> > Sorry if I wasn't clear.  The !! at the beginning of the call to find
> > is important.
> > The undefined value you mentioned below as possible no match result is
> > falsey because !!undefined === false.
> >
> >> myCollator.find('gaard', 'ard', 2) -> [2, 5]  // 4 or 5 as a bound
> >> myCollator.find('ard', 'ard', 0) -> [0, 3]  // 2 or 3 as a bound
> >> I guess [2, 5] !== [0, 3]
> >
> > True, but also [2, 5] !== [2, 5].
> >
> >> We could return [-1, undefined] for not found state, or just undefined.
> >
> >> I agree that returning a boolean makes for easier tests in loops.
> >
> >
> >> 25. март 2011. 14.00, Mike Samuel <mikesamuel at gmail.com> је написао/ла:
> >>>
> >>> 2011/3/25 Nebojša Ćirić <cira at google.com>:
> >>> > Looking through the notes from the meeting I also found some problems
> >>> > with
> >>> > the collator. We did specify the collatorType: search, but we didn't
> >>> > offer a
> >>> > function that would make use of it. Mark and I are thinking about:
> >>> > /**
> >>> >  * string - string to search over.
> >>> >  * substring - string to look for in "string"
> >>> >  * index - start search from index
> >>> >  * @return {Array} [first, last] - first is index of the match or -1,
> >>> > last
> >>> > is end of the match or undefined.
> >>> >  */
> >>> > LocaleInfo.Collator.prototype.find(string, substring, index)
> >>> > We could also opt for iterator solution where we keep the state.
> >>>
> >>> Assuming find returns a falsey value when nothing is found, is it the
> >>> case that for all (string, index) pairs,
> >>>
> >>> !!myCollator.find(string, substring, index) ===
> >>> !!myCollator.find(string.substring(index), substring, 0)
>
> Maybe a better way to phrase this relation is
>
> will any collator ever look at a code-unit to the left of index when
> trying to determine whether there is a match at or after index?
>
> E.g. if the code-unit at index might be a strict suffix of a substring
> that could be represented as a one codepoint ligature.
>
>
>
> >>> This would be false if the substring 'ard' should be found in 'gard',
> >>> but not 'gaard' because then
> >>>
> >>>     !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
> >>> 'ard', 0)
> >>>
> >>>
> >>> If that relation does not hold, then exposing find as an iterator
> >>> might help prevent a profusion of subtly wrong loops.
> >>>
> >>>
> >>> > The reason we need to return both begin and end part of the found
> string
> >>> > is:
> >>> > Look for gaard and we find gård - which may be equivalent in Danish,
> but
> >>> > substring lengths don't match (5 vs. 4) so we need to tell user the
> next
> >>> > index position.
> >>> > The other problem Jungshik found is that there is a combinatorial
> >>> > explosion
> >>> > with all ignoreXXX options we defined. My proposal is to define only
> N
> >>> > that
> >>> > make sense (and can be supported by all implementors) and fall back
> the
> >>> > rest
> >>> > to some predefined default.
> >>>
> >>>
> >>>
> >>> > --
> >>> > Nebojša Ćirić
> >>> >
> >>> > _______________________________________________
> >>> > es-discuss mailing list
> >>> > es-discuss at mozilla.org
> >>> > https://mail.mozilla.org/listinfo/es-discuss
> >>> >
> >>> >
> >>
> >>
> >>
> >> --
> >> Nebojša Ćirić
> >>
> >
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
>
>
>
>
> --
> Nebojša Ćirić
>
>
>
>
> --
> Nebojša Ćirić
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110328/5c3fc0a9/attachment.html>


More information about the es-discuss mailing list