Collation API not complete for search

Nebojša Ćirić cira at google.com
Fri Mar 25 16:34:53 PDT 2011


In that case I wouldn't put this new functionality in the Collator object. A
new StringSearch or StringIterator object would make more sense:

options = {
  collator[optional - default, collatorType=search],
  source[required],
  pattern[required]
}
LocaleInfo.StringIterator = function(options) {}
LocaleInfo.StringIterator.prototype.first = function() { find
first occurrence}
LocaleInfo.StringIterator.prototype.next = function() { get me
next occurrence of pattern in source}
LocaleInfo.StringIterator.prototype.matchLength = function() { length of the
match }
... (reset, setPosition...)

25. март 2011. 15.14, Mark Davis ☕ <mark at macchiato.com> је написао/ла:

> I think an iterator is a cleaner interface; we were just trying to minimize
> new API.
>
> In general, collation is context sensitive, so searching on substrings
> isn't a good idea. You want to search from a location, but have the rest of
> the text available to you.
>
> For the iterator, you would need to be able to reset to a location, but the
> context beforehand could affect what happens.
>
> Mark
>
> *— Il meglio è l’inimico del bene —*
>
>
>
> On Fri, Mar 25, 2011 at 14:22, Mike Samuel <mikesamuel at gmail.com> wrote:
>
>> 2011/3/25 Mike Samuel <mikesamuel at gmail.com>:
>> > 2011/3/25 Nebojša Ćirić <cira at google.com>:
>> >> find method wouldn't return boolean but an array of two values:
>> >
>> > Sorry if I wasn't clear.  The !! at the beginning of the call to find
>> > is important.
>> > The undefined value you mentioned below as possible no match result is
>> > falsey because !!undefined === false.
>> >
>> >> myCollator.find('gaard', 'ard', 2) -> [2, 5]  // 4 or 5 as a bound
>> >> myCollator.find('ard', 'ard', 0) -> [0, 3]  // 2 or 3 as a bound
>> >> I guess [2, 5] !== [0, 3]
>> >
>> > True, but also [2, 5] !== [2, 5].
>> >
>> >> We could return [-1, undefined] for not found state, or just undefined.
>> >
>> >> I agree that returning a boolean makes for easier tests in loops.
>> >
>> >
>> >> 25. март 2011. 14.00, Mike Samuel <mikesamuel at gmail.com> је
>> написао/ла:
>> >>>
>> >>> 2011/3/25 Nebojša Ćirić <cira at google.com>:
>> >>> > Looking through the notes from the meeting I also found some
>> problems
>> >>> > with
>> >>> > the collator. We did specify the collatorType: search, but we didn't
>> >>> > offer a
>> >>> > function that would make use of it. Mark and I are thinking about:
>> >>> > /**
>> >>> >  * string - string to search over.
>> >>> >  * substring - string to look for in "string"
>> >>> >  * index - start search from index
>> >>> >  * @return {Array} [first, last] - first is index of the match or
>> -1,
>> >>> > last
>> >>> > is end of the match or undefined.
>> >>> >  */
>> >>> > LocaleInfo.Collator.prototype.find(string, substring, index)
>> >>> > We could also opt for iterator solution where we keep the state.
>> >>>
>> >>> Assuming find returns a falsey value when nothing is found, is it the
>> >>> case that for all (string, index) pairs,
>> >>>
>> >>> !!myCollator.find(string, substring, index) ===
>> >>> !!myCollator.find(string.substring(index), substring, 0)
>>
>> Maybe a better way to phrase this relation is
>>
>> will any collator ever look at a code-unit to the left of index when
>> trying to determine whether there is a match at or after index?
>>
>> E.g. if the code-unit at index might be a strict suffix of a substring
>> that could be represented as a one codepoint ligature.
>>
>>
>> >>> This would be false if the substring 'ard' should be found in 'gard',
>> >>> but not 'gaard' because then
>> >>>
>> >>>     !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
>> >>> 'ard', 0)
>> >>>
>> >>>
>> >>> If that relation does not hold, then exposing find as an iterator
>> >>> might help prevent a profusion of subtly wrong loops.
>> >>>
>> >>>
>> >>> > The reason we need to return both begin and end part of the found
>> string
>> >>> > is:
>> >>> > Look for gaard and we find gård - which may be equivalent in Danish,
>> but
>> >>> > substring lengths don't match (5 vs. 4) so we need to tell user the
>> next
>> >>> > index position.
>> >>> > The other problem Jungshik found is that there is a combinatorial
>> >>> > explosion
>> >>> > with all ignoreXXX options we defined. My proposal is to define only
>> N
>> >>> > that
>> >>> > make sense (and can be supported by all implementors) and fall back
>> the
>> >>> > rest
>> >>> > to some predefined default.
>> >>>
>> >>>
>> >>>
>> >>> > --
>> >>> > Nebojša Ćirić
>> >>> >
>> >>> > _______________________________________________
>> >>> > es-discuss mailing list
>> >>> > es-discuss at mozilla.org
>> >>> > https://mail.mozilla.org/listinfo/es-discuss
>> >>> >
>> >>> >
>> >>
>> >>
>> >>
>> >> --
>> >> Nebojša Ćirić
>> >>
>> >
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
>>
>
>


-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110325/f082e2c0/attachment-0001.html>


More information about the es-discuss mailing list