Collation API not complete for search

Nebojša Ćirić cira at google.com
Mon Mar 28 13:36:17 PDT 2011


Shawn, would you be ok with adding this new API to the list for 0.5 so we
can support collation search?

I'll edit the strawman in case nobody objects to this addition.

25. март 2011. 16.34, Nebojša Ćirić <cira at google.com> је написао/ла:

> In that case I wouldn't put this new functionality in the Collator object.
> A new StringSearch or StringIterator object would make more sense:
>
> options = {
>   collator[optional - default, collatorType=search],
>   source[required],
>   pattern[required]
> }
> LocaleInfo.StringIterator = function(options) {}
> LocaleInfo.StringIterator.prototype.first = function() { find
> first occurrence}
> LocaleInfo.StringIterator.prototype.next = function() { get me
> next occurrence of pattern in source}
> LocaleInfo.StringIterator.prototype.matchLength = function() { length of
> the match }
> ... (reset, setPosition...)
>
> 25. март 2011. 15.14, Mark Davis ☕ <mark at macchiato.com> је написао/ла:
>
> I think an iterator is a cleaner interface; we were just trying to minimize
>> new API.
>>
>> In general, collation is context sensitive, so searching on substrings
>> isn't a good idea. You want to search from a location, but have the rest of
>> the text available to you.
>>
>> For the iterator, you would need to be able to reset to a location, but
>> the context beforehand could affect what happens.
>>
>> Mark
>>
>> *— Il meglio è l’inimico del bene —*
>>
>>
>>
>> On Fri, Mar 25, 2011 at 14:22, Mike Samuel <mikesamuel at gmail.com> wrote:
>>
>>> 2011/3/25 Mike Samuel <mikesamuel at gmail.com>:
>>> > 2011/3/25 Nebojša Ćirić <cira at google.com>:
>>> >> find method wouldn't return boolean but an array of two values:
>>> >
>>> > Sorry if I wasn't clear.  The !! at the beginning of the call to find
>>> > is important.
>>> > The undefined value you mentioned below as possible no match result is
>>> > falsey because !!undefined === false.
>>> >
>>> >> myCollator.find('gaard', 'ard', 2) -> [2, 5]  // 4 or 5 as a bound
>>> >> myCollator.find('ard', 'ard', 0) -> [0, 3]  // 2 or 3 as a bound
>>> >> I guess [2, 5] !== [0, 3]
>>> >
>>> > True, but also [2, 5] !== [2, 5].
>>> >
>>> >> We could return [-1, undefined] for not found state, or just
>>> undefined.
>>> >
>>> >> I agree that returning a boolean makes for easier tests in loops.
>>> >
>>> >
>>> >> 25. март 2011. 14.00, Mike Samuel <mikesamuel at gmail.com> је
>>> написао/ла:
>>> >>>
>>> >>> 2011/3/25 Nebojša Ćirić <cira at google.com>:
>>> >>> > Looking through the notes from the meeting I also found some
>>> problems
>>> >>> > with
>>> >>> > the collator. We did specify the collatorType: search, but we
>>> didn't
>>> >>> > offer a
>>> >>> > function that would make use of it. Mark and I are thinking about:
>>> >>> > /**
>>> >>> >  * string - string to search over.
>>> >>> >  * substring - string to look for in "string"
>>> >>> >  * index - start search from index
>>> >>> >  * @return {Array} [first, last] - first is index of the match or
>>> -1,
>>> >>> > last
>>> >>> > is end of the match or undefined.
>>> >>> >  */
>>> >>> > LocaleInfo.Collator.prototype.find(string, substring, index)
>>> >>> > We could also opt for iterator solution where we keep the state.
>>> >>>
>>> >>> Assuming find returns a falsey value when nothing is found, is it the
>>> >>> case that for all (string, index) pairs,
>>> >>>
>>> >>> !!myCollator.find(string, substring, index) ===
>>> >>> !!myCollator.find(string.substring(index), substring, 0)
>>>
>>> Maybe a better way to phrase this relation is
>>>
>>> will any collator ever look at a code-unit to the left of index when
>>> trying to determine whether there is a match at or after index?
>>>
>>> E.g. if the code-unit at index might be a strict suffix of a substring
>>> that could be represented as a one codepoint ligature.
>>>
>>>
>>> >>> This would be false if the substring 'ard' should be found in 'gard',
>>> >>> but not 'gaard' because then
>>> >>>
>>> >>>     !!myCollator.find('gaard', 'ard', 2) !== !!myCollator.find('ard',
>>> >>> 'ard', 0)
>>> >>>
>>> >>>
>>> >>> If that relation does not hold, then exposing find as an iterator
>>> >>> might help prevent a profusion of subtly wrong loops.
>>> >>>
>>> >>>
>>> >>> > The reason we need to return both begin and end part of the found
>>> string
>>> >>> > is:
>>> >>> > Look for gaard and we find gård - which may be equivalent in
>>> Danish, but
>>> >>> > substring lengths don't match (5 vs. 4) so we need to tell user the
>>> next
>>> >>> > index position.
>>> >>> > The other problem Jungshik found is that there is a combinatorial
>>> >>> > explosion
>>> >>> > with all ignoreXXX options we defined. My proposal is to define
>>> only N
>>> >>> > that
>>> >>> > make sense (and can be supported by all implementors) and fall back
>>> the
>>> >>> > rest
>>> >>> > to some predefined default.
>>> >>>
>>> >>>
>>> >>>
>>> >>> > --
>>> >>> > Nebojša Ćirić
>>> >>> >
>>> >>> > _______________________________________________
>>> >>> > es-discuss mailing list
>>> >>> > es-discuss at mozilla.org
>>> >>> > https://mail.mozilla.org/listinfo/es-discuss
>>> >>> >
>>> >>> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Nebojša Ćirić
>>> >>
>>> >
>>> _______________________________________________
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
>>> https://mail.mozilla.org/listinfo/es-discuss
>>>
>>
>>
>
>
> --
> Nebojša Ćirić
>



-- 
Nebojša Ćirić
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20110328/5b72de4c/attachment-0001.html>


More information about the es-discuss mailing list