Unicode normalization problem

monolithed monolithed at gmail.com
Wed Apr 1 20:30:38 UTC 2015


> What you’re seeing there is not normalization, but rather the string
iterator that automatically accounts for surrogate pairs (treating them as
a single unit).

```js
var foo = '𝐀';
var bar = 'Й';
foo.length; // 2
Array.from(foo).length // 1

bar.length; // 2
Array.from(foo).length // 2
```

I think this is strange.
How to safely work with strings?


2015-04-01 22:17 GMT+03:00 Alexander Guinness <monolithed at gmail.com>:

> My reasoning is based on the following example:
>
> ```js
> var text = '𝐀';
>
> text.length; // 2
>
> Array.from(text).length // 1
> ```
>
> 2015-04-01 22:05 GMT+03:00 Rick Waldron <waldron.rick at gmail.com>:
>
>>
>>
>> On Wed, Apr 1, 2015 at 2:59 PM monolithed <monolithed at gmail.com> wrote:
>>
>>> ```js
>>> var text = 'ЙйЁё';
>>>
>>> text.split(''); // ["И", "̆", "и", "̆", "Е", "̈", "е", "̈"]
>>> ```
>>>
>>> Possible solutions:
>>>
>>> 1.
>>>
>>> ```js
>>> text.normalize().split('') // ["Й", "й", "Ё", "ё"]
>>> ```
>>>
>>> I like it, but is no so comfortable
>>>
>>> 2.
>>>
>>> ```js
>>> Array.from(text) // ["И", "̆", "и", "̆", "Е", "̈", "е", "̈"]
>>> ```
>>>
>>> 3.
>>>
>>> ```js
>>> [...text] // ["И", "̆", "и", "̆", "Е", "̈", "е", "̈"]
>>> ```
>>>
>>>
>>> Should the `Array.from` and `...text` work as the first example and why?
>>>
>>
>> Why would they imply calling `normalize()`? What if that wasn't desired?
>>
>> Since #1 calls normalize before split(), the actual equivalents would
>> look like this:
>>
>>   Array.from(text.normalize()) // [ "Й", "й", "Ё", "ё" ]
>>   [...text.normalize()] // [ "Й", "й", "Ё", "ё" ]
>>
>> Rick
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20150401/83ed988c/attachment.html>


More information about the es-discuss mailing list