Full Unicode strings strawman
mikesamuel at gmail.com
Mon May 16 12:28:26 PDT 2011
2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
> On May 16, 2011, at 11:30 AM, Mike Samuel wrote:
>> 2011/5/16 Allen Wirfs-Brock <allen at wirfs-brock.com>:
>>> I tried to post a pointer to this strawman on this list a few weeks ago, but
>>> apparently it didn't reach the list for some reason.
>>> Feed back would be appreciated:
>> Will this change the behavior of character groups in regular
>> expressions? Would myString.match(/^.$/).length ever have length
>> 2? Would it ever match a supplemental codepoint?
> No, supplement codepoints are single string characters and RegExp matching operates on such characters. A string could, of course, contain character sequences that correspond to UTF-8, UTF-16, or other multi-unit encodings. However, from the perspective of Strings and RegExp those encodings would be multiple character sequences just like they are today. The only ES functions currently proposed that would deal with multi-character encodings of supplemental codepoints are the URI handling functions. However, it may be a good idea to add string-to-string UTF-8 and UTF-16 encode/decode functions that simply to the encode/decode and don't have all the other processing involved in encodeURI/decodeURI.
DOMString is defined at
Type Definition DOMString
A DOMString is a sequence of 16-bit units.
so how would round tripping a JS string through a DOM string work?
var oneSupplemental = "\U00010000";
alert(oneSupplemental.length); // alerts 1
var utf16Encoded = encodeUTF16(oneSupplemental);
alert(utf16Encoded.length); // alerts 2
var textNode = document.createTextNode(utf16Encoded);
alert(textNode.nodeValue.length); // alerts ?
Does the DOM need to represent utf16Encoded internally so that it can
report 2 as the length on fetch of nodeValue? If so, how can it
represent that for systems that use a UTF-16 internal representation
>> How would the below, which replaces orphaned surrogates with U+FFFD
>> when strings are viewed as sequences of UTF-16 code units behave?
>> myString.replace( /[\ud800-\udbff](?![\udc00-\uffff])/g, "\ufffd")
>> .replace( /(^|[^\ud800-\udbff])([\udc00-\udffff])/g, "\ufffd")
> Exactly as it currently does, assuming it was applied to a string that didn't contain any codepoints greater than \uffff. If the string contained any codepoints > \uffff those character would not match the pattern should be replaced.
> The important thing two keep in mind here is that under this proposal, a supplemental codepoint is a single logical charater. For example using a random character that isn't in the BMP:
> "\u+02defc" === "\ud8ff\udefc"; //this is fale
> "\u+02defc".length ===1 ;//this is true
> "\ud8ff\udefc"===2; //this is true
> Existing code that manipulates surrogate pairs continues to work unmodified because such code is explicitly manipulating pairs of characters. However, such code might produce unexpected results if handed a string containing a codepoint > \uffff . But that takes an explicit action by someone to introduce such an enhanced character into a string.
>>> es-discuss mailing list
>>> es-discuss at mozilla.org
More information about the es-discuss