Expectations around line ending behavior for U+2028 and U+2029

Logan Smyth loganfsmyth at gmail.com
Thu Oct 25 23:49:48 UTC 2018

> Tools that do not consider U+2028/29 to be line breaks are not behaving
as they should according to the latest Unicode standard.

That's part of what I'm attempting to understand. What specifically does
Unicode require for these code points? What are the expectations for
languages that have differing definitions of line separators? The HTML spec
defines newlines in https://html.spec.whatwg.org/#newlines as CR and LF
only. Is that technically in violation of the Unicode spec then? If code
editors were to adopt U+2028 and U+2029 as line separators, is the
expectation that they would apply that to HTML files too, even though that
would put the the editor's concept of a line in conflict with the
language's specification?

It seems unrealistic to expect that all tooling that processes source code
would adopt a new type of line separator. Given that, JS is the outlier.
Similarly, does Unicode make any guarantees about what counts as a line
terminator? If it changes in the future, would JS be forced to add that as
a type of LineTerminator as well? If it did, that could break existing
code, and if it doesn't, then JS would end up right back in the same place
with a concept of line numbers that differs from other tooling. CR and LF
are already the defacto standards, is it really realistic to expect tooling
to _ever_ change? It is much more likely that JS will have simply specified
itself as a special-case forever, which tooling will never handle.

On Thu, Oct 25, 2018 at 3:10 PM Waldemar Horwat <waldemar at google.com> wrote:

> On 10/25/2018 09:24 AM, Logan Smyth wrote:
> > Yeah, /LineTerminatorSequence/ is definitely the canonical definition of
> line numbers in JS at the moment. As we explore
> https://github.com/tc39/proposal-error-stacks, it would be good to
> clearly specify how a line number is computed from the original source. As
> currently specified, a line number in a stack trace takes U+2028/29 into
> account, and thus requires any consumer of this source code and line number
> value needs to have a special case for JS code. It seems unrealistic to
> expect every piece of tooling that works with source code would have a
> special case for JS code to take these 2 characters into account. Given
> that, the choices are
> >
> > 1. Every tool that manipulates source code needs to know what type so it
> can special-case JS it is in order to process line-related information
> > 2. Every tool should consider U+2028/29 newlines, causing line numbers
> to be off in other programming languages
> > 2. Accept that tooling and the spec will never correspond and the use of
> these two characters in source code will continue to cause issues
> > 3. Diverge the definition of current source-code line from the current
> /LineTerminatorSequence/ lexical grammar such that source line number is
> always /\r?\n/, which is what the user is realistically going to see in
> their editor
> The Unicode standard is the more relevant one here.  Choice 2 is the
> correct one per the Unicode standard.  Tools that do not consider U+2028/29
> to be line breaks are not behaving as they should according to the latest
> Unicode standard.
>      Waldemar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181025/af819109/attachment.html>

More information about the es-discuss mailing list