Expectations around line ending behavior for U+2028 and U+2029

Logan Smyth loganfsmyth at gmail.com
Mon Oct 29 18:04:00 UTC 2018


Sounds good. This means that the expectation, from the standpoint of
Unicode spec, is that all existing parsers and tooling for all languages
would also be updated to have line numbering that include U+2028/29, or
else that the line numbers would indefinitely be out of sync with the line
numbers rendered in an editor, when a file contains these characters. I
assume it would be a breaking change for most languages to add U+2028/29 as
line terminators for line comments and potentially string content at a
minimum, meaning that even if the editors render them as line separators,
it would still be a single-line token bridging multiple lines in those
languages.

I understand the desire for all this, but I hope it's understandable why
this situation is a little frustrating. There is no clean way to update any
given tool to treat U+2028/29 as newlines without either special-casing JS
or accepting that the tool's line numbers will not correspond with line
numbers from any tool that does not treat them as newlines, which is
realistically the vast majority of tooling and parsers. It is difficult to
make the argument to migrate any given tool when you're migrating away from
the current defacto behavior, even if you're migrating to the
Unicode-defined behavior, especially when it's not clear that the community
as a whole is even aware that they should be treating these as newlines in
the first place.




On Fri, Oct 26, 2018 at 4:52 PM Isiah Meadows <isiahmeadows at gmail.com>
wrote:

> So in other words, all these IDEs are broken and in violation of the
> Unicode spec. BTW, VSCode depends on Chrome, so it'll likely have most
> of the same behavior if it doesn't correctly account for them..
>
> -----
>
> Isiah Meadows
> contact at isiahmeadows.com
> www.isiahmeadows.com
>
> On Fri, Oct 26, 2018 at 5:49 PM Logan Smyth <loganfsmyth at gmail.com> wrote:
> >
> > Great, thank you for that resource Allen, it's helpful to have something
> concrete to consider.
> >
> > What you'd prefer is that that other languages should also be rendered
> with U+2028/29 as creating new lines, even though their specifications do
> not define them as lines? That means that any parser for these languages
> that follows the language spec would them be outputting line numbers that
> would potentially not correspond with the code as rendered inside of the
> developer's editor, if the editor renders U+2028/29 a line separators? That
> would for instance mean that Rust's single-line comments could actually be
> rendered as multiple lines, even though they are a single line according to
> the spec.
> >
> > My frustration here isn't that the characters exist, it's just that
> their behavior in a world of explicitly defined syntactic grammars that
> depend on line numbers for errors and things, they seem poorly-defined,
> even if their behavior in text documents may have more meaning. For
> instance, here is XCode's rendering of 2028/2029
> >
> >
> > 2028 does seem to render as a "line separator" in that visually the code
> is on a new line, but it is rendered within the same line number marker as
> the start of that snippet of text. That seems to satisfy the behavior
> defined by Unicode, but it's not helpful from the standpoint of code
> looking to process sourcecode. Should a parser follow that definition of
> line separator, since 2028 suggests rendering a new line, but since it's
> not a paragraph, it's conceptually part of the same paragraph? What is a
> paragraph in source code? Unicode has no sense of line numbers as far as I
> know, which means it seems up to an individual language to define what line
> number a given token is on.
> >
> >
> > > All of them recognise both characters as newlines (and increment the
> line number for those that display it).
> >
> > Revisiting my tests on my OSX machine, it seems like there is a
> difference in treatment of 2028 and 2029 that threw off at least some of my
> tests.
> > * VSCode: 2028 is a unicode placeholder and 2029 seems to be rendered
> zero-width, no new lines
> > * Sublime 3: 2028/29 rendered zero-width, no new lines
> > * TextEdit: 2028 is a newline, 2029 is zero-width, no new lines
> > * XCode: Per above screenshot, 2028 creates a line but renders within
> the same line number, 2029 creates a new line number
> > * Firefox, Chrome, and Safari, with text in a <pre> or <textarea>
> renders them all on one line zero-width, no new lines (though how HTML
> renders may just be a whole separate question)
> >
> >
> > On Fri, Oct 26, 2018 at 7:42 AM Claude Pache <claude.pache at gmail.com>
> wrote:
> >>
> >>
> >>
> >> >
> >> > Would it be worth exploring a definition of U+2028/29 in the spec
> such that they behave as line terminators for ASI, but otherwise do not
> increment things line number counts and behave as whitespace characters?
> >>
> >> Diverging the definition of line terminator for the purpose of line
> counting on one side, and ASI and single-line comment on the other side, is
> adding yet another complication in a matter that is already messy. And I
> suspect that most tools that have issues with the former case, have issues
> as well with the latter case, so that half-fixing is equivalent to not
> fixing.
> >>
> >> If we want to ”fix” the definition of line terminator somewhere, we
> should ”fix” it everywhere.
> >>
> >> (Note that the recent addition of U+2028 and U+2029 inside string
> literals does not constitutes a modification of the definition of line
> terminator in that context; it is rather allowing string literals to span
> multiple lines in some specific cases.)
> >>
> >> —Claude
> >
> > _______________________________________________
> > es-discuss mailing list
> > es-discuss at mozilla.org
> > https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20181029/050399e4/attachment.html>


More information about the es-discuss mailing list