ASC parsing bug?

Michael Daumling mdaeumli at adobe.com
Mon Jun 16 14:33:23 PDT 2008


Actually, this was an acceptance test file.

If possible, asc should IMHO assume UTF-8, and if UTF-8 decoding fails, retry with the default system encoding (a.k.a CP-1252 on English Windows).

For the Python rebuildtests scripts, someone should decide on which encoding to use for the acceptance and performance tests, and add an argument to the asc command line that forces the encoding.

Which command-line argument would that be, BTW?

Michael
 

-----Original Message-----
From: Steven Johnson 
Sent: Monday, June 16, 2008 10:18 AM
To: Lars Hansen; Edwin Smith; Michael Daumling; tamarin-devel at mozilla.org
Subject: Re: ASC parsing bug?

Or, if we encounter a non-ASCII sequence, and there's no BOM and no explicit encoding specified, simply fail compilation with an explicit error.
Draconian but effective.


On 6/16/08 7:08 AM, "Lars Hansen" <lhansen at adobe.com> wrote:

> The problem is how we can know that we should /not/ be using UTF8 (so 
> that we can choose the default encoding).  Already ASC allows an 
> encoding to be specified explicitly, and UTF8 is the fallback from 
> that case.  (Not clear to me yet which of the clients of the compiler 
> actually pass an encoding and where they obtain it from.)
> 
> The only viable strategy I can think of is if we encounter garbage in 
> a file we thought were UTF8 and then back up to the beginning and 
> retry with the default encoding (if different from UTF8).  Probably 
> works.  May not be worth the bother.
> 
> --lars
> 
>> -----Original Message-----
>> From: Edwin Smith
>> Sent: 16. juni 2008 15:46
>> To: Lars Hansen; Michael Daumling; tamarin-devel at mozilla.org
>> Subject: RE: ASC parsing bug?
>> 
>> Maybe the best guess for asc is java's default system encoding in 
>> that case?
>> 
>>> -----Original Message-----
>>> From: tamarin-devel-bounces at mozilla.org [mailto:tamarin-devel- 
>>> bounces at mozilla.org] On Behalf Of Lars Hansen
>>> Sent: Monday, June 16, 2008 9:07 AM
>>> To: Michael Daumling; tamarin-devel at mozilla.org
>>> Subject: RE: ASC parsing bug?
>>> 
>>> It appears to be the case that the ASC parser, if presented with an 
>>> input file that does not start with a BOM, will assume the file is 
>>> UTF8.  If the file is not actually UTF8 encoded but rather ascii 
>>> with some extended-ascii characters (as I assume your test case is) 
>>> then the parser (probably the Java input buffer layer actually) will 
>>> interpret those extended characters according to its own notions.  
>>> I'm not sure whether that should be considered an compiler error or 
>>> user error.  After all, there are no data available that allow the 
>>> parser to figure out the encoding of the file.
>>> 
>>> --lars
>>> 
>>>> -----Original Message-----
>>>> From: tamarin-devel-bounces at mozilla.org 
>>>> [mailto:tamarin-devel-bounces at mozilla.org] On Behalf Of Michael 
>>>> Daumling
>>>> Sent: 16. juni 2008 08:41
>>>> To: tamarin-devel at mozilla.org
>>>> Subject: ASC parsing bug?
>>>> 
>>>> Hi,
>>>> 
>>>> During testing my String implementation, I found that ASC
>> seems to
>>>> parse the string "Sören Lehmenkühler", which seems to be a fine 
>>>> German name, badly. Instead of the "ö" and "ü"
>>>> characters, the UTF-8 string in the ABC image contains 
>>>> REPLACEMENT_CHAR (0xFFFD).
>>>> 
>>>> Michael
>>>> 
>>>> _______________________________________________
>>>> Tamarin-devel mailing list
>>>> Tamarin-devel at mozilla.org
>>>> https://mail.mozilla.org/listinfo/tamarin-devel
>>>> 
>>> _______________________________________________
>>> Tamarin-devel mailing list
>>> Tamarin-devel at mozilla.org
>>> https://mail.mozilla.org/listinfo/tamarin-devel
>> 
> _______________________________________________
> Tamarin-devel mailing list
> Tamarin-devel at mozilla.org
> https://mail.mozilla.org/listinfo/tamarin-devel



More information about the Tamarin-devel mailing list