Binary Data - possible topic for joint session
Kris Kowal
kris.kowal at cixar.com
Sat Nov 14 13:34:34 PST 2009
[+ commonjs]
On Sat, Nov 7, 2009 at 6:21 PM, Maciej Stachowiak <mjs at apple.com> wrote:
> If nothing else there's quite a bit of prior art collected which should
> inform the conversation. I know the Binary/B proposal has the implementation
> momentum, but I don't know exactly what the status is. I haven't been
> closely following the evolution of these binary specs too closely but since
> it seems that nearly everyone else from the group is off to jsconf.eu I
> figured I ought to toss this out there.
Thanks, we're back, and convergence on binary data API's are our next
big thrust. I spoke with Brian Mitchell at jsconf.eu who has a
significant interest in our binary proposals. Particularly there was
a lot of interest in bit quantized data types, which in my opinion
would complement but not replace byte quantized data types. At this
point, I imagine that we will eventually have ByteString, ByteArray,
ByteStream, BitString, BitArray, and BitStream types, between our
"binary" and "io" module specifications.
> Binary/B feels largely right, but it has a few too many methods from Array
> simply because Array had them for my taste, specifically things like sort,
> reduce, shift, unshift etc.
In retrospect, I agree. I think our ByteArray could survive with a
very small subset of the Array API. Would anyone miss any of: push,
pop, shift, unshift, sort, reverse, splice, indexOf, lastIndexOf,
split, filter, forEach, every, some, map, reduce, reduceRight,
displace, extendLeft, extendRight. I imagine that the primary use
cases for ByteArray would be fixed-width, but explicitly growable with
length assignment, pipes and buffers, for which the most common
operations would copy(target, start, stop, targetStart) and conversion
to other types.
> (1) Binary/B does not have a cheap way to convert from the immutable
> representation (ByteString) to the mutable representation (ByteArray)
Apart from .toByteArray()? I imagine that implementations would be
able to track whether underlying buffer blocks are shared by multiple
ByteString or ByteArray data instances and support copy-on-write for
ByteArrays. I'm probably missing something. Perhaps you envision
something lower-level?
> (2) In Binary/B, Array-like index access to ByteString gives back one-byte
> ByteStrings instead of bytes, likely an over-literal copying of String
This has been mentioned, but there are certain values to over-literal
copying; the notion is that certain algorithms written for Strings,
albeit algorithms written for byte strings but suffering to do so with
Strings, should continue to function with ByteString. To that end, it
may be desirable for certain idioms to continue to function properly:
string[0].concat(string[1])
> (3) There are some seemingly needless differences in the interfaces to
> ByteString and ByteArray that follow from modeling on String and Array
I am not sure.
> (4) Binary/B has many more operations available in the base proposal
> (including charset transcoding and a generous selection of String and Array
> methods)
I think it will be desirable to trim down the ByteArray proposal.
I don't recall where, but there's also some hint that it would be good
to support conversion to various radix string representations,
certainly 16 and 64, but possibly also 2, 8, and 32 (either to the RFC
or Doug Crockford's proposal for human-error-resistant license keys).
I think that these ought to be folded into .toString(radix:Number) in
a future draft.
> (5) Different names - Data/DataBuilder vs. ByteString/ByteArray
I like ByteString. ByteArray is tending toward not being as strictly
Array-like, but I think it's also apt, from the perspective of users
implicitly understanding what kinds of operations are permitted on
ByteArrays based on their understanding of Arrays, like mutability and
resizability. I definitely don't like Data and DataBuilder for the
reasons Brendan outlined, but I definitely could see cases for Buffer
and Blob.
> On (1): cheap conversion from mutable to immutable
> (DataBuilder.prototype.release() in my proposal) lets binary data objects be
> built up with a convenient mutation-based idiom, but then passed around as
> immutable objects thereafter.
Ah, sure. That makes sense. My instinct is that under the hood, the
original byte array would not actually disappear but switch to
copy-on-write and transfer ownership of its underlying buffer to the
new ByteString. However, could this behavior not be folded up
transparently by toByteString()?
> On (2): I don't think a one-byte ByteString is
> ever useful, indexing to get the byte value would be much more helpful.
I agree this is debatable. I'm not ready to embark on a case study of
existing uses of Strings for binary data in JavaScript to explore what
methods are used, but there certainly is a corpus. The works of Jacob
Seidelin and Ama Chang come to mind; I've seen and massaged code for
most radix encodings, charset encodings, hashing algorithms, EXIM,
ID3, binary AJAX, ZIP archives, and the itinerant compression
algorithms like LZ77. They all use a combination of Array and String
operations, all operating on the octet invariant. It might be worth
looking into how easily these projects can be ported to these API's.
> On (3), I think it's good for the mutable interface to be
> a strict superset of
> the the immutable interface.
Also, not sure. I'm certain that there should be a body of common
methods so they can be used generically, but I'm not sure that it
should be exhaustive one way or the other. Perhaps in the course of
pruning ByteArray we'll converge on something a step away from
ByteString.
> My initial impression is that (1), (2) and (3) are all points on which my
> proposal is better.
> (4) and (5) are all points where perhaps neither proposal is at the optimum
> yet.
I think we can address (1) under the hood. I'm not sure about (2) and
(3); I've hitherto assumed that String/Array genericity would be
valuable. (4) is also contentious; Binary/B does "entrain" a lot of
necessary specification for charsets and radix encodings, although it
rather deliberately avoids specifying API's for structure packing and
unpacking.
> On (4), I suspect the sweet spot is somewhere between my spartan set of
> built-in operations and the very generous set in Binary/B.
Agreed.
> On (5), I'm not
> sure either set of names is the best possible, and I'm certainly not stuck
> on my own proposed names.
Yes. It might be best to revisit nomenclature after the API's settle.
Thanks,
Kris Kowal
More information about the es-discuss
mailing list