encapsulated hash codes
mjs at apple.com
Wed Nov 4 20:52:08 PST 2009
On Nov 4, 2009, at 6:43 PM, Allen Wirfs-Brock wrote:
> See below
> From: Maciej Stachowiak [mailto:mjs at apple.com]
> Sent: Wednesday, November 04, 2009 4:50 PM
> To: Allen Wirfs-Brock
> Cc: es-discuss
> Subject: Re: encapsulated hash codes
> On Nov 4, 2009, at 2:49 PM, Allen Wirfs-Brock wrote:
> A straw man proposal for Harmony encapsulated hash codes has been
> posted to the Wiki at.
> 1) Wouldn't it be simpler to have a single Object.hash()
> function that returns a uniformly distributed number in the 0..2^32
> range, and leave the mod operation up to code using the hash code?
> Making a new function that (essentially) encapsulates the hash
> operation and the mod does not seem to pull its weight in API
> Yes, it would and that is the classic approach (perhaps excluding
> the uniformity of distribution) used by most languages. As an
> implementer I would be fairly happy with that approach. However,
> some of our members have significant concerns about the possibility
> of hash codes being used as a covert communications canner. This
> proposal tries to avoid that problem by giving each client of
> identity hashes (most likely each hash table) their own hash
> function which yields a distinct sequences of hash values (for the
> same sequence of objects). Since the functions aren’t shared the
> sequence of values they generate can be used as a channel.
How can you use it as a covert channel? If you can only produce a hash
code given a value, you need an overt communication channel to
transfer the value to use a hash function for covert communication. Is
there a more complicated scenario that I'm not thinking of?
> Requiring uniform distribution over a 32-bit range may not be
> practical. If a client only wants hashes in the 0.255 range the
> obvious function of a 32-bit value (mod 256) ignores most of the
> bits that actually distinguish the values.
Thats ok. For a good quality hash function, this will be distributed
just as uniformly as a hash that produces values 0..255. Uniform
distribution over a 32-bit range is quite practical, there are many
> The individual functions probably do a bit more than a simple mod.
> Typically, the seed that is used to generate per object hash values
> (for example, the allocation address of an object) has fewer than 32
> variable bits (for example, only the size range of the allocation
> nursery). If the hash function know range of hash values it is
> expected to generate it can more effectively use the actual variable
> bits to generate a good distribution.
I don't believe that is the case. With a good hash function, changing
any one bit in the input will on average change half the bits in the
output (and similar properties for correlations of multiple bits).
Thus, the low 28 bits of 32-bit hash code on values with only (say)
28 bits of entropy will be just as good as the best 28-bit hash code
you could make.
See here for some references: http://burtleburtle.net/bob/hash/#lookup
> 2) What about non-objects? It would be nice if strings, numbers, and
> perhaps other primitive types could hash by value rather than by
> identity, so you could reasonably mix them as hash keys with objects.
> Sure this can be added, but isn’t an essential primitive. Hash
> values can be easily generated today for all the built-in value
> types. But object identity hash requires some sort of primitive
True, but that way is much less convenient and likely to be less
> 2) Security considerations: it should be recommended that the
> hash code, if computed from the address of the object, is not
> reversible. (Of course, with a copying collector you probably can't
> compute the hash code from the address).
> Yes, that would be a good requirement. In practice, it’s usually
> the case anyway.
Actually, many popular non-cryptographic hash functions are fully or
partially reversible. It's a desirable property that certain passes of
computing the hash function are reversible. To guarantee what I
suggested you'd have to combine an unpredictable salt with the object
address or use a cryptographic hash (which is more expensive) or do
something more tricky.
> Actually, the address is a good seed hash value even for copying
> collectors. There are tricks that are known by lisp/smalltalk/etc.
> implementers for efficiently doing this. Happy to share… Ideally,
> identity hashes cost nothing to compute and require no additional
> per object storage. The ideal isn’t obtainable, but you can get
> pretty close.
I would be interested to hear bout these tricks.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the es-discuss