encapsulated hash codes

Maciej Stachowiak mjs at apple.com
Wed Nov 4 20:52:08 PST 2009


On Nov 4, 2009, at 6:43 PM, Allen Wirfs-Brock wrote:

>  See below
>
> From: Maciej Stachowiak [mailto:mjs at apple.com]
> Sent: Wednesday, November 04, 2009 4:50 PM
> To: Allen Wirfs-Brock
> Cc: es-discuss
> Subject: Re: encapsulated hash codes
>
>
> On Nov 4, 2009, at 2:49 PM, Allen Wirfs-Brock wrote:
>
>
> A straw man proposal for  Harmony encapsulated hash codes  has been  
> posted to the Wiki at.
>
> 1)     Wouldn't it be simpler to have a single Object.hash()  
> function that returns a uniformly distributed number in the 0..2^32  
> range, and leave the mod operation up to code using the hash code?  
> Making a new function that (essentially) encapsulates the hash  
> operation and the mod does not seem to pull its weight in API  
> complexity.
>
> Yes, it would and that is the classic approach (perhaps excluding  
> the uniformity of distribution)  used by most languages.  As an  
> implementer I would be fairly happy with that approach.  However,  
> some of our members have significant concerns about the possibility  
> of hash codes being used as a covert communications canner.  This  
> proposal tries to avoid that problem by giving each client  of  
> identity hashes (most likely each hash table) their own hash  
> function which yields a distinct sequences of hash values (for the  
> same sequence of objects).  Since  the functions aren’t shared the  
> sequence of values they generate can be used as a channel.

How can you use it as a covert channel? If you can only produce a hash  
code given a value, you need an overt communication channel to  
transfer the value to use a hash function for covert communication. Is  
there a more complicated scenario that I'm not thinking of?

>
> Requiring uniform distribution over a 32-bit range may not be  
> practical.  If a client only wants hashes in the 0.255 range the  
> obvious function of a 32-bit value (mod 256) ignores most of the  
> bits that actually distinguish the values.

Thats ok. For a good quality hash function, this will be distributed  
just as uniformly as a hash that produces values 0..255. Uniform  
distribution over a 32-bit range is quite practical, there are many  
well-lnown algorithms.

>  The individual functions probably do a bit more than a simple mod.   
> Typically, the seed that is used to generate per object hash values   
> (for example, the allocation address of an object) has fewer than 32  
> variable bits (for example, only the size range of the allocation  
> nursery).  If the hash function know range of hash values it is  
> expected to generate it can more effectively use the actual variable  
> bits to generate a good distribution.

I don't believe that is the case. With a good hash function, changing  
any one bit in the input will on average change half the bits in the  
output (and similar properties for correlations of multiple bits).  
Thus, the low 28 bits of  32-bit hash code on values with only (say)  
28 bits of entropy will be just as good as the best 28-bit hash code  
you could make.

See here for some references: http://burtleburtle.net/bob/hash/#lookup

>
> 2) What about non-objects? It would be nice if strings, numbers, and  
> perhaps other primitive types could hash by value rather than by  
> identity, so you could reasonably mix them as hash keys with objects.
>
> Sure this can be added, but isn’t an essential primitive. Hash  
> values can be easily  generated today for all the built-in value  
> types.  But object identity hash requires some sort of primitive  
> support.

True, but that way is much less convenient and likely to be less  
performant.

>
> 2)     Security considerations: it should be recommended that the  
> hash code, if computed from the address of the object, is not  
> reversible. (Of course, with a copying collector you probably can't  
> compute the hash code from the address).
> Yes, that would be a good requirement.   In practice, it’s usually  
> the case anyway.

Actually, many popular non-cryptographic hash functions are fully or  
partially reversible. It's a desirable property that certain passes of  
computing the hash function are reversible.  To guarantee what I  
suggested you'd have to combine an unpredictable salt with the object  
address or use a cryptographic hash (which is more expensive) or do  
something more tricky.

> Actually, the address is a good seed hash value even for copying  
> collectors.  There are tricks that are known by lisp/smalltalk/etc.  
> implementers for efficiently doing this.  Happy to share…  Ideally,  
> identity hashes cost nothing to compute and require no additional  
> per object storage.  The ideal isn’t obtainable, but you can get  
> pretty close.

I would be interested to hear bout these tricks.

Regards,
Maciej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20091104/b3e8e74e/attachment.html>


More information about the es-discuss mailing list