[rust-dev] Addressing "the hashtable problem" with type classes

Niko Matsakis niko at alum.mit.edu
Thu Dec 1 20:25:42 PST 2011


Here is a proposal to work around a potentially serious problem with the 
modular type classes we have been discussing.  The problem, as 
explained, is that collections (and possibly other data structures) can 
lose coherence if they are used with incompatible implementations of 
type classes like `hashable`.

 From <https://gist.github.com/1421744>:

# The "Hashtable Problem"

There is a danger with typeclasses only on functions because they
permit a lack of coherence.  This danger is particularly acute with
collections, where the definition of a particular interface---such as
hashable or ord---may be used in structuring the data structure. This
renders the data structure corrupt when used with a different
implementation.

To see the problem, consider a hashtable defined something like this:

     mod ht {
         iface hash {
             fn hash() -> uint;
             fn eq(other: self) -> bool;
         }

         type t<K,V> = { buckets: [bucket<K,V>] }

         fn create<K:hash,V>() -> @t<K,V> {
             ...
         }

         fn put<K:hash,V>(ht: @t<K,V>, k: K, v: V) {
             ...
         }

         fn get<K:hash,V>(ht: @t<K,V>, k: K) -> V {
             ... // fails if not present
         }
     }

This looks reasonable.  Now we create two modules, `Module1` and
`Module2`, which both share a reference to the same hashtable, but
which have different implementations of the `hash` interface in scope:

     mod Module1 {
         impl hash for uint {
             fn hash() -> uint { ret self; }
             fn eq(other: int) -> bool { ret self == other; }
         }
         ...
         fn foo() -> str {
             let h = ht::create();
             ht::put(ht, 3u, "hi"); // 3u.hash() == 3u here
             ret Module2::bar(h);
         }
     }

     mod Module2 {
         impl hash for uint {
             fn hash() -> uint { ret self / 2; }
             fn eq(other: int) -> bool { ret self == other; }
         }
         ...
         fn bar(h: @ht::t<uint, str>) -> str {
             ret ht_get(ht, 3u); // fails because 3u.hash() == 1u here
         }
     }

Here, `Module2::bar()` will fail to find any value for `3u` because it
is using an incompatible hash implementation.  The annoying problem
here is that both modules are correct in and of themselves, they are
just using mutually inconsistent definitions of `hash for uint`.

# A solution of sorts

We cannot completely rule out the possibility of writing code like
this.  I believe it is inherent in any solution that involves both
scoped implementations (i.e., not global, as in Haskell) and open
types (i.e., methods can be declared even after the type has been
defined, unlike Java or C#).

However, we *can* make it possible to write code that detects such
mismatches.  The idea has two parts.  First, we allow type variables
in types to have interfaces associated with them:

     type HT<K : hash,V> = { ... }

Next, we say that the value for a type variable is not just a type,
like `uint`, but rather a tuple `type : impl*` of a type and zero or
more implementations, one for each interface bound.  Therefore, the
full type of the hashtable `h` in our example would not be
`HT<uint,str>` but rather `HT<uint : Module1::hash, str>`.  However,
the user does not (usually) have to write the implementations by hand;
they are supplied by the compiler based on what is in scope for the
given type.

Now, armed with this new idea, the fully specified form of the
function `bar()` from `Module2` would be:

     fn bar(h: @ht::t<uint: Module2::hash, str>) { ... }

As a result, the compiler will specify a type error at the call
`Module2::bar(h)` in `Module1::foo()`, because the type of the
parameter `h` in `bar()` does not match the type of the value being
provided (the implementation modules do not match).

## Default implementations

When the value of a type variable is written explicitly, it can be
written either `ty:impl1,impl2` (explicit implementations for each
interface) or just `ty`.  If only a type is specified, the compiler
will look for imported implementations and insert the default value.

One important case is when the type is itself a bounded type variable,
as here:

     fn put<K:hash,V>(ht: @t<K,V>, k: K, v: V);

The type `@t<K,V>` does not provide an explicit implementation for
`K`.  The compiler defaults it to whichever implementation is provided
as part of the type variable `K` by the caller.  This is because, for
a type variable, the implementations in scope are those that are
supplied by the caller.

# What if I don't care which hash function is being used?

Now, you might argue that it is still weird that the imports in
`Module2::bar()` are affecting what hash function is used for a
hashtable that is being passed in as a parameter, for which the hash
function has already (presumably) been specified when the hashtable
was created.  There are pluses and minuses to this.

The plus is that this full specification would allow for a very high
degree of optimization if we did monomorphizing: there would be no
dynamic calls at all involved in using the hashtable.  Also, so long
as you use a consistent set of implementations across most of your
methods, there is no annotation overhead since the detail
implementations that the compiler supplies will all match.

However, it would still be nice to be able to write a function that
operates over any hashtable that uses `uint` as its key type,
regardless of what hash function is in use.  This is basically
polymorphism.  Thankfully, we have an answer to that: interfaces.  So,
we define an interface where the `K` type variant is not necessarily
hashtable (after all, this interface could also be fulfilled by a
binary tree or some other kind of map):

     iface map<K,V> {
         fn put(k: K, v: V);
         fn get(k: K) -> V;
     }

And now we have to implement the interface for the type `ht::t`:

     impl<K:hash,V> map<K,V> for ht::t<K,V> {
         fn put(k: K, v: V) { ret ht::put(self, k, v); }
         fn get(k: K) { ret ht::get(self, k); }
     }

This interface hides the precise hash implementation, encapsulating it
so to speak.  There are some niggling implementation details to be
worked out, as the compiler will (in a dynamic setting) have to stash
the hash implementation pointer somewhere so it is available for use
in the `put()` and `get()` functions.  In a monomorphic setting,
customized variants of `put()` and `get()` would be generated in any
case.



More information about the Rust-dev mailing list