CachedOpEnvironment, non-unique cache keys?

@hinerm? others? I’m running into a situation where a few different Ops are returning outputs of the wrong type. For example, the DefaultHuMoment2 op is returning a Histogram1d instead of a RealType. Since a Histogram1d output type is never specified for these types of ops, I suspected the cache of the CachedOpEnvironment that is associated with the op was returning a result for the op that was inappropriate. When I look into the CachedOpEnvironment, it looks a Hash object (a simple utility class internal to CachedOpEnvironment that can create a combined hash for a combination of op-input-args) is used as the cache key for previously computed ops (e.g., cache.get(Hash)). Within the Hash class, the equals method is as follows.

@Override
public boolean equals(final Object obj) {
    if (obj == this) return true;
    if (obj instanceof Hash) return hash == ((Hash) obj).hash;
    return false;
}

Given that hashcodes aren’t necessarily unique, am I right in thinking that the equals method for one Hash object could return true (however unlucky) even if the compared object utilizes a different op-input-args combination? It would seem to explain the symptoms I’m seeing. Sorry I don’t have a small/simple reproducible example of this as it seems to only happen reliably with A LOT of ops computations and thus a large cache.

If we do want unique keys, I suppose one option would be to just use if (obj == this) return true; in Hash.equals(object) but it seems like we are creating a new instance of a Hash object each time to access items in the cache. So it would always return false and thus doesn’t seem viable. (I suppose maybe each CachedOp could hold a single instance of a Hash object for reuse?)

Alternatively, I suppose, the Hash could potentially keep (weak?) references to items used to instantiate it (i.e., op, input, args) to allow testing after testing of hash equality such as this.delegate == ((Hash) obj).delegate or this.delegate.equals(((Hash) obj).delegate) etc. Potentially this would retain some performance but address the uniqueness issue.

Thoughts? Alternatives? Suggestions? Am I off-base here?

Thanks!!

Jay

@dietzc is the author of this class so he would know best.

At a glance, it seems there is significant opportunity for clashes due to the construction of the hash.

First of all, by using the hash of the name of the second object’s class, we are discarding information. The hash is modified by the arguments but if there are no distinct arguments this seems dangerous, especially given XOR is commutative.

Second, XOR doesn’t seem like the right choice here. There’s some discussion as to why and when XOR is useful. I wonder if simply updating the hash generation to use prime multiplication would be sufficient?

1 Like

@jaywarrick @dietzc - I put a test Hash implementation on a branch… could you give it a try?

1 Like

Hi,

I just wrote some stuff here, but as @hinerm already worked on the prime multiplication, we can first check this. Just two comments:

  • I don’t want to enforce object identity on keys.
  • We shouldn’t use WeakReferences as keys, as I sometimes really want to use the hash for comparison and some caches (e.g. guava) will only use the system hash instead of the hash implementation of the class itself if weak keys are used.

Jay, can you test @hinerm changes?

1 Like

I will test the changes.

Are we still working under the assumption though that when you call cache.get(Hash) that the equals method of Hash will not erroneously return true? In other words, what if by chance the hashes are equal for two objects that we think should be considered different? I know the chance is very remote but evidently not impossible. It seems by relying on the comparison of the hash, this can still happen, no?

Cheers,

Jay

For now let’s assume that’s true. You’re right that there is always risk of collision, but it is highly suspicious if you encounter a collision more than like… once… ever… :smile:

1 Like

Got the same issue, but I’ll look into it a little more to definitively say it is a Hash class issue.

1 Like

@dietzc, I modified Hash to save a string version of the information regarding what was “hashed” and found what I suspected… The Hash test code is in [1]. Here is the result during the image feature extraction run with the debug point set to the line “System.out.println(“What the heck!”)”

Hash#1 - hashcode = -513759485 and Info String = “net.imglib2.roi.util.SamplingIterableInterval@6056ac7b;net.imagej.ops.stats.IterableMin@399704d5;”

Hash#2 - hashcode = -513759485 and Info String = “net.imglib2.roi.util.SamplingIterableInterval@4a9e84e5;net.imagej.ops.imagemoments.normalizedcentralmoments.DefaultNormalizedCentralMoment03@2d22934f;”

It seems the hash generation is not adequately unique or random enough, or more likely that [2] is happening in my case. This reference suggests that when the number of objects in the cache reaches ~77k objects (for int-based hashcodes), the probability of hashcode collision is ~50%. The cache I am using is up to ~198k objects when this debug point is reached first. This is in contrast to SHA1 hashcodes which are 160bit [3].

What’s scary is that I wouldn’t have noticed this had the ops not returned different object types. For cases where the return types are the same, it just produces the wrong value without you ever knowing, which is bad for a lot of reasons.

If this is correct, would you agree that a different hashing scheme or Hash.equals() method should be implemented? I can obviously do it but would like to have direction from you all on the best general approach.

Thanks!

Jay

[1]

 /**
 * Simple utility class to wrap two objects and an array of objects in a
 * single object which combines their hashes.
 */
private class Hash {

	private final int hash;
	private final String info;

	public Hash(final Object o1, final Object o2, final Object[] args) {
		
		StringBuilder b = new StringBuilder();
		b.append(o1.toString());
		b.append(";");
		b.append(o2.toString());
		b.append(";");
		// Implement hash joining algorithm from Jon Skeet on SO:
		// http://stackoverflow.com/questions/263400/what-is-the-best-algorithm-for-an-overridden-system-object-gethashcode
		long hash = 17;

		hash = hash * 23 + o1.hashCode();
		hash = hash * 23 + o2.hashCode();

		for (final Object o : args) {
			b.append(o.toString());
			b.append(";");
			hash = hash * 23 + o.hashCode();
		}

		this.hash = (int) hash;
		this.info = b.toString();
	}

	@Override
	public int hashCode() {
		return hash;
	}

	@Override
	public boolean equals(final Object obj) {
		if (obj == this) return true;
		if (obj instanceof Hash)
		{
			boolean ret = hash == ((Hash) obj).hash;
			if(ret && !info.equals(((Hash) obj).info))
			{
				System.out.println("What the heck!");
			}
			return ret;
		}
		return false;
	}
}

[2]

From http://eclipsesource.com/blogs/2012/09/04/the-3-things-you-should-know-about-hashcode/
"You may object that, unlike the printer’s type case, in Java there are 4,294,967,296 compartments (232 possible int values). With 4 billion slots, collisions seem to be extremely unlikely, right?
Turns out that it’s not so unlikely. Here’s the surprising math of collisions: Please imagine 23 random people in a room. How would you estimate the odds of finding two fellows with the same birthday among them? Pretty low, because there are 365 days in a year? In fact, the odds are about 50%! And with 50 people it’s a save bet. This phenomenon is called the Birthday paradox. Transferred to hash codes, this means that with 77,163 different objects, you have a 50/50 chance for a collision – given that you have an ideal hashCode function, that evenly distributes objects over all available buckets.

[3] from the same article as [2]
“You may know that cryptographic hash codes such as SHA1 are sometimes used to identify objects (Git does this, for example). Is this also unsafe? No. SHA1 uses 160-bit keys, which makes collisions virtually impossible. Even with a gigantic number of objects, the odds of a collision in this space are far below the odds of a meteor crashing the computer that runs your program. This article (http://preshing.com/20110504/hash-collision-probabilities/) has a great overview of collision probabilities.”

1 Like

I agree with @jaywarrick that the equals method of Hash should be changed. And maybe the class name should change: it is really just a Tuple of objects, which should only be equal if all the constituent elements are equal. Collisions with 31-bit hash codes are a fact of life, but as long as two keys are not actually equal as defined by equals, the Java collections classes deal with it OK.

I updated @hinerm branch with two commits. One fixing an issue introduced by the previous refactoring of the hash-codes and another one suggesting an equal methods which is a bit safer. We could save all hashes of the objects, however, keep in mind that these objects serve as keys for a cache and therefore shouldn’t be too memory consuming (especially, we shouldn’t keep references on the input arguments itself).

I agree with you that the memory burden gets fairly large if we try to keep references to inputs. Last night I tried essentially the Tuple approach and ran out of memory consistently. However, I’m also reluctant to continue to use hashes in the test of equals in your proposed hybrid approach. Although now equals will test for an identical op class, I can see using an op 10’s of thousands of times (e.g., cells in an image), which would put us into the 1% chance for collision realm per op. Certainly better per op but maybe not a whole lot better overall given there could be many ops used like this, so the chance of seeing the issue for at least one op still seems significant. It just makes me uneasy to run a computation and have a realistic chance of not getting the right result.

I’m trying to abstract this problem in my head to potentially suggest an alternative strategy but it hasn’t quite solidified yet.

In the meantime, could I suggest an adjustable parameter in CachedOpEnvironment for turning caching on an off (default on). Looking at it out right now, at least at first glance, it doesn’t seem too hard to implement or too fraught with peril. Agree? I might opt for less optimized computation but lower memory overhead and certainty in the result for now. Thoughts?

Why would you ever turn off caching in the CachedOpEnvironment? Just use the default OpEnvironment.

About hashing: OK. So what do you suggest? Should we check the hashes of each entry of input, delegate and args? I doubt that you will ever face the situation where all of these guys have the same hashes but are different objects.

Re: toggling caching - This is probably a result of my poor understanding of how to use OpEnvironments. As I look at it now, is it potentially as simple as … Instantiate a feature set I want to use, which in my case by default sets the OpEnvironment to be a CachedOpEnvironment, then just set the environment to be a new CustomOpEnvironment, thereby replacing the CachedOpEnvironment? Therefore, the feature set will no longer have the CachedOpEnvironment typically associated with a, AbstractCachedFeatureSet it already extends, but that is ok? I was largely just hoping to not have to implement a parallel set of feature sets.

Re: An alternative approach for Hash.equals - not sure if I have a solution. Maybe just some thoughts for now…

  • There might be a way to calculate the probabilities associate with your suggestion. You might be able to make an assumption of independence and thus just multiply the probabilities.

  • It seems like for feature extraction, most calculations are unique from an external / high-level perspective. In other words, I’m typically not asking to recalculate the same features for a given label region, but internally there are calculations that could be reused a lot. I wonder if there could be many little caches controlled by individual ops. Thus, this cache could be garbage collected when the op is done being used? So, through initialization of ops, you can control what ops stick around for how long, and thus the granularity of caching. This thought isn’t fully fleshed out but I’m just trying to avoid caching at the highest level where most calculations are unique and thus essentially dead weight to the cache. I suppose unless, for example, you are hoping to speed up rerunning of a feature set on a set of images (I suppose where only a subset of the input images have changed). My assumption was that the primary purpose was to reuse the a lot of the lower level calcs internal to the ops.

  • Potentially we could try to pull in a dependency for creating something like SHA1 hashes but probably it is limited by the hashcode implementations of the existing classes which will only produce integer length hashcodes. ?

Ok what about the following: Hash calculation stays as is. However, equals just uses object identity if the delegate classes are the same? Then we have the best of all worlds: In the very rare cases where we have the situation that the delegate classes are the same AND the hash is the same but the arguments are different, we are safe and just compute the result again. However, in all other cases we can use the cache, also safe.

Does this make sense?

I think I get it. If the delegate is different, then recalculate. If it is the same, and the hash is different, then recalculate. If the delegate is the same and the hash is the same, then check arg (not input) object identity.

My concern is that a given op with a particular set of args might be run 10’s of thousands of times (e.g., mean over each each cell label region in a large image of cells). You might even run this function with those args on many images to get into 100’s of thousands or millions. The only defining characteristic between these cached results will be the hash given the delegate and args are the same (often no args). So we still run into a ~1-100% chance of experiencing collision sometime during that run for the 10,000 cell and 1,000,000 cell case.

My suspicion is that using hashes as a surrogate for testing input inequality will be very tough given how many inputs we’ll often have.

Just trying to think of something else but not sure if my thoughts are heading in the right direction… If we would be forced to use object identity, then I suppose memory (and speed?) becomes a big deal. If individual ops could maintain a small cache (maybe “parent ops” share with “children ops” or some organization like that could be useful?), when they are done being used, the op and it’s associated cache could be garbage collected. This way, object equality could be tested for but potentially with less of a balloon in memory? But, it seems like sharing of calculations across ops (e.g., mean is probably used in a lot of ops) is one of the places to obtain an appreciable gain in performance. Just rambling now…

Thoughts?

Thanks

That’s the idea about the CachedOpEnvironment: Avoid recalculation of features. For example in the Haralick features we heavily make use of this. If you don’t want any caching, just use the DefaultOpEnvironment. If you are forced to use the CachedOpEnvironment because I used it in my obsolete branch, then you have to implement your own FeatureSet hierarchy. Feel free to copy the code.

About hashes: If you run the very same op on millions of cells, collisions might happen, true. However, it is even more unlikely that the hash ofthe simplename of a class is the same and the complete hash of the Hash object is the same. We can add another constraint on the hash of the input if may not be the same (see https://github.com/imagej/imagej-ops/tree/test-hash).

I don’t see a way around it. Do you @hinerm / @ctrueden? We really can’t enforce object equality at all.

If I understand correctly, I think the “best” we can do with input and arg hashes is to test delegate object equality and individually test hash equality of the input and each arg. This minimizes the chance of inadvertent collision upon combining hash keys (essentially expanding the overall bits for hashing) while additionally testing for number of args. However, even going all the way to this scenario can’t fix the particular situation I’m worried about where delegate and args are equal but input is not and you have a LARGE number of inputs.

I think we are on the same page but just to document this particular problem scenario with “math” for discussion…

Example

  • Object 1: input=A, delegate=B, arg=C
  • Object 2: input=D, delegate=B, arg=C

If

  • hash(A)=x=inputHash for object 1
  • hash(B)=y
  • hash©=z
  • hash(D)=x=inputHash for object 2 == hash(A) (which will occur with relative certainty for large numbers of inputs.)

Then

  • totalHash(A,B,C)=((17*23 + x)*23 + y)*23 + z = alpha
  • totalHash(D,B,C)=((17*23 + x)*23 + y)*23 + z = alpha

And, Hash.equals…

  • (Hash) obj).delegate == delegate && ((Hash) obj).inputHash == inputHash && (Hash) obj).totalHash && totalHash
  • which evaluates to B == B && x == x && alpha == alpha

Thus, regardless of how we combine things, the problem is that we have limited hashing for a large number of inputs. I.e, whether you use hashes, combined hashes, or object equality for the delegate and args, if inputHashes are inadvertently equal, we have a high chance of failure when input numbers are large with identical delegate and args.

Unless we can create multiple hashes/characteristics that can discriminate inputs, we can’t really expand our hashing capability sufficiently for the number of inputs that can realistically run into. For example, maybe and SamplingIterableInterval could produce a typical hash and an additional hash for the source of sampling and an additional number for the size / location of the interval. We can’t combine the hashes into a single 32-bit int, otherwise we are back to where we started but, if we keep the hashes separate we might have a chance. Maybe we could check to see if the input implements “SuperHash” which can return an arbitrary number of hashes or numbers. If so, test for equality on each individual hash of the SuperHash? I suppose we can technically do this for anything (e.g., args) that we think might be potentially numerous. We are limited to applying the concept of SuperHashing to classes we maintain, but, maybe this at least provides a way to fix issues in the future, especially with automatic converters that could wrap troublesome classes for SuperHashing.

Re: test-Hash branch - I think technically, the equals method you linked to [1] currently leaves out a condition for testing “equality” of the totalHash, which is something I thought you were suggesting in the text. If testing for object equality on the delegate somehow tests the args as well (e.g., the delegate is final and the internal args / params can’t be adjusted), then I guess you are covered and would be why the totalHash is not used in the equals method. In that case, I suppose then totalHash is just used to diversify locations of data in the HashMap for perfomance? If this wasn’t your intention, then we probably need a condition for totalHash equality as well to avoid arg equality. Right?

[1]

if (obj instanceof Hash) return ((Hash) obj).delegate == delegate && ((Hash) obj).inputHash == inputHash;

Shouldn’t it be

  • totalHash(A,B,C)=((17*23 + x)23 + x)23 + y = alpha
  • totalHash(D,B,C)=((17*23 + z)23 + x)23 + y = beta

in your example? I think you wanted to say that if hash(A) == hash(D) we have a problem, right? As we reuse the other parameters of the Op I agree, this only depends how likely it is that two inputs have the same hash (because everything is way more unlikely: different class, different args, different input each individual object has the same hash as another op with different delegate, different args (but same number of args) and different input).

So the situation that two totalHashes + all individual hashes are the same is very, very unlikely, unless for the cases where we reuse an Op with different inputs. We could either rely on the hash implementations of the inputs OR you somehow reset your cache after each computed object? You can do that by writing your own CacheService with higher priority and access this service in your code.

Re: Math - Man! I think I have dyslexia sometimes. Actually, based on the definition of Object 1 and 2, my initial error that propagated was in the If statement. We are on the same page though. I’ve updated the example. The “Then” statement is as it was supposed to be.

Re:

So the situation that two totalHashes + all individual hashes are the same is very, very unlikely, unless for the cases where we reuse an Op with different inputs.

I still think the best case scenario for the current Hash class is to test for delegate object identity and identity of individual hashes for the input and args. This way at least the output type is guaranteed (I think) to be the same. If you might not have the same delegate, then your output type can be different and you get a type conversion error after you retrieve your item from the cache (i.e., what I had happen to me).

Re:

Yeah. I was wondering about the latter. What do you mean by “computed object”.

Thanks

next iteration on branch. does this look good now? By “computed object” I mean: If you processed an object, i.e. extracted features for it, you actually can empty your cached because the caching only makes sense per object.