piątek, 13 października 2017

Notes on Java (hash) maps

Maps

Maps are one the most common data structures, used everyday by many programmers. In Java, the most popular implementation is HashMap, which uses open hashing (chaining) over an array to achieve constant-time (o(1)) performance of get/put operations (barring degenerated situation - for example when hashCode() call returns constant number).

Java HashMap

There are few things concerning this topic that should be interesting even if you're not that into algorithms / data structures ;-)

First of all, the underlying implementation and possible performance benefits. As said, Java HashMap is an open hashing one, which means that entries with the same bucket (array index calculated from hash) are put on a list. Therefore, in order to find a value for a particular key, the list needs to be iterated. This is of course o(n) operation (where n means number of entries in the list).

Java 8, however, implements a nice solution for this problem. According to this blog: http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html Java switches bucket implementation from list to a balanced tree after a specific threshold is met (TREEIFY_THRESHOLD = 8 entries as of current). Thus, worst case performance improves from o(n) to o(log n). What is particularly interesting is balancing of the tree. For entries with different hashes, the hash value is used to compare entries. What if hash values are equal? Well, in this case the implementation hopes that keys implement Comparable interface. If this doesn't hold true - well, tree will be linearised, so in case of heavy collisions no performance benefits should be expected.

Robin Hood (RH) hashing 

As already mentioned, Java stdlib implementation uses open hashing approach. There is however a competing closed hashing (open addressing) approach. Closed hashing algorithm in case of collision calculate new hash (and array index) till free spot is found. This obviously means a performance penalty in high-collision scenarios as compared to open hashing in which the entry would be simply put at the end of the list. For get it's similar as the algorithm needs to go one-by-one comparing keys. Performance gets of course worse with higher loads of the underlying table.

I'd like to show here a neat idea how to improve over typical linear search, called Robin Hood hashing. The idea is quite old - published for the first time in 1986 (paper here: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf). Why such name? Well, the idea is basically that some entries will get a free (insertion) slot sooner (with lower number of tries) than others, due to their "better" hash code value. Therefore, the algoritthm can put entries with "worse" hash codes closer (and move "better" to further spots). Thus, it's like taking from "richer" and giving to "poorer" :-)

Implementation itself is really straightforward - for each entry probe length (distance from originally calculated and current position) is kept. In case of collision, if probe length of the new entry is bigger than probe length of the current one, the new one is inserted (as poorer) and insert continues with the previously inserted one. If probe length of the new one is smaller, new index is calculated (and probe length for that entry increases). Therefore prob lengths will gradually even out.
According to this article https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ Robin Hood is faster as for current architectures it will has less cache misses (due to lower probe count variance). There were some attempts to measure the performance benefits of such approach. One of the best can be found here - http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion. Up to date, only Rust language implements Robin Hood as the standard hash map.

Key takeaways


  • Java hash map implementation is very fast, in Java 8 even faster (balanced tree)
  • it's worthwhile to implement Comparable for keys used in hash map 
  • Robin Hood hashing is an interesting idea that's worth trying, especially in case of memory constraints (high load factors - 0,9 or even 0,95 without big performance hit) or other non-functional requirements (seems reasonable for disc storage due to linear access pattern) 

Further reading


Brak komentarzy: