piątek, 13 października 2017

Notes on Java (hash) maps


Maps are one the most common data structures, used everyday by many programmers. In Java, the most popular implementation is HashMap, which uses open hashing (chaining) over an array to achieve constant-time (o(1)) performance of get/put operations (barring degenerated situation - for example when hashCode() call returns constant number).

Java HashMap

There are few things concerning this topic that should be interesting even if you're not that into algorithms / data structures ;-)

First of all, the underlying implementation and possible performance benefits. As said, Java HashMap is an open hashing one, which means that entries with the same bucket (array index calculated from hash) are put on a list. Therefore, in order to find a value for a particular key, the list needs to be iterated. This is of course o(n) operation (where n means number of entries in the list).

Java 8, however, implements a nice solution for this problem. According to this blog: http://www.nurkiewicz.com/2014/04/hashmap-performance-improvements-in.html Java switches bucket implementation from list to a balanced tree after a specific threshold is met (TREEIFY_THRESHOLD = 8 entries as of current). Thus, worst case performance improves from o(n) to o(log n). What is particularly interesting is balancing of the tree. For entries with different hashes, the hash value is used to compare entries. What if hash values are equal? Well, in this case the implementation hopes that keys implement Comparable interface. If this doesn't hold true - well, tree will be linearised, so in case of heavy collisions no performance benefits should be expected.

Robin Hood (RH) hashing 

As already mentioned, Java stdlib implementation uses open hashing approach. There is however a competing closed hashing (open addressing) approach. Closed hashing algorithm in case of collision calculate new hash (and array index) till free spot is found. This obviously means a performance penalty in high-collision scenarios as compared to open hashing in which the entry would be simply put at the end of the list. For get it's similar as the algorithm needs to go one-by-one comparing keys. Performance gets of course worse with higher loads of the underlying table.

I'd like to show here a neat idea how to improve over typical linear search, called Robin Hood hashing. The idea is quite old - published for the first time in 1986 (paper here: https://cs.uwaterloo.ca/research/tr/1986/CS-86-14.pdf). Why such name? Well, the idea is basically that some entries will get a free (insertion) slot sooner (with lower number of tries) than others, due to their "better" hash code value. Therefore, the algoritthm can put entries with "worse" hash codes closer (and move "better" to further spots). Thus, it's like taking from "richer" and giving to "poorer" :-)

Implementation itself is really straightforward - for each entry probe length (distance from originally calculated and current position) is kept. In case of collision, if probe length of the new entry is bigger than probe length of the current one, the new one is inserted (as poorer) and insert continues with the previously inserted one. If probe length of the new one is smaller, new index is calculated (and probe length for that entry increases). Therefore prob lengths will gradually even out.
According to this article https://www.sebastiansylvan.com/post/robin-hood-hashing-should-be-your-default-hash-table-implementation/ Robin Hood is faster as for current architectures it will has less cache misses (due to lower probe count variance). There were some attempts to measure the performance benefits of such approach. One of the best can be found here - http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion. Up to date, only Rust language implements Robin Hood as the standard hash map.

Key takeaways

  • Java hash map implementation is very fast, in Java 8 even faster (balanced tree)
  • it's worthwhile to implement Comparable for keys used in hash map 
  • Robin Hood hashing is an interesting idea that's worth trying, especially in case of memory constraints (high load factors - 0,9 or even 0,95 without big performance hit) or other non-functional requirements (seems reasonable for disc storage due to linear access pattern) 

Further reading

piątek, 15 września 2017

Java 8 continued - cheat sheet ;)

Continuing the topic of Java 8. Folks at Zeroturnaround (authors of JRebel) wrote a nice article regarding:

  • default methods in interfaces
  • best practices with lambdas usage
  • proper Optional usage
download/print the cheat sheet in PDF: 

Enjoy! :-) 

czwartek, 14 września 2017

Java 8 for late-adopters - Venkat's 5 cents

Some time ago I wrote a short post regarding some Java 8 features that may be interesting for current pre-8 developers and may motivate them to switch to the latest version. It was really basic, more like teaser than tutorial.

As one of the projects I'm in is currently moving to Java 8, I recently did a presentation on the topic. During research I found out a number of great tutorials - ranging from simple introductions to deep-in-details encyclopedias. One, however, caught my eye.

Venkat's "Java 8 programming idioms" series, published here: http://blog.agiledeveloper.com/2017/09/java-8-programming-idioms-series-at-ibm.html Touches mainly functional side of the Java 8. It can particularly handy, as I see that some developers that up to date lived in an object-oriented programming world have a hard time switching to functional style of thinking. Enjoy!

wtorek, 25 lipca 2017

[memo] hibernate's import.sql

There is a neat, heavily undocumented feature oh Hibernate - import.sql.

If Hibernate creates DB schema from scratch - "hibernate.hbm2ddl.auto" is set to either "create" or "create-drop", it will load "import.sql" file located in the root classpath. If Spring Boot is used, the property controlling the behaviour is "spring.jpa.hibernate.ddl-auto".

But why would anyone like to import some static data in a webapp? Well, for prototyping purposes for example. Or for "static" webapp mocking some external resources for integration testing. 

poniedziałek, 24 lipca 2017

Continuous Integration for (non)hipsters

Continuous Integration? Soooo ooold, so ninetees. Not hot anymore, not trendy. Currently even Continuous Delivery sounds like something regular, typical, borig. Everybody does CI... or do they?

In order to answer this question, let's refer to the excellent article, published by Martin Fowler here. Headline states:
Continuous Integration is a software development practice where members of a team integrate their work frequently (...). Each integration is verified by an automated build (including test) to detect integration errors as quickly as possible.
That's it.
Code (yes, this includes tests!)
Pull/push (integrate).
Let something (a CI server for instance) run tests and confirm that everything is in order

Four easy steps. However, the devil is in the detail.

First of all, the steps should be executed at least once a day, preferably multiple times a day by each team member. This way, any integration issues will be discovered on the spot. If your team has a habit of keeping private branches for a long time, without source integration and pushes (triggering a full CI build), well, you're not doing CI! If you use feature branches, which are a good thing, keep them public, so that CI server can watch them during development time.

Second of all, tests are the cornerstone of CI. Usually not all tests are run during build - in order to make it faster, only quick, small unit ones. If project build isn't fast, developers won't build before pushing changes (we're all always super busy, aren't we?). Therefore all expensive tests - especially integration ones - should be left for CI server. It's a general rule for a bigger builds. This way if an issue fits into this 1% not covered by unit tests, it should be found by the end of the day. By the way it's really a good habit to have a blame mails enabled in the CI server. The name is a bit misleading - it's not about the blame, it's all about fixing broken build (or bad tests!) as soon as possible.

All of what was above can be found in the aforementioned Martin Fowler's article - in a more elaborate way. Yes, it's a quite old one, but still very, very relevant.