Over the last few years, with most large scale data processors such as Google, Yahoo, Amazon, Twitter and Facebook open-sourcing their internal data crunching algorithms/software, "BigData" projects have taken off exponentially. We are now in the BigData era of computing, where a large number of companies and individuals are contributing to or adopting software stacks such as Hadoop, Cassandra, MongoDB, Riak, Redis etc. for large scale distributed data processing. Several sub industries have been spawned to provide technical support, hardware and SaaS for BigData as well. What is most appealing and revolutionary about this industry is the fact that open source software is at the heart of it. Any company or individual can dip into the BigData pot without spending large amounts of money and without being held hostage by vendors.
For a recent experiment, I had to install and configure a simple two node Hadoop cluster. In this post, I will highlight a few gotchas that I encountered, the solutions to which were not properly documented anywhere. Hopefully this will save someone else a little bit of time.
I will not go through the process of installing Hadoop, as it is well documented in numerous places. (Michael G. Noll's Hadoop tutorial is an excellent starting point for newbies.)
Gotcha: Server at x not available yet, Zzzzz...
I encountered this error message after I setup my second node and started the whole cluster up. The datanode logs were full of the above message and the dfs health check showed the available space as 0 bytes.
This seems to be a Java bug in resolving names from /etc/hosts. Java networking libraries cache host name resolutions forever. (see: http://myhowto.org/java/42-understanding-host-name-resolution-and-dns-behavior-in-java/) Setting the networkaddress.cache.ttl JVM property to a reasonable number such as 10 should solve this problem, but I had no luck with it. In the end, I managed to get around the issue by using IP addresses instead of hostnames in Hadoop *-site.xml configuration files. Since I was setting up a development cluster, this was acceptable. In a production environment, it is very unlikely that you will be relying on /etc/hosts to resolve hosts anyway. Therefore, it's a minor annoyance that you will only encounter during a trial installation like mine.
Gotcha: Error reading task output , Too many fetch-failures
I started seeing a bunch of these messages while a job was running in the cluster. Although the job completed successfully, these failures delayed the job quite a lot.
This is a problem with /etc/hosts configuration. If you have multiple aliases for the localhost, make sure the hostname alias that is in the Hadoop configuration file, is the first in the list.
For example, if the current /etc/hosts file looks like:
127.0.0.1 localhost.localdomain localhost hadoop-slave1
Change it to:
127.0.0.1 hadoop-slave1 localhost.localdomain localhost
Gotcha: java.lang.NoClassDefFoundError for classes defined in external libraries
Any non trivial MapReduce job will need to reference external libraries. However, even if the HADOOP_CLASSPATH variable is correctly set, when the job is run by Hadoop, you will get NoClassDef errors for any classes defined in these external library jars.
This took a bit of searching to find out. Hadoop expects all external dependencies to be in a directory named lib inside the job jar. Use the following Maven assembly descriptor to create a job jar that conforms to this convention. (Make sure the core Hadoop dependencies in the POM are set to the provided scope.)
I will conttinue to update this post as I continue to experiment with Hadoop. If you spot any errors or know of a more elgant solution to some of these problems, please leave a comment.