Building a Hadoop Release
How to build a Hadoop release so it can be installed at Nebraska
Note: You DO NOT need to use this page if you are only deploying Hadoop, not building a new one for the Nebraska system.
At Nebraska, we usually build on SL4 on node001 in the directory /opt/osg/osg-100/hadoop; on SL5, build on dcache06 in /opt/hadoop-build.
At Nebraska, we usually build on SL4 on node001 in the directory /opt/osg/osg-100/hadoop; on SL5, build on dcache06 in /opt/hadoop-build.
- Download and unpack the Hadoop source code
- Source an existing VDT install which includes Ant and JDK (do a "ls" on the $VDT_LOCATION and make sure there are "ant" and "jdk1.5" directories).
- Set the following variables:
- HADOOP_HOME=freshly unpacked source
- CLASSPATH variables:
export CLASSPATH=$HADOOP_HOME/hadoop-0.19.0-core.jar:$HADOOP_HOME/lib/commons-logging-1.0.4.jar:$HADOOP_HOME/lib/commons-logging-api-1.0.4.jar:$HADOOP_HOME/lib/log4j-1.2.15.jar:$CLASSPATH
- Library variables:
export LD_LIBRARY_PATH=$HADOOP_HOME/build/libhdfs:$VDT_LOCATION/jdk1.5/jre/lib/amd64/server:$LD_LIBRARY_PATH
- Path variables:
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/src/contrib/fuse-dfs/src:$PATH
- Patch Hadoop as necessary. The patches we use are listed below.
- (Only on 64-bit nodes). Edit $HADOOP_HOME/src/c++/libhdfs/Makefile; replace all occurrences of -m32 with -m64.
- Export misc. build variables:
export PERMS=1
Otherwise, fuse-dfs will not build.
export FUSE_HOME=$VDT_LOCATION/fuse - Build Hadoop:
ant compile-contrib -Dlibhdfs=1 -Dfusedfs=1 jar
This requires automake >= 1.9.5, which IS NOT AVAILABLE on RHEL4 (for Nebraska builders: this is located in /usr on node001. Build there.). I had to download and install it from source, then add /usr/local/bin to the PATH and /usr/local/lib to the LD_LIBRARY_PATH. You can test your automake version with "automake --version". - Fix link, build/libhdfs/libhdfs.so, to not be absolute. I 'cd' to the directory $HADOOP_HOME/build/libhdfs, rm the existing libhdfs.so, then perform 'ln -s libhdfs.so.1 libhdfs.so'
- cd $HADOOP_HOME/.. Then, make a copy of the entire directory, hadoop-0.x.x/, to hadoop/. Finally, issue the tar command:
tar zcf hadoop-0.x.x-RHELy-zzz.tar.gz hadoop/
Replace x.x with the Hadoop version number; replace y with the RHEL release (4 or 5), and zzz with the platform (i686 or x86_64). - Copy the resulting tarball into t2.unl.edu:/var/www/html/cache.
Patches we apply to Hadoop
- Patch to add offset logging support in datanodes: https://issues.apache.org/jira/secure/attachment/12400028/clienttrace.patch.
- Patch to fix a cache corruption in fuse-dfs, as referenced here: https://issues.apache.org/jira/browse/HADOOP-4298. FIXED IN 0.19.0
- Mutex lock on read patch for fuse-dfs. No JIRA reference yet. FIXED IN 0.19.0
- Patch for a java error during fsck under certain conditions: https://issues.apache.org/jira/browse/HADOOP-4351 FIXED IN 0.19.0
- Patch for averages in Ganglia metrics: https://issues.apache.org/jira/browse/HADOOP-4369. Wrapped into below patch
- Patch for Ganglia NPE: https://issues.apache.org/jira/browse/HADOOP-3422. Wrapped into patch below
- Patch for Ganglia 3.1 support: https://issues.apache.org/jira/browse/HADOOP-4675.
- Quickest way to patch:
cd $HADOOP_HOME
Contact Brian if patch does not succeed.
cp src/core/org/apache/hadoop/metrics/ganglia/GangliaContext.java src/core/org/apache/hadoop/metrics/ganglia/GangliaContext31.java
curl -k https://issues.apache.org/jira/secure/attachment/12394647/hadoop-4675-3.patch | patch -p 0 - Patch for FUSE-DFS "df": https://issues.apache.org/jira/browse/HADOOP-4368. Patch available for Hadoop 0.19.0 and 0.20.0
- Quickest way to patch 0.19.0 (will fail on 0.20.0):
cd $HADOOP_HOME
curl -k https://issues.apache.org/jira/secure/attachment/12395292/fuse_statfs.patch | patch -p 0 - Patch for FUSE-DFS to prevent infinite loop on read error (Patch available for Hadoop 0.19.0 and 0.20.0). Fixed in HADOOP-4616
- Quickest way to patch 0.19.0 (will fail on 0.20.0):
cd $HADOOP_HOME
curl -k http://issues.apache.org/jira/secure/attachment/12394123/HADOOP-4616_0.19.txt | patch -p 0 - Patch for FUSE-DFS groups: https://issues.apache.org/jira/browse/HADOOP-4727. Hadoop 0.19.0 only
- Quickest way to patch:
cd $HADOOP_HOME
curl -k https://issues.apache.org/jira/secure/attachment/12394700/hadoop-4727.patch | patch -F5 -p 0
Building GridFTP-HDFS
The build machines used are similar to before. For dcache07, the build directory is /opt/gridftp-hdfs-build.- Pre-requisites:
- Valid Hadoop, preferably installed via the UNL pacman cache.
- subversion RPM package providing the standard svn client.
- Use pacman to pull in the Globus GridFTP SDK:
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:Globus-Base-Data-Server
pacman -get http://vdt.cs.wisc.edu/vdt_1101_cache:Globus-Base-SDK - Source the VDT's setup.sh.
- Make sure $VDT_LOCATION exists in the following steps!
- Check out the GridFTP-HDFS sources:
svn co svn://t2.unl.edu/brian/gridftp_hdfs - Make a backup copy of the makefiles:
cp makefile_header makefile_header.bkp
This is so the original makefiles can be preserved during the next step.
cp Makefile Makefile.bkp - Replace MAGIC_VDT_LOCATION with the actual contents of $VDT_LOCATION.
sed -i s:MAGIC_VDT_LOCATION:$VDT_LOCATION:g Makefile
sed -i s:MAGIC_VDT_LOCATION:$VDT_LOCATION:g makefile_header - Run make to build the GridFTP module.
make
- Copy the original makefiles back:
cp Makefile.bkp Makefile
cp makefile_header.bkp makefile_header - Create a tarball, and place it in the pacman cache.