Personal tools
You are here: Home Documentation Hadoop Hadoop Installation
Document Actions

Hadoop Installation

by admin last modified 2009-08-22 10:16

How to install Hadoop using the Nebraska Pacman packaging

Warning


These instructions are out of date.  Follow the OSG ones here:

https://twiki.grid.iu.edu/bin/view/Storage/Hadoop

These are kept for historical purposes only.

Install Pre-req

This install process is based on pacman; to install Pacman, do the following:
mkdir /opt/pacman
cd /opt/pacman
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.26.tar.gz
tar zxf pacman-3.26.tar.gz
cd pacman-3.26
source setup.sh
Next, determine the node which will be your namenode and the list of the data nodes.  Finally, determine where the Hadoop software will be installed; we recommend either a pre-existing Pacman install (such as the OSG) or an NFS mount.

Hadoop Install

The Hadoop install is relatively automated:
cd $VDT_LOCATION
echo "http://t2.unl.edu/store/cache" > trusted.caches
echo "http://vdt.cs.wisc.edu/vdt_1101_cache" >> trusted.caches
export VDTSETUP_AGREE_TO_LICENSES=y
pacman -get http://t2.unl.edu/store/cache:Hadoop
pacman -get http://t2.unl.edu/store/cache:Hadoop-Config
The difference between the Hadoop and Hadoop-Config packages is that Hadoop-Config includes a set of handy aliases for interacting with HDFS and a set of init scripts.  You may skip that one if you like.

We recommend also installing FUSE on all servers, and GridFTP-HDFS on external servers.

Hadoop Configuration

We have included some sane default values for the Hadoop config; you ought to put any changes in hadoop-site.xml, not hadoop-default.xml.  We make notes of hadoop-site.xml changes, as well as the changes in the other config files:
  • fs.default.name (in conf/hadoop-site.xml, defaults to hdfs://hadoop-name:9000).  Change to the hostname of your namenode.
  • hadoop.tmp.dir (in conf/hadoop-site.xml, defaults to /scratch/hadoop).  Change to the location of the disk storage on this node; may be a comma-separated list of mount points (note: do not put a space between commas).
  • slaves (one node per line, in conf/slaves).  List the names of all the nodes attached to the cluster.  Only needed if you want to use the provided scripts to start/stop the datanodes
  • Ganglia location (*.server lines in conf/hadoop-metrics.properties).  Only needed if you want to use Ganglia to monitor Hadoop.
  • Log location (in conf/hadoop-env.sh, defaults to /var/log).  This is only for the stdout of each daemon; the log4j logs go to ${hadoop.tmp.dir}/logs.
  • hosts_exclude file.  Make sure that this file exists in ${hadoop.tmp.dir}, otherwise the namenode will not start up.
  • Directory locations.  Make sure that ${hadoop.tmp.dir} exists, otherwise none of the daemons will start up.
  • Open file limits.  Sites have reported that they can exhaust the open file limits of the user running Hadoop pretty quickly, especially with small files.  Increase the open file limits on all the datanodes to 8192 or more.
At this point, we recommend starting up the namenode and datanode daemons and reading the logfiles to make sure they can startup successfully.
. $VDT_LOCATION/setup.sh
start-dfs.sh
If you have slaves set up in the slaves file, then the startup script will attempt to launch daemons through SSH.  If you use this method, you must have passwordless SSH installed for the user running hadoop.  Passwordless SSH is only used for launching the daemons; it is by no means required to run Hadoop; many sites don't use this and use the VDT-provided init.d script instead.

FUSE Install

In order to be able to mount FUSE, one needs to install the FUSE package (the FUSE-DFS filesystem itself comes with the Hadoop install). 

Prior to installing FUSE, you must have the kernel sources installed which match EXACTLY the current kernel version as reported by `uname -r`.  It's probably either the kernel-devel or kernel-smp-devel RPM from yum.

If you get stuck below in installing FUSE due to the kernel sources -- and you are sure that the kernel sources are there -- you can override the kernel source installation check with the following variable:
export VDT_HAS_KERNEL_SOURCES=1
Do the following:
cd $VDT_LOCATION
pacman -get http://t2.unl.edu/store/cache:FUSE
modprobe fuse
At Nebraska, we require that Hadoop is mounted in /mnt/hadoop.  Make sure you have already created that directly.  Finally, we can mount the system.  As root,
source $VDT_LOCATION/setup.sh
export VDT_HDFS_FUSE_MOUNT=/mnt/hadoop
fuse_dfs -oserver=hadoop-name -oport=9000 $VDT_HDFS_FUSE_MOUNT -oallow_other -ordbufffer=131072
Modify the VDT_HDFS_FUSE_MOUNT variable as required for your site.  This needs to be done at boot; sourcing the VDT is a necessary step as it brings in the correct environment for FUSE.

When you run the mount command, it prints out some scary warnings:
fuse-dfs didn't recognize /hadoop,-2
fuse-dfs ignoring option allow_other
Ignore them.

FUSE Configuration

If you installed the Hadoop-Config package above, you can easily start FUSE on boot.
First, tell the VDT what namenode and port you use.  Add the following to $VDT_LOCATION/vdt/etc/vdt-local-setup.sh:
export VDT_GRIDFTP_HDFS_NAMENODE="hadoop-name"
export VDT_GRIDFTP_HDFS_PORT="9000"
export VDT_HDFS_FUSE_MOUNT="/mnt/hadoop"
Edit these variables as you see fit.

Enable the VDT packaging of the init script:
. $VDT_LOCATION/setup.sh
vdt-control --enable hadoop_fuse
vdt-control --on



Powered by Plone, the Open Source Content Management System