GridFTP HDFS
How to install a GridFTP server for the HDFS file system.
Install Pre-req
The GridFTP server for HDFS is based upon the stock Globus server with a new loadable module to interact with Hadoop.
Make sure that the xinetd RPM is installed on your system.
We have created a pacman-based install using the VDT's packaging of Globus. To install pacman, do the following:
mkdir /opt/pacman
cd /opt/pacman
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-3.26.tar.gz
tar zxf pacman-3.26.tar.gz
cd pacman-3.26
source setup.sh
Now, create your install directory; we will refer to this as $VDT_LOCATION in this documentation.
Installation
The installation is relatively automatic. If your site uses GUMS,
cd $VDT_LOCATION
echo "http://t2.unl.edu/store/cache" > trusted.caches
echo "http://vdt.cs.wisc.edu/vdt_1101_cache" >> trusted.caches
export VDTSETUP_AGREE_TO_LICENSES=y
export VDTSETUP_INSTALL_CERTS=l
export VDTSETUP_CA_CERT_UPDATER=n
export VDTSETUP_ENABLE_BESTMAN=n
export VDTSETUP_ENABLE_ROTATE=n
export VDTSETUP_ENABLE_GRIDFTP=n
export VDTSETUP_EDG_CRL_UPDATE=n
export VDT_GUMS_HOST=<host for your GUMS install>
pacman -get http://t2.unl.edu/store/cache:GridFTP_HDFS
ln -s /etc/grid-security/certificates $VDT_LOCATION/globus/TRUSTED_CA
If your site does not use GUMS, replace the VDT_GUMS_HOST line with
export VDT_NO_PRIMA=1
Currently supported platforms include RHEL-4 and RHEL-5, both in 32-bit and 64-bit.
Currently, this automatically installs the gridftp server into /etc/xinetd.d without asking the admin (hopefully this will change in the future). This means that GridFTP_HDFS must be installed using the root account. The gridftp server uses port 5000 (as opposed to the default port, 2811).
Configure CA Certificates
(Follow the following directions, taken from the CE post-install document from the OSG)To pull the OSG recommended CA distribution edit the cacerts_url in the configuration file at
$VDT_LOCATION/vdt/etc/vdt-update-certs.conf
This file contains URLs to CA Certificate distributions including the OSG GOC distribution with certificates recommended by the OSG Security Team, as well as the VDT convenience distribution. You must uncomment one of these (or create your own), and then run the commands below to activate the certificate updates.
source $VDT_LOCATION/vdt-questions.sh; $VDT_LOCATION/vdt/sbin/vdt-setup-ca-certificates
vdt-control --enable vdt-update-certs
vdt-control --on vdt-update-certs
Configure Authentication
There are two ways to configure authentication:
- PRIMA/GUMS (will integrate into larger site setups):
cp $VDT_LOCATION/post-install/*-authz.conf /etc/grid-security
Edit the line starting with imsContact in /etc/grid-security/prima-authz.conf to point to your GUMS installation
- grid-mapfile (not recommended):
- Create a file called /etc/grid-security/grid-mapfile.
- For each user of the site (with DN <DN> mapping to unix user name <user>, add a line of the following format:
"<DN>" <user>
Configure HDFS
Unfortunately, GridFTP-HDFS does not yet integrate naturally into the Hadoop config system. You must specify the following environmental variables for the GridFTP server:- VDT_GRIDFTP_HDFS_REPLICAS: Integer number of replicas for each saved file (defaults to 3).
- VDT_GRIDFTP_HDFS_NAMENODE: Hostname of the Hadoop namenode to connect to (defaults to hadoop-name).
- VDT_GRIDFTP_HDFS_PORT: Port number of the Hadoop namenode (defaults to 9000).
- VDT_GRIDFTP_HDFS_MOUNT_POINT: The FUSE mount point on the SRM node. This allows the GridFTP server to convert from SRM filenames (which include the FUSE mount path) and the native Hadoop filenames (which do not include the FUSE mount path). Defaults to /mnt/hadoop.
- VDT_GRIDFTP_LOAD_LIMIT: If the load is above this integer value, then the gridftp server will accept new transfers, but not allow them to actually start movement. Defaults to 20.
Turn on GridFTP-HDFS server
Now, simply turn on the server:vdt-control --enable gridftp-hdfsMake sure you turn on log rotation - nothing like a logfile filling up a partition to ruin a good night's sleep.
vdt-control --enable vdt-rotate-logs # DO NOT FORGET THIS ONE!
vdt-control --on
Test GridFTP server
Left as an exercise for the installer. Make sure that you can copy files in and out of Hadoop using globus-url-copy.Add GridFTP server to BestMan config
Any BestMan SRM server must be told of the location of the new GridFTP server before it can use it. Add this new server to the bestman.rc of the SRM server and restart the server.(Note for Nebraska admins: in dcache07, this file is at /opt/bestman/bestman/conf/bestman.rc; in srm, this file is at /opt/bestman/bestman/conf/bestman.rc).
Enable Gratia GridFTP probe
Now, go enable the Gratia GridFTP probe -- especially if you're at Nebraska.Additionally, if you are at Nebraska, then you'll want to replicate the data from Nebraska's collector to the FNAL collector.