Nebraska Operations
Operational note for the Hadoop install at Nebraska
Daily Operations
All of the admin operations must be done as root on hadoop-name, unless otherwise noted.DFS filesystem check
hadoop fsck / -blocksA successful check will end with these words:
The filesystem under path '/' is HEALTHYA unsuccessful check will end with the following:
The filesystem under path '/' is CORRUPTEDFollow up with more diagnostic in this case.
FUSE on worker nodes check
If you see the message "Transport endpoint is not connected" on worker nodes, this means that the FUSE mount has died. Simply log into the worker nodes, unmount the file system, and remount it:umount /mnt/hadoopYou may want to check for this condition every morning with "part 'ls /mnt/hadoop > /dev/null'" and look for the "Transport endpoint..." message.
fuse_dfs -oserver=hadoop-name -oport=9000 /mnt/hadoop -oallow_other -ordbufffer=131072
ls /mnt/hadoop
DFS overall report
hadoop dfsadmin -reportA similar amount of information can be found from
Get safemode status
hadoop dfsadmin -safemode get
Leave or enter safemode
hadoop dfsadmin -safemode leave
hadoop dfsadmin -safemode enter
Decommissioning a data node
- Add the data node entry to the file /scratch/hadoop-root/hosts_exclude (one host per line).
- Perform "hadoop dfsadmin -refreshNodes"
- Watch the logs of the namenode to make sure all the appropriate actions are taken!
- The decommissioning will end in the logfiles with this message: "Decommission complete for node 172.16.1.55:50010"
Cleaning up a CORRUPT filesystem (as reported by fsck)
When the namenode is in safemode, one cannot make any edits to the file system. First, run fsck and determine the extent of the damage. If it is acceptable to delete or otherwise move aside the damaged files, turn off safemode, and move the file using the following commandhadoop fsck -moveThis moves any files with problematic blocks into /lost+found.
Restoring from a checkpoint
Shut down the namenode. There are two checkpoint images being kept at any time:/scratch/hadoop/dfs/namesecondary/current/Copy all the files in one of these directories into the following directory:
/scratch/hadoop/dfs/namesecondary/previous.checkpoint/
/scratch/hadoop/dfs/name/current/Start the namenode again, and watch the logs for activities.
Miscellaneous Activities
Fixing "stuck" under-replicated files
We have observed a few occaisions where a few blocks stay under-replicated; this seems to correspond with blocks written during or around the time of namenode crashes (last time, it was due to a kernel panic unrelated to Hadoop). Assume file X has a block with only Y replicas, but the target number is Z. After an appropriate amount of time (block replicas usually occur within a few seconds; wait no more than 10 minutes), do the following:- Confirm the file is having the problem using "hadoop fsck X"
- Set the replicas of file X to the current number of replicas, Y: "hadoop fs -setrep Y X"
- Use fsck to make sure the file now shows up fine; "hadoop fsck X"
- Set the number of replicas back up to Z: "hadoop fs -setrep Z X"
- Watch the file using fsck and make sure the replicas get created: "hadoop fsck X"
Here's the script using xargs:
hadoop fsck / | awk '{print $1}' | grep user | tr -d ':' | sort | uniq > /tmp/stuck_replicas
cat /tmp/stuck_replicas | xargs -t -i hadoop fs -setrep 2 {}
hadoop fsck / # Make sure everything is happy
cat /tmp/stuck_replicas | xargs -t -i hadoop fs -setrep 3 {}
hadoop fsck / # Watch and see if everything becomes happyPort forwarding for the hadoop web interface
This only needs to be done once:/sbin/iptables -t nat -A PREROUTING -p tcp --dport 8088 -i eth0 -j DNAT --to-destination 172.16.100.8:50070
/sbin/iptables -t nat -A PREROUTING -p tcp --dport 8089 -i eth0 -j DNAT --to-destination 172.16.1.9:50030
IMPORTANT: make sure to not kill the normal nat behavior on dcache-head. In fact, just use the /root/restart_NAT.sh script which includes the above lines.
Starting and Stopping
Controlling via init scripts
Every node which is installed via VDT should have init scripts; if they aren't there, copy them from a node which has them.- (If installed through VDT): vdt-control --on
- If not installed via VDT, use chkconfig to add the scripts
- To start/stop datanode: /etc/init.d/hadoop_datanode [start|stop]
- To start/stop namenode: /etc/init.d/hadoop_nadenode [start|stop]
- To start/stop FUSE mount: /etc/init.d/hadoop_fuse [start|stop]
- To start/stop the GridFTP xinetd install:
- vdt-control --enable gridftp-hdfs
- vdt-control [--on|--off]
Manually starting a Hadoop daemon
We recommend using the init scripts to start and stop daemons. However, if you must do this manually:cd $HADOOP_HOME/binThe valid actions are start or stop; the valid daemons are datanode or namenode.
source ./hadoop-config.sh
./hadoop-daemon.sh --config $HADOOP_CONF_DIR start datanode
Manually remounting FUSE
To unmount,umount /mnt/hadoopTo mount, after sourcing the Hadoop environment,
fuse_dfs -oserver=hadoop-name -oport=9000 /mnt/hadoop -oallow_other -ordbufffer=131072