GPN Cluster Install Notes
Notes from the software installation of gpnjayhawk.
Install Notes
Author: John Eslick, jceslick@ku.edu
Disclaimer: I'm not really sure my site works properly, and I may have made mistakes typing up the notes, so you should a least read through before you start and see if everything seems reasonable. If you find mistakes, please email me, and I'll fix them. If we get the bugs out of this, I may finally get my own site working.
The Condor configuration is still not quite right but I think the problems are minor.
Original Documentation:
This just covers the steps used to setup the software for the GPN on gpnjayhawk.cpe.engr.ku.edu. For the details see the documentation at http://t2.unl.edu/documentation/gpn/, http://osg.ivdgl.org/twiki/bin/view/Integration/, and the condor manual.
Software:
Rocks 4.1 for i386
CentOS 4.3 for i386
Condor 6.7.20
OSG 0.4.1
Hardware:
The cluster frontend is a Sun x2100 with a DVD drive, 80 Gb, and 250 Gb hard drive. It has a single core processor. The compute nodes have one 80 Gb hard drive and a dual core processor.
Setup the hardware so it works. Your hardware may be different, but hopefully it won't be too important. I think the important thing as far as these notes go is that you can PXE boot.
OS Install:
There are 5 discs. The first is a boot roll for Rocks 4.1. The other four are CentOS 4.3 discs.
Turn off all cluster machines
Turn on head node
Insert boot roll and boot from CD
A Rocks screen will come up type “frontend” and press enter. This will run the frontend install if you don't do it fast enough the compute element install will start. If you are too slow just reboot.
You will be asked which rolls to install. Install: kernel, base, ganglia, webserver, java, area51, and hpc. Do not install SGE. We will try to use condor for all job scheduling. Press ok after package selection.
When asked if you have additional roll CDs answer yes.
Insert the 4 OS discs one at a time in order
After the last OS disc there are no more rolls.
You will be asked for some cluster information. Use the information below.
Fully Qualified Domain Name: gpnjayhawk.cpe.engr.ku.edu
Country: US
Cluster Name: gpnjayhawk
Organization: University of Kansas
Location: Lawrence
State: Kansas
LatLong: N38.96 W95.25
Disk partitioning
On sda make / partition 50,000 Mb
On sda make swap partition 6,000 Mb
On sda make /state/partition1 with the rest of the disk
On sdb make /export partition 100,000 Mb
On sdb make /osg_data with the rest of the disk
You will be asked for network information.
Private Network Interface
IP Address: 10.1.1.1
Netmask: 255.0.0.0
Pubic Network Interface
IP Address: 129.237.115.18
Netmask: 255.255.248.0
Gateway: 129.237.119.254
DNS 1: 129.237.112.1
DNS 2: 129.237.32.1
DNS 3: 129.237.23.2
You will be asked for time information
Check system clock uses UTC
For time zone select America/Chicago
Network time server pool.ntp.org
Root password is ********
The installer will ask you to insert discs. They will not be in order.
After all the discs have been read the software will install
The computer will reboot
The frontend node is now installed.
Log in a root. The first time you log in you will be prompted for a location of ssh keys and a pass phrase. Use the default directory. The pass phrase *************************** was used.
Export the /osg_data directory to /home/osg_data
Add this line to the /etc/exports file:
/osg_data 10.0.0.0/255.0.0.0(rw,async)
Add this line to the /etc/auto.home file:
osg_data gpnjayhawk.local:/osg_data
Restart NFS:
/etc/init.d/nfs restart
Do this command:
make -C /var/411
Install required compat-libstdc++-33 rpm
Get the rpm:
wget ftp://mirrors.kernel.org/centos/4.3/os/i386/CentOS/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm
Install on frontend node:
rpm -i compat-li*rpm
Copy the compat-libstdc++-33 rpm to /home/install/contrib/4.1/i386/RPMS/
Copy /home/install/site-profiles/4.1/nodes/skeleton.xml to /home/install/site-profiles/4.1/nodes/extend-compute.xml
Edit the new extend-compute.xml file. Find the <package>...</package> lines and add:
<package> compat-libstdc++-33 </package>
Rebuild the rocks distribution:
cd /home/install
rocks-dist dist
Run insert-ethers on the front end.
Select compute for appliance type.
Setup software on compute nodes. For each compute node:
Move monitor and keyboard to node
Turn on node
Press F8 while booting to get to the boot menu. (depends on your computer)
Select MBA... for PXE boot from broadcom card. (depends on your computer)
Installation should start
Move monitor and keyboard back to frontend after installation has started on all compute nodes.
Exit insert-ethers by pressing F10
You may now want to add some users with “useradd” and “passwd”
Check that /home/osg_data mounts on the nodes and that the file libstdc++.so.5 exists on the nodes.
(Note : If there is no directory named osg_data in /home issue the following command
ssh compute-0-0
sudo /etc/init.d/nfs restart
The default location for the libstdc++.so.5 file is /usr/lib.
You can check the file by giving following commands
sudo updatedb
locate libstdc++.so.5Change the permissions for /home/osg_data so its readable by everyone
chmod a+rx /home/osg_data
Condor Install:
Condor 6.7.20 for RHEL 4 was used for this installation.
Download condor tar file and manual from condor website.
Add condor user
useradd condor
Extract Condor files in /root
tar -zxvf condor*.tar.gz
Run the Condor install script.
cd ~/condor-6.7.*
perl condor_install
The install script will ask many questions answer as below
-STEP 1- Would you like to do a full installation of Condor? - yes
-STEP 2- Are you planning to setup Condor on multiple machines? - yes
Will all the machines share files via a file server? - yes
What are the hostnames of the machines you wish to setup? (1 per line)
gpnjayhawk
compute-0-0
compute-0-1
compute-0-2
compute-0-3
-STEP 3- Have you installed a release directory already? - no
Where would you like to install the Condor release directory? - /share/apps/condor
That directory doesn't exist, should I create it now? - yes
-STEP 4- If something goes wrong with Condor, who should get email about it? - root@gpnjayhawk.local
What is the full path to a mail program that understands "-s" means you want to specify a subject? - /bin/mail
-STEP 5- Do all of the machines in your pool from your domain ("local") share a common filesystem? - yes
Do all of the users across all the machines in your domain have a unique UID (in other words, do they all share a common passwd file)? - yes
In some cases, even if you have unique UIDs, you might not have all users listed in the password file on each machine. Is this the case at your site? - no
-STEP 6- Enable Java Universe support? - yes
Please enter the full path to the JVM, or "none" to leave unconfigured: - /usr/java/jdk1.5.0_05/bin/java
You entered: /usr/java/jdk1.5.0_05/bin/java. Is that right? - yes
-STEP 7- Shall I create links in some other directory? - yes
Where should I install these files? - /usr/local/bin
-STEP 8- What is the full hostname of the central manager? - gpnjayhawk.cpe.engr.ku.edu
You have a "condor" user on this machine. Is the home directory for this account (/home/condor) shared among all machines in your pool? - yes
Do you want to put all the Condor directories for each machine in subdirectories of /home/condor/hosts? - yes
Do you want to specify a local partition for file locking? - yes
Where should I put the lock files? - /tmp
Shall I create it now? - yes (you probably will not get this question now)
-STEP 10- Do you want all the machine-specific config files for each host in one directory? - yes
What directory should I use? - /share/apps/condor/etc
Should I put in a soft link from /home/condor/condor_config to /share/apps/condor/etc/condor_config - yes
Futher Condor Setup
Run condor_init on the frontend.
perl /share/apps/condor/sbin/condor_init
Edit the file /share/apps/condor/etc/condor_config.
Comment out the line:
LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local
Uncomment the line:
LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local
Change UID_DOMAIN to local
Change FILESYSTEM_DOMAIN to local
Change USE_NFS to true
Set TRUST_UID_DOMAIN to true
Edit the file /share/apps/condor/etc/gpnjayhawk.local
Add the lines: (mine was empty)
# NETWORK_INTERFACE = 10.1.1.1
START = False
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
On front end make condor start on boot
Copy /share/apps/condor/etc/examples/condor.boot to /etc/init.d/condor
Make sybolic links in
ln -s /etc/init.d/condor /etc/rc.d/rc1.d/K99condor
ln -s /etc/init.d/condor /etc/rc.d/rc2.d/K99condor
ln -s /etc/init.d/condor /etc/rc.d/rc3.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc4.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc5.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc6.d/K99condor
Make the file /etc/profile.d/condor.sh to setup the condor environment with the lines:
export CONDOR_ROOT=/share/apps/condor
export CONDOR_CONFIG=$CONDOR_ROOT/etc/condor_config
export PATH=$PATH:$CONDOR_ROOT/bin:$CONDOR_ROOT/sbin
Change the permission of the file so the it is executable:
chmod a+x /etc/profile.d/condor.sh
source /etc/profile.d/condor.sh
Change the permissions of /home/condor so that everyone can read it.
chmod a+rx /home/condor
The cluster frontend should now be setup to be the condor central manager. Start Condor or reboot the frontend and make sure condor starts. On the central manager the following processes should be running: condor_master, condor_collector, condor_negotiator, condor_startd, condor_schedd.
Finish condor installation on the compute nodes. Edit the /home/install/site-profiles/4.1/nodes/extend-compute.xml file. Make the post section look like below. It includes a little bit in the last line for the OSG setup later.
<post>
<file name="/etc/init.d/condor">
<eval>cat /etc/init.d/condor</eval>
</file>
<file name="/etc/profile.d/condor.sh">
<eval>cat /etc/profile.d/condor.sh</eval>
</file>
chmod u+x /etc/init.d/condor
chmod u+x /etc/profile.d/condor
ln -s /etc/init.d/condor /etc/rc.d/rc1.d/K99condor
ln -s /etc/init.d/condor /etc/rc.d/rc2.d/K99condor
ln -s /etc/init.d/condor /etc/rc.d/rc3.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc4.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc5.d/S99condor
ln -s /etc/init.d/condor /etc/rc.d/rc6.d/K99condor
mkdir /state/partition1/osg_wn_tmp
</post>
Rebuild the rocks distribution:
cd /home/install
rocks-dist dist
Try it out on a node. Reinstall a compute node.
Check the new compute node setup. The processes: condor_master, condor_startd, condor_schedd should be running.
Reinstall the rest of the compute nodes
The basic condor setup should be complete try condor_status as a regular user. Try some simple jobs.
More condor setup (dedicated resources for parallel jobs):
Put the following lines in the local config files for all the compute nodes. See the /share/apps/condor/etc/exapmles/condor_config.local.dedicated.resoucrce file.
DedicatedScheduler = DedicateScheduler@gpnjayhawk.cpe.engr.ku.edu
#Start all jobs but prefer dedicated ones
START = True
SUSPEND = False
CONTINUE = True
PREEMPT = False
KILL = False
WANT_SUSPEND = False
WANT_VACATE = False
RANK = Scheduler =?= $(DedicatedScheduler)
MPI_CONDOR_RSH_PATH = $(LIBEXEC)
CONDOR_SSHD = /usr/sbin/sshd
CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen
STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler
You probably need to restart condor on the machines. Run “condor_reconfig -all” on the central manager, or reboot the computers and see if it still works.
Test with some MPI jobs in the parallel universe.
Install pacman:
mkdir /opt/pacman
cd /opt/pacman
wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.gz
tar -zxvf pac*.tar.gz
You will have to run “source setup.sh” in the pacman directory before using running pacman
Install OSG stuff for GPN:
Make some directories for OSG.
mkdir -p /opt/osg/osg-0.4.1
mkdir -p /share/apps/osg/osg-wn-client-0.4.1
mkdir /home/osg_data/apps
mkdir /home/osg_data/data
mkdir /state/partition1/osg_wn_tmp
The /state/partition1/osg_wn_tmp directory should be made automatically on the worker nodes.
cd /opt/pacman/pacm*/
source setup.sh
cd /opt/osg/osg-0.4.1
export VDTSETUP_CONDOR_LOCATION=$CONDOR_ROOT (make sure that worked)
pacman -get OSG:ce
Answer yes to all questions and wait for install
To be safe logout and login.
cd /opt/pacman/pacm*/
source setup.sh
cd /share/apps/osg/osg-wn-client-0.4.1
export VDTSETUP_CONDOR_LOCATION=$CONDOR_ROOT
pacman -get OSG:wn-client
answer yes until you get to the question about the location of the CA files. Answer that local.
Continue to answer yes.
You need to add symbolic links for globus on all the worker nodes. Add these lines to the post section of the /home/install/site-profiles/4.1/nodes/extend-compute.xml file:
mkdir -p /opt/osg/osg-0.4.1
ln -s /share/apps/osg/globus /opt/osg/osg-0.4.1/globus
Rebuild the rocks distribution. If you sourced the OSG setup.sh, you will have to log out and log in first.
cd /home/install
rocks-dist dist
download jdk-1_5*i586.bin from Sun. The java 1.4 that comes with osg is not good enough
cp ./jdk-1_5*bin /opt/osg/osg-0.4.1/
cp ./jdk-1_5*bin /share/apps/osg/osg-wn-client-0.4.1/
cd /opt/osg/osg-0.4.1
sh jdk*.bin
cd /share/apps/osg/osg-wn-client-0.4.1
sh jdk*.bin
mv /opt/osg/osg-0.4.1/jdk1.4 /opt/osg/osg-0.4.1/jdk1.4.old
mv /share/apps/osg/osg-wn-client-0.4.1/jdk1.4 /share/apps/osg/osg-wn-client-0.4.1/jdk1.4.old
To make it easy make a symbolic links the the new java
ln -s /opt/osg/osg-0.4.1/jdk1.5.0_07 /opt/osg/osg-0.4.1/jdk1.4
ln -s /share/apps/osg/osg-wn-client-0.4.1/jdk1.5.0_07 /share/apps/osg/osg-wn-client-0.4.1/jdk1.4
Reinstall the compute nodes. (for the globus links)
Configure the OSG stuff:
I already have personal, host, and http certificates. If you don't, see the GPN documentation to find out how to request them.
source /opt/osg/osg-0.4.1/setup.sh
export VDTSETUP_CONDOR_LOCATION=/share/apps/condor
cd /opt/osg/osg-0.4.1/
pacman -get OSG:Globus-Condor-Setup
Run this (select whatever default stuff)
$VDT_LOCATION/vdt/setup/setup-cert-request
In the /root directory run
wget http://t2.unl.edu/cms/grid_user/UNL-CA-bundle.tgz
tar -zxvf UNL*bund*.tgz
cp UNL*dle/* $VDT_LOCATION/globus/TRUSTED_CA/
This is where you would request the certificates that I already have.
mkdir /opt/osg/osg-0.4.1/globus/etc/http (if its not there)
copy the host cert and key to /opt/osg/osg-0.4.1/globus/etc/
copy the http cert and key to /opt/osg/osg-0.4.1/globus/etc/http/
/etc/rc.d/init.d/xinetd restart
cd $VDT_LOCATION/monitoring
run “./configure-osg.sh” The answers are below
Specify your OSG GROUP [OSG]: OSG
Specify your OSG SITE NAME [UNAVAILABLE]: gpnjayhawk
Specify your VO sponsors [UNAVAILABLE]: gpn
Specify your policy url [UNAVAILABLE]: gpnjayhawk.cpe.engr.ku.edu/gpn/policy
Specify a contact for your server (full name) [UNAVAILABLE]: John Eslick
Specify the contact's email address [UNAVAILABLE]: jceslick@ku.edu
Specify your server's city [UNAVAILABLE]: Lawrence
Specify your server's country [UNAVAILABLE]: US
Specify your server's longitude [UNAVAILABLE]: -95.25
Specify your server's latitude [UNAVAILABLE]: 38.96
Specify your OSG GRID path [UNAVAILABLE]: /share/apps/osg/osg-wn-client-0.4.1
Specify your OSG APP path [UNAVAILABLE]: /home/osg_data/apps
Specify your OSG DATA path [UNAVAILABLE]: /home/osg_data/data
Specify your OSG WN_TMP path [UNAVAILABLE]: /state/partition1/osg_wn_tmp
Specify your OSG SITE_READ path [UNAVAILABLE]:
Specify your OSG SITE_WRITE path [UNAVAILABLE]:
Is a storage element (SE) available [n] (y/n): n
Are you running the MonALISA monitoring services [n] (y/n): n
Specify your batch queue manager OSG_JOB_MANAGER [UNAVAILABLE]: condor
Specify installation directory for condor [UNAVAILABLE]: /share/apps/condor
Specify the Condor config location []: /share/apps/condor/etc/condor_config
Is the above information correct (y/n)?: y (if it is correct)
Then some stuff happens
That's it for this part
Logout
Setup VOMS Client:
mkdir -p /opt/glite/etc/
make the file /opt/glite/etc/vomses with the line:
"gpn" "t2.unl.edu" "15002" "/DC=org/DC=doegrids/OU=Services/CN=voms/t2.unl.edu" "gpn" "32"
Personal Certificate:
Make sure you have a personal certificate for yourself. See the gpn documentation to find out how to request one.
Login as you (not root). In your home directory do “mkdir ./.globus”
Copy your usercert.pem and userkey.pem files to ~/.globus
Change the permissions on your *.pem files with the commands:
chmod 600 ~/.globus/userkey.pem
chmod 644 ~/.globus/usercert.pem
source /opt/osg/osg-0.4.1/setup.sh
voms-proxy-init --voms gpn:/gpn (NOTE: you must request to be in the VOMS group see the GPN docs)
globusrun -a -r gpn-husker.unl.edu
If thats successful, you can probably run jobs on other GPN clusters now. GUMS is still not setup on gpnjayhawk though so do that now.
GUMS setup:
Login as root again
source /opt/osg/osg-0.4.1/setup.sh
cd $VDT_LOCATION
pacman -get OSG:gums (answer yes to questions)
ln -s /opt/osg/osg-0.4.1/globus/etc /etc/grid-security
$VDT_LOCATION/post-install/apache start
$VDT_LOCATION/post-install/tomcat-5 restart
Logout. Login to your account.
source /opt/osg/osg-0.4.1/setup.sh
run grid-proxy-init and note your identity. This is your DN (distinguished name).
cd /opt/osg/osg-0.4.1/gums-service/sbin/
./addAdmin 'your DN' (Do include the quotation marks around your DN)
In a browser where you have installed your user cert (see GPN docs), vist https://gpnjayhawk.cpe.engr.ku.edu:8443/gums/.
Click “Generate Grid Mapfile” and make sure you do not get access denied
logout of your account and back in as root
run “source /opt/osg/osg-0.4.1/setup.sh”
cd /opt/osg/osg-0.4.1/globus/etc/http
chown daemon:daemon ./httpkey.pem
ln -s /opt/osg/osg-0.4.1/globus/share/certificates /etc/grid-security/certificates
Edit the gums configuration file at $VDT_LOCATION/vdt-app-data/gums/gums.config. See this page http://t2.unl.edu/documentation/gpn/gpn-gums-configuration/ .
Cut off everything after the </persistencefactories> section and paste in the corresponding part from the above reference.
Set the wild card in the host groups section at the end to be “*.cpe.engr.ku.edu”
For now, remove all the groups except for gpn.
Back in a web browser where your personal certificate is loaded goto https://gpnjayhawk.cpe.engr.ku.edu:8443/gums/ then click on “Update Members” then click “Update VO Member Database”
If that was successful go on. If not see the log files. You can find information in the GPN documentation.
Logout and back in as root. You need to do this to get OSG out of your environment to add users
Add user gpn
PRIMA setup:
Login as root
run “source /opt/osg/osg-0.4.1/setup.sh”
cp $VDT_LOCATION/post-install/*.conf $GRID_SECURITY_DIR
emacs $GRID_SECURITY_DIR/prima-authz.conf
look at imsContact line. Looks ok already.
/etc/rc.d/init.d/xinetd restart
$VDT_LOCATION/post-install/apache restart
$VDT_LOCATION/post-install/tomcat-5 restart
Test it out:
Login as you
source /opt/osg/osg-0.4.1/setup.sh
voms-proxy-init --voms gpn:/gpn
globusrun -a -r gpnjayhawk.cpe.engr.ku.edu
If that's ok try some stuff with condor. If its not ok, check the globus log files.
More testing...
More stuff:
I haven't set up monitoring things.
Notes:
For cent os 4.4 to get the compat-libstdc++-33 rpm use the following command
wget ftp://mirrors.kernel.org/centos/4.4/os/i386/CentOS/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm