Personal tools
You are here: Home Documentation GPNgrid Documentation GPN Cluster Install Notes
Document Actions

GPN Cluster Install Notes

by John Eslick last modified 2007-05-24 15:11

Notes from the software installation of gpnjayhawk.

Install Notes


Author: John Eslick, jceslick@ku.edu

Disclaimer: I'm not really sure my site works properly, and I may have made mistakes typing up the notes, so you should a least read through before you start and see if everything seems reasonable. If you find mistakes, please email me, and I'll fix them. If we get the bugs out of this, I may finally get my own site working.


The Condor configuration is still not quite right but I think the problems are minor.


Original Documentation:

This just covers the steps used to setup the software for the GPN on gpnjayhawk.cpe.engr.ku.edu. For the details see the documentation at http://t2.unl.edu/documentation/gpn/, http://osg.ivdgl.org/twiki/bin/view/Integration/, and the condor manual.


Software:

Rocks 4.1 for i386

CentOS 4.3 for i386

Condor 6.7.20

OSG 0.4.1


Hardware:

The cluster frontend is a Sun x2100 with a DVD drive, 80 Gb, and 250 Gb hard drive. It has a single core processor. The compute nodes have one 80 Gb hard drive and a dual core processor.


Setup the hardware so it works. Your hardware may be different, but hopefully it won't be too important. I think the important thing as far as these notes go is that you can PXE boot.


OS Install:

There are 5 discs. The first is a boot roll for Rocks 4.1. The other four are CentOS 4.3 discs.

  1. Turn off all cluster machines

  2. Turn on head node

  3. Insert boot roll and boot from CD

  4. A Rocks screen will come up type “frontend” and press enter. This will run the frontend install if you don't do it fast enough the compute element install will start. If you are too slow just reboot.

  5. You will be asked which rolls to install. Install: kernel, base, ganglia, webserver, java, area51, and hpc. Do not install SGE. We will try to use condor for all job scheduling. Press ok after package selection.

  6. When asked if you have additional roll CDs answer yes.

  7. Insert the 4 OS discs one at a time in order

  8. After the last OS disc there are no more rolls.

  9. You will be asked for some cluster information. Use the information below.

    1. Fully Qualified Domain Name: gpnjayhawk.cpe.engr.ku.edu

    2. Country: US

    3. Cluster Name: gpnjayhawk

    4. Organization: University of Kansas

    5. Location: Lawrence

    6. State: Kansas

    7. LatLong: N38.96 W95.25

  10. Disk partitioning

    1. On sda make / partition 50,000 Mb

    2. On sda make swap partition 6,000 Mb

    3. On sda make /state/partition1 with the rest of the disk

    4. On sdb make /export partition 100,000 Mb

    5. On sdb make /osg_data with the rest of the disk

  11. You will be asked for network information.

    1. Private Network Interface

      1. IP Address: 10.1.1.1

      2. Netmask: 255.0.0.0

    2. Pubic Network Interface

      1. IP Address: 129.237.115.18

      2. Netmask: 255.255.248.0

      3. Gateway: 129.237.119.254

      4. DNS 1: 129.237.112.1

      5. DNS 2: 129.237.32.1

      6. DNS 3: 129.237.23.2

  12. You will be asked for time information

    1. Check system clock uses UTC

    2. For time zone select America/Chicago

    3. Network time server pool.ntp.org

  13. Root password is ********

  14. The installer will ask you to insert discs. They will not be in order.

  15. After all the discs have been read the software will install

  16. The computer will reboot

  17. The frontend node is now installed.

  18. Log in a root. The first time you log in you will be prompted for a location of ssh keys and a pass phrase. Use the default directory. The pass phrase *************************** was used.

  19. Export the /osg_data directory to /home/osg_data

    1. Add this line to the /etc/exports file:

/osg_data  10.0.0.0/255.0.0.0(rw,async)
    1. Add this line to the /etc/auto.home file:

osg_data  gpnjayhawk.local:/osg_data
    1. Restart NFS:

/etc/init.d/nfs restart
    1. Do this command:

make -C /var/411
  1. Install required compat-libstdc++-33 rpm

    1. Get the rpm:

wget ftp://mirrors.kernel.org/centos/4.3/os/i386/CentOS/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm
    1. Install on frontend node:

rpm -i compat-li*rpm
    1. Copy the compat-libstdc++-33 rpm to /home/install/contrib/4.1/i386/RPMS/

    2. Copy /home/install/site-profiles/4.1/nodes/skeleton.xml to /home/install/site-profiles/4.1/nodes/extend-compute.xml

    3. Edit the new extend-compute.xml file. Find the <package>...</package> lines and add:

<package> compat-libstdc++-33 </package>
    1. Rebuild the rocks distribution:

cd /home/install
rocks-dist dist
  1. Run insert-ethers on the front end.

  2. Select compute for appliance type.

  3. Setup software on compute nodes. For each compute node:

    1. Move monitor and keyboard to node

    2. Turn on node

    3. Press F8 while booting to get to the boot menu. (depends on your computer)

    4. Select MBA... for PXE boot from broadcom card. (depends on your computer)

    5. Installation should start

  4. Move monitor and keyboard back to frontend after installation has started on all compute nodes.

  5. Exit insert-ethers by pressing F10

  6. You may now want to add some users with “useradd” and “passwd”

  7. Check that /home/osg_data mounts on the nodes and that the file libstdc++.so.5 exists on the nodes.
    (Note : If there is no directory named osg_data in /home issue the following command 


                  ssh compute-0-0
                  sudo /etc/init.d/nfs restart


    The default location for the libstdc++.so.5 file is /usr/lib.     
     You can check the file by giving following commands
                  sudo updatedb
                  locate libstdc++.so.5

  8. Change the permissions for /home/osg_data so its readable by everyone

chmod a+rx /home/osg_data


Condor Install:

Condor 6.7.20 for RHEL 4 was used for this installation.

  1. Download condor tar file and manual from condor website.

  2. Add condor user

useradd condor

  1. Extract Condor files in /root

tar -zxvf condor*.tar.gz

  1. Run the Condor install script.

cd ~/condor-6.7.*

perl condor_install

  1. The install script will ask many questions answer as below

    1. -STEP 1- Would you like to do a full installation of Condor? - yes

    2. -STEP 2- Are you planning to setup Condor on multiple machines? - yes

    3. Will all the machines share files via a file server? - yes

    4. What are the hostnames of the machines you wish to setup? (1 per line)

      1. gpnjayhawk

      2. compute-0-0

      3. compute-0-1

      4. compute-0-2

      5. compute-0-3

    5. -STEP 3- Have you installed a release directory already? - no

    6. Where would you like to install the Condor release directory? - /share/apps/condor

    7. That directory doesn't exist, should I create it now? - yes

    8. -STEP 4- If something goes wrong with Condor, who should get email about it? - root@gpnjayhawk.local

    9. What is the full path to a mail program that understands "-s" means you want to specify a subject? - /bin/mail

    10. -STEP 5- Do all of the machines in your pool from your domain ("local") share a common filesystem? - yes

    11. Do all of the users across all the machines in your domain have a unique UID (in other words, do they all share a common passwd file)? - yes

    12. In some cases, even if you have unique UIDs, you might not have all users listed in the password file on each machine. Is this the case at your site? - no

    13. -STEP 6- Enable Java Universe support? - yes

    14. Please enter the full path to the JVM, or "none" to leave unconfigured: - /usr/java/jdk1.5.0_05/bin/java

    15. You entered: /usr/java/jdk1.5.0_05/bin/java. Is that right? - yes

    16. -STEP 7- Shall I create links in some other directory? - yes

    17. Where should I install these files? - /usr/local/bin

    18. -STEP 8- What is the full hostname of the central manager? - gpnjayhawk.cpe.engr.ku.edu

    19. You have a "condor" user on this machine. Is the home directory for this account (/home/condor) shared among all machines in your pool? - yes

    20. Do you want to put all the Condor directories for each machine in subdirectories of /home/condor/hosts? - yes

    21. Do you want to specify a local partition for file locking? - yes

    22. Where should I put the lock files? - /tmp

    23. Shall I create it now? - yes (you probably will not get this question now)

    24. -STEP 10- Do you want all the machine-specific config files for each host in one directory? - yes

    25. What directory should I use? - /share/apps/condor/etc

    26. Should I put in a soft link from /home/condor/condor_config to /share/apps/condor/etc/condor_config - yes


Futher Condor Setup

  1. Run condor_init on the frontend.

        perl /share/apps/condor/sbin/condor_init

  1. Edit the file /share/apps/condor/etc/condor_config.

    1. Comment out the line:

        LOCAL_CONFIG_FILE = $(LOCAL_DIR)/condor_config.local

    1. Uncomment the line:

        LOCAL_CONFIG_FILE = $(RELEASE_DIR)/etc/$(HOSTNAME).local

    1. Change UID_DOMAIN to local

    2. Change FILESYSTEM_DOMAIN to local

    3. Change USE_NFS to true

    4. Set TRUST_UID_DOMAIN to true

  1. Edit the file /share/apps/condor/etc/gpnjayhawk.local

    1. Add the lines: (mine was empty)

# NETWORK_INTERFACE = 10.1.1.1

START = False

DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD

  1. On front end make condor start on boot

    1. Copy /share/apps/condor/etc/examples/condor.boot to /etc/init.d/condor

    2. Make sybolic links in

ln -s /etc/init.d/condor /etc/rc.d/rc1.d/K99condor

ln -s /etc/init.d/condor /etc/rc.d/rc2.d/K99condor

ln -s /etc/init.d/condor /etc/rc.d/rc3.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc4.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc5.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc6.d/K99condor

  1. Make the file /etc/profile.d/condor.sh to setup the condor environment with the lines:

export CONDOR_ROOT=/share/apps/condor

export CONDOR_CONFIG=$CONDOR_ROOT/etc/condor_config

export PATH=$PATH:$CONDOR_ROOT/bin:$CONDOR_ROOT/sbin

  1. Change the permission of the file so the it is executable:

chmod a+x /etc/profile.d/condor.sh

source /etc/profile.d/condor.sh

  1. Change the permissions of /home/condor so that everyone can read it.

chmod a+rx /home/condor

  1. The cluster frontend should now be setup to be the condor central manager. Start Condor or reboot the frontend and make sure condor starts. On the central manager the following processes should be running: condor_master, condor_collector, condor_negotiator, condor_startd, condor_schedd.

  2. Finish condor installation on the compute nodes. Edit the /home/install/site-profiles/4.1/nodes/extend-compute.xml file. Make the post section look like below. It includes a little bit in the last line for the OSG setup later.

<post>

<file name="/etc/init.d/condor">

<eval>cat /etc/init.d/condor</eval>

</file>

<file name="/etc/profile.d/condor.sh">

<eval>cat /etc/profile.d/condor.sh</eval>

</file>

chmod u+x /etc/init.d/condor

chmod u+x /etc/profile.d/condor

ln -s /etc/init.d/condor /etc/rc.d/rc1.d/K99condor

ln -s /etc/init.d/condor /etc/rc.d/rc2.d/K99condor

ln -s /etc/init.d/condor /etc/rc.d/rc3.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc4.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc5.d/S99condor

ln -s /etc/init.d/condor /etc/rc.d/rc6.d/K99condor

mkdir /state/partition1/osg_wn_tmp

</post>

  1. Rebuild the rocks distribution:

cd /home/install

rocks-dist dist

  1. Try it out on a node. Reinstall a compute node.

  2. Check the new compute node setup. The processes: condor_master, condor_startd, condor_schedd should be running.

  3. Reinstall the rest of the compute nodes


The basic condor setup should be complete try condor_status as a regular user. Try some simple jobs.


More condor setup (dedicated resources for parallel jobs):

  1. Put the following lines in the local config files for all the compute nodes. See the /share/apps/condor/etc/exapmles/condor_config.local.dedicated.resoucrce file.

DedicatedScheduler = DedicateScheduler@gpnjayhawk.cpe.engr.ku.edu

#Start all jobs but prefer dedicated ones

START = True

SUSPEND = False

CONTINUE = True

PREEMPT = False

KILL = False

WANT_SUSPEND = False

WANT_VACATE = False

RANK = Scheduler =?= $(DedicatedScheduler)

MPI_CONDOR_RSH_PATH = $(LIBEXEC)

CONDOR_SSHD = /usr/sbin/sshd

CONDOR_SSH_KEYGEN = /usr/bin/ssh-keygen

STARTD_EXPRS = $(STARTD_EXPRS), DedicatedScheduler

  1. You probably need to restart condor on the machines. Run “condor_reconfig -all” on the central manager, or reboot the computers and see if it still works.

  2. Test with some MPI jobs in the parallel universe.


Install pacman:

  1. mkdir /opt/pacman

  2. cd /opt/pacman

  3. wget http://physics.bu.edu/pacman/sample_cache/tarballs/pacman-latest.tar.gz

  4. tar -zxvf pac*.tar.gz

You will have to run “source setup.sh” in the pacman directory before using running pacman


Install OSG stuff for GPN:

  1. Make some directories for OSG.

    1. mkdir -p /opt/osg/osg-0.4.1

    2. mkdir -p /share/apps/osg/osg-wn-client-0.4.1

    3. mkdir /home/osg_data/apps

    4. mkdir /home/osg_data/data

    5. mkdir /state/partition1/osg_wn_tmp

  2. The /state/partition1/osg_wn_tmp directory should be made automatically on the worker nodes.

  3. cd /opt/pacman/pacm*/

  4. source setup.sh

  5. cd /opt/osg/osg-0.4.1

  6. export VDTSETUP_CONDOR_LOCATION=$CONDOR_ROOT (make sure that worked)

  7. pacman -get OSG:ce

  8. Answer yes to all questions and wait for install

  9. To be safe logout and login.

  10. cd /opt/pacman/pacm*/

  11. source setup.sh

  12. cd /share/apps/osg/osg-wn-client-0.4.1

  13. export VDTSETUP_CONDOR_LOCATION=$CONDOR_ROOT

  14. pacman -get OSG:wn-client

    1. answer yes until you get to the question about the location of the CA files. Answer that local.

    2. Continue to answer yes.

  1. You need to add symbolic links for globus on all the worker nodes. Add these lines to the post section of the /home/install/site-profiles/4.1/nodes/extend-compute.xml file:

mkdir -p /opt/osg/osg-0.4.1

ln -s /share/apps/osg/globus /opt/osg/osg-0.4.1/globus


  1. Rebuild the rocks distribution. If you sourced the OSG setup.sh, you will have to log out and log in first.

    1. cd /home/install

    2. rocks-dist dist

  2. download jdk-1_5*i586.bin from Sun. The java 1.4 that comes with osg is not good enough

  3. cp ./jdk-1_5*bin /opt/osg/osg-0.4.1/

  4. cp ./jdk-1_5*bin /share/apps/osg/osg-wn-client-0.4.1/

  5. cd /opt/osg/osg-0.4.1

  6. sh jdk*.bin

  7. cd /share/apps/osg/osg-wn-client-0.4.1

  8. sh jdk*.bin

  9. mv /opt/osg/osg-0.4.1/jdk1.4 /opt/osg/osg-0.4.1/jdk1.4.old

  10. mv /share/apps/osg/osg-wn-client-0.4.1/jdk1.4 /share/apps/osg/osg-wn-client-0.4.1/jdk1.4.old

  11. To make it easy make a symbolic links the the new java

    1. ln -s /opt/osg/osg-0.4.1/jdk1.5.0_07 /opt/osg/osg-0.4.1/jdk1.4

    2. ln -s /share/apps/osg/osg-wn-client-0.4.1/jdk1.5.0_07 /share/apps/osg/osg-wn-client-0.4.1/jdk1.4

  12. Reinstall the compute nodes. (for the globus links)


Configure the OSG stuff:

I already have personal, host, and http certificates. If you don't, see the GPN documentation to find out how to request them.

  1. source /opt/osg/osg-0.4.1/setup.sh

  2. export VDTSETUP_CONDOR_LOCATION=/share/apps/condor

  3. cd /opt/osg/osg-0.4.1/

  4. pacman -get OSG:Globus-Condor-Setup

  5. Run this (select whatever default stuff)

$VDT_LOCATION/vdt/setup/setup-cert-request

  1. In the /root directory run

wget http://t2.unl.edu/cms/grid_user/UNL-CA-bundle.tgz

tar -zxvf UNL*bund*.tgz

cp UNL*dle/* $VDT_LOCATION/globus/TRUSTED_CA/

  1. This is where you would request the certificates that I already have.

  2. mkdir /opt/osg/osg-0.4.1/globus/etc/http (if its not there)

  3. copy the host cert and key to /opt/osg/osg-0.4.1/globus/etc/

  4. copy the http cert and key to /opt/osg/osg-0.4.1/globus/etc/http/

  5. /etc/rc.d/init.d/xinetd restart

  6. cd $VDT_LOCATION/monitoring

  7. run “./configure-osg.sh” The answers are below

    1. Specify your OSG GROUP [OSG]: OSG

    2. Specify your OSG SITE NAME [UNAVAILABLE]: gpnjayhawk

    3. Specify your VO sponsors [UNAVAILABLE]: gpn

    4. Specify your policy url [UNAVAILABLE]: gpnjayhawk.cpe.engr.ku.edu/gpn/policy

    5. Specify a contact for your server (full name) [UNAVAILABLE]: John Eslick

    6. Specify the contact's email address [UNAVAILABLE]: jceslick@ku.edu

    7. Specify your server's city [UNAVAILABLE]: Lawrence

    8. Specify your server's country [UNAVAILABLE]: US

    9. Specify your server's longitude [UNAVAILABLE]: -95.25

    10. Specify your server's latitude [UNAVAILABLE]: 38.96

    11. Specify your OSG GRID path [UNAVAILABLE]: /share/apps/osg/osg-wn-client-0.4.1

    12. Specify your OSG APP path [UNAVAILABLE]: /home/osg_data/apps

    13. Specify your OSG DATA path [UNAVAILABLE]: /home/osg_data/data

    14. Specify your OSG WN_TMP path [UNAVAILABLE]: /state/partition1/osg_wn_tmp

    15. Specify your OSG SITE_READ path [UNAVAILABLE]:

    16. Specify your OSG SITE_WRITE path [UNAVAILABLE]:

    17. Is a storage element (SE) available [n] (y/n): n

    18. Are you running the MonALISA monitoring services [n] (y/n): n

    19. Specify your batch queue manager OSG_JOB_MANAGER [UNAVAILABLE]: condor

    20. Specify installation directory for condor [UNAVAILABLE]: /share/apps/condor

    21. Specify the Condor config location []: /share/apps/condor/etc/condor_config

    22. Is the above information correct (y/n)?: y (if it is correct)

    23. Then some stuff happens

  8. That's it for this part

  9. Logout


Setup VOMS Client:

  1. mkdir -p /opt/glite/etc/

  2. make the file /opt/glite/etc/vomses with the line:

"gpn" "t2.unl.edu" "15002" "/DC=org/DC=doegrids/OU=Services/CN=voms/t2.unl.edu" "gpn" "32"


Personal Certificate:

  1. Make sure you have a personal certificate for yourself. See the gpn documentation to find out how to request one.

  2. Login as you (not root). In your home directory do “mkdir ./.globus”

  3. Copy your usercert.pem and userkey.pem files to ~/.globus

  4. Change the permissions on your *.pem files with the commands:

    1. chmod 600 ~/.globus/userkey.pem

    2. chmod 644 ~/.globus/usercert.pem

  5. source /opt/osg/osg-0.4.1/setup.sh

  6. voms-proxy-init --voms gpn:/gpn (NOTE: you must request to be in the VOMS group see the GPN docs)

  7. globusrun -a -r gpn-husker.unl.edu

  8. If thats successful, you can probably run jobs on other GPN clusters now. GUMS is still not setup on gpnjayhawk though so do that now.


GUMS setup:

  1. Login as root again

  2. source /opt/osg/osg-0.4.1/setup.sh

  3. cd $VDT_LOCATION

  4. pacman -get OSG:gums (answer yes to questions)

  5. ln -s /opt/osg/osg-0.4.1/globus/etc /etc/grid-security

  6. $VDT_LOCATION/post-install/apache start

  7. $VDT_LOCATION/post-install/tomcat-5 restart

  8. Logout. Login to your account.

  9. source /opt/osg/osg-0.4.1/setup.sh

  10. run grid-proxy-init and note your identity. This is your DN (distinguished name).

  11. cd /opt/osg/osg-0.4.1/gums-service/sbin/

  12. ./addAdmin 'your DN' (Do include the quotation marks around your DN)

  13. In a browser where you have installed your user cert (see GPN docs), vist https://gpnjayhawk.cpe.engr.ku.edu:8443/gums/.

  14. Click “Generate Grid Mapfile” and make sure you do not get access denied

  15. logout of your account and back in as root

  16. run “source /opt/osg/osg-0.4.1/setup.sh”

  17. cd /opt/osg/osg-0.4.1/globus/etc/http

  18. chown daemon:daemon ./httpkey.pem

  19. ln -s /opt/osg/osg-0.4.1/globus/share/certificates /etc/grid-security/certificates

  20. Edit the gums configuration file at $VDT_LOCATION/vdt-app-data/gums/gums.config. See this page http://t2.unl.edu/documentation/gpn/gpn-gums-configuration/ .

    1. Cut off everything after the </persistencefactories> section and paste in the corresponding part from the above reference.

    2. Set the wild card in the host groups section at the end to be “*.cpe.engr.ku.edu”

    3. For now, remove all the groups except for gpn.

  21. Back in a web browser where your personal certificate is loaded goto https://gpnjayhawk.cpe.engr.ku.edu:8443/gums/ then click on “Update Members” then click “Update VO Member Database”

  22. If that was successful go on. If not see the log files. You can find information in the GPN documentation.

  23. Logout and back in as root. You need to do this to get OSG out of your environment to add users

  24. Add user gpn


PRIMA setup:

  1. Login as root

  2. run “source /opt/osg/osg-0.4.1/setup.sh”

  3. cp $VDT_LOCATION/post-install/*.conf $GRID_SECURITY_DIR

  4. emacs $GRID_SECURITY_DIR/prima-authz.conf

  5. look at imsContact line. Looks ok already.

  6. /etc/rc.d/init.d/xinetd restart

  7. $VDT_LOCATION/post-install/apache restart

  8. $VDT_LOCATION/post-install/tomcat-5 restart


Test it out:

  1. Login as you

  2. source /opt/osg/osg-0.4.1/setup.sh

  3. voms-proxy-init --voms gpn:/gpn

  4. globusrun -a -r gpnjayhawk.cpe.engr.ku.edu

  5. If that's ok try some stuff with condor. If its not ok, check the globus log files.

More testing...


More stuff:

I haven't set up monitoring things.



Notes:

For cent os 4.4 to get the compat-libstdc++-33 rpm use the following command

wget ftp://mirrors.kernel.org/centos/4.4/os/i386/CentOS/RPMS/compat-libstdc++-33-3.2.3-47.3.i386.rpm


Powered by Plone, the Open Source Content Management System