Condor-G: Submitting Simple Jobs
A brief tutorial on submitting simple grid jobs using Condor-G
Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is give to condor_submit.
First, move to your submission location:
$ cd ~
$ mkdir submit
$ cd submit
Create a Condor submit file. As you can see from the condor_submit manual page, there are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer "red.unl.edu" and running under the "jobmanager-pbs" (jobmanager-fork for now) job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.
(Feel free to use your favorite editor, but we will demonstrate with 'cat' in the example below. When using cat to create files, press Ctrl-D to close the file -- don't actually type "Ctrl-D" into the file. Whenever you create a file using cat, we suggest you use cat to display the file and confirm that it contains the expected text.)
Create the submit file, then verify that it was entered correctly. (If you copy and paste, be sure you don't get ^M characters instead of "returns":
$ cat > myjob.submit
Universe = globus
GlobusScheduler = red.unl.edu:/jobmanager-fork
Executable = myscript.sh
Arguments = TestJob 10
Output = job.output
Error = job.error
Log = job.log
Notification = never
queue
Ctrl-D
$ cat myjob.submit
Universe = globus
GlobusScheduler = red.unl.edu:/jobmanager-fork
Executable = myscript.sh
Arguments = TestJob 10
Output = job.output
Error = job.error
Log = job.log
Notification = never
queue
Create a little program to run on the grid. (First argument is the job name, and the second argument is the length of time in seconds to sleep).
$ cat > myscript.sh
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
Ctrl-D
$ cat myscript.sh
#! /bin/sh
echo "I'm process id $$ on" `hostname`
echo "This is sent to standard error" 1>&2
date
echo "Running as binary $0" "$@"
echo "My name (argument 1) is $1"
echo "My sleep duration (argument 2) is $2"
sleep $2
echo "Sleep of $2 seconds finished. Exiting"
echo "RESULT: 0 SUCCESS"
Make the program executable and test it.
$ chmod a+x myscript.sh
$ ./myscript.sh TEST 1
I'm process id 15676 on osg-test2.unl.edu
This is sent to standard error
Wed Mar 8 14:54:50 CST 2006
Running as binary ./myscript.sh TEST 1
My name (argument 1) is TEST
My sleep duration (argument 2) is 1
Sleep of 1 seconds finished. Exiting
RESULT: 0 SUCCESS
Submit your test job to Condor-G.
$ condor_submit myjob.submit
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 77540.
Occasionally run condor_q to watch the progress of your job. You may also want to occasionally run "condor_q -globus" which presents Globus specific status information. (Additional documentation on condor_q)
$ condor_q mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
77531.0 mfurukaw 3/8 14:55 0+00:00:00 I 0 0.0 myscript.sh TestJo
1 jobs; 1 idle, 0 running, 0 held
$ condor_q -globus mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
77540.0 mfurukaw UNSUBMITTED fork red.unl.edu /home/mfurukaw/sub
$ condor_q mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
77540.0 mfurukaw 3/8 15:21 0+00:00:08 R 0 0.0 myscript.sh TestJo
1 jobs; 0 idle, 1 running, 0 held
$ condor_q -globus mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
77540.0 mfurukaw ACTIVE fork red.unl.edu /home/mfurukaw/sub
$ condor_q mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
77540.0 mfurukaw 3/8 15:21 0+00:00:34 c 0 0.0 myscript.sh TestJo
0 jobs; 0 idle, 0 running, 0 held
$ condor_q -globus mfurukaw
-- Submitter: osg-test2.unl.edu : <172.16.149.233:58263> : osg-test2.unl.edu
ID OWNER STATUS MANAGER HOST EXECUTABLE
77540.0 mfurukaw DONE fork red.unl.edu /home/mfurukaw/sub
In another window you can run "tail -f" to watch the log file for your job to monitor its progress. For the remainder of this tutorial, we suggest you re-run this command when you submit one or more jobs. This will allow you to see monitor how typical Condor-G jobs progress. Use "Ctrl-C" to stop watching the file.
In a second window:
$ cd ~/submit
$ tail -f --lines=500 job.log
000 (77540.000.000) 03/08 15:21:40 Job submitted from host: <172.16.149.233:58263>
...
017 (77540.000.000) 03/08 15:22:09 Job submitted to Globus
RM-Contact: red.unl.edu:/jobmanager-fork
JM-Contact: https://red.unl.edu:34048/13383/1141852771/
Can-Restart-JM: 1
...
027 (77540.000.000) 03/08 15:22:09 Job submitted to grid resource
GridResource: gt2 red.unl.edu:/jobmanager-fork
GridJobId: gt2 red.unl.edu:/jobmanager-fork https://red.unl.edu:34048/13383/1141852771/
...
001 (77540.000.000) 03/08 15:23:44 Job executing on host: gt2 red.unl.edu:/jobmanager-fork
...
005 (77540.000.000) 03/08 15:24:23 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
0 - Total Bytes Received By Job
...
When the job is no longer listed in condor_q or when the log file reports "Job terminated," you can see the results in condor_history. Please note this will take a long time to return results as there are over 77000 jobs run from this node.
$ condor_history mfurukaw
ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD
77540.0 mfurukaw 3/8 15:21 0+00:00:34 C 3/8 15:24
When the job completes, verify that the output is as expected. (The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.)
$ ls
job.error job.log job.output myjob.submit myscript.sh
$ cat job.error
This is sent to standard error
$ cat job.output
I'm process id 18054 on node007
Wed Mar 8 15:19:38 CST 2006
Running as binary /home/localGridUser/.globus/.gass_cache/local/md5/61/d9fc39a96f6a00a0fcf3070a658cf3/md5/23/365d3d7e24dff4086e318c509f2ab4/data TestJob 10
My name (argument 1) is TestJob
My sleep duration (argument 2) is 10
Sleep of 10 seconds finished. Exiting
RESULT: 0 SUCCESS
If you didn't watch the job.log file with tail -f above, you will want to examine the information logged now:
$ cat job.log
Clean up the results:
$ rm job.*