Earlier today we learned about some grid technologies including Condor-G and DAGMan. In this session, we will install and submit some simple jobs using these components.
These directions were specifically written for a tutorial given on Monday, July 14, 2003, on the pc-##.gs.unina.it computers. This tutorial will work elsewhere with two important notes. First, use the "Full VDT Install" directions, not the "Using the existing VDT Install" directions. Second, jobs are being sent to pc-##.gs.unina.it computers. You will likely need to change the computer names to match a working Globus Gatekeeper that you have access to.
The VDT (Virtual Data Toolkit) distributes a variety of grid middleware, making it easy to install Condor-G and DAGMan and get going. (Official Virtual Data Toolkit web site)
First, create a scratch directory to do your work in. Be sure to create a unique name for your directory in /tmp as the directory may be shared with other users. Where the example specifies "username", use your username, your real name, or something else unique.
Lets get going! First make the subdirectories:
$ cd /tmp $ mkdir username-condor-g-dagman-tutorial $ cd username-condor-g-dagman-tutorial $ mkdir submit
Link your scratch space to the existing VDT installation:
$ ln -s /vdt . $ pushd vdt
Now we'll want to "source setup.sh" the scripts created by the VDT. This will configure your environment so Globus, Condor-G, and related tools are ready for use. (If the "echo $GLOBUS_LOCATION" returns nothing, rerun "source setup.sh" and try again.)
(This tutorial assumes you're using a Bourne Shell derived shell, typically /bin/sh or /bin/bash. This is the system default on the pc-##.gs.unina.it computers. If you've changed to a csh derived shell, you'll need to slightly modify the examples. Whenever you are asked to source setup.sh, there will also be a setup.csh for csh users. For simplicity you may want to use /bin/sh or /bin/bash for this tutorial.)
$ pwd /tmp/username-condor-g-dagman-tutorial/vdt $ source setup.sh $ echo $GLOBUS_LOCATION /vdt/globus $ popd /tmp/username-condor-g-dagman-tutorial
Create a short lived proxy for this tutorial. (The default proxy length is 12 hours. For a long lived task, you might create proxies with lifespans of 24 hours, several days, or even several months.) (The "-verify" option is not required, but can is useful for debugging. -verify will warn you if an expected Certificate Authority Certificate is missing.)
$ grid-proxy-info -all ERROR: Couldn't find a valid proxy. Use -debug for further information. $ grid-proxy-init -hours 4 -verify Your identity: /O=Grid/O=Globus/OU=gs.unina.it/CN=Test User Enter GRID pass phrase for this identity: Your pass phrase Creating proxy ........................................... Done Proxy Verify OK Your proxy is valid until Thu Jul 10 16:06:13 2004 $ grid-proxy-info -all subject : /O=Grid/O=Globus/OU=gs.unina.it/CN=Test User/CN=proxy issuer : /O=Grid/O=Globus/OU=gs.unina.it/CN=Test User type : full strength : 512 bits timeleft : 3:59:57
Do a quick test with globus-job-run to ensure that you can submit a job via globus before we proceed to using Condor-G:
$ globus-job-run server1.gs.unina.it /bin/date Wed Jul 9 17:57:49 CDT 2004
Now we are ready to submit our first job with Condor-G. The basic procedure is to create a Condor job submit description file. This file can tell Condor what executable to run, what resources to use, how to handle failures, where to store the job's output, and many other characteristics of the job submission. Then this file is give to condor_submit.
First, move to our scratch submission location:
$ pwd /tmp/username-condor-g-dagman-tutorial $ cd submit
Create a Condor submit file. As you can see from the condor_submit manual page, there are many options that can be specified in a Condor-G submit description file. We will start out with just a few. We'll be sending the job to the computer "server1.gs.unina.it" and running under the "jobmanager-fork" job manager. We're setting notification to never to avoid getting email messages about the completion of our job, and redirecting the stdout/err of the job back to the submission computer.
(Feel free to use your favorite editor, but we will demonstrate with 'cat' in the example below. When using cat to create files, press Ctrl-D to close the file -- don't actually type "Ctrl-D" into the file. Whenever you create a file using cat, we suggest you use cat to display the file and confirm that it contains the expected text.)
Create the submit file, then verify that it was entered correctly:
$ cat > myjob.submit executable=myscript.sh arguments=TestJob 10 output=results.output error=results.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it:/jobmanager-fork queue Ctrl-D $ cat myjob.submit executable=myscript.sh arguments=TestJob 10 output=results.output error=results.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it:/jobmanager-fork queue
Create a little program to run on the grid.
$ cat > myscript.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 0 SUCCESS" Ctrl-D $ cat myscript.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 0 SUCCESS"
Make the program executable and test it.
$ chmod a+x myscript.sh $ ./myscript.sh TEST 1 I'm process id 3428 on ws01.gs.unina.it This is sent to standard error Thu Jul 10 12:21:11 CDT 2003 Running as binary ./myscript.sh TEST 1 My name (argument 1) is TEST My sleep duration (argument 2) is 1 Sleep of 1 seconds finished. Exiting RESULT: 0 SUCCESS
Submit your test job to Condor-G.
$ condor_submit myjob.submit Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 1.
Occasionally run condor_q to watch the progress of your job. You may also want to occasionally run "condor_q -globus" which presents Globus specific status information. (Additional documentation on condor_q)
$ condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 adesmet 7/10 17:28 0+00:00:00 I 0 0.0 myscript.sh TestJo 1 jobs; 1 idle, 0 running, 0 held $ condor_q -globus -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond $ condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 adesmet 7/10 17:28 0+00:00:27 R 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 1 running, 0 held $ condor_q -globus -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond $ condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 adesmet 7/10 17:28 0+00:00:40 C 0 0.0 myscript.sh 0 jobs; 0 idle, 0 running, 0 held $ condor_q -globus -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 1.0 adesmet DONE fork server1.gs.unina.it /tmp/username-cond $ condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held
In another window you can run "tail -f" to watch the log file for your job to monitor its progress. For the remainder of this tutorial, we suggest you re-run this command when you submit one or more jobs. This will allow you to see monitor how typical Condor-G jobs progress. Use "Ctrl-C" to stop watching the file.
In a second window:
$ cd /tmp/username-condor-g-dagman-tutorial/submit $ tail -f --lines=500 results.log 000 (001.000.000) 07/10 17:28:48 Job submitted from host: <192.167.1.100:35688> ... 017 (001.000.000) 07/10 17:29:01 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2321/696/1057876132/ Can-Restart-JM: 1 ... 001 (001.000.000) 07/10 17:29:01 Job executing on host: server1.gs.unina.it ... 005 (001.000.000) 07/10 17:30:08 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
When the job is no longer listed in condor_q or when the log file reports "Job terminated," you can see the results in condor_history.
$ condor_history ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD 1.0 adesmet 7/10 10:28 0+00:00:00 C ??? /tmp/username-co
When the job completes, verify that the output is as expected. (The binary name is different from what you created because of how Globus and Condor-G cooperate to stage your file to execute computer.)
$ ls myjob.submit myscript.sh* results.error results.log results.output $ cat results.error This is sent to standard error $ cat results.output $I'm process id 733 on server1.gs.unina.it Thu Jul 10 17:28:57 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/28/fcae5001dbcd99cc476984b4151284/md5/af/355c4959dc83a74b18b7c03eb27201/data TestJob 10 My name (argument 1) is TestJob My sleep duration (argument 2) is 10 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
If you didn't watch the results.log file with tail -f above, you will want to examine the information logged now:
$ cat results.log
Clean up the results:
$ rm results.*
When an problem occurs in the middleware, Condor-G will place your job on "Hold". Held jobs remain in the queue, but wait for user intervention. When you resolve the problem, you can use condor_release to free job to continue.
You can places jobs on hold yourself, perhaps if you want to delay your run using condor_hold
For this example, we'll make the output file non-writable. The job will be unable to copy the results back and will be placed on hold.
Submit the job again, but this time immediately after submitting it, mark the output file as read-only:
$ condor_submit myjob.submit ; chmod a-w results.output Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 2.
Watch the job with tail. When the job goes on hold, use Ctrl-C to exit tail. Note that condor_q reports that the job is in the "H" or Held state.
$ tail -f --lines=500 results.log 000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864> ... 017 (003.000.000) 07/12 22:35:57 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:33178/12497/1058042148/ Can-Restart-JM: 1 ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: server1.gs.unina.it ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 129: the standard output/error size is different Code 2 Subcode 129 ... Ctrl-C $ condor_q -- Submitter: pc-23.gs.unina.it : <192.167.1.23:32864> : pc-23.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 adesmet 7/12 22:35 0+00:00:55 H 0 0.0 myscript.sh TestJo 1 jobs; 0 idle, 0 running, 1 held
Fix the problem (make the file writable again), then release the job. You can specifiy the job's ID, or just use "-all" to release all held jobs.
$ chmod u+w results.output $ condor_release -all All jobs released.
Again, watch the log until the job finishes:
$ tail -f --lines=500 results.log 000 (003.000.000) 07/12 22:35:44 Job submitted from host: <192.167.1.23:32864> ... 017 (003.000.000) 07/12 22:35:57 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:33178/12497/1058042148/ Can-Restart-JM: 1 ... 001 (003.000.000) 07/12 22:35:57 Job executing on host: server1.gs.unina.it ... 012 (003.000.000) 07/12 22:36:52 Job was held. Globus error 129: the standard output/error size is different ... 013 (003.000.000) 07/12 22:44:33 Job was released. via condor_release (by user Todd) ... 001 (003.000.000) 07/12 22:44:46 Job executing on host: server1.gs.unina.it ... 005 (003.000.000) 07/12 22:44:51 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... Ctrl-C
Your job finished, the results have been retreived successfully:
$ cat results.output I'm process id 12528 on server1.gs.unina.it Sat Jul 12 22:35:53 CEST 2003 Running as binary /home/home45/Aland/.globus/.gass_cache/local/md5/6d/217f3f7926c06a529143f6129bf269/md5/a7/2af94ba728c69c588e523a99baaefd/data TestJob 10 My name (argument 1) is TestJob My sleep duration (argument 2) is 10 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
Before continuing, clean up the results:
$ rm results.*
Since it will be handy for the rest of the tutorial, create a little shell script to monitor the Condor-G queue:
$ cat > watch_condor_q #! /bin/sh while true; do condor_q condor_q -globus sleep 10 done Ctrl-D $ cat watch_condor_q #! /bin/sh while true; do condor_q condor_q -globus sleep 10 done $ chmod a+x watch_condor_q
Create a minimal DAG for DAGMan. This DAG will have a single node.
$ cat > mydag.dag Job HelloWorld myjob.submit Ctrl-D $ cat mydag.dag Job HelloWorld myjob.submit
Submit it with condor_submit_dag, then watch the run. Notice that condor_dagman is running as a job and that condor_dagman submits your real job without your direct intervention. You might happen to catch the "C" (completed) state as your job finishes, but that often goes by too quickly to notice.
Again, in another window you may want to run "tail -f --lines=500 results.log" in a second window to watch the job log file as your job runs. You might also want to watch DAGMan's log file with "tail -f --lines=500 mydag.dag.dagman.out" in a third window. (mydag.dag.dagman.out) in the same way in a third window. For the remainder of this tutorial, we suggest you re-run this command when you submit a DAG. This will allow you to see how typical DAGs progress. Use "Ctrl-C" to stop watching the file.
First window:
$ condor_submit_dag mydag.dag Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 2. ----------------------------------------------------------------------- $ ./watch_condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 adesmet 7/10 17:33 0+00:00:03 R 0 2.6 condor_dagman -f - 3.0 adesmet 7/10 17:33 0+00:00:00 I 0 0.0 myscript.sh TestJo 2 jobs; 1 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 adesmet 7/10 17:33 0+00:00:33 R 0 2.6 condor_dagman -f - 3.0 adesmet 7/10 17:33 0+00:00:15 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 2.0 adesmet 7/10 17:33 0+00:01:03 R 0 2.6 condor_dagman -f - 3.0 adesmet 7/10 17:33 0+00:00:45 R 0 0.0 myscript.sh TestJo 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 3.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE Ctrl-C
Third window:
$ cd /tmp/username-condor-g-dagman-tutorial/submit $ touch mydag.dag.dagman.out $ tail -f --lines=500 mydag.dag.dagman.out 7/14 16:29:10 ****************************************************** 7/14 16:29:10 ** condor_scheduniv_exec.3.0 (CONDOR_DAGMAN) STARTING UP 7/14 16:29:10 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/14 16:29:10 ** $CondorPlatform: I386-LINUX-RH9 $ 7/14 16:29:10 ** PID = 5463 7/14 16:29:10 ****************************************************** 7/14 16:29:10 Using config file: /scratch/adesmet/vdttest/condor/etc/condor_conf ig 7/14 16:29:10 Using local config files: /scratch/adesmet/vdttest/condor/local.pu ffin/condor_config.local 7/14 16:29:10 DaemonCore: Command Socket at <192.167.1.100:56294> 7/14 16:29:10 argv[0] == "condor_scheduniv_exec.3.0" 7/14 16:29:10 argv[1] == "-Debug" 7/14 16:29:10 argv[2] == "3" 7/14 16:29:10 argv[3] == "-Lockfile" 7/14 16:29:10 argv[4] == "mydag.dag.lock" 7/14 16:29:10 argv[5] == "-Dag" 7/14 16:29:10 argv[6] == "mydag.dag" 7/14 16:29:10 argv[7] == "-Rescue" 7/14 16:29:10 argv[8] == "mydag.dag.rescue" 7/14 16:29:10 argv[9] == "-Condorlog" 7/14 16:29:10 argv[10] == "results.log" 7/14 16:29:10 DAG Lockfile will be written to mydag.dag.lock 7/14 16:29:10 DAG Input file is mydag.dag 7/14 16:29:10 Rescue DAG will be written to mydag.dag.rescue 7/14 16:29:10 Condor log will be written to results.log, etc. 7/14 16:29:10 Parsing mydag.dag ... 7/14 16:29:10 Dag contains 1 total jobs 7/14 16:29:10 Deleting any older versions of log files... 7/14 16:29:10 Bootstrapping... 7/14 16:29:10 Number of pre-completed jobs: 0 7/14 16:29:10 Registering condor_event_timer... 7/14 16:29:11 Submitting Condor Job HelloWorld ... 7/14 16:29:11 submitting: condor_submit -a 'dag_node_name = HelloWorld' -a '+DA GManJobID = 3.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' myjob.subm it 2>&1 7/14 16:29:11 assigned Condor ID (4.0.0) 7/14 16:29:11 Just submitted 1 job this cycle... 7/14 16:29:11 Event: ULOG_SUBMIT for Condor Job HelloWorld (4.0.0) 7/14 16:29:11 Of 1 nodes total: 7/14 16:29:11 Done Pre Queued Post Ready Un-Ready Failed 7/14 16:29:11 === === === === === === === 7/14 16:29:11 0 0 1 0 0 0 0 7/14 16:29:51 Event: ULOG_GLOBUS_SUBMIT for Condor Job HelloWorld (4.0.0) 7/14 16:29:56 Event: ULOG_EXECUTE for Condor Job HelloWorld (4.0.0) 7/14 16:30:26 Event: ULOG_JOB_TERMINATED for Condor Job HelloWorld (4.0.0) 7/14 16:30:26 Job HelloWorld completed successfully. 7/14 16:30:26 Of 1 nodes total: 7/14 16:30:26 Done Pre Queued Post Ready Un-Ready Failed 7/14 16:30:26 === === === === === === === 7/14 16:30:26 1 0 0 0 0 0 0 7/14 16:30:26 All jobs Completed! 7/14 16:30:26 **** condor_scheduniv_exec.3.0 (condor_DAGMAN) EXITING WITH STATUS 0
Verify your results:
$ ls -l total 12 -rw-r--r-- 1 adesmet adesmet 28 Jul 10 10:35 mydag.dag -rw-r--r-- 1 adesmet adesmet 523 Jul 10 10:36 mydag.dag.condor.sub -rw-r--r-- 1 adesmet adesmet 608 Jul 10 10:38 mydag.dag.dagman.log -rw-r--r-- 1 adesmet adesmet 1860 Jul 10 10:38 mydag.dag.dagman.out -rw-r--r-- 1 adesmet adesmet 29 Jul 10 10:38 mydag.dag.lib.out -rw------- 1 adesmet adesmet 0 Jul 10 10:36 mydag.dag.lock -rw-r--r-- 1 adesmet adesmet 175 Jul 9 18:13 myjob.submit -rwxr-xr-x 1 adesmet adesmet 194 Jul 10 10:36 myscript.sh -rw-r--r-- 1 adesmet adesmet 31 Jul 10 10:37 results.error -rw------- 1 adesmet adesmet 833 Jul 10 10:38 results.log -rw-r--r-- 1 adesmet adesmet 261 Jul 10 10:37 results.output -rwxr-xr-x 1 adesmet adesmet 81 Jul 10 10:35 watch_condor_q $ cat results.error This is sent to standard error $ cat results.output I'm process id 29149 on server1.gs.unina.it Thu Jul 10 10:38:44 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/aa/ceb9e04077256aaa2acf4dff670897/md5/27/2f50da149fc049d07b1c27f30b67df/data TEST 1 My name (argument 1) is TEST My sleep duration (argument 2) is 1 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
Looking at DAGMan's various files, we see that DAGMan itself ran as a Condor job (specifically, a scheduler universe job):
$ ls mydag.dag mydag.dag.dagman.log mydag.dag.lib.out myjob.submit results.error results.output mydag.dag.condor.sub mydag.dag.dagman.out mydag.dag.lock myscript.sh results.log watch_condor_q $ cat mydag.dag.condor.sub # Filename: mydag.dag.condor.sub # Generated by condor_submit_dag mydag.dag universe = scheduler executable = /afs/cs.wisc.edu/u/a/d/adesmet/miron-condor-g-dagman-talk/vdt/condor/bin/condor_dagman getenv = True output = mydag.dag.lib.out error = mydag.dag.lib.out log = mydag.dag.dagman.log remove_kill_sig = SIGUSR1 arguments = -f -l . -Debug 3 -Lockfile mydag.dag.lock -Condorlog results.log -Dag mydag.dag -Rescue mydag.dag.rescue environment = _CONDOR_DAGMAN_LOG=mydag.dag.dagman.out;_CONDOR_MAX_DAGMAN_LOG=0 queue $ cat mydag.dag.dagman.log 000 (006.000.000) 07/10 10:36:43 Job submitted from host: <192.167.1.100:33785> ... 001 (006.000.000) 07/10 10:36:44 Job executing on host: <192.167.1.100:33785> ... 005 (006.000.000) 07/10 10:38:10 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
If you weren't watching the DAGMan output file with tail -f, you can examine the file with the following command:
$ cat mydag.dag.dagman.out
Clean up your results. Be careful when deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .
$ rm mydag.dag.* results.*
Typically each node in a DAG will have its own Condor submit file. Create some more submit files by copying our existing file. For simplicity during this tutorial, we'll keep the submit files very similar, notably using the same executable, but your submit files and executables can differ in real-world use.
$ cp myjob.submit job.setup.submit $ cp myjob.submit job.work1.submit $ cp myjob.submit job.work2.submit $ cp myjob.submit job.workfinal.submit $ cp myjob.submit job.finalize.submit
Edit the various submit files. Change the output and error entries to point to results.NODE.output and results.NODE.error files where NODE is actually the middle word in the submit file (job.NODE.submit). So job.finalize.error would include:
output=results.finalize.output error=results.finalize.error
Here is one possible set of settings for the output entries:
$ grep '^output=' job.*.submit job.finalize.submit:output=results.finalize.output job.setup.submit:output=results.setup.output job.work1.submit:output=results.work1.output job.work2.submit:output=results.work2.output job.workfinal.submit:output=results.workfinal.output $ grep '^error=' job.*.submit job.finalize.submit:error=results.finalize.error job.setup.submit:error=results.setup.error job.work1.submit:error=results.work1.error job.work2.submit:error=results.work2.error job.workfinal.submit:error=results.workfinal.error
This is important so that the various nodes don't overwrite each others output.
Leave the log entries alone. DAGMan and Condor can handle multiple jobs interleaving log messages into the same file. Condor will ensure that the different jobs will not overwrite each other's entries in the log. (Each job can use its own log file, but you may want to use one common log file anyway because it's convenient to have all of your job status information in a single place.)
log=results.log
Also change the arguments entries so that the first argument is something unique to each node (perhaps the NODE name).
For node work2, change the second argument to 120 so that it looks something like:
arguments=MyWorkerNode2 120
Here is the more complex DAG we'll be creating. Flow moves from top to bottom in this graph.
HelloWorld | Setup | ||||||
WorkerNode_1 | WorkerNode_Two | ||||||
CollectResults | |||||||
CollectResults |
Add the new nodes to your DAG:
$ cat mydag.dag Job HelloWorld myjob.submit $ cat >> mydag.dag Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Ctrl-D $ cat mydag.dag Job HelloWorld myjob.submit Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Ctrl-D
condor_q -dag will organize jobs into their associated DAGs. Change watch_condor_q to use this:
$ rm watch_condor_q $ cat > watch_condor_q #! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q echo .... echo .... Output from condor_q -globus echo .... condor_q -globus echo .... echo .... Output from condor_q -dag echo .... condor_q -dag sleep 10 done Ctrl-D $ cat watch_condor_q #! /bin/sh while true; do echo .... echo .... Output from condor_q echo .... condor_q echo .... echo .... Output from condor_q -globus echo .... condor_q -globus echo .... echo .... Output from condor_q -dag echo .... condor_q -dag sleep 10 done $ chmod a+x watch_condor_q
Submit your new DAG and monitor it.
Again, in separate windows you may want to run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.
$ condor_submit_dag mydag.dag Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 8. ----------------------------------------------------------------------- $ ./watch_condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:08 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond 6.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:08 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:12 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond 6.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:12 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:42 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond 6.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:00:42 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:24 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:01:12 R 0 2.6 condor_dagman -f - 5.0 adesmet 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh TestJo 6.0 adesmet 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 5.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond 6.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:01:12 R 0 2.6 condor_dagman -f - 5.0 |-HelloWorld 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh TestJo 6.0 |-Setup 7/10 17:45 0+00:00:54 R 0 0.0 myscript.sh Setup 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:01:42 R 0 2.6 condor_dagman -f - 7.0 adesmet 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh work1 8.0 adesmet 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh Worker 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond 8.0 adesmet UNSUBMITTED fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:01:42 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:00 I 0 0.0 myscript.sh Worker 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:02:12 R 0 2.6 condor_dagman -f - 7.0 adesmet 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh work1 8.0 adesmet 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond 8.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:02:12 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:27 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:02:42 R 0 2.6 condor_dagman -f - 7.0 adesmet 7/10 17:46 0+00:00:57 R 0 0.0 myscript.sh work1 8.0 adesmet 7/10 17:46 0+00:00:57 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 7.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond 8.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:02:43 R 0 2.6 condor_dagman -f - 7.0 |-WorkerNode_ 7/10 17:46 0+00:00:58 R 0 0.0 myscript.sh work1 8.0 |-WorkerNode_ 7/10 17:46 0+00:00:58 R 0 0.0 myscript.sh Worker 3 jobs; 0 idle, 3 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:03:13 R 0 2.6 condor_dagman -f - 8.0 adesmet 7/10 17:46 0+00:01:28 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 8.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:03:13 R 0 2.6 condor_dagman -f - 8.0 |-WorkerNode_ 7/10 17:46 0+00:01:28 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:03:43 R 0 2.6 condor_dagman -f - 8.0 adesmet 7/10 17:46 0+00:01:58 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 8.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:03:43 R 0 2.6 condor_dagman -f - 8.0 |-WorkerNode_ 7/10 17:46 0+00:01:58 R 0 0.0 myscript.sh Worker 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:04:13 R 0 2.6 condor_dagman -f - 9.0 adesmet 7/10 17:49 0+00:00:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:04:13 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:00:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:04:43 R 0 2.6 condor_dagman -f - 9.0 adesmet 7/10 17:49 0+00:00:32 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:04:43 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:00:32 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:05:13 R 0 2.6 condor_dagman -f - 9.0 adesmet 7/10 17:49 0+00:01:02 R 0 0.0 myscript.sh workfi 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 9.0 adesmet DONE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:05:13 R 0 2.6 condor_dagman -f - 9.0 |-CollectResu 7/10 17:49 0+00:01:02 C 0 0.0 myscript.sh workfi 1 jobs; 0 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:05:43 R 0 2.6 condor_dagman -f - 10.0 adesmet 7/10 17:50 0+00:00:13 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 10.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:05:44 R 0 2.6 condor_dagman -f - 10.0 |-LastNode 7/10 17:50 0+00:00:13 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:06:14 R 0 2.6 condor_dagman -f - 10.0 adesmet 7/10 17:50 0+00:00:43 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 10.0 adesmet ACTIVE fork server1.gs.unina.it /tmp/username-cond -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 4.0 adesmet 7/10 17:45 0+00:06:14 R 0 2.6 condor_dagman -f - 10.0 |-LastNode 7/10 17:50 0+00:00:43 R 0 0.0 myscript.sh Final 2 jobs; 0 idle, 2 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: ws01.gs.unina.it : <192.167.1.100:35688> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl-C
Watching the logs or the condor_q output, you'll note that the CollectResults node ("workfinal") wasn't run until both of the WorkerNode nodes ("work1" and "work2") finished.
Examine your results.
$ ls job.finalize.submit mydag.dag.condor.sub myscript.sh results.setup.error results.workfinal.error job.setup.submit mydag.dag.dagman.log results.error results.setup.output results.workfinal.output job.work1.submit mydag.dag.dagman.out results.finalize.error results.work1.error watch_condor_q job.work2.submit mydag.dag.lib.out results.finalize.output results.work1.output job.workfinal.submit mydag.dag.lock results.log results.work2.error mydag.dag myjob.submit results.output results.work2.output $ tail --lines=500 results.*.error ==> results.finalize.error <== This is sent to standard error ==> results.setup.error <== This is sent to standard error ==> results.work1.error <== This is sent to standard error ==> results.work2.error <== This is sent to standard error ==> results.workfinal.error <== This is sent to standard error $ tail --lines=500 results.*.output ==> results.finalize.output <== I'm process id 29614 on server1.gs.unina.it Thu Jul 10 10:53:58 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/0d/7c60aa10b34817d3ffe467dd116816/md5/de/03c3eb8a20852948a2af53438bbce1/data Finalize 1 My name (argument 1) is Finalize My sleep duration (argument 2) is 1 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS ==> results.setup.output <== I'm process id 29337 on server1.gs.unina.it Thu Jul 10 10:50:31 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/a5/fab7b658db65dbfec3ecf0a5414e1c/md5/f4/e9a04ae03bff43f00a10c78ebd60fd/data Setup 1 My name (argument 1) is Setup My sleep duration (argument 2) is 1 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS ==> results.work1.output <== I'm process id 29444 on server1.gs.unina.it Thu Jul 10 10:51:04 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/2e/17db42df4e113f813cea7add42e03e/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode1 1 My name (argument 1) is WorkerNode1 My sleep duration (argument 2) is 1 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS ==> results.work2.output <== I'm process id 29432 on server1.gs.unina.it Thu Jul 10 10:51:03 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/ea/9a3c8d16346b2fea808cda4b5969fa/md5/f6/f1bd82a2fec9a3a372a44c009a63ca/data WorkerNode2 120 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 120 Sleep of 120 seconds finished. Exiting RESULT: 0 SUCCESS ==> results.workfinal.output <== I'm process id 29554 on server1.gs.unina.it Thu Jul 10 10:53:27 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/c9/7ba5d43acad3d9ebdfa633839e75c3/md5/11/cd84efa75305d54100f0f451b46b35/data WorkFinal 1 My name (argument 1) is WorkFinal My sleep duration (argument 2) is 1 Sleep of 10 seconds finished. Exiting RESULT: 0 SUCCESS
Examine your log
$ cat results.log 000 (005.000.000) 07/10 17:45:24 Job submitted from host: <192.167.1.100:35688> DAG Node: HelloWorld ... 000 (006.000.000) 07/10 17:45:24 Job submitted from host: <192.167.1.100:35688> DAG Node: Setup ... 017 (006.000.000) 07/10 17:45:42 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2349/914/1057877133/ Can-Restart-JM: 1 ... 001 (006.000.000) 07/10 17:45:42 Job executing on host: server1.gs.unina.it ... 017 (005.000.000) 07/10 17:45:42 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2348/915/1057877133/ Can-Restart-JM: 1 ... 001 (005.000.000) 07/10 17:45:42 Job executing on host: server1.gs.unina.it ... 005 (005.000.000) 07/10 17:46:50 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 005 (006.000.000) 07/10 17:46:50 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 000 (007.000.000) 07/10 17:46:55 Job submitted from host: <192.167.1.100:35688> DAG Node: WorkerNode_1 ... 000 (008.000.000) 07/10 17:46:56 Job submitted from host: <192.167.1.100:35688> DAG Node: WorkerNode_Two ... 017 (008.000.000) 07/10 17:47:09 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2364/1037/1057877219/ Can-Restart-JM: 1 ... 001 (008.000.000) 07/10 17:47:09 Job executing on host: server1.gs.unina.it ... 017 (007.000.000) 07/10 17:47:09 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2367/1040/1057877220/ Can-Restart-JM: 1 ... 001 (007.000.000) 07/10 17:47:09 Job executing on host: server1.gs.unina.it ... 005 (007.000.000) 07/10 17:48:17 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 005 (008.000.000) 07/10 17:49:18 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 000 (009.000.000) 07/10 17:49:22 Job submitted from host: <192.167.1.100:35688> DAG Node: CollectResults ... 017 (009.000.000) 07/10 17:49:35 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2383/1185/1057877366/ Can-Restart-JM: 1 ... 001 (009.000.000) 07/10 17:49:35 Job executing on host: server1.gs.unina.it ... 005 (009.000.000) 07/10 17:50:42 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ... 000 (010.000.000) 07/10 17:50:42 Job submitted from host: <192.167.1.100:35688> DAG Node: LastNode ... 017 (010.000.000) 07/10 17:50:55 Job submitted to Globus RM-Contact: server1.gs.unina.it:/jobmanager-fork JM-Contact: https://server1.gs.unina.it:2392/1247/1057877446/ Can-Restart-JM: 1 ... 001 (010.000.000) 07/10 17:50:55 Job executing on host: server1.gs.unina.it ... 005 (010.000.000) 07/10 17:52:02 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job ...
Examine the DAGMan log
$ cat mydag.dag.dagman.out 7/14 17:09:20 ****************************************************** 7/14 17:09:20 ** condor_scheduniv_exec.17.0 (CONDOR_DAGMAN) STARTING UP 7/14 17:09:20 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/14 17:09:20 ** $CondorPlatform: I386-LINUX-RH9 $ 7/14 17:09:20 ** PID = 7080 7/14 17:09:20 ****************************************************** 7/14 17:09:20 Using config file: /scratch/adesmet/vdttest/condor/etc/condor_config 7/14 17:09:20 Using local config files: /scratch/adesmet/vdttest/condor/local.puffin/condor_config.local 7/14 17:09:20 DaemonCore: Command Socket at <192.167.1.100:60785> 7/14 17:09:20 argv[0] == "condor_scheduniv_exec.17.0" 7/14 17:09:20 argv[1] == "-Debug" 7/14 17:09:20 argv[2] == "3" 7/14 17:09:20 argv[3] == "-Lockfile" 7/14 17:09:20 argv[4] == "mydag.dag.lock" 7/14 17:09:20 argv[5] == "-Dag" 7/14 17:09:20 argv[6] == "mydag.dag" 7/14 17:09:20 argv[7] == "-Rescue" 7/14 17:09:20 argv[8] == "mydag.dag.rescue" 7/14 17:09:20 argv[9] == "-Condorlog" 7/14 17:09:20 argv[10] == "results.log" 7/14 17:09:20 DAG Lockfile will be written to mydag.dag.lock 7/14 17:09:20 DAG Input file is mydag.dag 7/14 17:09:20 Rescue DAG will be written to mydag.dag.rescue 7/14 17:09:20 Condor log will be written to results.log, etc. 7/14 17:09:20 Parsing mydag.dag ... 7/14 17:09:20 Dag contains 6 total jobs 7/14 17:09:20 Deleting any older versions of log files... 7/14 17:09:20 Deleting older version of results.log 7/14 17:09:20 Bootstrapping... 7/14 17:09:20 Number of pre-completed jobs: 0 7/14 17:09:20 Registering condor_event_timer... 7/14 17:09:21 Submitting Condor Job HelloWorld ... 7/14 17:09:21 submitting: condor_submit -a 'dag_node_name = HelloWorld' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' myjob.submit 2>&1 7/14 17:09:21 assigned Condor ID (18.0.0) 7/14 17:09:21 Submitting Condor Job Setup ... 7/14 17:09:21 submitting: condor_submit -a 'dag_node_name = Setup' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1 7/14 17:09:21 assigned Condor ID (19.0.0) 7/14 17:09:21 Just submitted 2 jobs this cycle... 7/14 17:09:21 Event: ULOG_SUBMIT for Condor Job HelloWorld (18.0.0) 7/14 17:09:21 Event: ULOG_SUBMIT for Condor Job Setup (19.0.0) 7/14 17:09:21 Of 6 nodes total: 7/14 17:09:21 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:09:21 === === === === === === === 7/14 17:09:21 0 0 2 0 0 4 0 7/14 17:10:16 Event: ULOG_GLOBUS_SUBMIT for Condor Job Setup (19.0.0) 7/14 17:10:21 Event: ULOG_GLOBUS_SUBMIT for Condor Job HelloWorld (18.0.0) 7/14 17:10:36 Event: ULOG_EXECUTE for Condor Job Setup (19.0.0) 7/14 17:10:46 Event: ULOG_EXECUTE for Condor Job HelloWorld (18.0.0) 7/14 17:10:56 Event: ULOG_JOB_TERMINATED for Condor Job Setup (19.0.0) 7/14 17:10:56 Job Setup completed successfully. 7/14 17:10:56 Of 6 nodes total: 7/14 17:10:56 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:10:56 === === === === === === === 7/14 17:10:56 1 0 1 0 2 2 0 7/14 17:11:02 Submitting Condor Job WorkerNode_1 ... 7/14 17:11:02 submitting: condor_submit -a 'dag_node_name = WorkerNode_1' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1 7/14 17:11:02 assigned Condor ID (20.0.0) 7/14 17:11:03 Submitting Condor Job WorkerNode_Two ... 7/14 17:11:03 submitting: condor_submit -a 'dag_node_name = WorkerNode_Two' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 7/14 17:11:04 assigned Condor ID (21.0.0) 7/14 17:11:04 Just submitted 2 jobs this cycle... 7/14 17:11:04 Event: ULOG_SUBMIT for Condor Job WorkerNode_1 (20.0.0) 7/14 17:11:04 Event: ULOG_JOB_TERMINATED for Condor Job HelloWorld (18.0.0) 7/14 17:11:04 Job HelloWorld completed successfully. 7/14 17:11:04 Event: ULOG_SUBMIT for Condor Job WorkerNode_Two (21.0.0) 7/14 17:11:04 Of 6 nodes total: 7/14 17:11:04 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:11:04 === === === === === === === 7/14 17:11:04 2 0 2 0 0 2 0 7/14 17:11:54 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_1 (20.0.0) 7/14 17:11:59 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_Two (21.0.0) 7/14 17:12:09 Event: ULOG_EXECUTE for Condor Job WorkerNode_1 (20.0.0) 7/14 17:12:14 Event: ULOG_EXECUTE for Condor Job WorkerNode_Two (21.0.0) 7/14 17:12:29 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_1 (20.0.0) 7/14 17:12:29 Job WorkerNode_1 completed successfully. 7/14 17:12:29 Of 6 nodes total: 7/14 17:12:29 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:12:29 === === === === === === === 7/14 17:12:29 3 0 1 0 0 2 0 7/14 17:13:24 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_Two (21.0.0) 7/14 17:13:24 Job WorkerNode_Two completed successfully. 7/14 17:13:24 Of 6 nodes total: 7/14 17:13:24 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:13:24 === === === === === === === 7/14 17:13:24 4 0 0 0 1 1 0 7/14 17:13:30 Submitting Condor Job CollectResults ... 7/14 17:13:30 submitting: condor_submit -a 'dag_node_name = CollectResults' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.workfinal.submit 2>&1 7/14 17:13:30 assigned Condor ID (22.0.0) 7/14 17:13:30 Just submitted 1 job this cycle... 7/14 17:13:30 Event: ULOG_SUBMIT for Condor Job CollectResults (22.0.0) 7/14 17:13:30 Of 6 nodes total: 7/14 17:13:30 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:13:30 === === === === === === === 7/14 17:13:30 4 0 1 0 0 1 0 7/14 17:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Job CollectResults (22.0.0) 7/14 17:14:40 Event: ULOG_EXECUTE for Condor Job CollectResults (22.0.0) 7/14 17:15:00 Event: ULOG_JOB_TERMINATED for Condor Job CollectResults (22.0.0) 7/14 17:15:00 Job CollectResults completed successfully. 7/14 17:15:00 Of 6 nodes total: 7/14 17:15:00 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:15:00 === === === === === === === 7/14 17:15:00 5 0 0 0 1 0 0 7/14 17:15:06 Submitting Condor Job LastNode ... 7/14 17:15:06 submitting: condor_submit -a 'dag_node_name = LastNode' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1 7/14 17:15:06 assigned Condor ID (23.0.0) 7/14 17:15:06 Just submitted 1 job this cycle... 7/14 17:15:06 Event: ULOG_SUBMIT for Condor Job LastNode (23.0.0) 7/14 17:15:06 Of 6 nodes total: 7/14 17:15:06 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:15:06 === === === === === === === 7/14 17:15:06 5 0 1 0 0 0 0 7/14 17:16:01 Event: ULOG_GLOBUS_SUBMIT for Condor Job LastNode (23.0.0) 7/14 17:16:26 Event: ULOG_EXECUTE for Condor Job LastNode (23.0.0) 7/14 17:16:46 Event: ULOG_JOB_TERMINATED for Condor Job LastNode (23.0.0) 7/14 17:16:46 Job LastNode completed successfully. 7/14 17:16:46 Of 6 nodes total: 7/14 17:16:46 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:16:46 === === === === === === === 7/14 17:16:46 6 0 0 0 0 0 0 7/14 17:16:46 All jobs Completed! 7/14 17:16:46 **** condor_scheduniv_exec.17.0 (condor_DAGMAN) EXITING WITH STATUS 0
Clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .
$ rm mydag.dag.* results.*
DAGMan can handle a situation where some of the nodes in a DAG fails. DAGMan will run as many nodes as possible, then create a rescue DAG making it easy to continue when the problem is fixed.
Let's create a script that will fail so we can see this:
$ cat > myscript2.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 1 FAILURE" exit 1 Ctrl-D $ cat myscript2.sh #! /bin/sh echo "I'm process id $$ on" `hostname` echo "This is sent to standard error" 1>&2 date echo "Running as binary $0" "$@" echo "My name (argument 1) is $1" echo "My sleep duration (argument 2) is $2" sleep $2 echo "Sleep of $2 seconds finished. Exiting" echo "RESULT: 1 FAILURE" exit 1 $ chmod a+x myscript2.sh
Modify job.work2.submit to run myscript2.sh instead of myscript.sh:
$ rm job.work2.submit $ cat > job.work2.submit executable=myscript2.sh output=results.work2.output error=results.work2.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it:/jobmanager-fork arguments=WorkerNode2 60 queue Ctrl-D $ cat job.work2.submit executable=myscript2.sh output=results.work2.output error=results.work2.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it:/jobmanager-fork arguments=WorkerNode2 60 queue
Submit the dag again.
$ condor_submit_dag mydag.dag Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 15. -----------------------------------------------------------------------
Use watch_condor_q to watch the jobs until they finish.
In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.
$ ./watch_condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 adesmet 7/10 11:11 0+00:00:04 R 0 2.6 condor_dagman -f - 16.0 adesmet 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh 17.0 adesmet 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 16.0 adesmet UNSUBMITTED fork server1.gs.unina.it /afs/cs.wisc.edu/u 17.0 adesmet UNSUBMITTED fork server1.gs.unina.it /afs/cs.wisc.edu/u -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 15.0 adesmet 7/10 11:11 0+00:00:04 R 0 2.6 condor_dagman -f - 16.0 |-HelloWorld 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh 17.0 |-Setup 7/10 11:11 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held Output of watch_condor_q truncated -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl-C
Check your results:
$ ls job.finalize.submit mydag.dag.condor.sub myscript.sh results.output results.work2.output job.setup.submit mydag.dag.dagman.log myscript2.sh results.setup.error results.workfinal.error job.work1.submit mydag.dag.dagman.out results.error results.setup.output results.workfinal.output job.work2.submit mydag.dag.lib.out results.finalize.error results.work1.error watch_condor_q job.workfinal.submit mydag.dag.lock results.finalize.output results.work1.output mydag.dag myjob.submit results.log results.work2.error $ cat results.work2.output I'm process id 29921 on server1.gs.unina.it Thu Jul 10 11:12:42 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/87/459c159766cefb36f0d75023de0e35/md5/70/5d82b930ec61460d9c9ca65cbe5a8a/data WorkerNode2 60 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 60 Sleep of 60 seconds finished. Exiting RESULT: 1 FAILURE $ cat mydag.dag.dagman.out 7/14 17:09:20 ****************************************************** 7/14 17:09:20 ** condor_scheduniv_exec.17.0 (CONDOR_DAGMAN) STARTING UP 7/14 17:09:20 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/14 17:09:20 ** $CondorPlatform: I386-LINUX-RH9 $ 7/14 17:09:20 ** PID = 7080 7/14 17:09:20 ****************************************************** 7/14 17:09:20 Using config file: /scratch/adesmet/vdttest/condor/etc/condor_config 7/14 17:09:20 Using local config files: /scratch/adesmet/vdttest/condor/local.puffin/condor_config.local 7/14 17:09:20 DaemonCore: Command Socket at <192.167.1.100:60785> 7/14 17:09:20 argv[0] == "condor_scheduniv_exec.17.0" 7/14 17:09:20 argv[1] == "-Debug" 7/14 17:09:20 argv[2] == "3" 7/14 17:09:20 argv[3] == "-Lockfile" 7/14 17:09:20 argv[4] == "mydag.dag.lock" 7/14 17:09:20 argv[5] == "-Dag" 7/14 17:09:20 argv[6] == "mydag.dag" 7/14 17:09:20 argv[7] == "-Rescue" 7/14 17:09:20 argv[8] == "mydag.dag.rescue" 7/14 17:09:20 argv[9] == "-Condorlog" 7/14 17:09:20 argv[10] == "results.log" 7/14 17:09:20 DAG Lockfile will be written to mydag.dag.lock 7/14 17:09:20 DAG Input file is mydag.dag 7/14 17:09:20 Rescue DAG will be written to mydag.dag.rescue 7/14 17:09:20 Condor log will be written to results.log, etc. 7/14 17:09:20 Parsing mydag.dag ... 7/14 17:09:20 Dag contains 6 total jobs 7/14 17:09:20 Deleting any older versions of log files... 7/14 17:09:20 Deleting older version of results.log 7/14 17:09:20 Bootstrapping... 7/14 17:09:20 Number of pre-completed jobs: 0 7/14 17:09:20 Registering condor_event_timer... 7/14 17:09:21 Submitting Condor Job HelloWorld ... 7/14 17:09:21 submitting: condor_submit -a 'dag_node_name = HelloWorld' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' myjob.submit 2>&1 7/14 17:09:21 assigned Condor ID (18.0.0) 7/14 17:09:21 Submitting Condor Job Setup ... 7/14 17:09:21 submitting: condor_submit -a 'dag_node_name = Setup' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1 7/14 17:09:21 assigned Condor ID (19.0.0) 7/14 17:09:21 Just submitted 2 jobs this cycle... 7/14 17:09:21 Event: ULOG_SUBMIT for Condor Job HelloWorld (18.0.0) 7/14 17:09:21 Event: ULOG_SUBMIT for Condor Job Setup (19.0.0) 7/14 17:09:21 Of 6 nodes total: 7/14 17:09:21 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:09:21 === === === === === === === 7/14 17:09:21 0 0 2 0 0 4 0 7/14 17:10:16 Event: ULOG_GLOBUS_SUBMIT for Condor Job Setup (19.0.0) 7/14 17:10:21 Event: ULOG_GLOBUS_SUBMIT for Condor Job HelloWorld (18.0.0) 7/14 17:10:36 Event: ULOG_EXECUTE for Condor Job Setup (19.0.0) 7/14 17:10:46 Event: ULOG_EXECUTE for Condor Job HelloWorld (18.0.0) 7/14 17:10:56 Event: ULOG_JOB_TERMINATED for Condor Job Setup (19.0.0) 7/14 17:10:56 Job Setup completed successfully. 7/14 17:10:56 Of 6 nodes total: 7/14 17:10:56 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:10:56 === === === === === === === 7/14 17:10:56 1 0 1 0 2 2 0 7/14 17:11:02 Submitting Condor Job WorkerNode_1 ... 7/14 17:11:02 submitting: condor_submit -a 'dag_node_name = WorkerNode_1' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1 7/14 17:11:02 assigned Condor ID (20.0.0) 7/14 17:11:03 Submitting Condor Job WorkerNode_Two ... 7/14 17:11:03 submitting: condor_submit -a 'dag_node_name = WorkerNode_Two' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 7/14 17:11:04 assigned Condor ID (21.0.0) 7/14 17:11:04 Just submitted 2 jobs this cycle... 7/14 17:11:04 Event: ULOG_SUBMIT for Condor Job WorkerNode_1 (20.0.0) 7/14 17:11:04 Event: ULOG_JOB_TERMINATED for Condor Job HelloWorld (18.0.0) 7/14 17:11:04 Job HelloWorld completed successfully. 7/14 17:11:04 Event: ULOG_SUBMIT for Condor Job WorkerNode_Two (21.0.0) 7/14 17:11:04 Of 6 nodes total: 7/14 17:11:04 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:11:04 === === === === === === === 7/14 17:11:04 2 0 2 0 0 2 0 7/14 17:11:54 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_1 (20.0.0) 7/14 17:11:59 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_Two (21.0.0) 7/14 17:12:09 Event: ULOG_EXECUTE for Condor Job WorkerNode_1 (20.0.0) 7/14 17:12:14 Event: ULOG_EXECUTE for Condor Job WorkerNode_Two (21.0.0) 7/14 17:12:29 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_1 (20.0.0) 7/14 17:12:29 Job WorkerNode_1 completed successfully. 7/14 17:12:29 Of 6 nodes total: 7/14 17:12:29 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:12:29 === === === === === === === 7/14 17:12:29 3 0 1 0 0 2 0 7/14 17:13:24 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_Two (21.0.0) 7/14 17:13:24 Job WorkerNode_Two completed successfully. 7/14 17:13:24 Of 6 nodes total: 7/14 17:13:24 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:13:24 === === === === === === === 7/14 17:13:24 4 0 0 0 1 1 0 7/14 17:13:30 Submitting Condor Job CollectResults ... 7/14 17:13:30 submitting: condor_submit -a 'dag_node_name = CollectResults' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.workfinal.submit 2>&1 7/14 17:13:30 assigned Condor ID (22.0.0) 7/14 17:13:30 Just submitted 1 job this cycle... 7/14 17:13:30 Event: ULOG_SUBMIT for Condor Job CollectResults (22.0.0) 7/14 17:13:30 Of 6 nodes total: 7/14 17:13:30 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:13:30 === === === === === === === 7/14 17:13:30 4 0 1 0 0 1 0 7/14 17:14:20 Event: ULOG_GLOBUS_SUBMIT for Condor Job CollectResults (22.0.0) 7/14 17:14:40 Event: ULOG_EXECUTE for Condor Job CollectResults (22.0.0) 7/14 17:15:00 Event: ULOG_JOB_TERMINATED for Condor Job CollectResults (22.0.0) 7/14 17:15:00 Job CollectResults completed successfully. 7/14 17:15:00 Of 6 nodes total: 7/14 17:15:00 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:15:00 === === === === === === === 7/14 17:15:00 5 0 0 0 1 0 0 7/14 17:15:06 Submitting Condor Job LastNode ... 7/14 17:15:06 submitting: condor_submit -a 'dag_node_name = LastNode' -a '+DAGManJobID = 17.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1 7/14 17:15:06 assigned Condor ID (23.0.0) 7/14 17:15:06 Just submitted 1 job this cycle... 7/14 17:15:06 Event: ULOG_SUBMIT for Condor Job LastNode (23.0.0) 7/14 17:15:06 Of 6 nodes total: 7/14 17:15:06 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:15:06 === === === === === === === 7/14 17:15:06 5 0 1 0 0 0 0 7/14 17:16:01 Event: ULOG_GLOBUS_SUBMIT for Condor Job LastNode (23.0.0) 7/14 17:16:26 Event: ULOG_EXECUTE for Condor Job LastNode (23.0.0) 7/14 17:16:46 Event: ULOG_JOB_TERMINATED for Condor Job LastNode (23.0.0) 7/14 17:16:46 Job LastNode completed successfully. 7/14 17:16:46 Of 6 nodes total: 7/14 17:16:46 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:16:46 === === === === === === === 7/14 17:16:46 6 0 0 0 0 0 0 7/14 17:16:46 All jobs Completed! 7/14 17:16:46 **** condor_scheduniv_exec.17.0 (condor_DAGMAN) EXITING WITH STATUS 0
Uh oh, DAGMan ran that remaining nodes based on bad data from node work2. Normally DAGMan checks the return code and considers non-zero a failure. We did modify myscript2.sh to return non-zero. That would normally work, but we're using Condor-G, not normal Condor. Condor-G relies on Globus and Globus doesn't return error codes.
If you're interested in having DAGMan notice a failed job and stopping the DAG at that point, you'll need to use a POST script to detect the problem. One solution is to wrap your executable in a script that will output the executable's return code to stdout and have the POST script scan the stdout for the status. Of perhaps your executable's normal output contains enough information to make the decision.
In this case, our executable is emitting a well known message. Let's add a POST script.
First, clean up your results. Be careful about deleting the mydag.dag.* files, you do not want to delete the mydag.dag file, just mydag.dag.* .
$ rm mydag.dag.* results.*
Now create a script to check the output.
$ cat > postscript_checker #! /bin/sh grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null Ctrl-D $ cat postscript_checker #! /bin/sh grep 'RESULT: 0 SUCCESS' $1 > /dev/null 2>/dev/null $ chmod a+x postscript_checker
Modify your mydag.dag to use the new script for the nodes.
$ cat >>mydag.dag Script POST Setup postscript_checker results.setup.output Script POST WorkerNode_1 postscript_checker results.work1.output Script POST WorkerNode_Two postscript_checker results.work2.output Script POST CollectResults postscript_checker results.workfinal.output Script POST LastNode postscript_checker results.finalize.output Ctrl-D $ cat mydag.dag Job HelloWorld myjob.submit Job Setup job.setup.submit Job WorkerNode_1 job.work1.submit Job WorkerNode_Two job.work2.submit Job CollectResults job.workfinal.submit Job LastNode job.finalize.submit PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode Script POST Setup postscript_checker results.setup.output Script POST WorkerNode_1 postscript_checker results.work1.output Script POST WorkerNode_Two postscript_checker results.work2.output Script POST CollectResults postscript_checker results.workfinal.output Script POST LastNode postscript_checker results.finalize.output $ ls job.finalize.submit job.work1.submit job.workfinal.submit myjob.submit myscript2.sh watch_condor_q job.setup.submit job.work2.submit mydag.dag myscript.sh postscript_checker
Submit the dag again with the new POST scripts in place.
$ condor_submit_dag mydag.dag Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.condor.sub Log of DAGMan debugging messages : mydag.dag.dagman.out Log of Condor library debug messages : mydag.dag.lib.out Log of the life of condor_dagman itself : mydag.dag.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 22. -----------------------------------------------------------------------
Again, watch the job with watch_condor_q. In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.dagman.out" to monitor the job's progress.
$ ./watch_condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 adesmet 7/10 11:25 0+00:00:03 R 0 2.6 condor_dagman -f - 23.0 adesmet 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh 24.0 adesmet 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 23.0 adesmet UNSUBMITTED fork server1.gs.unina.it /afs/cs.wisc.edu/u 24.0 adesmet UNSUBMITTED fork server1.gs.unina.it /afs/cs.wisc.edu/u -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 22.0 adesmet 7/10 11:25 0+00:00:03 R 0 2.6 condor_dagman -f - 23.0 |-HelloWorld 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh 24.0 |-Setup 7/10 11:25 0+00:00:00 I 0 0.0 myscript.sh Setup 3 jobs; 2 idle, 1 running, 0 held Output of watch_condor_q truncated -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl-C
Check your results:
$ ls job.finalize.submit mydag.dag mydag.dag.rescue results.error results.work1.error job.setup.submit mydag.dag.condor.sub myjob.submit results.log results.work1.output job.work1.submit mydag.dag.dagman.log myscript.sh results.output results.work2.error job.work2.submit mydag.dag.dagman.out myscript2.sh results.setup.error results.work2.output job.workfinal.submit mydag.dag.lib.out postscript_checker results.setup.output watch_condor_q $ cat mydag.dag.dagman.out 7/14 17:25:24 ****************************************************** 7/14 17:25:24 ** condor_scheduniv_exec.24.0 (CONDOR_DAGMAN) STARTING UP 7/14 17:25:24 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/14 17:25:24 ** $CondorPlatform: I386-LINUX-RH9 $ 7/14 17:25:24 ** PID = 7809 7/14 17:25:24 ****************************************************** 7/14 17:25:24 Using config file: /scratch/adesmet/vdttest/condor/etc/condor_config 7/14 17:25:24 Using local config files: /scratch/adesmet/vdttest/condor/local.puffin/condor_config.local 7/14 17:25:24 DaemonCore: Command Socket at <192.167.1.100:34511> 7/14 17:25:24 argv[0] == "condor_scheduniv_exec.24.0" 7/14 17:25:24 argv[1] == "-Debug" 7/14 17:25:24 argv[2] == "3" 7/14 17:25:24 argv[3] == "-Lockfile" 7/14 17:25:24 argv[4] == "mydag.dag.lock" 7/14 17:25:24 argv[5] == "-Dag" 7/14 17:25:24 argv[6] == "mydag.dag" 7/14 17:25:24 argv[7] == "-Rescue" 7/14 17:25:24 argv[8] == "mydag.dag.rescue" 7/14 17:25:24 argv[9] == "-Condorlog" 7/14 17:25:24 argv[10] == "results.log" 7/14 17:25:24 DAG Lockfile will be written to mydag.dag.lock 7/14 17:25:24 DAG Input file is mydag.dag 7/14 17:25:24 Rescue DAG will be written to mydag.dag.rescue 7/14 17:25:24 Condor log will be written to results.log, etc. 7/14 17:25:24 Parsing mydag.dag ... 7/14 17:25:24 jobName: Setup 7/14 17:25:24 jobName: WorkerNode_1 7/14 17:25:24 jobName: WorkerNode_Two 7/14 17:25:24 jobName: CollectResults 7/14 17:25:24 jobName: LastNode 7/14 17:25:24 Dag contains 6 total jobs 7/14 17:25:24 Deleting any older versions of log files... 7/14 17:25:24 Bootstrapping... 7/14 17:25:24 Number of pre-completed jobs: 0 7/14 17:25:24 Registering condor_event_timer... 7/14 17:25:25 Submitting Condor Job HelloWorld ... 7/14 17:25:25 submitting: condor_submit -a 'dag_node_name = HelloWorld' -a '+DAGManJobID = 24.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' myjob.submit 2>&1 7/14 17:25:25 assigned Condor ID (25.0.0) 7/14 17:25:25 Submitting Condor Job Setup ... 7/14 17:25:25 submitting: condor_submit -a 'dag_node_name = Setup' -a '+DAGManJobID = 24.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.setup.submit 2>&1 7/14 17:25:26 assigned Condor ID (26.0.0) 7/14 17:25:26 Just submitted 2 jobs this cycle... 7/14 17:25:26 Event: ULOG_SUBMIT for Condor Job HelloWorld (25.0.0) 7/14 17:25:26 Event: ULOG_SUBMIT for Condor Job Setup (26.0.0) 7/14 17:25:26 Of 6 nodes total: 7/14 17:25:26 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:25:26 === === === === === === === 7/14 17:25:26 0 0 2 0 0 4 0 7/14 17:26:26 Event: ULOG_GLOBUS_SUBMIT for Condor Job HelloWorld (25.0.0) 7/14 17:26:31 Event: ULOG_GLOBUS_SUBMIT for Condor Job Setup (26.0.0) 7/14 17:26:51 Event: ULOG_EXECUTE for Condor Job HelloWorld (25.0.0) 7/14 17:27:01 Event: ULOG_EXECUTE for Condor Job Setup (26.0.0) 7/14 17:27:11 Event: ULOG_JOB_TERMINATED for Condor Job HelloWorld (25.0.0) 7/14 17:27:11 Job HelloWorld completed successfully. 7/14 17:27:11 Of 6 nodes total: 7/14 17:27:11 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:27:11 === === === === === === === 7/14 17:27:11 1 0 1 0 0 4 0 7/14 17:27:21 Event: ULOG_JOB_TERMINATED for Condor Job Setup (26.0.0) 7/14 17:27:21 Job Setup completed successfully. 7/14 17:27:21 Running POST script of Job Setup... 7/14 17:27:21 Of 6 nodes total: 7/14 17:27:21 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:27:21 === === === === === === === 7/14 17:27:21 1 0 0 1 0 4 0 7/14 17:27:26 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job Setup (26.0.0) 7/14 17:27:26 POST Script of Job Setup completed successfully. 7/14 17:27:26 Of 6 nodes total: 7/14 17:27:26 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:27:26 === === === === === === === 7/14 17:27:26 2 0 0 0 2 2 0 7/14 17:27:32 Submitting Condor Job WorkerNode_1 ... 7/14 17:27:32 submitting: condor_submit -a 'dag_node_name = WorkerNode_1' -a '+DAGManJobID = 24.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work1.submit 2>&1 7/14 17:27:32 assigned Condor ID (27.0.0) 7/14 17:27:33 Submitting Condor Job WorkerNode_Two ... 7/14 17:27:33 submitting: condor_submit -a 'dag_node_name = WorkerNode_Two' -a '+DAGManJobID = 24.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 7/14 17:27:33 assigned Condor ID (28.0.0) 7/14 17:27:33 Just submitted 2 jobs this cycle... 7/14 17:27:33 Event: ULOG_SUBMIT for Condor Job WorkerNode_1 (27.0.0) 7/14 17:27:33 Event: ULOG_SUBMIT for Condor Job WorkerNode_Two (28.0.0) 7/14 17:27:33 Of 6 nodes total: 7/14 17:27:33 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:27:33 === === === === === === === 7/14 17:27:33 2 0 2 0 0 2 0 7/14 17:28:33 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_1 (27.0.0) 7/14 17:28:38 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_Two (28.0.0) 7/14 17:28:58 Event: ULOG_EXECUTE for Condor Job WorkerNode_1 (27.0.0) 7/14 17:29:03 Event: ULOG_EXECUTE for Condor Job WorkerNode_Two (28.0.0) 7/14 17:29:18 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_1 (27.0.0) 7/14 17:29:18 Job WorkerNode_1 completed successfully. 7/14 17:29:18 Running POST script of Job WorkerNode_1... 7/14 17:29:18 Of 6 nodes total: 7/14 17:29:18 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:29:18 === === === === === === === 7/14 17:29:18 2 0 1 1 0 2 0 7/14 17:29:23 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkerNode_1 (27.0.0) 7/14 17:29:23 POST Script of Job WorkerNode_1 completed successfully. 7/14 17:29:23 Of 6 nodes total: 7/14 17:29:23 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:29:23 === === === === === === === 7/14 17:29:23 3 0 1 0 0 2 0 7/14 17:30:08 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_Two (28.0.0) 7/14 17:30:08 Job WorkerNode_Two completed successfully. 7/14 17:30:08 Running POST script of Job WorkerNode_Two... 7/14 17:30:08 Of 6 nodes total: 7/14 17:30:08 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:30:08 === === === === === === === 7/14 17:30:08 3 0 0 1 0 2 0 7/14 17:30:13 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkerNode_Two (28.0.0) 7/14 17:30:13 POST Script of Job WorkerNode_Two failed with status 1 7/14 17:30:13 Of 6 nodes total: 7/14 17:30:13 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:30:13 === === === === === === === 7/14 17:30:13 3 0 0 0 0 2 1 7/14 17:30:13 ERROR: the following job(s) failed: 7/14 17:30:13 ---------------------- Job ---------------------- 7/14 17:30:13 Node Name: WorkerNode_Two 7/14 17:30:13 NodeID: 3 7/14 17:30:13 Node Status: STATUS_ERROR 7/14 17:30:13 Error: POST Script failed with status 1 7/14 17:30:13 Job Submit File: job.work2.submit 7/14 17:30:13 POST Script: postscript_checker results.work2.output 7/14 17:30:13 Condor Job ID: (28.0.0) 7/14 17:30:13 Q_PARENTS: 1, <END> 7/14 17:30:13 Q_WAITING: <END> 7/14 17:30:13 Q_CHILDREN: 4, <END> 7/14 17:30:13 --------------------------------------- <END> 7/14 17:30:13 Aborting DAG... 7/14 17:30:13 Writing Rescue DAG to mydag.dag.rescue... 7/14 17:30:13 **** condor_scheduniv_exec.24.0 (condor_DAGMAN) EXITING WITH STATUS 1
DAGMan notices that one of the jobs failed. DAGMan ran as much of the DAG as possible and logged enough information to continue the run when the situation is resolved.
Look at the rescue DAG. It's the same structurally as your original DAG, but notes that finished are marked DONE. (DAGMan also reorganized the file.) When you submit the rescue DAG, DONE nodes will be skipped.
$ cat mydag.dag.rescue # Rescue DAG file, created after running # the mydag.dag DAG file # # Total number of Nodes: 6 # Nodes premarked DONE: 3 # Nodes that failed: 1 # WorkerNode_Two,<ENDLIST> JOB HelloWorld myjob.submit DONE JOB Setup job.setup.submit DONE SCRIPT POST Setup postscript_checker results.setup.output JOB WorkerNode_1 job.work1.submit DONE SCRIPT POST WorkerNode_1 postscript_checker results.work1.output JOB WorkerNode_Two job.work2.submit SCRIPT POST WorkerNode_Two postscript_checker results.work2.output JOB CollectResults job.workfinal.submit SCRIPT POST CollectResults postscript_checker results.workfinal.output JOB LastNode job.finalize.submit SCRIPT POST LastNode postscript_checker results.finalize.output PARENT Setup CHILD WorkerNode_1 WorkerNode_Two PARENT WorkerNode_1 CHILD CollectResults PARENT WorkerNode_Two CHILD CollectResults PARENT CollectResults CHILD LastNode
So we know there is a problem with the work2 step. Let's "fix" it.
$ rm myscript2.sh $ cp myscript.sh myscript2.sh
Now we can submit our rescue DAG. (If you didn't fix the problem, DAGMan would have generated another rescue DAG, this time "mydag.dag.rescue.rescue".) In separate windows run "tail -f --lines=500 results.log" and "tail -f --lines=500 mydag.dag.rescue.dagman.out" to monitor the job's progress.
$ condor_submit_dag mydag.dag.rescue Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : mydag.dag.rescue.condor.sub Log of DAGMan debugging messages : mydag.dag.rescue.dagman.out Log of Condor library debug messages : mydag.dag.rescue.lib.out Log of the life of condor_dagman itself : mydag.dag.rescue.dagman.log Condor Log file for all jobs of this DAG : results.log Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 27. ----------------------------------------------------------------------- $ ./watch_condor_q -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 27.0 adesmet 7/10 11:34 0+00:00:01 R 0 2.6 condor_dagman -f - 28.0 adesmet 7/10 11:34 0+00:00:00 I 0 0.0 myscript2.sh Worke 2 jobs; 1 idle, 1 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 28.0 adesmet UNSUBMITTED fork server1.gs.unina.it /afs/cs.wisc.edu/u -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 27.0 adesmet 7/10 11:34 0+00:00:01 R 0 2.6 condor_dagman -f - 28.0 |-WorkerNode_ 7/10 11:34 0+00:00:00 I 0 0.0 myscript2.sh Worke 2 jobs; 1 idle, 1 running, 0 held Output of watch_condor_q truncated -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE -- Submitter: ws01.gs.unina.it : <192.167.1.100:33785> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held Ctrl-C
Check your results.
$ ls job.finalize.submit mydag.dag.lib.out myscript2.sh results.work1.error job.setup.submit mydag.dag.rescue postscript_checker results.work1.output job.work1.submit mydag.dag.rescue.condor.sub results.error results.work2.error job.work2.submit mydag.dag.rescue.dagman.log results.finalize.error results.work2.output job.workfinal.submit mydag.dag.rescue.dagman.out results.finalize.output results.workfinal.error mydag.dag mydag.dag.rescue.lib.out results.log results.workfinal.output mydag.dag.condor.sub mydag.dag.rescue.lock results.output watch_condor_q mydag.dag.dagman.log myjob.submit results.setup.error mydag.dag.dagman.out myscript.sh results.setup.output $ cat mydag.dag.rescue.dagman.out 7/14 17:34:18 ****************************************************** 7/14 17:34:18 ** condor_scheduniv_exec.29.0 (CONDOR_DAGMAN) STARTING UP 7/14 17:34:18 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/14 17:34:18 ** $CondorPlatform: I386-LINUX-RH9 $ 7/14 17:34:18 ** PID = 8293 7/14 17:34:18 ****************************************************** 7/14 17:34:18 Using config file: /scratch/adesmet/vdttest/condor/etc/condor_config 7/14 17:34:18 Using local config files: /scratch/adesmet/vdttest/condor/local.puffin/condor_config.local 7/14 17:34:18 DaemonCore: Command Socket at <192.167.1.100:35624> 7/14 17:34:18 argv[0] == "condor_scheduniv_exec.29.0" 7/14 17:34:18 argv[1] == "-Debug" 7/14 17:34:18 argv[2] == "3" 7/14 17:34:18 argv[3] == "-Lockfile" 7/14 17:34:18 argv[4] == "mydag.dag.rescue.lock" 7/14 17:34:18 argv[5] == "-Dag" 7/14 17:34:18 argv[6] == "mydag.dag.rescue" 7/14 17:34:18 argv[7] == "-Rescue" 7/14 17:34:18 argv[8] == "mydag.dag.rescue.rescue" 7/14 17:34:18 argv[9] == "-Condorlog" 7/14 17:34:18 argv[10] == "results.log" 7/14 17:34:18 DAG Lockfile will be written to mydag.dag.rescue.lock 7/14 17:34:18 DAG Input file is mydag.dag.rescue 7/14 17:34:18 Rescue DAG will be written to mydag.dag.rescue.rescue 7/14 17:34:18 Condor log will be written to results.log, etc. 7/14 17:34:18 Parsing mydag.dag.rescue ... 7/14 17:34:18 jobName: Setup 7/14 17:34:18 jobName: WorkerNode_1 7/14 17:34:18 jobName: WorkerNode_Two 7/14 17:34:18 jobName: CollectResults 7/14 17:34:18 jobName: LastNode 7/14 17:34:18 Dag contains 6 total jobs 7/14 17:34:18 Deleting any older versions of log files... 7/14 17:34:18 Deleting older version of results.log 7/14 17:34:18 Bootstrapping... 7/14 17:34:18 Number of pre-completed jobs: 3 7/14 17:34:18 Registering condor_event_timer... 7/14 17:34:20 Submitting Condor Job WorkerNode_Two ... 7/14 17:34:20 submitting: condor_submit -a 'dag_node_name = WorkerNode_Two' -a '+DAGManJobID = 29.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.work2.submit 2>&1 7/14 17:34:20 assigned Condor ID (30.0.0) 7/14 17:34:20 Just submitted 1 job this cycle... 7/14 17:34:20 Event: ULOG_SUBMIT for Condor Job WorkerNode_Two (30.0.0) 7/14 17:34:20 Of 6 nodes total: 7/14 17:34:20 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:34:20 === === === === === === === 7/14 17:34:20 3 0 1 0 0 2 0 7/14 17:35:20 Event: ULOG_GLOBUS_SUBMIT for Condor Job WorkerNode_Two (30.0.0) 7/14 17:35:35 Event: ULOG_EXECUTE for Condor Job WorkerNode_Two (30.0.0) 7/14 17:36:50 Event: ULOG_JOB_TERMINATED for Condor Job WorkerNode_Two (30.0.0) 7/14 17:36:50 Job WorkerNode_Two completed successfully. 7/14 17:36:50 Running POST script of Job WorkerNode_Two... 7/14 17:36:50 Of 6 nodes total: 7/14 17:36:50 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:36:50 === === === === === === === 7/14 17:36:50 3 0 0 1 0 2 0 7/14 17:36:55 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job WorkerNode_Two (30.0.0) 7/14 17:36:55 POST Script of Job WorkerNode_Two completed successfully. 7/14 17:36:55 Of 6 nodes total: 7/14 17:36:55 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:36:55 === === === === === === === 7/14 17:36:55 4 0 0 0 1 1 0 7/14 17:37:01 Submitting Condor Job CollectResults ... 7/14 17:37:01 submitting: condor_submit -a 'dag_node_name = CollectResults' -a '+DAGManJobID = 29.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.workfinal.submit 2>&1 7/14 17:37:01 assigned Condor ID (31.0.0) 7/14 17:37:01 Just submitted 1 job this cycle... 7/14 17:37:01 Event: ULOG_SUBMIT for Condor Job CollectResults (31.0.0) 7/14 17:37:01 Of 6 nodes total: 7/14 17:37:01 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:37:01 === === === === === === === 7/14 17:37:01 4 0 1 0 0 1 0 7/14 17:37:56 Event: ULOG_GLOBUS_SUBMIT for Condor Job CollectResults (31.0.0) 7/14 17:38:21 Event: ULOG_EXECUTE for Condor Job CollectResults (31.0.0) 7/14 17:38:31 Event: ULOG_JOB_TERMINATED for Condor Job CollectResults (31.0.0) 7/14 17:38:31 Job CollectResults completed successfully. 7/14 17:38:31 Running POST script of Job CollectResults... 7/14 17:38:31 Of 6 nodes total: 7/14 17:38:31 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:38:31 === === === === === === === 7/14 17:38:31 4 0 0 1 0 1 0 7/14 17:38:36 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job CollectResults (31.0.0) 7/14 17:38:36 POST Script of Job CollectResults completed successfully. 7/14 17:38:36 Of 6 nodes total: 7/14 17:38:36 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:38:36 === === === === === === === 7/14 17:38:36 5 0 0 0 1 0 0 7/14 17:38:42 Submitting Condor Job LastNode ... 7/14 17:38:42 submitting: condor_submit -a 'dag_node_name = LastNode' -a '+DAGManJobID = 29.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' job.finalize.submit 2>&1 7/14 17:38:42 assigned Condor ID (32.0.0) 7/14 17:38:42 Just submitted 1 job this cycle... 7/14 17:38:42 Event: ULOG_SUBMIT for Condor Job LastNode (32.0.0) 7/14 17:38:42 Of 6 nodes total: 7/14 17:38:42 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:38:42 === === === === === === === 7/14 17:38:42 5 0 1 0 0 0 0 7/14 17:39:42 Event: ULOG_GLOBUS_SUBMIT for Condor Job LastNode (32.0.0) 7/14 17:40:07 Event: ULOG_EXECUTE for Condor Job LastNode (32.0.0) 7/14 17:40:27 Event: ULOG_JOB_TERMINATED for Condor Job LastNode (32.0.0) 7/14 17:40:27 Job LastNode completed successfully. 7/14 17:40:27 Running POST script of Job LastNode... 7/14 17:40:27 Of 6 nodes total: 7/14 17:40:27 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:40:27 === === === === === === === 7/14 17:40:27 5 0 0 1 0 0 0 7/14 17:40:32 Event: ULOG_POST_SCRIPT_TERMINATED for Condor Job LastNode (32.0.0) 7/14 17:40:32 POST Script of Job LastNode completed successfully. 7/14 17:40:32 Of 6 nodes total: 7/14 17:40:32 Done Pre Queued Post Ready Un-Ready Failed 7/14 17:40:32 === === === === === === === 7/14 17:40:32 6 0 0 0 0 0 0 7/14 17:40:32 All jobs Completed! 7/14 17:40:32 **** condor_scheduniv_exec.29.0 (condor_DAGMAN) EXITING WITH STATUS 0 $ cat results.work2.output I'm process id 30478 on server1.gs.unina.it Thu Jul 10 11:34:46 CDT 2003 Running as binary /n/uscms_share/home/adesmet2/.globus/.gass_cache/local/md5/23/61b50cd9b278330cac68107dd390d6/md5/5e/004f7216b8b846d548357da00985f4/data WorkerNode2 60 My name (argument 1) is WorkerNode2 My sleep duration (argument 2) is 60 Sleep of 60 seconds finished. Exiting RESULT: 0 SUCCESS
We've fixed our artificial problem and completed our run.
Besides CPU jobs, DAGMan can also specify data placement or DaP jobs via Stork. Stork is new member of the Condor toolbox that specializes in data placement, i.e. scheduling data movements, using a variety of available transfer protocols. First, start a local stork server, per Appendix I: Starting a Stork Server.
The next example will create a simple, but realistic DAG that retrieves a data set via a Stork job, then filters the data set via a Condor-G job. Like Condor-G, Stork requires a submit file. Create a simple DaP submit file that copies a remote data set to a local data set:
$ cat >> stork-transfer-http-file.dap [ dap_type = transfer; src_url = "http://www.gs.unina.it/"; dest_url = "file:/tmp/stork_data.html"; ] Ctrl-D $ cat stork-transfer-http-file.dap [ dap_type = transfer; src_url = "http://www.gs.unina.it/"; dest_url = "file:/tmp/stork_data.html"; ]Now create a Condor-G job to filter this data set. For this example, we will use a simple Perl script to convert all text in the input data file to lower case:
$ cat >> filter.submit # translate all characters to lowercase executable=/usr/bin/perl arguments=-p -e tr/A-Z/a-z/ input=/tmp/stork_data.html output=results.filter.output error=results.filter.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it/jobmanager-fork queue Ctrl-D $ cat filter.submit # translate all characters to lowercase executable=/usr/bin/perl arguments=-p -e tr/A-Z/a-z/ input=/tmp/stork_data.html output=results.filter.output error=results.filter.error log=results.log notification=never universe=globus globusscheduler=server1.gs.unina.it/jobmanager-fork queueNow that we have a Stork Dap submit file, and a Condor-G submit file, let's construct the DAG that submits both. The PARENT CHILD statement instructs DAGMan to run the Stork DaP data placement job before the Condor-G filter job:
$ cat >> stork_data_filter.dag DaP Data stork-transfer-http-file.dap Job Filter filter.submit PARENT Data CHILD Filter Ctrl-D $ stork_data_filter.dag DaP Data stork-transfer-http-file.dap Job Filter filter.submit PARENT Data CHILD FilterSubmit your new DAG and monitor it, using the watch_condor_q. developed above. Note the new command line options to specify the Stork server and Stork user log to DAGMan. These options will be moved to static configuration files in future releases of Condor. Again, in separate windows you may want to run tail -f --lines=500 stork_data_filter.dag.dagman.out" to monitor the job's progress.
$ condor_submit_dag -storklog /tmp/Stork.userlog.user_log \ -storkserver `hostname` stork_data_filter.dag Checking your DAG input file and all submit files it references. This might take a while... Done. ----------------------------------------------------------------------- File for submitting this DAG to Condor : stork_data_filter.dag.condor.sub Log of DAGMan debugging messages : stork_data_filter.dag.dagman.out Log of Condor library debug messages : stork_data_filter.dag.lib.outLog of the life of condor_dagman itself : stork_data_filter.dag.dagman.log Condor Log file for all Condor jobs of this DAG: results.log Stork Log file for all DaP jobs of this DAG : /tmp/Stork.userlog.user_log Stork server to which DaP jobs will be submitted : diet Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 86. ----------------------------------------------------------------------- $ ./watch_condor_q .... .... Output from condor_q .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:06 R 0 2.3 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held .... .... Output from condor_q -globus .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE .... .... Output from condor_q -dag .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:06 R 0 2.3 condor_dagman -f - 1 jobs; 0 idle, 1 running, 0 held .... .... Output from condor_q .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:16 R 0 2.3 condor_dagman -f - 87.0 adesmet 7/18 20:55 0+00:00:00 I 0 0.8 perl -p -e tr/A-Z/ 2 jobs; 1 idle, 1 running, 0 held .... .... Output from condor_q -globus .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 87.0 adesmet UNSUBMITTED fork server1.gs.unina.it /usr/bin/perl .... .... Output from condor_q -dag .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:16 R 0 2.3 condor_dagman -f - 87.0 |-Filter 7/18 20:55 0+00:00:00 I 0 0.8 perl -p -e tr/A-Z/ 2 jobs; 1 idle, 1 running, 0 held .... .... Output from condor_q .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:27 R 0 2.3 condor_dagman -f - 87.0 adesmet 7/18 20:55 0+00:00:00 C 0 0.8 perl -p -e tr/A-Z/ 1 jobs; 0 idle, 1 running, 0 held .... .... Output from condor_q -globus .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER STATUS MANAGER HOST EXECUTABLE 87.0 adesmet DONE fork server1.gs.unina.it /usr/bin/perl .... .... Output from condor_q -dag .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER/NODENAME SUBMITTED RUN_TIME ST PRI SIZE CMD 86.0 adesmet 7/18 20:55 0+00:00:27 R 0 2.3 condor_dagman -f - 87.0 |-Filter 7/18 20:55 0+00:00:00 C 0 0.8 perl -p -e tr/A-Z/ 1 jobs; 0 idle, 1 running, 0 held .... .... Output from condor_q .... -- Submitter: ws01.gs.unina.it : <198.51.254.11:51902> : ws01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held .... .... Output from condor_q -globus ....Examine the DAGMan log
$ cat stork_data_filter.dag.dagman.out 7/18 20:55:43 ****************************************************** 7/18 20:55:43 ** condor_scheduniv_exec.86.0 (CONDOR_DAGMAN) STARTING UP 7/18 20:55:43 ** $CondorVersion: 6.6.5 May 3 2004 $ 7/18 20:55:43 ** $CondorPlatform: I386-LINUX-RH72 $ 7/18 20:55:43 ** PID = 17010 7/18 20:55:43 ****************************************************** 7/18 20:55:43 Using config file: /opt/vdt/condor/etc/condor_config 7/18 20:55:43 Using local config files: /opt/vdt/condor/local.diet/condor_config.local 7/18 20:55:43 DaemonCore: Command Socket at <198.51.254.11:57319> 7/18 20:55:43 argv[0] == "condor_scheduniv_exec.86.0" 7/18 20:55:43 argv[1] == "-Debug" 7/18 20:55:43 argv[2] == "3" 7/18 20:55:43 argv[3] == "-Lockfile" 7/18 20:55:43 argv[4] == "stork_data_filter.dag.lock" 7/18 20:55:43 argv[5] == "-Dag" 7/18 20:55:43 argv[6] == "stork_data_filter.dag" 7/18 20:55:43 argv[7] == "-Rescue" 7/18 20:55:43 argv[8] == "stork_data_filter.dag.rescue" 7/18 20:55:43 argv[9] == "-Condorlog" 7/18 20:55:43 argv[10] == "results.log" 7/18 20:55:43 argv[11] == "-Storklog" 7/18 20:55:43 argv[12] == "/tmp/Stork.userlog.user_log" 7/18 20:55:43 argv[13] == "-Storkserver" 7/18 20:55:43 argv[14] == "diet" 7/18 20:55:43 DAG Lockfile will be written to stork_data_filter.dag.lock 7/18 20:55:43 DAG Input file is stork_data_filter.dag 7/18 20:55:43 Rescue DAG will be written to stork_data_filter.dag.rescue 7/18 20:55:43 Condor log will be written to results.log, etc. 7/18 20:55:43 DaP log will be written to /tmp/Stork.userlog.user_log 7/18 20:55:43 Parsing stork_data_filter.dag ... 7/18 20:55:43 Dag contains 2 total jobs 7/18 20:55:43 Deleting any older versions of log files... 7/18 20:55:43 Deleting older version of /tmp/Stork.userlog.user_log 7/18 20:55:43 Bootstrapping... 7/18 20:55:43 Number of pre-completed jobs: 0 7/18 20:55:43 Registering condor_event_timer... 7/18 20:55:43 Registering dap_event_timer... 7/18 20:55:44 Of 2 nodes total: 7/18 20:55:44 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:55:44 === === === === === === === 7/18 20:55:44 0 0 0 0 1 1 0 7/18 20:55:44 Submitting DaP Job Data ... 7/18 20:55:44 assigned DaP ID (0) 7/18 20:55:44 Just submitted 1 job this cycle... 7/18 20:55:44 Of 2 nodes total: 7/18 20:55:44 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:55:44 === === === === === === === 7/18 20:55:44 0 0 1 0 0 1 0 7/18 20:55:49 Event: ULOG_SUBMIT for DaP Job Data (7) 7/18 20:55:49 Event: ULOG_EXECUTE for DaP Job Data (7) 7/18 20:55:49 Event: ULOG_JOB_TERMINATED for DaP Job Data (7) 7/18 20:55:49 Job Data completed successfully. 7/18 20:55:49 Of 2 nodes total: 7/18 20:55:49 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:55:49 === === === === === === === 7/18 20:55:49 1 0 0 0 1 0 0 7/18 20:55:55 Submitting Condor Job Filter ... 7/18 20:55:55 submitting: condor_submit -a 'dag_node_name = Filter' -a '+DAGManJobID = 86.0' -a 'submit_event_notes = DAG Node: $(dag_node_name)' filter.submit 2>&1 7/18 20:55:56 assigned Condor ID (87.0.0) 7/18 20:55:56 Just submitted 1 job this cycle... 7/18 20:55:56 Event: ULOG_SUBMIT for Condor Job Filter (87.0.0) 7/18 20:55:56 Of 2 nodes total: 7/18 20:55:56 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:55:56 === === === === === === === 7/18 20:55:56 1 0 1 0 0 0 0 7/18 20:55:59 Of 2 nodes total: 7/18 20:55:59 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:55:59 === === === === === === === 7/18 20:55:59 1 0 1 0 0 0 0 7/18 20:56:11 Event: ULOG_GLOBUS_SUBMIT for Condor Job Filter (87.0.0) 7/18 20:56:11 Event: ULOG_EXECUTE for Condor Job Filter (87.0.0) 7/18 20:56:16 Event: ULOG_JOB_TERMINATED for Condor Job Filter (87.0.0) 7/18 20:56:16 Job Filter completed successfully. 7/18 20:56:16 Of 2 nodes total: 7/18 20:56:16 Done Pre Queued Post Ready Un-Ready Failed 7/18 20:56:16 === === === === === === === 7/18 20:56:16 2 0 0 0 0 0 0 7/18 20:56:16 All jobs Completed! 7/18 20:56:16 **** condor_scheduniv_exec.86.0 (condor_DAGMAN) EXITING WITH STATUS 0Examine the input data set. Notice the text is a mix of upper and lower case characters.
$ cat /tmp/stork_data.html <html> <head> <title>The 2nd International Summer School on Grid Computing 2004</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="sschool.css" type="text/css"> <SCRIPT LANGUAGE="JavaScript"> var javascript_version = 1.0;</SCRIPT> <SCRIPT LANGUAGE="JavaScript1.1"> javascript_version = 1.1;</SCRIPT> <SCRIPT LANGUAGE="JavaScript"> ...Examine the filtered data set. Notice the text is now a only lower case characters, from the Condor-G filter job.
$ cat results.filter.output <html> <head> <title>the 2nd international summer school on grid computing 2004</title> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> <link rel="stylesheet" href="sschool.css" type="text/css"> <script language="javascript"> var javascript_version = 1.0;</script> <script language="javascript1.1"> javascript_version = 1.1;</script> <script language="javascript"> ...To clean up, kill your user level stork server, Appendix I: Stopping a Stork Server. and cleanup any leftover output files:
$ rm -f /tmp/stork_data.html results.* *.error *.log *.sub *.out
That's it. There is a lot more you can do with Condor-G and DAGMan, but this basic introduction is all you need to know to get started. Good luck!
End of Tutorial
$ ps aux | grep stork_server | grep -v grep test 6946 0.0 0.9 5232 2116 ? S 20:00 0:00 stork_server -p 34048 -Config /opt/stork/etc/stork.config -Serverlog /tmp/Stork.userlogIf a stork_server is already running, stop it via Appendix II, below. Otherwise, start a local, user level stork server. Note: Many of the stork_server command line options will be moved to static configuration files in future releases of Stork.
$ stork_server -p 34048 -Config /opt/stork/etc/stork.config \ -Serverlog /tmp/Stork.userlogVerify the stork_server is running:
$ ps aux | grep stork_server | grep -v grep test 6946 0.0 0.9 5232 2116 ? S 20:00 0:00 stork_server -p 34048 -Config /opt/stork/etc/stork.config -Serverlog /tmp/Stork.userlogYour user level stork_server is now running.
$ ps aux | grep stork_server | grep -v grep test 6946 0.0 0.9 5232 2116 ? S 20:00 0:00 stork_server -p 34048 -Config /opt/stork/etc/stork.config -Serverlog /tmp/Stork.userlogThe stork_server process id is 6946 in the second column of the above output. kill this process, and verify it is gone:
$ kill 6946 $ ps aux | grep stork_server | grep -v grepThe final ps command above should produce no output. If a stork_server command is still present at this point, contact your local system administrator for assistance. You should now have no stork_server processes running.