|
||||||||||||
4.0 Submitting a standard universe job4.1 Preliminary: Switching CondorBefore you continue, let's switch to the real Condor pool that will let you use the entire Condor pool we have set up here. First, let's shut down our Condor pool by telling the Condor master to quit. It will shut down the rest of your personal Condor pool. % ps -x PID TTY STAT TIME COMMAND 15652 ? S 0:00 sshd: roy@pts/0 15654 pts/0 S 0:00 -bash 15824 ? S 0:00 condor_master 15825 ? S 0:00 condor_collector -f 15826 ? S 0:00 condor_negotiator -f 15827 ? S 0:00 condor_schedd -f 15828 ? S 0:05 condor_startd -f 15951 pts/0 R 0:00 ps -x % condor_off -master Sent "Kill-Daemon" command for "master" to local master % ps -x PID TTY STAT TIME COMMAND 15652 ? S 0:00 sshd: roy@pts/0 15654 pts/0 S 0:00 -bash 15953 pts/0 R 0:00 ps -x Now switch to the other Condor: % export CONDOR_CONFIG=/opt/condor-6.7.19/etc/condor_config % export PATH=/opt/condor-6.7.19/bin:${PATH} % export PATH=/opt/condor-6.7.19/sbin:${PATH} % which condor_q /opt/condor-6.7.19/bin/condor_q Make sure that Condor is running on this computer: > ps -auwx | grep condor condor 2595 0.0 0.2 7036 2952 ? S May29 0:57 /opt/condor-6.7.19/sbin/condor_master condor 2603 0.0 0.3 8100 3564 ? S May29 2:19 condor_startd -f condor 2608 0.0 0.3 8456 3612 ? S May29 0:01 condor_schedd -f roy 16010 0.0 0.0 3688 668 pts/0 S 22:45 0:00 grep condor What is different about this Condor setup from your personal Condor? Why is it different?
Run Now we are ready to move forward! 4.2 What is the standard universe?Your first job was considered a vanilla universe job. This meant that it was a plain old job. Condor also supports standard universe jobs. If you have the source code for your program and if it meets certain requirements, you can re-link your program and Condor will provide two major features for you:
Your tutorial leader will give you more details about standard universe, or you can read about them online.
4.3 Linking a program for standard universeFirst, you need a job to run. We'll use the same job as before. In case you don't have it, here it is. Save it in simple.c:#include <stdio.h> main(int argc, char **argv) { int sleep_time; int input; int failure; if (argc != 3) { printf("Usage: simple <sleep-time> <integer>\n"); failure = 1; } else { sleep_time = atoi(argv[1]); input = atoi(argv[2]); printf("Thinking really hard for %d seconds...\n", sleep_time); sleep(sleep_time); printf("We calculated: %d\n", input * 2); failure = 0; } return failure; } Now compile the program using condor_compile. This doesn't change how the program is compiled, just how it is linked. Take note that the executable is named differently. On these computers, condor_compile seems to be particularly slow. I think it's because your home directory is on NFS: I haven't seen it be this slow before. % condor_compile gcc -o simple.std simple.c condor_compile gcc -o simple.std simple.c LINKING FOR CONDOR : /usr/bin/ld -L/opt/condor-6.7.19/lib -Bstatic --eh-frame-hdr -m elf_i386 -dynamic-linker /lib/ld-linux.so.2 -o simple.std /opt/condor-6.7.19/lib/condor_rt0.o /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/../../../crti.o /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/crtbeginT.o -L/opt/condor-6.7.19/lib -L/usr/lib/gcc-lib/i386-redhat-linux/3.2.3 -L/usr/lib/gcc-lib/i386-redhat-linux/3.2.3/../../.. /tmp/ccq64oUU.o /opt/condor-6.7.19/lib/libcondorzsyscall.a /opt/condor-6.7.19/lib/libcondor_z.a /opt/condor-6.7.19/lib/libcomp_libstdc++.a /opt/condor-6.7.19/lib/libcomp_libgcc.a /opt/condor-6.7.19/lib/libcomp_libgcc_eh.a /opt/condor-6.7.19/lib/libcomp_libgcc_eh.a -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c -lcondor_nss_files -lcondor_nss_dns -lcondor_resolv -lcondor_c /opt/condor-6.7.19/lib/libcomp_libgcc.a /opt/condor-6.7.19/lib/libcomp_libgcc_eh.a /opt/condor-6.7.19/lib/libcomp_libgcc_eh.a /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/crtend.o /usr/lib/gcc-lib/i386-redhat-linux/3.2.3/../../../crtn.o /opt/condor-6.7.19/lib/libcondorzsyscall.a(condor_file_agent.o)(.text+0x250): In function `CondorFileAgent::open(char const*, int, int)': : the use of `tmpnam' is dangerous, better use `mkstemp' % ls -lh simple.std -rwxr-xr-x 1 roy users 1.8M Jun 2 22:49 simple.std* There are a lot of warnings there--you can safely ignore those warnings. You can also see just how many libraries we link the program against. It's a lot! And yes, the executable is much bigger now. Partly that's the price of having checkpointing and partly it is because the program is now statically linked, but you can make it slightly smaller if you want by getting rid of debugging symbols: % strip simple.std % ls -lh simple.std -rwxr-xr-x 1 roy users 1.4M Jun 2 2006 simple.std* Note the extra output when you run the program by hand now: % ./simple.std 4 10 Condor: Notice: Will checkpoint to ./simple.std.ckpt Condor: Notice: Remote system calls disabled. Thinking really hard for 4 seconds... We calculated: 20 4.4 Submitting a standard universe programSubmitting a standard universe job is almost the same as a vanilla universe job. Just change the universe to standard. Here is a sample submit file. I suggest making it run for a longer time, so we can experiment with the checkpointing while it runs. Also, get rid of the multiple queue commands that we had. Here is the complete submit file, I suggest naming it submit.std. Universe = standard Executable = simple.std Arguments = 120 10 Log = simple.log Output = simple.out Error = simple.error Queue Then submit it as you did before, with condor_submit: % rm simple.log % condor_submit submit.std Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 24. % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 24.0 roy 6/2 22:55 0+00:00:00 I 0 9.8 simple.std 120 10 1 jobs; 1 idle, 0 running, 0 held % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 24.0 roy 6/2 22:55 0+00:00:00 R 0 9.8 simple.std 120 10 1 jobs; 0 idle, 1 running, 0 held Two minutes pass... % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 held % cat simple.log 000 (024.000.000) 06/02 22:55:47 Job submitted from host: <192.167.1.21:32783> ... 001 (024.000.000) 06/02 22:55:52 Job executing on host: <192.167.1.21:32782> ... 005 (024.000.000) 06/02 22:57:52 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 1035 - Run Bytes Sent By Job 1518455 - Run Bytes Received By Job 1035 - Total Bytes Sent By Job 1518455 - Total Bytes Received By Job
Tip
You can use condor_q -run to see where your job is running. It
will only show running jobs though, not idle jobs.
Notice that the log file has a bit more information this time: we can see how much data was transfered to and from the job because it's in the standard universe. The remote usage was not very interesting because the job just slept, but a real job would have some interesting numbers there.
4.5 Advanced tricks in the standard universeAt this point in the tutorial, I will demonstrate how you can force your job to be checkpointed and what it will look like. We will use a command called condor_checkpoint that you normally never to use, so we can demonstrate. One reason that it normally isn't used is because it checkpoints all jobs running on a computer, not just the job you want to checkpoint. Be warned. Begin by submitting your job, and figuring out where it is running: % rm simple.log % condor_submit submit.std Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 25. % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 25.0 roy 6/2 23:01 0+00:00:00 R 0 9.8 simple.std 120 10 1 jobs; 0 idle, 1 running, 0 held % condor_q -run -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME HOST(S) 25.0 roy 6/2 23:01 0+00:00:06 vm1@ws-01.gs.unina.it Now let's tell Condor to checkpoint and see what happens. Update this to be appropriate for your job! % condor_checkpoint ws-1.gs.unina.it % cat simple.log 000 (025.000.000) 06/02 23:01:19 Job submitted from host: <192.167.1.21:32783> ... 001 (025.000.000) 06/02 23:01:23 Job executing on host: <192.167.1.21:32782> ... 006 (025.000.000) 06/02 23:01:41 Image size of job updated: 10733 ... 003 (025.000.000) 06/02 23:01:41 Job was checkpointed. Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage ... 005 (025.000.000) 06/02 23:01:41 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 752314 - Run Bytes Sent By Job 1518806 - Run Bytes Received By Job 752314 - Total Bytes Sent By Job 1518806 - Total Bytes Received By Job ... Voila! We checkpointed our job correctly.
Advanced note
You might notice that the job finished right after it was
checkpointed. Why? The job was checkpointed while executing sleep(),
then essentially restarted from the checkpoint (though Condor doesn't
consider this to be a restart since the job didn't leave the
computer). Condor didn't keep track of how much time had elapsed in
the sleep call, so the job finished right away. Don't worry--Condor
handles other system calls just fine. It's not clear how to handle
checkpointing sleep()--if your job is interrupted during the sleep
and restarted sometime later, how much time should Condor force the
job to sleep for? Do we rely on wall clock time? Run time?
Normally, you never need to use condor_checkpoint: we just used it as a demonstration. Condor will checkpoint your jobs periodically (the default is every three hours) or when your job is forced to leave a computer to give time to another user. So you should never need to use condor_checkpoint.
Extra Credit
You can customize the behavior of the standard universe quite a
bit. For instance, you can force some files to be accesssed locally
instead of via remote I/O. You can change the buffering of remote I/O
to get better performance. You can disable checkpointing. You can kill
a job that has been restarted from its checkpoint more than three
times. How do you do these things? Hint, look at the
condor_submit manual page
|
||||||||||||
|