Banner
Title: Condor Practical
Subtitle: Basics of Condor
Tutor: Alain Roy
Authors: Alain Roy

Master Worker

Please Note

At this point, you should have finished the Condor exercises and the Search for Knowledge exercises. If everything went smoothly for you, if you have extra time, and if you feel comfortable with C++ and Linux, you can continue on to learn about MW. Because C++ is not a prerequisite for this class, we do not expect you to go through this, but we want to provide it as an option for those of you that are interested.

Getting Ready

Master Worker (MW for short) is an addition to Condor: it is not provided with Condor but is an extra download. It has no hidden knowledge of Condor, but is built on top of Condor using public interfaces. The first thing to do is to download MW into your home directory:

% cd ~

% wget http://www.cs.wisc.edu/condor/mw/mw-0.2.2.tgz

We apologize in advance that the documentation for MW is rather light. But you can read what there is online.

Top

Compiling MW

In your home directory, first extract the MW source code and rename the directory to mw, then make a directory to install mw into:

% tar xzf mw-0.2.2.tgz

% mv mw mw-src

% mkdir ~/mw

Now configure and build MW. Make sure the CONDOR_CONFIG is properly set and you can find the Condor binaries in /opt/condor-6.7.19. We're going to build it without PVM, but we'll use the socket implementation. From a high-level perspective, it doesn't make a difference which you use. The socket implementation is slightly less capable, but is much easier to use and debug if there are problems. The entire configure/make process should just take a couple of minutes. Make sure you edit the prefix that you give to configure Note that because your home directory is on NFS, it may build slowly.

% which condor_version
/opt/condor-6.7.19/bin/condor_version

% echo $CONDOR_CONFIG
/opt/condor-6.7.19/etc/condor_config

% cd mw-src

% ./configure --with-condor=/opt/condor-6.7.19 \
              --prefix ~/mw                    \
              --without-pvm
checking for g++... g++
checking for C++ compiler default output... a.out
checking whether the C++ compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
[output trimmed...]

% make
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done
make[1]: Entering directory `/home/users/roy/mw-src/src'
/usr/bin/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\"
-DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\"
-DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1
-DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1
-DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1
-DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1
-DHAVE_SYS_TIME_H=1 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1
-DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 -DHAVE_GETHOSTNAME=1
-DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1
-DHAVE_DYNAMIC_CAST= -DCONDOR_DIR=\"/opt/condor-6.7.19\" -DUSE_POLL=1
-I. -I. -IRMComm -IMW-File -IMW-CondorPVM -IMW-Socket -IMWControlTasks
-g -O2 -Wall -c MW.C
[output trimmed...]

% make install
[ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done
make[1]: Entering directory `/home/users/roy/mw-src/src'
/bin/sh ../mkinstalldirs /home/users/roy/mw/lib
mkdir /home/users/roy/mw/lib
[output trimmed...]

Assuming you don't see any errors, you're set to go!

Top

The examples

MW has provided several examples. They are all in the mw-src/examples directory.

% cd examples

% ls 
blackbox/  fib/  knapsack/  Makefile  Makefile.in  matmul/  newmatmul/  newskel/  n-queens/  SC05/  skel/
  • blackbox: run any program as an MW worker
  • fib: calcluate fibonacci numbers
  • knapsack: incomlete examples. Solve the knapsack problem with branch and bound.
  • matmul: multiply two matrices
  • newmatmul: Ignore this bad example of matrix multiplication.
  • newskel: An empty shell for an MW application
  • n-queens: Find a chessboard with N queens on it so that no two queens attack each other.
  • skel: An empty shell for an MW application

Top

Trying an example in independent mode

MW can run applications within Condor, but it can also run them without Condor, just on your computer. This will only create a single worker, which will execute the tasks serially. This can be easier to try out and easier to debug. Let run matmaul in independent mode. The matrices are randomly generated, with parameters in_master.

% cd matmaul

% cat in_master
1
workermatmul_indp TRUE
150
200
150

% ./mastermatmul_indp < in_master
0:21:17.443 MWDriver is pid 22333.
0:21:17.444 Starting from the beginning.
0:21:17.444 argc=1, argv[0]=./mastermatmul_indp
0:21:17.444 Adding executable workermatmul_indp for TRUE
0:21:17.444 ERROR: add_executable with workermatmul_indp, which doesn't exist or isn't readable
0:21:17.444 MWRMComm::set_num_arch_classes to 1
0:21:17.444 MWRMComm::set_arch_class_attributes for arch class 0 to TRUE
0:21:17.446 Good to go.
0:21:17.446 num_TODO = 100, num_run = 0, num_done = 0
0:21:17.637 Got a HOSTADD message.
0:21:17.637 About to call setup
0:21:17.637 Worker 2 started.
0:21:17.637 Worker started on machine ws-01.gs.unina.it.

[output trimmed...]

[output trimmed...]

0:21:26.419 Killing workers:
0:21:26.419 MWList::Can't remove any element from empty list.
0:21:26.419 MWList::Can't remove any element from empty list.
0:21:26.419 MWList::Can't remove any element from empty list.
0:21:26.420 MWList::Can't remove any element from empty list.
Ignore those error messages at the end (Can't remove any element...).

Congratulations! You've successfully run your first MW job, albeit a simple one.

Top

Trying an example as a Condor job

The submit file for the matmul example is submit_socket. Theoretically you could use submit_pvm but we don't have PVM installed. You could also use submit_file which uses Condor's standard universe, but there is not particular advantage for our short-running job.

Look at submit_socket:

# Now we're in the scheduler universe

universe = Scheduler

# The name of our executable

Executable     = mastermatmul_socket

# Assume a max image size of 16 Megabytes.

Image_Size     = 4 Meg
+MemoryRequirements = 4

# This goes into stdin for the master.

Input   = in_master.socket

# Set the output of this job to go to out_master

Output  = out_master.socket

# Set the stderr of this job to go to out_worker.  It is named 
# out_worker because the output of the workers is directed to stderr

Error   = out_worker.socket

# Keep a log in case of problems.

Log = work.log

Queue

Edit this file to have an extra line:

Getenv = True

Notice that this is a scheduler universe job. We haven't talked about those very much. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks.

Now submit the job and watch it run:

% rm -f checkpoint

% condor_submit submit_socket
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 45.

The MW master ran and immediately submitted the worker jobs.
% condor_q
-- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  45.0   roy             6/3  00:31   0+00:00:02 R  0   9.8  mastermatmul_socke
  46.0   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.1   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.2   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.3   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.4   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.5   roy             6/3  00:31   0+00:00:00 I  0   9.8
 mw_exec0.$$(Opsys)

Most of them found computers to run on:
% condor_q

-- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  45.0   roy             6/3  00:31   0+00:03:07 R  0   9.8  mastermatmul_socke
  46.0   roy             6/3  00:31   0+00:00:04 R  0   9.8  mw_exec0.$$(Opsys)
  46.1   roy             6/3  00:31   0+00:00:02 R  0   9.8  mw_exec0.$$(Opsys)
  46.2   roy             6/3  00:31   0+00:00:00 R  0   9.8  mw_exec0.$$(Opsys)
  46.3   roy             6/3  00:31   0+00:00:00 R  0   9.8  mw_exec0.$$(Opsys)
  46.4   roy             6/3  00:31   0+00:00:00 I  0   9.8  mw_exec0.$$(Opsys)
  46.5   roy             6/3  00:31   0+00:00:00 R  0   9.8  mw_exec0.$$(Opsys)

Eventually they finished up:
% condor_q

-- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               

0 jobs; 0 idle, 0 running, 0 held

We saw the master submit six workers. Five of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run:
% cat out_master.socket 
0:31:46.108 MWDriver is pid 22424.
0:31:46.109 Socket bound to port: 8997
0:31:46.118 Starting from the beginning.
0:31:46.118 argc=1, argv[0]=condor_scheduniv_exec.45.0
0:31:46.118 Adding executable workermatmul_socket for ((OPSYS=="LINUX")&&(ARCH=="INTEL"))
0:31:46.119 MWRMComm::set_num_arch_classes to 1
0:31:46.119 MWRMComm::set_arch_class_attributes for arch class 0 to ((OPSYS=="LINUX")&&(ARCH=="INTEL"))
0:31:46.119 MWSocketRC::process_executable_name workermatmul_socket 0 0
0:31:46.119 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL.exe
0:31:46.127 In MWSocketRC::init_beginning_workers()

[output trimmed...]

Top

It's your turn

Now that you've tried out the basics, we'll let you explore by yourself. Here are some ideas:

  • Make a task take a longer time (perhaps a sleep() to artificially inflate the time). Does MW use multiple workers properly?
  • Write an MW program (or modify the matmul example) to have a task that does nothing.
    • How long does it take to run 10,000 tasks?
    • Estimate how long it would take to run 10,000 Condor jobs that do /bin/sleep 0.
    Note: If you want to adjust the number of workers, you can do that in MWDriver (the master) with:
    RMC->set_target_num_workers( target_num_workers );
    
  • Solve your favorite problem using MW. (Integrating practical next week?)

Top

Top