|
||||||||||||
|
Master Worker
Please Note
At this point, you should have finished the Condor exercises and the Search for Knowledge exercises. If everything went smoothly for you, if you have extra time, and if you feel comfortable with C++ and Linux, you can continue on to learn about MW. Because C++ is not a prerequisite for this class, we do not expect you to go through this, but we want to provide it as an option for those of you that are interested. Getting ReadyMaster Worker (MW for short) is an addition to Condor: it is not provided with Condor but is an extra download. It has no hidden knowledge of Condor, but is built on top of Condor using public interfaces. The first thing to do is to download MW into your home directory: % cd ~ % wget http://www.cs.wisc.edu/condor/mw/mw-0.2.2.tgz We apologize in advance that the documentation for MW is rather light. But you can read what there is online. Compiling MWIn your home directory, first extract the MW source code and rename the directory to mw, then make a directory to install mw into: % tar xzf mw-0.2.2.tgz % mv mw mw-src % mkdir ~/mw
Now configure and build MW. Make sure the CONDOR_CONFIG is properly
set and you can find the Condor binaries in
% which condor_version /opt/condor-6.7.19/bin/condor_version % echo $CONDOR_CONFIG /opt/condor-6.7.19/etc/condor_config % cd mw-src % ./configure --with-condor=/opt/condor-6.7.19 \ --prefix ~/mw \ --without-pvm checking for g++... g++ checking for C++ compiler default output... a.out checking whether the C++ compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o [output trimmed...] % make [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make all) ; done make[1]: Entering directory `/home/users/roy/mw-src/src' /usr/bin/g++ -DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" -DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_FCNTL_H=1 -DHAVE_LIMITS_H=1 -DHAVE_SYS_TIME_H=1 -DHAVE_UNISTD_H=1 -DTIME_WITH_SYS_TIME=1 -DHAVE_VPRINTF=1 -DHAVE_GETCWD=1 -DHAVE_GETHOSTNAME=1 -DHAVE_GETTIMEOFDAY=1 -DHAVE_MKDIR=1 -DHAVE_STRSTR=1 -DHAVE_DYNAMIC_CAST= -DCONDOR_DIR=\"/opt/condor-6.7.19\" -DUSE_POLL=1 -I. -I. -IRMComm -IMW-File -IMW-CondorPVM -IMW-Socket -IMWControlTasks -g -O2 -Wall -c MW.C [output trimmed...] % make install [ "__src examples" = "__" ] || for subdir in `echo "src examples"`; do (cd $subdir && make install) ; done make[1]: Entering directory `/home/users/roy/mw-src/src' /bin/sh ../mkinstalldirs /home/users/roy/mw/lib mkdir /home/users/roy/mw/lib [output trimmed...] Assuming you don't see any errors, you're set to go! The examplesMW has provided several examples. They are all in the mw-src/examples directory. % cd examples % ls blackbox/ fib/ knapsack/ Makefile Makefile.in matmul/ newmatmul/ newskel/ n-queens/ SC05/ skel/
Trying an example in independent mode
MW can run applications within Condor, but it can also run them
without Condor, just on your computer. This will only create a single
worker, which will execute the tasks serially. This can be easier to try out
and easier to debug. Let run matmaul in independent mode. The
matrices are randomly generated, with parameters % cd matmaul % cat in_master 1 workermatmul_indp TRUE 150 200 150 % ./mastermatmul_indp < in_master 0:21:17.443 MWDriver is pid 22333. 0:21:17.444 Starting from the beginning. 0:21:17.444 argc=1, argv[0]=./mastermatmul_indp 0:21:17.444 Adding executable workermatmul_indp for TRUE 0:21:17.444 ERROR: add_executable with workermatmul_indp, which doesn't exist or isn't readable 0:21:17.444 MWRMComm::set_num_arch_classes to 1 0:21:17.444 MWRMComm::set_arch_class_attributes for arch class 0 to TRUE 0:21:17.446 Good to go. 0:21:17.446 num_TODO = 100, num_run = 0, num_done = 0 0:21:17.637 Got a HOSTADD message. 0:21:17.637 About to call setup 0:21:17.637 Worker 2 started. 0:21:17.637 Worker started on machine ws-01.gs.unina.it. [output trimmed...] [output trimmed...] 0:21:26.419 Killing workers: 0:21:26.419 MWList::Can't remove any element from empty list. 0:21:26.419 MWList::Can't remove any element from empty list. 0:21:26.419 MWList::Can't remove any element from empty list. 0:21:26.420 MWList::Can't remove any element from empty list.Ignore those error messages at the end (Can't remove any element...). Congratulations! You've successfully run your first MW job, albeit a simple one. Trying an example as a Condor job
The submit file for the matmul example is
Look at submit_socket: # Now we're in the scheduler universe universe = Scheduler # The name of our executable Executable = mastermatmul_socket # Assume a max image size of 16 Megabytes. Image_Size = 4 Meg +MemoryRequirements = 4 # This goes into stdin for the master. Input = in_master.socket # Set the output of this job to go to out_master Output = out_master.socket # Set the stderr of this job to go to out_worker. It is named # out_worker because the output of the workers is directed to stderr Error = out_worker.socket # Keep a log in case of problems. Log = work.log Queue Edit this file to have an extra line: Getenv = True Notice that this is a scheduler universe job. We haven't talked about those very much. It's a job that runs on the submit computer as soon as you submit it. You get all the benefits of Condor (reliability, logging, etc) with a job that executes locally. We use it for DAGMan and MW: it is a job that submits other jobs and watches over them. In this case, it will be master, which will spawn the other workers (as jobs) and will send them their tasks. Now submit the job and watch it run: % rm -f checkpoint % condor_submit submit_socket Submitting job(s). Logging submit event(s). 1 job(s) submitted to cluster 45. The MW master ran and immediately submitted the worker jobs. % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 45.0 roy 6/3 00:31 0+00:00:02 R 0 9.8 mastermatmul_socke 46.0 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.1 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.2 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.3 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.4 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.5 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) Most of them found computers to run on: % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 45.0 roy 6/3 00:31 0+00:03:07 R 0 9.8 mastermatmul_socke 46.0 roy 6/3 00:31 0+00:00:04 R 0 9.8 mw_exec0.$$(Opsys) 46.1 roy 6/3 00:31 0+00:00:02 R 0 9.8 mw_exec0.$$(Opsys) 46.2 roy 6/3 00:31 0+00:00:00 R 0 9.8 mw_exec0.$$(Opsys) 46.3 roy 6/3 00:31 0+00:00:00 R 0 9.8 mw_exec0.$$(Opsys) 46.4 roy 6/3 00:31 0+00:00:00 I 0 9.8 mw_exec0.$$(Opsys) 46.5 roy 6/3 00:31 0+00:00:00 R 0 9.8 mw_exec0.$$(Opsys) Eventually they finished up: % condor_q -- Submitter: ws-01.gs.unina.it : <192.167.1.21:32783> : ws-01.gs.unina.it ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 0 jobs; 0 idle, 0 running, 0 heldWe saw the master submit six workers. Five of them started to run, and they did all of the work. Look at out_master.socket to see the result of the run: % cat out_master.socket 0:31:46.108 MWDriver is pid 22424. 0:31:46.109 Socket bound to port: 8997 0:31:46.118 Starting from the beginning. 0:31:46.118 argc=1, argv[0]=condor_scheduniv_exec.45.0 0:31:46.118 Adding executable workermatmul_socket for ((OPSYS=="LINUX")&&(ARCH=="INTEL")) 0:31:46.119 MWRMComm::set_num_arch_classes to 1 0:31:46.119 MWRMComm::set_arch_class_attributes for arch class 0 to ((OPSYS=="LINUX")&&(ARCH=="INTEL")) 0:31:46.119 MWSocketRC::process_executable_name workermatmul_socket 0 0 0:31:46.119 Making a link from workermatmul_socket to mw_exec0.LINUX.INTEL.exe 0:31:46.127 In MWSocketRC::init_beginning_workers() [output trimmed...] It's your turnNow that you've tried out the basics, we'll let you explore by yourself. Here are some ideas:
|
|||||||||||
|