EMAN2/Parallel/Distributed

THIS METHOD IS NO LONGER WORKING IN EMAN2.1 - we are considering retiring it. It was quite powerful and useful many years ago, but rarely makes sense in the current linux cluster ecosystem. If you are actively using it, we'd very much like to discuss the issue, and understand what motivates this in your setup.

Distributed Computing

This is by far the most flexible parallelism mechanism in EMAN2. It can permit you to:

use multiple lab workstations as a sort of 'mini cluster'
use a cluster or multiple clusters from your desktop machine without copying data back and forth
vary the number of processors that you are using (up or down) without restarting your job
...

However, it is also the most difficult method to use, requires a bit of effort to set up, and may not work in all environments. If you are running on a cluster at a shared supercomputing center, you will probably want to use MPI instead. If you are trying to use multiple workstations in your lab as a sort of ad-hoc cluster, this method works well (it can be used on clusters too, if you have issues with MPI).

Quickstart

For those not wanting to read or understand the parallelism method, here are the basic required steps:

on the machine with the data, make a scratch directory on a local hard drive, cd to it, and run e2parallel.py dcserver --port=9990 --verbose=2
make another scratch directory on a local hard drive, cd to it, and run e2parallel.py dcclient --host=<server hostname>
repeat #2 for each core or machine you want to run tasks on
run your parallel job, like 'e2refine.py' with the --parallel=dc:localhost:9990

Notes

If you need to restart the server for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless to stop it with <ctrl>c and restart it
Make sure the same version of EMAN2 on all machines, if multiple machines are being used as clients
If you need to stop the 'e2refine' program, you should run 'e2parallel.py killall' to cancel any pending jobs on the server after stopping e2refine.
You can add or remove clients at any time during a run
When you are done running jobs, exit the server (^c), then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.

You should really consider reading the detailed instructions below :^)

Introduction

This is the sort of parallelism made famous by projects like SETI-at-home and Folding-at-Home. The general idea is that you have a list of small jobs to do, and a bunch of computers with spare cycles willing to help out with the computation. The number of computers willing to do computations may vary with time, and possibly may agree to do a computation, but then fail to complete it. This is a very flexible parallelism model, which can be adapted to both individual computers with multiple cores as well as linux clusters or sets of workstations laying around the lab.

There are 3 components to this system:

User Application (customer) <==> Server <==> Compute Nodes (client)

The user application (e2refine.py for example) builds a list of computational tasks that it needs to have completed, then sends the list to the server. Compute nodes with nothing to do then contact the server and request tasks to compute. The server sends the tasks out to the clients. When the client finishes the requested computation, results are sent back to the server. The user application then requests the results from the server and completes processing. As long as the number of tasks to complete is larger than the number of clients servicing requests, this is an extremely efficient infrastructure.

Internally things are somewhat more complicated and tackle issues such as data caching on the clients, how to handle clients that die in the middle of processing, etc., but the basic concept is quite straightforward.

With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.

How to use Distributed Computing in EMAN2

To use distributed computing, there are three basic steps:

Run a server on a machine that the clients can communicate with
Run some number of clients pointing at the server
run an EMAN2 program with the --parallel=dc:host:port option

What follows are specific instructions for doing this under 2 different scenarios.

Using DC on a linux cluster

This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:

If the individual compute nodes can communicate directly (through the head node) to your workstation, you may consider running the server and the e2refine.py command directly on your workstation, and launch only clients on the cluster. The clients will communicate data among themselves using the high-performance internal network on the cluster, so this approach doesn't require much more network bandwidth than copying the data to the cluster, and copying the results back when you're done, and has the convenience that all data and results remain on your computer where you can monitor them.
If the individual compute nodes cannot communicate outside the cluster, then you will need to use e2scp.py to copy your project data to the disk on the cluster. If you are permitted to run small single-CPU commands directly on the storage/head node (attached to the physical storage), then running the server and e2refine command on that node is the best option.
If that isn't allowed on your cluster either, then things become a bit more difficult. You will need to launch the server, e2refine and the clients all from the queuing system script. Given the diversity of different cluster configurations, it is difficult to give specific details on this process, but the general comments below should give you something to start with.

General method of using DC computing:

The server is run with the e2parallel.py dcserver --port=9990 command.
The clients are run with the e2parallel.py dcclient --port=9990 --server=<server hostname> command.
The actual refinement is run with the 'e2refine.py --parallel=dc:<server hostname>:9990' command.

Notes:

The server MUST be run from a directory on a hard drive physically attached to the computer (not a network mounted drive). This directory should not require large amounts of disk space. This need not be the same drive that stores the data.
The clients MUST similarly be run from a directory on a physically attached drive. If you are running multiple clients on a single cluster node with multiple cores, all of the clients should be run from the SAME directory so they can share a data cache. This directory may get quite large, as it will be used to cache data during processing to reduce network load.
If you need to stop the server, do so nicely with '^c' or 'kill <pid>'. Do NOT 'kill -9 <pid>'. You may stop and restart the server without disturbing the running refinement job, so long as it isn't down for more than 5-10 minutes.
Clients should also be killed 'nicely'. Clients may be started or stopped at any time without disturbing the refinement run.
If you decide to kill the refinement in the middle, you may also wish to run the 'e2parallel.py killall' command from the server directory to remove any incomplete tasks from the server.
If you are forced to run the server on a compute-node with the data stored on a network mounted drive, then additional precautions MUST be taken:
- When you finish the job, nicely kill the server, then immediately run 'e2bdb.py -c' on the same node. After this, it will be safe to access the files from the head-node again.
- While the job is running, you must not access any of the project files from the head-node, or database corruption may result. On a shared filesystem, only one node may have read/write access to databases at one time. This means if you need to check the progress of the running job, you must be very careful not to do anything that causes data to be written to the project. A safer alternative which may be possible on your cluster is to log in to the node running the server, and check the files from there. see the warning about the database for more info on this topic.
You may wish to consider copying the data from the shared filesystem onto a local scratch drive on the same node running the server, then copying the results back to the shared filesystem after running 'e2bdb.py -c' at the end of the job. This will nicely avoid database corruption issues...

Using DC on a set of workstations

The server should run on a computer with a direct physical connection to the storage
All of the clients must be able to make a network connection to the server machine
Run a server on the desired machine e2parallel.py dcserver in an empty directory on the local hard drive
The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
Run one client for each core you want to use for processing on each computer : e2parallel.py dcclient --server=<server> --port=9990 (replace the server hostname and port with the correct values)
Run your EMAN2 programs with the option --parallel=dc:<server>:9990 (again, use the right port number and server hostname)

For all of the above, once you have finished running your jobs, kill the server, then run 'e2parallel.py dckillclients' from the same directory. When it stops spewing out 'client killed' messages, you can kill this server.

IF THIS IS NOT WORKING FOR YOU, PLEASE FOLLOW THESE DEBUGGING INSTRUCTIONS