Diff for "EMAN2/Parallel"

Differences between revisions 2 and 19 (spanning 17 versions)

Parallel Processing in EMAN2

EMAN2 uses a modular strategy for running commands in parallel. That is, you can choose different ways to run EMAN2 programs in parallel, depending on your environment. We now support 3 distinct methods for parallelism, and each has its own page of documentation. Please follow the appropriate link:

Threaded - This is for use on a single computer with multiple processors (cores). For example, the Core2Duo processors of a few years ago had 2 cores. In 2010, individual computers often have single or dual processors with 2, 4 or 6 cores each, for a total of up to 12 cores. EMAN2 can make very efficient use of all of these cores, but this mode will ONLY work if you want to run on a single computer.
MPI - This is the standard parallelism method used on virtually all large clusters nowadays. It will require a small amount of custom installation for your specific cluster, even if you are using a binary distribution of EMAN2. Follow this link for more details
Distributed - This was the original parallelism method developed for EMAN2. It can be used on anything from sets of workstations to multiple clusters, and can dynamically change how many processors it's using during a single run, allowing you, for example, to make use of idle cycles at night on lab workstations, but reduce the load during the day for normal use. It is very flexible, but requires a bit of effort, and a knowledgeable user to configure and use.

Programs with parallelism support will take the --parallel command line option as follows:

--parallel=<type>:<option>=<value>:<option>=<value>:...

for example, for the distributed parallelism model: --parallel=dc:localhost:9990

for the local multicore threaded model: --parallel=thread:4 (where 4 is the number of cores to use)

Note that not all programs will run in parallel. If a program does not accept the --parallel option, then it is not parallelized.

Local Machine (multiple cores)

Now working (As of 7/15/2010)

Most modern computers have 2, 4 or even 6 compute 'cores' on a single machine. These cores can perform computations simultaneously and independently. It is very easy to use: put 'thread:<ncpu>' in the 'Parallel' box in e2workflow, or specify the '--parallel=thread:<ncpu>' option on the command line. <ncpu> should, of course, be replaced with the number of cores you wish to use.

Note: This option only allows you to use multiple cores/processors on a single computer. If you want to use multiple computers at the same time, this will not work, see other options below.

MPI

Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters, which may or may not be possible depending on your cluster's network configuration. Direct MPI support is planned by fall 2010.

Distributed Computing

Quickstart

For those not wanting to read or understand the parallelism method, here are the basic required steps:

on the machine with the data, make a scratch directory on a local hard drive, cd to it, and run e2parallel.py dcserver --port=9990 --verbose=2
make another scratch directory on a local hard drive, cd to it, and run e2parallel.py dcclient --host=<server hostname>
repeat #2 for each core or machine you want to run tasks on
run your parallel job, like 'e2refine.py' with the --parallel=dc:localhost:9990

Notes

If you need to restart the server for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless to stop it with ^c and restart it
Make sure the same version of EMAN2 on all machines, if multiple machines are being used as clients
If you need to stop the 'e2refine' program, you can run 'e2parallel.py killall' to cancel any pending jobs on the server after stopping e2refine.
You can add or remove clients at any time during a run
When you are done running jobs, exit the server (^c), then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.

You should really consider reading the detailed instructions below :^)

Introduction

This is the sort of parallelism made famous by projects like SETI-at-home and Folding-at-Home. The general idea is that you have a list of small jobs to do, and a bunch of computers with spare cycles willing to help out with the computation. The number of computers willing to do computations may vary with time, and possibly may agree to do a computation, but then fail to complete it. This is a very flexible parallelism model, which can be adapted to both individual computers with multiple cores as well as linux clusters or sets of workstations laying around the lab.

There are 3 components to this system:

User Application (customer) <==> Server <==> Compute Nodes (client)

The user application (e2refine.py for example) builds a list of computational tasks that it needs to have completed, then sends the list to the server. Compute nodes with nothing to do then contact the server and request tasks to compute. The server sends the tasks out to the clients. When the client finishes the requested computation, results are sent back to the server. The user application then requests the results from the server and completes processing. As long as the number of tasks to complete is larger than the number of clients servicing requests, this is an extremely efficient infrastructure.

Internally things are somewhat more complicated and tackle issues such as data caching on the clients, how to handle clients that die in the middle of processing, etc., but the basic concept is quite straightforward.

With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.

How to use Distributed Computing in EMAN2

To use distributed computing, there are three basic steps:

Run a server on a machine that the clients can communicate with
Run some number of clients pointing at the server
run an EMAN2 program with the --parallel=dc:host:port option

What follows are specific instructions for doing this under 2 different scenarios.

Using DC on a linux cluster

This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:

If the individual compute nodes can communicate directly (through the head node) to your workstation, you may consider running the server and the e2refine.py command directly on your workstation, and launch only clients on the cluster. The clients will communicate data among themselves using the high-performance internal network on the cluster, so this approach doesn't require much more network bandwidth than copying the data to the cluster, and copying the results back when you're done, and has the convenience that all data and results remain on your computer where you can monitor them.
If the individual compute nodes cannot communicate outside the cluster, then you will need to use e2scp.py to copy your project data to the disk on the cluster. If you are permitted to run small single-CPU commands directly on the storage/head node (attached to the physical storage), then running the server and e2refine command on that node is the best option.
If that isn't allowed on your cluster either, then things become a bit more difficult. You will need to launch the server, e2refine and the clients all from the queuing system script. Given the diversity of different cluster configurations, it is difficult to give specific details on this process, but the general comments below should give you something to start with.

General method of using DC computing:

The server is run with the e2parallel.py dcserver --port=9990 command.
The clients are run with the e2parallel.py dcclient --port=9990 --server=<server hostname> command.
The actual refinement is run with the 'e2refine.py --parallel=dc:<server hostname>:9990' command.

Notes:

The server MUST be run from a directory on a hard drive physically attached to the computer (not a network mounted drive). This directory should not require large amounts of disk space. This need not be the same drive that stores the data.
The clients MUST similarly be run from a directory on a physically attached drive. If you are running multiple clients on a single cluster node with multiple cores, all of the clients should be run from the SAME directory so they can share a data cache. This directory may get quite large, as it will be used to cache data during processing to reduce network load.
If you need to stop the server, do so nicely with '^c' or 'kill <pid>'. Do NOT 'kill -9 <pid>'. You may stop and restart the server without disturbing the running refinement job, so long as it isn't down for more than 5-10 minutes.
Clients should also be killed 'nicely'. Clients may be started or stopped at any time without disturbing the refinement run.
If you decide to kill the refinement in the middle, you may also wish to run the 'e2parallel.py killall' command from the server directory to remove any incomplete tasks from the server.
If you are forced to run the server on a compute-node with the data stored on a network mounted drive, then additional precautions MUST be taken:
- When you finish the job, nicely kill the server, then immediately run 'e2bdb.py -c' on the same node. After this, it will be safe to access the files from the head-node again.
- While the job is running, you must not access any of the project files from the head-node, or database corruption may result. On a shared filesystem, only one node may have read/write access to databases at one time. This means if you need to check the progress of the running job, you must be very careful not to do anything that causes data to be written to the project. A safer alternative which may be possible on your cluster is to log in to the node running the server, and check the files from there. see the warning about the database for more info on this topic.
You may wish to consider copying the data from the shared filesystem onto a local scratch drive on the same node running the server, then copying the results back to the shared filesystem after running 'e2bdb.py -c' at the end of the job. This will nicely avoid database corruption issues...

Using DC on a set of workstations

The server should run on a computer with a direct physical connection to the storage
All of the clients must be able to make a network connection to the server machine
Run a server on the desired machine e2parallel.py dcserver in an empty directory on the local hard drive
The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
Run one client for each core you want to use for processing on each computer : e2parallel.py dcclient --server=<server> --port=9990 (replace the server hostname and port with the correct values)
Run your EMAN2 programs with the option --parallel=dc:<server>:9990 (again, use the right port number and server hostname)

For all of the above, once you have finished running your jobs, kill the server, then run 'e2parallel.py dckillclients' from the same directory. When it stops spewing out 'client killed' messages, you can kill this server.

IF THIS IS NOT WORKING FOR YOU, PLEASE FOLLOW THESE DEBUGGING INSTRUCTIONS

-  ⇤ ← Revision 2 as of 2009-05-21 15:58:00 → 
  Size: 954
  Editor: SteveLudtke
  Comment:
+   ← Revision 19 as of 2010-12-17 12:52:30 → ⇥
  Size: 11931
  Editor: SteveLudtke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 4:
-environment. Unfortunately, as of May, 2009, the parallelism infrastructure is just beginning to come together. This should be gradually fleshed out over
summer 2009. At the moment, only one parallelism infrastructure is fully functional.
+environment. We now support 3 distinct methods for parallelism, and each has its own page of documentation. Please follow the appropriate link:

 * [[EMAN2/Parallel/Threaded|Threaded]] - This is for use on a single computer with multiple processors (cores). For example, the Core2Duo processors of a few years ago had 2 cores. In 2010, individual computers often have single or dual processors with 2, 4 or 6 cores each, for a total of up to 12 cores. EMAN2 can make very efficient use of all of these cores, but this mode will ONLY work if you want to run on a single computer.
 * [[EMAN2/Parallel/Mpi|MPI]] - This is the standard parallelism method used on virtually all large clusters nowadays. It will require a small amount of custom installation for your specific cluster, even if you are using a binary distribution of EMAN2. Follow this link for more details
 * [[EMAN2/Parallel/Distributed|Distributed]] - This was the original parallelism method developed for EMAN2. It can be used on anything from sets of workstations to multiple clusters, and can dynamically change how many processors it's using during a single run, allowing you, for example, to make use of idle cycles at night on lab workstations, but reduce the load during the day for normal use. It is very flexible, but requires a bit of effort, and a knowledgeable user to configure and use.
-Line 11:
+Line 15:
-for example, for the distributed parallelism model: --parallel=dc:localhost:9990
+for example, for the distributed parallelism model: ''--parallel=dc:localhost:9990''

for the local multicore threaded model: ''--parallel=thread:4''  (where 4 is the number of cores to use)

Note that not all programs will run in parallel. If a program does not accept the --parallel option, then it is not parallelized.
-Line 14:
+Line 22:
-Not yet implemented, please use Distributed Computing
+Now working (As of 7/15/2010)

Most modern computers have 2, 4 or even 6 compute 'cores' on a single machine. These cores can perform computations simultaneously and independently. It is very easy to use:
put 'thread:<ncpu>' in the 'Parallel' box in e2workflow, or specify the '--parallel=thread:<ncpu>' option on the command line. <ncpu> should, of course, be replaced
with the number of cores you wish to use. 

'''Note:''' This option '''only''' allows you to use multiple cores/processors on a '''single''' computer. If you want to use multiple computers at the same time, this will not work, see other options below.

=== MPI ===
Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters, which may or may
not be possible depending on your cluster's network configuration. Direct MPI support is planned by fall 2010.
-Line 18:
+Line 36:
+==== Quickstart ====
For those not wanting to read or understand the parallelism method, here are the basic required steps:
-Line 19:
+Line 39:
+. on the machine with the data, make a scratch directory on a local hard drive, cd to it, and run e2parallel.py dcserver --port=9990 --verbose=2
 1. make another scratch directory on a local hard drive, cd to it, and run e2parallel.py dcclient --host=<server hostname>
 1. repeat #2 for each core or machine you want to run tasks on
 1. run your parallel job, like 'e2refine.py' with the --parallel=dc:localhost:9990
-Line 20:
+Line 44:
-=== MPI ===
Sorry, we haven't had a chance to finish this yet. For the moment you will have to use the Distributed Computing mode on clusters.
+Notes
 * If you need to restart the server for some reason, that's fine. As long as it is restarted within about 5 minutes, it should be harmless to stop it with ^c and restart it
 * Make sure the same version of EMAN2 on all machines, if multiple machines are being used as clients
 * If you need to stop the 'e2refine' program, you can run 'e2parallel.py killall' to cancel any pending jobs on the server after stopping e2refine.
 * You can add or remove clients at any time during a run
 * When you are done running jobs, exit the server (^c), then run 'e2parallel.py dckillclients' from the server directory, and let it run for a minute or two. This will tell the clients to shut down. If you plan to do another run relatively soon, you can just leave the server and clients running.

You should really consider reading the detailed instructions below :^)

==== Introduction ====
This is the sort of parallelism made famous by projects like SETI-at-home and Folding-at-Home. The general idea is that you have a list of small jobs to do,
and a bunch of computers with spare cycles willing to help out with the computation. The number of computers willing to do computations may vary with time, and
possibly may agree to do a computation, but then fail to complete it. This is a very flexible parallelism model, which can be adapted to both individual computers
with multiple cores as well as linux clusters or sets of workstations laying around the lab.

There are 3 components to this system:

User Application (customer) <==> Server <==> Compute Nodes (client)

The user application (e2refine.py for example) builds a list of computational tasks that it needs to have completed, then sends the list to the server. Compute nodes with nothing to do then
contact the server and request tasks to compute. The server sends the tasks out to the clients. When the client finishes the requested computation, results are sent
back to the server. The user application then requests the results from the server and completes processing. As long as the number of tasks to complete is larger than the
number of clients servicing requests, this is an extremely efficient infrastructure.

Internally things are somewhat more complicated and tackle issues such as data caching on the clients, how to handle clients that die in the middle of processing, etc., but
the basic concept is quite straightforward.

With any of the e2parallel.py commands below, you may consider adding the --verbose=1 (or 2) option to see more of what it's doing.

==== How to use Distributed Computing in EMAN2 ====
To use distributed computing, there are three basic steps:
 * Run a server on a machine that the clients can communicate with
 * Run some number of clients pointing at the server
 * run an EMAN2 program with the --parallel=dc:host:port option

What follows are specific instructions for doing this under 2 different scenarios.

===== Using DC on a linux cluster =====
This can be a bit tricky, as there are several possible configurations, depending on the configuration of your cluster:
 * If the individual compute nodes can communicate directly (through the head node) to your workstation, you may consider running the server and the e2refine.py command directly on your workstation, and launch only clients on the cluster. The clients will communicate data among themselves using the high-performance internal network on the cluster, so this approach doesn't require much more network bandwidth than copying the data to the cluster, and copying the results back when you're done, and has the convenience that all data and results remain on your computer where you can monitor them.
 * If the individual compute nodes cannot communicate outside the cluster, then you will need to use e2scp.py to copy your project data to the disk on the cluster. If you are permitted to run small single-CPU commands directly on the storage/head node (attached to the physical storage), then running the server and e2refine command on that node is the best option.
 * If that isn't allowed on your cluster either, then things become a bit more difficult. You will need to launch the server, e2refine and the clients all from the queuing system script. Given the diversity of different cluster configurations, it is difficult to give specific details on this process, but the general comments below should give you something to start with.

General method of using DC computing:

 * The server is run with the ''e2parallel.py dcserver --port=9990'' command. 
 * The clients are run with the ''e2parallel.py dcclient --port=9990 --server=<server hostname>'' command. 
 * The actual refinement is run with the 'e2refine.py --parallel=dc:<server hostname>:9990' command. 

Notes:
 * The server MUST be run from a directory on a hard drive physically attached to the computer (not a network mounted drive). This directory should not require large amounts of disk space. This need not be the same drive that stores the data.
 * The clients MUST similarly be run from a directory on a physically attached drive. If you are running multiple clients on a single cluster node with multiple cores, all of the clients should be run from the SAME directory so they can share a data cache. This directory may get quite large, as it will be used to cache data during processing to reduce network load. 
 * If you need to stop the server, do so nicely with '^c' or 'kill <pid>'. Do NOT 'kill -9 <pid>'. You may stop and restart the server without disturbing the running refinement job, so long as it isn't down for more than 5-10 minutes.
 * Clients should also be killed 'nicely'. Clients may be started or stopped at any time without disturbing the refinement run.
 * If you decide to kill the refinement in the middle, you may also wish to run the 'e2parallel.py killall' command from the server directory to remove any incomplete tasks from the server.
 * If you are forced to run the server on a compute-node with the data stored on a network mounted drive, then additional precautions MUST be taken:
  * When you finish the job, nicely kill the server, then immediately run 'e2bdb.py -c' on the same node. After this, it will be safe to access the files from the head-node again.
  * While the job is running, you must not access any of the project files from the head-node, or database corruption may result. On a shared filesystem, only one node may have read/write access to databases at one time. This means if you need to check the progress of the running job, you must be very careful not to do anything that causes data to be written to the project. A safer alternative which may be possible on your cluster is to log in to the node running the server, and check the files from there. see the [[EMAN2/DatabaseWarning|warning about the database]] for more info on this topic.
 * You may wish to consider copying the data from the shared filesystem onto a local scratch drive on the same node running the server, then copying the results back to the shared filesystem after running 'e2bdb.py -c' at the end of the job. This will nicely avoid database corruption issues...

===== Using DC on a set of workstations =====
 * The server should run on a computer with a direct physical connection to the storage
 * All of the clients must be able to make a network connection to the server machine
 * Run a server on the desired machine ''e2parallel.py dcserver'' in an empty directory on the local hard drive
 * The server will print a message saying what port it's running on. This will usually be 9990. If it is something else, make a note of it.
 * Run one client for each core you want to use for processing on each computer : ''e2parallel.py dcclient --server=<server> --port=9990'' (replace the server hostname and port with the correct values)
 * Run your EMAN2 programs with the option ''--parallel=dc:<server>:9990'' (again, use the right port number and server hostname)

For all of the above, once you have finished running your jobs, kill the server, then run 'e2parallel.py dckillclients' from the same directory.
When it stops spewing out 'client killed' messages, you can kill this server.

'''''IF THIS IS NOT WORKING FOR YOU, PLEASE FOLLOW [[EMAN2/Parallel/Debug|THESE DEBUGGING INSTRUCTIONS]]'''''