Parallel Processing in EMAN

Important note: If you plan to run on a linux cluster, make SURE you download the cluster binary, or compile from source with the cluster option. If you do not, you will have numerous problems.

EMAN uses a very portable and very coarse-grained type of parallelism. It can function on virtually any multiple processor supercomputer (shared or distributed memory), and can even run on clusters of connected workstations (with certain restrictions). Many of the EMAN command line programs accept the 'proc=' argument. You can discover if an individual command accepts this argument by typing its name followed by 'help', eg 'refine help'. Any command which takes this argument impliments parallelism in the same way (with 2 specific experimental exceptions).

The form for specifying the number of processors to use is [proc=<num>]. That is, 'proc=5' will run the job on 5 processors. Never specify more processors than you have configured. This will cause multiple processes to run on a single processor and slow down the entire job.

Configuration for clusters

On a multiple processor computer (like an SGI Origin 2000), when specifying a fixed number of processors to use, no configuration is necessary. Simply specify the number of processors to use, and it will run. However, if you wish to use a variable number of processors or run on workstation clusters, you must create a configuration file. You may create this file in one of two places. You may create a file called '$HOME/.eman/mparm' or you may create a file called '.mparm' located in the directory where the EMAN commands are to be run. The format of the file is the same in both cases. The '.mparm' file in the local directory will take precedence over the global file in $HOME/.eman.

The format of this file is quite simple. Each line contains the specification for 1 computer/node. Here is an example for a cluster of 4 computers, 2 single processor and 2 dual processor:

ssh	1	1	node1	/
ssh	1	1	node2	/
ssh	2	1	node3	/
ssh	2	1	node4	/

This file would be placed in $HOME/.eman/mparm. The first column is either 'ssh' or 'rsh' depending on which program is configured for logging in from one machine to another. Note that for this to work, the individual machines MUST be configured to allow the same user to log in from one machine to another with NO password entry. Read the man pages for 'ssh' or 'rsh' to learn how to do this. You can test to see if this is configured properly by setting up the configuration file then typing 'runpar test'. This will run a simple command on each processor. If it fails, an error message should be reported. If you see a 'password:' prompt, then the machines are not configured properly.

The second column is the number of processors on that particular node. The third column is always 1. The fourth column is the network name of the computer. The fifth column is not used in the '$HOME/.eman/mparm' file, but IS used if the file is placed in the run directory.

To use EMAN on clusters of computers, the directory where processing is being done MUST be cross-mounted on all of the computers being used. However, on some machines, the directories aren't always mounted in the same place. In most workstation setups, the 5th column will be the same on all lines, but if your local configuration has the directory mounted in different places on different machines, you can use this column to specify the path to the run directory on each node.

One other note. It is not possible to run a single job across mixed platforms. If you have 2 SGIs and 2 PC's, you cannot run a single job on all 4 machines.

Current distributions of linux have some known problems with their NFS implementations. Specifically, if two nodes append to the same file within a few seconds of each other, the file will be corrupted. Lacking a better solution, the rather extreme measure of writing a custom fileserver for EMAN was undertaken. When the cluster version of EMAN is run, all file write operations are passed through a single-threaded fileserver running within runpar on the host machine. Not only does this avoid NFS bugs, but it seems to run much faster than NFS did. File reads are still performed through NFS, thus taking advantage of local file-caching. This feature is disabled in the single processor and shared memory versions of the EMAN binaries. Note: As of March 2002, Athlon 2000XP processors are the most cost effective. While the P4 has potential, with current linux compilers, it will generally perform slower than a fast PIII, and the fastest P4's are quite expensive. The Athlon also has a much better floating point unit than the PIII/4, and performs math-intensive work better. For comparison, an Athlon 1.2/133 benchmarks twice as fast as a PIII/800 running EMAN code.


EMAN 1.2 beta, last modified 5/16/02