Differences between revisions 20 and 64 (spanning 44 versions)
Revision 20 as of 2008-11-12 04:23:23
Size: 11674
Editor: SteveLudtke
Comment:
Revision 64 as of 2013-08-27 04:07:07
Size: 10956
Editor: SteveLudtke
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Dealing with image data on disk in EMAN2 ==
EMAN2 supports a variety of mechanisms for dealing with your data on disk. Virtually all cryo-EM file formats are supported as well as some good generic formats. In addition, EMAN2 has a local embedded database storage scheme used heavily during processing. This mechanism is faster than typical direct file access, and permits easy logging of tasks and book-keeping. Finally, we support communications with the EMEN2 OODB, permitting things like directly reading image data from a centralized database for processing.
== Image Data files in EMAN2 ==

Virtually all cryo-EM file formats are supported as well as some good generic formats. The default format used in EMAN2 processing is HDF5, which supports stacks of 2-D and 3-D images as well as arbitrary header information for each image in the file. If you convert an image to a format like MRC, you will lose any metadata not compatible with that format.

 * '''Any''' program in EMAN2 should directly read '''any''' supported file format without conversion. (Specific programs may require header information not available in all formats)
 * '''Most''' programs can write images to '''any''' output file format, determined by the [[EMAN2ImageFormats|filename you use]]. However, we strongly suggest using HDF unless you are transferring data to other software, as any other format will lose header information.
 * ''e2proc2d.py'' and ''e2proc3d.py'' can be used to explicitly convert files among specified file formats with specific data types, and are also used for general-purpose image processing.
 * ''e2display.py'' and ''e2projectmanager.py'' (via the file browser) can be used with the ''Save as'' button to convert to arbitrary output formats using a graphical interface.
 * '''MRC/CCP4''' have a number of special issues, please see the appropriate section on the [[EMAN2ImageFormats|image formats]] page.
Line 5: Line 12:
EMAN2 supports the following file formats: This is a quick summary of the most used file formats.

'''''For a complete list of supported formats and capabilities see the [[EMAN2ImageFormats|image formats]] page.'''''
Line 12: Line 22:
||Gatan DM4 ||R || FEI SER ||R ||
Line 17: Line 28:
To convert from one format to another, the ''e2proc2d.py'' and ''e2proc3d.py'' programs can be used for 2-D and 3-D images respectively. The basic usage ''proc2d.py <infile> <outfile>'' will simply convert from one file format to another. By default, image type for the output file is recognized by file extension. Both programs also have options for specifying file type when it would otherwise be ambiguous.

Any program in EMAN should be able to read/write any of the above file formats seamlessly, though each format may have its own limitations. We attempt to preserve as much metadata as possible, but some formats simply aren't very flexible in this regard. The only format supporting EMAN2's full model for associating attributes with individual images is HDF5, which is the format we encourage for general use and file interchange moving forward. This would be considered the default format for EMAN2. Unfortunately, while HDF5 is exceptionally flexible and portable, its performance on large image stacks is substantially worse than the simpler flat-file formats. For this reason, the primary storage mechanism in EMAN2 for internal processing is a BerkeleyDB-based embedded database system.
----
----
----
= obsolete below this point =
The documentation below this point is preserved mostly for historical reasons as we transition to EMAN2.1, the BDB mechanism is being completely retired, so the information below no longer applies.
Line 22: Line 35:
In EMAN2, we have converted to a model where most of the image data being processed is stored in and 'embedded database' in the local directory rather than in the traditional MRC/IMAGIC/SPIDER files. When files are added to this local database for the first time, you will see an EMAN2DB directory appear in the local directory. This subdirectory contains all of the image data and header information for an unlimited number of 'virtual files'. Files may still be copied into and out of this database into conventional files, but by storing data internally, we gain a (sometimes substantial) performance benefit, have much more flexibility in how metadata (known as 'header information') is stored, and permit much better tracking of what tasks have been completed on each data item. This idea might take some getting used to, and we hope you will appreciate its elegance once you do. In EMAN2, we have converted to a model where most of the image data being processed is stored in an 'embedded database' in the local directory rather than in the traditional MRC/IMAGIC/SPIDER files. When files are added to this local database for the first time, you will see an EMAN2DB directory appear in the local directory. This subdirectory contains all of the image data and header information for an unlimited number of 'virtual files'. Files may still be copied into and out of this database into conventional files, but by storing data internally, we gain a (sometimes substantial) performance benefit, have much more flexibility in how metadata (known as 'header information') is stored, and permit much better tracking of what tasks have been completed on each data item. This idea might take some getting used to, and we hope you will appreciate its elegance once you do.
Line 24: Line 37:
The files contained in the database directory (EMAN2DB) should never be manipulated by hand. Don't rename or copy files. This could cause errors or data loss. The entire directory can be safely moved from one directory to another, but its contents must not be altered. If you need to extract data for use with another program or somesuch, you MUST use an EMAN2 program to copy the data into a standard disk file. If you want to insure there is no loss of metadata, use the HDF format for this purpose. The files contained in the database directory (EMAN2DB) should never be manipulated by hand. Don't rename or copy files. This could cause errors or data loss. The entire directory can be safely moved from one directory to another, but its contents must not be altered. If you need to extract data for use with another program or somesuch, you MUST use an EMAN2 program to copy the data into a standard disk file. Note that the only standard disk file that supports the full metadata model used in the database system at present is the HDF format. If you want to insure there is no loss of metadata, use the HDF format for this purpose. If you copy a database image to, for example, a SPIDER file, you will lose any metadata which cannot be represented in the SPIDER header.
Line 28: Line 41:

You can browse the database using the workflow interface, or by running 'e2display.py' with no arguments, and browsing to the 'bdb' item in any directory. This will show the content of any database, which in some cases may contain only metadata and not images (in most cases they contain image stacks just like a spider or imagic format file).
Line 45: Line 60:
With a selection list: bdb:dbname?select.selectname
Line 46: Line 63:

'''Important note:''' If using any URL conaining a '?', you should put the entire specifier in double or single quotes, ie- "bdb:dbname?select.selectname". The '?' character is interpreted by most unix shells as a wildcard character for the filesystem, and not using the quotes could result in errors like "zsh: no matches found: bdb:dbname?select.abc".

The 'select.selectname' mechanism allows you to have a local database named 'select', and each key within that database contains a list of integers to be treated as image numbers in the file. ie bdb:db?select.abc would refer to a database called 'select' with key 'abc' referring to a list of image numbers which would then be dereferenced from 'db'.
Line 53: Line 74:
==== Using the database from Python (for programmers or advanced users) ====
The normal method for accessing image data on disk is using the read_image, read_images and write_image methods, for example:
The FAQ has a few questions answered which may shed more light on this.

===== e2bdb.py =====
''e2bdb.py'' is a utility for examining and interacting with databases from the command-line. ''e2bdb.py'' issued with no arguments will show a list of the databases in the current directory (much like ''ls'' for regular files). ''e2bdb.py -l'' will give details for each database. ''e2bdb.py -s'' will return each database name in 'bdb:database' format for use in other commands such as:
Line 57: Line 80:
e2.py
img=EMData()
img.read_image("test.hdf",5) # reads the 6th image from test.hdf (first image is 0)
img.write_image("test2.hdf",-1) # appends (-1) the image to the end of test2.hdf
img_list=EMData.read_images("test.hdf",range(50)) # reads the first 50 images from test.hdf into a list of EMData objects
n=EMUtil.get_image_count("test.hdf") # counts the number of images in test.hdf
foreach i (`e2bdb.py -s`)
e2proc2d.py $i output.spi
end
Line 64: Line 84:
Similar operations can be performed with databases, such as : The --filt and --match options allow you to filter the results of the search, by either doing a substring match (--filt), or full python regular expression matching (--match).
Line 66: Line 86:
{{{
img.read_image("bdb:test",5)
img.write_image("bdb:test2",-1)
}}}
However, this is not the preferred mechanism for using the database interface, since there are many more powerful operations which can be performed. Such as:

{{{
e2.py # This implicitly performs a 'from EMAN2db import *', which opens the local environment: DB=EMAN2DB.open_db()
DB.open_dict("test") # this opens a specific database in the local directory called "test"
DB.test[0]=test_image() # stores an EMData object in the 'test' database
DB["test"][0]=test_image() # equivalent to above, you can access a database as DB.name or DB["name"]
img=DB.test[0] # This reads the EMData object back from the database
DB.test.set_attr(3,"mykey",5.5) # This sets an attribute "mykey" on EMData keyed 3 in database 'test'
                                  # This operation is MUCH faster than doing the same thing with any
                                  # flat file
DB.test.get_attr(0,"mykey") # This retrieves an attribute of image 0 from database test without
                                  # loading the image data
DB.test["testimg"]=test_image() # Keys in the database need not be integers, though the
                                  # read_image, etc. methods can only access integer keys
DB.test["alist"]=[1,2,3,4,5] # You can also use the 'test' database to store arbitrary other
                                  # metadata, not just images. This assigns a list to key 'alist'
DB.close_dict("test") # While database will be cleanly closed automatically, except for
                                  # cases where python is forcibly terminated (^c is ok), it isn't
                                  # a bad idea to close them if you know you won't use them again
}}}
Basically, each database object can be treated as a python dictionary. Any Python object that can be pickled (almost any python object) can be stored as a value in these dictionaries. It is even possible to mix images of different sizes within a single object.

The attribute mechanism (set_attr, get_attr) is tied into the EMData object attribute dictionary. That is, the following operations are functionally equivalent, but the second version is MUCH faster.

{{{
img=DB.test[3]
img.set_attr("mykey",5.5)
DB.test[3]=img
# OR
DB.test.set_attr(3,"mykey",5.5)
}}}
Unlike python dictionaries, if a value in the database is an object, changing the object does not result in writing the change back to the database, unless you explicitly write it again. For example:

{{{
# With a dictionary
test={1:["a","b","c"],2:3}
test[1][1]="c"
print test[1]
["a","c","c"]
# With a database
DB.open_dict("test")
DB.test[1]=["a","b","c"]
DB.test[2]=3
DB.test[1][1]="c" # This effectively does nothing
print test[1]
["a","b","c"]
# To make the above actually work
d=DB.test[1]
d[1]="c"
DB.test[1]=d
}}}
You can write/read the full header for an EMData object inexpensively with:

{{{
DB.test[2]=test_image()
hdr=DB.test.get_header() # returns the equivalent of get_attr_dict on an EMData object
If DB is associated with the disk database, get header requires an argument (image number).
hdr["apix_x"]=2.0
DB.test.set_header(hdr) # hdr can be either a dictionary or and EMData object
}}}
There is a small cost associated with opening each database, so it is generally a good idea for performance purposes to open the database and only close it if you aren't expecting to use it again for some time.
Finally, --makevstack can be used to make a 'virtual' image stack from one or more other stacks. For example ''e2bdb.py bdb:.#db1 bdb:.#db2 --makevstack=bdb:.#db3'' will combine the images in db1 and db2 into db3. However, unlike doing this same task with ''e2proc2d.py bdb:.#db1 bdb:.#db3; e2proc2d.py bdb:.#db2 bdb:.#db3'', the --makevstack option will not actually copy the image data. Instead it creates 'db3' which references the data already stored in 'db1' and 'db2'. If the image data in 'db1' or 'db2' is changed, the corresponding images in 'db3' will also appear to change. However, 'db3' has its own copy of the metadata associated with the image data. That is, if you added an attribute to image 5 in 'db1' : 'fred=25', this attribute would appear only in image 5 in 'db1'. It would not appear in 'db3'. However if you inverted the image contrast in 'db1' : ''e2proc2d.py bdb:db1 bdb:db1 --inplace --mult=-1'', that change WOULD be reflected in 'db3'. Note that if image data is WRITTEN to 'db3', it will NOT overwrite the image data in the original databases ('db1' or 'db2'), but will store the image data in 'db3'. For example, if image 7 were written to 'db3', image 7 in 'db1' would remain unchanged. Future reads of image 7 from 'db3' would read this new data. The reference to 'db1' for that specific image would be broken. However, reading image 6 from 'db3' would still reference image 6 from 'db1'.
Line 134: Line 89:
Multiple processes/threads ''on a single machine'' can safely have the same database open at the same time (reading and writing). The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. On a single machine the databases coordinate with each other using the database cache in /tmp. Multiple processes/threads ''on a single machine'' can safely have the same database open at the same time (reading and writing). The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. On a single machine the databases coordinate with each other using the database cache in /tmp which '''MUST''' be on a locally mounted filesystem (not NFS).
Line 136: Line 91:
'''Multiple processes accessing (reading and writing) to a single file from multiple machines on a network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors. Files opened for reading only should be safe.''' '''Multiple processes accessing (reading and writing) to a single file from multiple machines on a network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors.'''
Line 138: Line 93:
When opening a database read-only, the caching mechanism is disabled, so changes made by a single other node opened for writing should be reflected in the read-only databases as soon as the write is flushed to disk.

The 'standard' parallelism mechanism in EMAN2 will be an encapsulation and distribution approach where reads/writes are synchronized through a single 'master' node. Finer grained MPI processing will also be supported, but less generally. SPARX is using a different approach.
If you are using one of the EMAN2 standard parallelism mechanisms (MPI or distributed processing), all writes are coordinated through a single node. If you are trying to look at a project read-only while, for example, an MPI job is running on another node, you may do so (just be careful not to write anything to the database). Also, unless you run '''e2bdb.py -c''' first, you may observe some strange inconsistencies in the files.

Image Data files in EMAN2

Virtually all cryo-EM file formats are supported as well as some good generic formats. The default format used in EMAN2 processing is HDF5, which supports stacks of 2-D and 3-D images as well as arbitrary header information for each image in the file. If you convert an image to a format like MRC, you will lose any metadata not compatible with that format.

  • Any program in EMAN2 should directly read any supported file format without conversion. (Specific programs may require header information not available in all formats)

  • Most programs can write images to any output file format, determined by the filename you use. However, we strongly suggest using HDF unless you are transferring data to other software, as any other format will lose header information.

  • e2proc2d.py and e2proc3d.py can be used to explicitly convert files among specified file formats with specific data types, and are also used for general-purpose image processing.

  • e2display.py and e2projectmanager.py (via the file browser) can be used with the Save as button to convert to arbitrary output formats using a graphical interface.

  • MRC/CCP4 have a number of special issues, please see the appropriate section on the image formats page.

File Formats

This is a quick summary of the most used file formats.

For a complete list of supported formats and capabilities see the image formats page.

HDF5

R/W

MRC/CCP4

R/W

IMAGIC

R/W

SPIDER

R/W

PIF

R/W

ICOS

R/W

VTK

R/W

PGM

R/W

Amira

R/W

Xplor

W

Gatan DM2

R

Gatan DM3

R

Gatan DM4

R

FEI SER

R

TIFF

R/W

Scans-a-lot

R

LST

R/W

PNG

R/W

Video-4-Linux

R

JPEG

W




obsolete below this point

The documentation below this point is preserved mostly for historical reasons as we transition to EMAN2.1, the BDB mechanism is being completely retired, so the information below no longer applies.

EMAN2 Embedded Database

In EMAN2, we have converted to a model where most of the image data being processed is stored in an 'embedded database' in the local directory rather than in the traditional MRC/IMAGIC/SPIDER files. When files are added to this local database for the first time, you will see an EMAN2DB directory appear in the local directory. This subdirectory contains all of the image data and header information for an unlimited number of 'virtual files'. Files may still be copied into and out of this database into conventional files, but by storing data internally, we gain a (sometimes substantial) performance benefit, have much more flexibility in how metadata (known as 'header information') is stored, and permit much better tracking of what tasks have been completed on each data item. This idea might take some getting used to, and we hope you will appreciate its elegance once you do.

The files contained in the database directory (EMAN2DB) should never be manipulated by hand. Don't rename or copy files. This could cause errors or data loss. The entire directory can be safely moved from one directory to another, but its contents must not be altered. If you need to extract data for use with another program or somesuch, you MUST use an EMAN2 program to copy the data into a standard disk file. Note that the only standard disk file that supports the full metadata model used in the database system at present is the HDF format. If you want to insure there is no loss of metadata, use the HDF format for this purpose. If you copy a database image to, for example, a SPIDER file, you will lose any metadata which cannot be represented in the SPIDER header.

Using the database with normal EMAN2 programs

Note: It is very important that you not manually rename or edit the files in the EMAN2DB directory. Doing so could corrupt the entire database in such a way that EMAN2 programs will no longer be able to access it properly. You can safely move the directory as a whole to a different location, but otherwise it should not be modified.

You can browse the database using the workflow interface, or by running 'e2display.py' with no arguments, and browsing to the 'bdb' item in any directory. This will show the content of any database, which in some cases may contain only metadata and not images (in most cases they contain image stacks just like a spider or imagic format file).

The database can be accessed by any of the EMAN2 programs. Normally you would specify a file as 'test.hdf' or '/home/stevel/test.hdf'. To access the database, simply specify 'bdb:test' or 'bdb:/home/stevel/data/test'. In the first instance (bdb:test), the named database 'test' will be accessed in the EMAN2DB database in the local directory. Specifying 'bdb:/home/stevel/data/test' will access the database named test in /home/stevel/data/EMAN2DB. Each of the EMAN2DB directories can contain an unlimited number of individual named databases. The EMAN2 GUI interface will provide tools for browsing these databases interactively, and you can find all of their names by listing the 'EMAN2DB/*.bdb' files. Please note that when specifying database names, you don't use the '.bdb' extension. There is more to the database than just the '.bdb' file you see.

For example, say you have a database called 'averages' containing 200 class-averages, and you want to get them out for processing in Spider. Just:

e2proc2d.py bdb:averages averages.spi

and you will end up with a Spider format stack file containing all of the images.

The way you specify an image inside one of these databases is any of:

For a database in the local directory: bdb:dbname

For a database in another directory referenced to the current one: bdb:../local/path#dbname

For a database at an absolute path: bdb:/absolute/path/to/directory#dbname

With a selection list: bdb:dbname?select.selectname

To access keys as a virtual database: bdb:/absolute/path/to/directory#dbname?key,key,key

Important note: If using any URL conaining a '?', you should put the entire specifier in double or single quotes, ie- "bdb:dbname?select.selectname". The '?' character is interpreted by most unix shells as a wildcard character for the filesystem, and not using the quotes could result in errors like "zsh: no matches found: bdb:dbname?select.abc".

The 'select.selectname' mechanism allows you to have a local database named 'select', and each key within that database contains a list of integers to be treated as image numbers in the file. ie bdb:db?select.abc would refer to a database called 'select' with key 'abc' referring to a list of image numbers which would then be dereferenced from 'db'.

The final access method is not very commonly used, but can be quite powerful for specialized purposes. In a typical image stack file, such as SPIDER or IMAGIC format, the individual images are numbered from 0 to n. Say you have a database with 50 images in it, and you want to extract image numbers 0,3,6,10 and 12 from the database. You could do this several ways, including running 5 separate proc2d commands or putting the numbers in a text file and having proc2d use the text file. An alternative would be:

e2proc2d.py bdb:averages?0,3,6,10,12 selected.hed

EMAN2 programs will treat averages?0,3,6,10,12 as if it were actually a database with only 5 images in it, numbered from 0-4: 0=0, 1=3,2=6,3=10,4=12.

The FAQ has a few questions answered which may shed more light on this.

e2bdb.py

e2bdb.py is a utility for examining and interacting with databases from the command-line. e2bdb.py issued with no arguments will show a list of the databases in the current directory (much like ls for regular files). e2bdb.py -l will give details for each database. e2bdb.py -s will return each database name in 'bdb:database' format for use in other commands such as:

foreach i (`e2bdb.py -s`)
e2proc2d.py $i output.spi
end

The --filt and --match options allow you to filter the results of the search, by either doing a substring match (--filt), or full python regular expression matching (--match).

Finally, --makevstack can be used to make a 'virtual' image stack from one or more other stacks. For example e2bdb.py bdb:.#db1 bdb:.#db2 --makevstack=bdb:.#db3 will combine the images in db1 and db2 into db3. However, unlike doing this same task with e2proc2d.py bdb:.#db1 bdb:.#db3; e2proc2d.py bdb:.#db2 bdb:.#db3, the --makevstack option will not actually copy the image data. Instead it creates 'db3' which references the data already stored in 'db1' and 'db2'. If the image data in 'db1' or 'db2' is changed, the corresponding images in 'db3' will also appear to change. However, 'db3' has its own copy of the metadata associated with the image data. That is, if you added an attribute to image 5 in 'db1' : 'fred=25', this attribute would appear only in image 5 in 'db1'. It would not appear in 'db3'. However if you inverted the image contrast in 'db1' : e2proc2d.py bdb:db1 bdb:db1 --inplace --mult=-1, that change WOULD be reflected in 'db3'. Note that if image data is WRITTEN to 'db3', it will NOT overwrite the image data in the original databases ('db1' or 'db2'), but will store the image data in 'db3'. For example, if image 7 were written to 'db3', image 7 in 'db1' would remain unchanged. Future reads of image 7 from 'db3' would read this new data. The reference to 'db1' for that specific image would be broken. However, reading image 6 from 'db3' would still reference image 6 from 'db1'.

Clusters/MPI

Multiple processes/threads on a single machine can safely have the same database open at the same time (reading and writing). The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. On a single machine the databases coordinate with each other using the database cache in /tmp which MUST be on a locally mounted filesystem (not NFS).

Multiple processes accessing (reading and writing) to a single file from multiple machines on a network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors.

If you are using one of the EMAN2 standard parallelism mechanisms (MPI or distributed processing), all writes are coordinated through a single node. If you are trying to look at a project read-only while, for example, an MPI job is running on another node, you may do so (just be careful not to write anything to the database). Also, unless you run e2bdb.py -c first, you may observe some strange inconsistencies in the files.

Eman2DataStorage (last edited 2022-03-08 23:55:01 by SteveLudtke)