Diff for "Eman2DataStorage"

Differences between revisions 6 and 7

Dealing with image data on disk in EMAN2

EMAN2 supports a variety of mechanisms for dealing with your data on disk. Virtually all cryo-EM file formats are supported as well as some good generic formats. In addition, EMAN2 has a local embedded database storage scheme used heavily during processing. This mechanism is faster than typical direct file access, and permits easy logging of tasks and book-keeping. Finally, we support communications with the EMEN2 OODB, permitting things like directly reading image data from a centralized database for processing.

File Formats

EMAN2 supports the following file formats:

HDF5	R/W	MRC/CCP4	R/W
IMAGIC	R/W	SPIDER	R/W
PIF	R/W	ICOS	R/W
VTK	R/W	PGM	R/W
Amira	R/W	Xplor	W
Gatan DM2	R	Gatan DM3	R
TIFF	R/W	Scans-a-lot	R
LST	R/W	PNG	R/W
Video-4-Linux	R	JPEG	W

To convert from one format to another, the e2proc2d.py and e2proc3d.py programs can be used for 2-D and 3-D images respectively. The basic usage proc2d.py <infile> <outfile> will simply convert from one file format to another. By default, image type for the output file is recognized by file extension. Both programs also have options for specifying file type when it would otherwise be ambiguous.

Any program in EMAN should be able to read/write any of the above file formats seamlessly, though each format may have its own limitations. We attempt to preserve as much metadata as possible, but some formats simply aren't very flexible in this regard. The only format supporting EMAN2's full model for associating attributes with individual images is HDF5, which is the format we encourage for general use and file interchange moving forward. This would be considered the default format for EMAN2. Unfortunately, while HDF5 is exceptionally flexible and portable, its performance on large image stacks is substantially worse than the simpler flat-file formats. For this reason, the primary storage mechanism in EMAN2 for internal processing is a BerkeleyDB-based embedded database system.

EMAN2 Embedded Database

You'll note that whenever you run and EMAN2 program in a new directory, a subdirectory called EMAN2DB is also created. In EMAN1, a hidden file '.emanlog' was created, and this simple file contained a history of all of the EMAN1 commands run in that directory. In EMAN2, we have converted to a model where most of the image data being processed is stored in and 'embedded database' in the local directory rather than in the traditional MRC/IMAGIC/SPIDER files. Files may still be copied into and out of this database into conventional files, but by storing data internally, we gain a (sometimes substantial) performance benefit, have much more flexibility in how metadata (known as 'header information') is stored, and permit much better tracking of what tasks have been completed on each data item. This idea might take some getting used to, and we hope you will appreciate its elegance once you do.

The files contained in the database directory (EMAN2DB) should never be manipulated by hand. Don't rename or copy files. This could cause errors or data loss. The entire directory can be safely moved from one directory to another, but its contents must not be altered. If you need to extract data for use with another program or somesuch, you MUST use an EMAN2 program to copy the data into a standard disk file. If you want to insure there is no loss of metadata, use the HDF format for this purpose.

Using the database with normal EMAN2 programs

Note: It is very important that you not manually rename or edit the files in the EMAN2DB directory. Doing so could corrupt the entire database in such a way that EMAN2 programs will no longer be able to access it properly. You can safely move the directory as a whole to a different location, but otherwise it should not be modified. The database can be accessed by any of the EMAN2 programs. Normally you would specify a file as 'test.hdf' or '/home/stevel/test.hdf'. To access the database, simply specify 'bdb:test' or 'bdb:/home/stevel/data/test'. In the first instance (bdb:test), the named database 'test' will be accessed in the EMAN2DB database in the local directory. Specifying 'bdb:/home/stevel/data/test' will access the database named test in /home/stevel/data/EMAN2DB. Each of the EMAN2DB directories can contain an unlimited number of individual named databases. The EMAN2 GUI interface will provide tools for browsing these databases interactively, and you can find all of their names by listing the 'EMAN2DB/*.bdb' files. Please note that when specifying database names, you don't use the '.bdb' extension. There is more to the database than just the '.bdb' file you see.

For example, say you have a database called 'averages' containing 200 class-averages, and you want to get them out for processing in Spider. Just:

e2proc2d.py bdb:averages averages.spi

and you will end up with a Spider format stack file containing all of the images.

Using the database from Python (for programmers or advanced users)

The normal method for accessing image data on disk is using the read_image, read_images and write_image methods, for example:

e2.py
img=EMData()
img.read_image("test.hdf",5)  # reads the 6th image from test.hdf (first image is 0)
img.write_image("test2.hdf",-1)   # appends (-1) the image to the end of test2.hdf
img_list=EMData.read_images("test.hdf",(0,50))   # reads the first 50 images from test.hdf into a list of EMData objects
n=EMUtil.get_image_count("test.hdf")   # counts the number of images in test.hdf

Similar operations can be performed with databases, such as :

img.read_image("bdb:test",5)
img.write_image("bdb:test2",-1)

However, this is not the preferred mechanism for using the database interface, since there are many more powerful operations which can be performed. Such as:

e2.py    # This implicitly performs a 'from EMAN2db import *', which opens the local environment: db=EMAN2DB.open_db()

db.open_dict("test")     # this opens a specific database in the local directory called "test"

db.test[0]=test_image()  # stores an EMData object in the 'test' database

img=db.test[0]           # This reads the EMData object back from the database

db.test.set_attr(3,"mykey",5.5)   # This sets an attribute "mykey" on EMData keyed 3 in database 'test'
                                  # This operation is MUCH faster than doing the same thing with any
                                  # flat file

db.test.get_attr(0,"mykey")       # This retrieves an attribute of image 0 from database test without
                                  # loading the image data

db.test["testimg"]=test_image()   # Keys in the database need not be integers, though the
                                  # read_image, etc. methods can only access integer keys

db.test["alist"]=[1,2,3,4,5]      # You can also use the 'test' database to store arbitrary other
                                  # metadata, not just images. This assigns a list to key 'alist'

db.close_dict("test")             # While database will be cleanly closed automatically, except for
                                  # cases where python is forcibly terminated (^c is ok), it isn't
                                  # a bad idea to close them if you know you won't use them again

Basically, each database object can be treated as a python dictionary. Any Python object that can be pickled (almost any python object) can be stored as a value in these dictionaries. It is even possible to mix images of different sizes within a single object.

There is a small cost associated with opening each database, so it is generally a good idea for performance purposes to open the database and only close it if you aren't expecting to use it again for some time.

Multiple processes on a single machine can safely have the same database open at the same time. The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. Multiple processes accessing (reading and writing) to a single file from multiple machines on an network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors. We are working on a solution for this issue...

-  ⇤ ← Revision 6 as of 2008-07-13 14:09:44 → 
  Size: 8534
  Editor: SteveLudtke
  Comment:
+   ← Revision 7 as of 2008-07-13 14:21:20 → ⇥
  Size: 8576
  Editor: SteveLudtke
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 86:
-Multiple processes on a single machine can safely have the same database open at the same time. The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. Multiple processes accessing (reading and writing) to a single file from multiple machines on an network-mounted filesystem may not be safe (still investigating how to handle this).
+Multiple processes on a single machine can safely have the same database open at the same time. The databases (based on BerkeleyDB) support record-level locking. If one process is writing to a record and another process simultaneously tries to read the record, the read operation will block until the write completes. Multiple processes accessing (reading and writing) to a single file from multiple machines on an network-mounted filesystem IS NOT SAFE, and may result in unpredictable errors. We are working on a solution for this issue...