Parallel Tandem

X! Search Engine Development

Parallel Tandem

Parallel Tandem is a parallel driver for X!Tandem. This open source software is an ongoing project in Andrew Link's laboratory. These applications consist of a PVM or MPI parallel program and several utility programs which reduce several fold the time it takes to compare tandem mass spectra to amino acid sequences in a protein database.

Tandem_PVM runs within the PVM environment to launch X!Tandem on as many processors as there are in the PVM configuration. Similarly, Tandem_MPI operates in the MPI environment. If there are 20 dual processor nodes, then 40 instances of X!Tandem will run in parallel, each with its own set of ms/ms spectra. The program "autotandem_pvm" or "autotandem_mpi" automates the process of running the parallel program and the utility programs which collate the results into one XML format file. The programs may also run independently in a stepwise manner for more flexibility. The resultant output XML file may be viewed using The GPM suite of programs.

Reference: Duncan, D. T., Craig, R., Link, A. J. (2005) Parallel Tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X!Tandem. J. Proteome Research. in press.

Some Statsitics

We routinely divide a file of several thousand ms/ms spectra into several smaller files of an equal number of spectra and search them with X!Tandem in parallel. Running without refinement is at least 90% scalable--tested on 40 processors. For example, on 40 processors, we have searched 25,000 ms/ms spectra against a 50,000 entry protein database file with 6 modifications without refinement in 30 minutes. This same search took 20 hours on one processor. Autotandem, parallel tandem with refinement, is about 70% scalable depending on the input parameters. For example, a fully automated search of 25,000 acquired spectra against a 50,000 entry protein database with 3 modifications with refinement took 3 hours using 30 processors as compared to 70 hours on a single processor.

Applications

Searching for several potential modifications against an entire database, to build a sub-database of proteins for subsequent searches, is the premier application of these parallel methods. Once this sub-database has been assembled, the resultant smaller database of candidate proteins can be searched again with refinement (searched for missed cleavages and partial cleavages with potential modifications), either sequentially or again in parallel. The results of running a search without refinement against the entire database, on one machine or in parallel with many machines, are in very close agreement. The results of running a subsequent search with refinement against the sub-database, on one machine compared to several machines in parallel, are also in close agreement but have produced a very small percent difference in the number of proteins identified (~2%). This is due to peptide expectation values that are borderline with respect to the confidence level set in the parallel input.xml parameters.

Authors and Contact

Authors: Dexter Duncan and Andrew Link, Vanderbilt University School of Medicine
Contact: