Parallelism#

Pyrodigal is entirely thread-safe, which allows processing different contigs in parallel with the same GeneFinder object.

Command Line#

The command line supports processing input sequences in parallel, but this needs to be enabled explicitly. Use the -j flag to specify any number of jobs to run in parallel, or -j0 to run as many jobs as there are available cores on the machine (as reported by os.cpu_count):

$ pyrodigal -j0 ...

Reentrancy#

The GeneFinder.find_genes method is re-entrant, so calling it across different threads doesn’t cause any issue. The easiest way to call the find_genes method in parallel is with a multiprocessing.pool.ThreadPool, which can easily split the work into chunks to be processed across different threads:

from multiprocess.pool import ThreadPool
from pyrodigal import GeneFinder

sequences = [ ... ]                  # a list of sequences to process
gene_finder = GeneFinder(meta=True)  # a single gene finder object

with ThreadPool() as pool:
    genes = pool.map(gene_finder.find_genes, sequences)

This is internally what the Pyrodigal CLI does when called with the -j flag set to any number of jobs but 1.

Processes#

On some setups, such as workstations with virtualized CPUs, threads may not be as efficient because of the requirement for a shared memory space accessible by all threads which may cross physical boundaries. In that case, using processes will be faster, despite the initial requirement to copy data in each worker process.

To use processes instead of threads in the command line, use the following flag:

$ pyrodigal --pool=process ...

In the API example from above, simply import Pool from multiprocessing.pool instead of ThreadPool.

Parallelism#

Command Line#

Reentrancy#

Processes#

This Page