Parallelism#
Pyrodigal is entirely thread-safe, which allows processing different contigs
in parallel with the same GeneFinder
object.
Command Line#
The command line supports processing input sequences in parallel, but this needs
to be enabled explicitly. Use the -j
flag to specify any number of jobs to
run in parallel, or -j0
to run as many jobs as there are available cores on
the machine (as reported by os.cpu_count
):
$ pyrodigal -j0 ...
Reentrancy#
The GeneFinder.find_genes
method is re-entrant, so calling it across
different threads doesn’t cause any issue. The easiest way to call the
find_genes
method in parallel is with a
multiprocessing.pool.ThreadPool
, which can easily split the work into chunks
to be processed across different threads:
from multiprocess.pool import ThreadPool
from pyrodigal import GeneFinder
sequences = [ ... ] # a list of sequences to process
gene_finder = GeneFinder(meta=True) # a single gene finder object
with ThreadPool() as pool:
genes = pool.map(gene_finder.find_genes, sequences)
This is internally what the Pyrodigal CLI does when called with the -j
flag
set to any number of jobs but 1.
Processes#
On some setups, such as workstations with virtualized CPUs, threads may not be as efficient because of the requirement for a shared memory space accessible by all threads which may cross physical boundaries. In that case, using processes will be faster, despite the initial requirement to copy data in each worker process.
To use processes instead of threads in the command line, use the following flag:
$ pyrodigal --pool=process ...
In the API example from above, simply import Pool
from
multiprocessing.pool
instead of ThreadPool
.