GeneFinder#
- class pyrodigal.GeneFinder#
A configurable gene finder for genomes and metagenomes.
- meta#
Whether or not this object is configured to find genes using the metagenomic bins or manually created training infos.
- Type:
- training_info#
The object storing the training information, or
None
if the gene finder either is in metagenomic mode or hasn’t been trained yet.- Type:
- max_overlap#
The maximum number of nucleotides that can overlap between two genes on the same strand.
- Type:
- __init__(training_info=None, *, meta=False, metagenomic_bins=None, closed=False, mask=False, min_mask=50, min_gene=90, min_edge_gene=60, max_overlap=60, backend='detect')#
Instantiate and configure a new gene finder.
- Parameters:
training_info (
TrainingInfo
, optional) – A training info instance to use in single mode without having to train first.- Keyword Arguments:
meta (
bool
) – Set toTrue
to run in metagenomic mode, using a pre-trained profiles for better results with metagenomic or progenomic inputs. Defaults toFalse
.metagenomic_bins (
MetagenomicBins
, optional) – The metagenomic bins to use while in meta mode. WhenNone
is given, use all models from Prodigal.closed (
bool
) – Set toTrue
to consider sequences ends closed, which prevents proteins from running off edges. Defaults toFalse
.mask (
bool
) – Prevent genes from running across regions containing unknown nucleotides. Defaults toFalse
.min_mask (
int
) – The minimum mask length, when region masking is enabled. Regions shorter than the given length will not be masked, which may be helpful to prevent masking of single unknown nucleotides.min_gene (
int
) – The minimum gene length. Defaults to the value used in Prodigal.min_edge_gene (
int
) – The minimum edge gene length. Defaults to the value used in Prodigal.max_overlap (
int
) – The maximum number of nucleotides that can overlap between two genes on the same strand. This must be lower or equal to the minimum gene length.backend (
str
) – The backend implementation to use for computing the connection scoring pre-filter. Leave as"detect"
to select the fastest available implementation at runtime. Mostly useful for testing.
Added in version 0.6.4: The
training_info
argument.Added in version 0.7.0: The
min_edge
,min_edge_gene
andmax_overlap
arguments.Added in version 2.0.0: The
backend
argument.Added in version 3.0.0: The
metagenomic_bins
argument.Added in version 3.1.0: The
min_mask
argument.
- find_genes(sequence)#
Find all the genes in the input DNA sequence.
- Parameters:
sequence (
str
or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol. Letters not corresponding to an usual nucleotide (not any of “ATGC”) will be ignored.- Returns:
Genes
– A list of all the genes found in the input.- Raises:
MemoryError – When allocation of an internal buffers fails.
RuntimeError – On calling this method without having called
train
before while in single mode.TypeError – When
sequence
does not implement the buffer protocol.
- train(sequence, *sequences, force_nonsd=False, start_weight=4.35, translation_table=11)#
Search parameters for the ORF finder using a training sequence.
If more than one sequence is provided, it is assumed that they are different contigs part of the same genome. Like in the original Prodigal implementation, they will be merged together in a single sequence joined by
TTAATTAATTAA
linkers.- Parameters:
sequence (
str
or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol.- Keyword Arguments:
force_nonsd (
bool
, optional) – Set toTrue
to bypass the heuristic algorithm that tries to determine if the organism the training sequence belongs to uses a Shine-Dalgarno motif or not.start_weight (
float
, optional) – The start score weight to use. The default value has been manually selected by the Prodigal authors as an appropriate value for 99% of genomes.translation_table (
int
, optional) – The translation table to use. Check the List of genetic codes page listing all genetic codes for the available values.
- Returns:
TrainingInfo
– The resulting training info, which can be saved to disk and used later on to create a newGeneFinder
instance.- Raises:
RuntimeError – When calling this method while in metagenomic mode.
TypeError – When
sequence
does not implement the buffer protocol.ValueError – When
translation_table
is not a valid genetic code number, or whensequence
is too short to train.MemoryError – When allocation of an internal buffer with the system allocator fails.