GeneFinder#

class pyrodigal.GeneFinder#

A configurable gene finder for genomes and metagenomes.

meta#

Whether or not this object is configured to find genes using the metagenomic bins or manually created training infos.

Type:

bool

closed#

Whether or not proteins can run off edges when finding genes in a sequence.

Type:

bool

mask#

Prevent genes from running across regions containing unknown nucleotides.

Type:

bool

training_info#

The object storing the training information, or None if the gene finder either is in metagenomic mode or hasn’t been trained yet.

Type:

TrainingInfo

min_gene#

The minimum length for genes to be reported by Prodigal.

Type:

int

min_edge_gene#

The minimum length for genes located on contig edges.

Type:

int

max_overlap#

The maximum number of nucleotides that can overlap between two genes on the same strand.

Type:

int

__init__(training_info=None, *, meta=False, metagenomic_bins=None, closed=False, mask=False, min_mask=50, min_gene=90, min_edge_gene=60, max_overlap=60, backend='detect')#

Instantiate and configure a new gene finder.

Parameters:

training_info (TrainingInfo, optional) – A training info instance to use in single mode without having to train first.

Keyword Arguments:
  • meta (bool) – Set to True to run in metagenomic mode, using a pre-trained profiles for better results with metagenomic or progenomic inputs. Defaults to False.

  • metagenomic_bins (MetagenomicBins, optional) – The metagenomic bins to use while in meta mode. When None is given, use all models from Prodigal.

  • closed (bool) – Set to True to consider sequences ends closed, which prevents proteins from running off edges. Defaults to False.

  • mask (bool) – Prevent genes from running across regions containing unknown nucleotides. Defaults to False.

  • min_mask (int) – The minimum mask length, when region masking is enabled. Regions shorter than the given length will not be masked, which may be helpful to prevent masking of single unknown nucleotides.

  • min_gene (int) – The minimum gene length. Defaults to the value used in Prodigal.

  • min_edge_gene (int) – The minimum edge gene length. Defaults to the value used in Prodigal.

  • max_overlap (int) – The maximum number of nucleotides that can overlap between two genes on the same strand. This must be lower or equal to the minimum gene length.

  • backend (str) – The backend implementation to use for computing the connection scoring pre-filter. Leave as "detect" to select the fastest available implementation at runtime. Mostly useful for testing.

Added in version 0.6.4: The training_info argument.

Added in version 0.7.0: The min_edge, min_edge_gene and max_overlap arguments.

Added in version 2.0.0: The backend argument.

Added in version 3.0.0: The metagenomic_bins argument.

Added in version 3.1.0: The min_mask argument.

find_genes(sequence)#

Find all the genes in the input DNA sequence.

Parameters:

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol. Letters not corresponding to an usual nucleotide (not any of “ATGC”) will be ignored.

Returns:

Genes – A list of all the genes found in the input.

Raises:
  • MemoryError – When allocation of an internal buffers fails.

  • RuntimeError – On calling this method without having called train before while in single mode.

  • TypeError – When sequence does not implement the buffer protocol.

train(sequence, *sequences, force_nonsd=False, start_weight=4.35, translation_table=11)#

Search parameters for the ORF finder using a training sequence.

If more than one sequence is provided, it is assumed that they are different contigs part of the same genome. Like in the original Prodigal implementation, they will be merged together in a single sequence joined by TTAATTAATTAA linkers.

Parameters:

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol.

Keyword Arguments:
  • force_nonsd (bool, optional) – Set to True to bypass the heuristic algorithm that tries to determine if the organism the training sequence belongs to uses a Shine-Dalgarno motif or not.

  • start_weight (float, optional) – The start score weight to use. The default value has been manually selected by the Prodigal authors as an appropriate value for 99% of genomes.

  • translation_table (int, optional) – The translation table to use. Check the List of genetic codes page listing all genetic codes for the available values.

Returns:

TrainingInfo – The resulting training info, which can be saved to disk and used later on to create a new GeneFinder instance.

Raises:
  • RuntimeError – When calling this method while in metagenomic mode.

  • TypeError – When sequence does not implement the buffer protocol.

  • ValueError – When translation_table is not a valid genetic code number, or when sequence is too short to train.

  • MemoryError – When allocation of an internal buffer with the system allocator fails.