GeneFinder

class pyrodigal.GeneFinder

A configurable gene finder for genomes and metagenomes.

meta

Whether or not this object is configured to find genes using the metagenomic bins or manually created training infos.

Type:

bool

closed

Whether or not proteins can run off edges when finding genes in a sequence.

Type:

bool

mask

Prevent genes from running across regions containing unknown nucleotides.

Type:

bool

training_info

The object storing the training information, or None if the object is in metagenomic mode or hasn’t been trained yet.

Type:

TrainingInfo

min_gene

The minimum gene length.

Type:

int

min_edge_gene

The minimum edge gene length.

Type:

int

max_overlap

The maximum number of nucleotides that can overlap between two genes on the same strand.

Type:

int

__init__()

Instantiate and configure a new ORF finder.

Parameters:

training_info (TrainingInfo, optional) – A training info instance to use in single mode without having to train first.

Keyword Arguments:
  • meta (bool) – Set to True to run in metagenomic mode, using a pre-trained profiles for better results with metagenomic or progenomic inputs. Defaults to False.

  • metagenomic_bins (MetagenomicBins, optional) – The metagenomic bins to use while in meta mode. When None is given, use all models from Prodigal.

  • closed (bool) – Set to True to consider sequences ends closed, which prevents proteins from running off edges. Defaults to False.

  • mask (bool) – Prevent genes from running across regions containing unknown nucleotides. Defaults to False.

  • min_mask (int) – The minimum mask length, when region masking is enabled. Regions shorter than the given length will not be masked, which may be helpful to prevent masking of single unknown nucleotides.

  • min_gene (int) – The minimum gene length. Defaults to the value used in Prodigal.

  • min_edge_gene (int) – The minimum edge gene length. Defaults to the value used in Prodigal.

  • max_overlap (int) – The maximum number of nucleotides that can overlap between two genes on the same strand. This must be lower or equal to the minimum gene length.

  • backend (str) – The backend implementation to use for computing the connection scoring pre-filter. Leave as "detect" to select the fastest available implementation at runtime. Mostly useful for testing.

New in version 0.6.4: The training_info argument.

New in version 0.7.0: The min_edge, min_edge_gene and max_overlap arguments.

New in version 2.0.0: The backend argument.

New in version 3.0.0: The metagenomic_bins argument.

New in version 3.0.2: The min_mask argument.

find_genes(sequence)

Find all the genes in the input DNA sequence.

Parameters:

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol. Letters not corresponding to an usual nucleotide (not any of “ATGC”) will be ignored.

Returns:

Genes – A list of all the genes found in the input.

Raises:
  • MemoryError – When allocation of an internal buffers fails.

  • RuntimeError – On calling this method without having called train before while in single mode.

  • TypeError – When sequence does not implement the buffer protocol.

train(sequence, *sequences, force_nonsd=False, start_weight=4.35, translation_table=11)

Search parameters for the ORF finder using a training sequence.

If more than one sequence is provided, it is assumed that they are different contigs part of the same genome. Like in the original Prodigal implementation, they will be merged together in a single sequence joined by TTAATTAATTAA linkers.

Parameters:

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol.

Keyword Arguments:
  • force_nonsd (bool, optional) – Set to True to bypass the heuristic algorithm that tries to determine if the organism the training sequence belongs to uses a Shine-Dalgarno motif or not.

  • start_weight (float, optional) – The start score weight to use. The default value has been manually selected by the Prodigal authors as an appropriate value for 99% of genomes.

  • translation_table (int, optional) – The translation table to use. Check the Wikipedia page listing all genetic codes for the available values.

Returns:

TrainingInfo – The resulting training info, which can be saved to disk and used later on to create a new GeneFinder instance.

Raises:
  • MemoryError – When allocation of an internal buffers fails.

  • RuntimeError – When calling this method while in metagenomic mode.

  • TypeError – When sequence does not implement the buffer protocol.

  • ValueError – When translation_table is not a valid genetic code number, or when sequence is too short to train.