OrfFinder

class pyrodigal.OrfFinder

A configurable ORF finder for genomes and metagenomes.

meta

Whether or not this object is configured to find genes using the metagenomic bins or manually created training infos.

Type

bool

closed

Whether or not proteins can run off edges when finding genes in a sequence.

Type

bool

mask

Prevent genes from running across regions containing unknown nucleotides.

Type

bool

training_info

The object storing the training information, or None if the object is in metagenomic mode or hasn’t been trained yet.

Type

TrainingInfo

min_gene

The minimum gene length.

Type

int

min_edge_gene

The minimum edge gene length.

Type

int

max_overlap

The maximum number of nucleotides that can overlap between two genes on the same strand.

Type

int

__init__(training_info=None, *, meta=False, closed=False, mask=False, min_gene=90, min_edge_gene=60, max_overlap=60)

Instantiate and configure a new ORF finder.

Parameters

training_info (TrainingInfo, optional) – A training info instance to use in single mode without having to train first.

Keyword Arguments
  • meta (bool) – Set to True to run in metagenomic mode, using a pre-trained profiles for better results with metagenomic or progenomic inputs. Defaults to False.

  • closed (bool) – Set to True to consider sequences ends closed, which prevents proteins from running off edges. Defaults to False.

  • mask (bool) – Prevent genes from running across regions containing unknown nucleotides. Defaults to False.

  • min_gene (int) – The minimum gene length. Defaults to the value used in Prodigal.

  • min_edge_gene (int) – The minimum edge gene length. Defaults to the value used in Prodigal.

  • max_overlap (int) – The maximum number of nucleotides that can overlap between two genes on the same strand. This must be lower or equal to the minimum gene length.

Changed in version 0.6.4: Added the training_info argument.

Changed in version 0.7.0: Added min_edge, min_edge_gene and max_overlap.

find_genes(sequence)

Find all the genes in the input DNA sequence.

Parameters

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol. Letters not corresponding to an usual nucleotide (not any of “ATGC”) will be ignored.

Returns

Genes – A list of all the genes found in the input.

Raises
  • MemoryError – When allocation of an internal buffers fails.

  • RuntimeError – On calling this method without having called train before while in single mode.

  • TypeError – When sequence does not implement the buffer protocol.

train(sequence, *sequences, force_nonsd=False, start_weight=4.35, translation_table=11)

Search parameters for the ORF finder using a training sequence.

If more than one sequence is provided, it is assumed that they are different contigs part of the same genome. Like in the original Prodigal implementation, they will be merged together in a single sequence joined by TTAATTAATTAA linkers.

Parameters

sequence (str or buffer) – The nucleotide sequence to use, either as a string of nucleotides, or as an object implementing the buffer protocol.

Keyword Arguments
  • force_nonsd (bool, optional) – Set to True to bypass the heuristic algorithm that tries to determine if the organism the training sequence belongs to uses a Shine-Dalgarno motif or not.

  • start_weight (float, optional) – The start score weight to use. The default value has been manually selected by the Prodigal authors as an appropriate value for 99% of genomes.

  • translation_table (int, optional) – The translation table to use. Check the Wikipedia page listing all genetic codes for the available values.

Returns

TrainingInfo – The resulting training info, which can be saved to disk and used later on to create a new OrfFinder instance.

Raises
  • MemoryError – When allocation of an internal buffers fails.

  • RuntimeError – When calling this method while in metagenomic mode.

  • TypeError – When sequence does not implement the buffer protocol.

  • ValueError – When translation_table is not a valid genetic code number, or when sequence is too short to train.