Genes

class pyrodigal.Genes

A list of raw genes found by Prodigal in a single sequence.

sequence

The compressed input sequence for which the gene predictions were made.

Type:

pyrodigal.Sequence

training_info

A reference to the training info these predictions were obtained with.

Type:

pyrodigal.TrainingInfo

nodes

A collection of raw nodes found in the input sequence.

Type:

pyrodigal.Nodes

meta

Whether these genes have been found after a run in metagenomic mode, or in single mode.

Type:

bool

metagenomic_bin

The metagenomic model with which these genes have been found.

Type:

pyrodigal.MetagenomicBin

Added in version 0.5.4.

Added in version 2.0.0: The meta attribute.

Added in version 3.0.0: The metagenomic_bin attribute.

write_genbank(file, sequence_id, division='BCT', date=None, translation_table=None, strict_translation=True)

Write predicted genes and sequence to file in GenBank format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the GenBank record.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • division (str) – The GenBank division to write in the GenBank header. Should often be BCT (for bacterial sequences) given the scope of Prodigal.

  • date (datetime.date, optional) – The date to write in the GenBank header, or None to use now.

  • translation_table (int or None) – A translation table to pass to Gene.translate, or None to use the translation table from the TrainingInfo these genes were obtained with.

  • strict_translation (bool) – Whether to handle ambiguous codons in strict mode when translating. See the strict parameter of Gene.translate for more information.

Returns:

int – The number of bytes written to the file.

Note

The original Prodigal outputs incomplete GenBank files containing only the coordinates of the predicted genes inside CDS features, without including the translation or the original sequence. Since this is not the most useful output, and often requires additional post-processing, Pyrodigal outputs a complete GenBank record instead.

Added in version 3.0.0.

Added in version 3.4.0: The translation_table and strict_translation parameters.

write_genes(file, sequence_id, width=70, full_id=False)

Write nucleotide sequences of genes to file in FASTA format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the nucleotide sequences.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • width (int) – The width to use to wrap sequence lines. Prodigal uses 70 for nucleotide sequences.

  • full_id (bool) – Pass True to use the full sequence identifier in the header of each record, or False to use the sequence numbering such as the one used in Prodigal.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional prefix argument with sequence_id.

Added in version 3.4.0:: The full_id parameter.

write_gff(file, sequence_id, header=True, include_translation_table=False, full_id=True)

Write the genes to file in General Feature Format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the features.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from. Used in the first column of the GFF-formated output.

  • header (bool) – True to write a GFF header line, False otherwise.

  • include_translation_table (bool) – True to write the translation table used to predict the genes in the GFF attributes, False otherwise. Useful for genes that were predicted from meta mode, since the different metagenomic models have different translation tables.

  • full_id (bool) – Pass True to use the full sequence identifier in the header of each record, or False to use the sequence numbering such as the one used in Prodigal.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional``prefix`` argument with sequence_id.

Added in version 3.0.0: The include_translation_table argument.

Added in version 3.4.0: The full_id parameter.

write_scores(file, sequence_id, header=True)

Write the start scores to file in tabular format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the features.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • header (bool) – True to write a header line, False otherwise.

Returns:

int – The number of bytes written to the file.

Added in version 0.7.0.

Added in version 2.0.0: The sequence_id argument.

write_translations(file, sequence_id, width=60, translation_table=None, include_stop=True, strict_translation=True, full_id=False)

Write protein sequences of genes to file in FASTA format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the protein sequences.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • width (int) – The width to use to wrap sequence lines. Prodigal uses 60 for protein sequences.

  • translation_table (int, optional) – A different translation to use to translation the genes. If None given, use the one from the training info.

  • include_stop (bool) – Pass False to disable translating the STOP codon into a star character (*) for complete genes. True keeps the default behaviour of Prodigal, however it often does not play nice with other programs or libraries that will use the FASTA file for downstream processing.

  • strict_translation (bool) – Whether to handle ambiguous codons in strict mode when translating. See the strict parameter of Gene.translate for more information.

  • full_id (bool) – Pass True to use the full sequence identifier in the header of each record, or False to use the sequence numbering such as the one used in Prodigal.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional``prefix`` argument with sequence_id.

Added in version 3.0.0: The include_stop argument.

Added in version 3.4.0: The strict_translation and full_id parameters.

score

The total score of the gene path in the sequence.

This value can be used to compare the genes obtained on the same sequence with different TrainingInfo parameters, and find the best set of parameters for a sequence.

Added in version 3.4.0.

Type:

float

class pyrodigal.Gene

A single raw gene found by Prodigal within a DNA sequence.

Caution

The gene coordinates follows the conventions from Prodigal, not Python, so coordinates are 1-based, end-inclusive. To index the original sequence with a gene object, remember to switch back to zero-based coordinates: sequence[gene.begin-1:gene.end].

Added in version 0.5.4.

confidence()

Estimate the confidence of the prediction.

Returns:

float – A confidence percentage (between 0 and 100).

sequence()

Build the nucleotide sequence of this predicted gene.

This function takes care of reverse-complementing the sequence if it is on the reverse strand.

Note

Since Pyrodigal uses a generic symbol for unknown nucleotides, any unknown characters in the original sequence will be rendered with an N.

Added in version 0.5.4.

translate(translation_table=None, unknown_residue=88, include_stop=True, strict=True)

Translate the predicted gene into a protein sequence.

Parameters:
  • translation_table (int, optional) – An alternative translation table to use to translate the gene. Use None (the default) to translate using the translation table this gene was found with.

  • unknown_residue (str) – A single character to use for residues translated from codons with unknown nucleotides.

  • include_stop (bool) – Pass False to disable translating the STOP codon into a star character (*) for complete genes. True keeps the default behaviour of Prodigal, however it often does not play nice with other programs or libraries that will use the protein sequence for downstream processing.

  • strict (bool) – If True (the default), translate codons containing any unknown nucleotide as unknown_residue. If False, attempt to translate some incomplete codons when there is no ambiguity, taking into account the translation table (e.g. CCN, which always translates to Pro).

Returns:

str – The proteins sequence as a string using the right translation table and the standard single letter alphabet for proteins.

Raises:

ValueError – when translation_table is not a valid genetic code number.

Added in version 3.0.0: The include_stop keyword argument.

Added in version 3.4.0: The strict keyword argument.

Changed in version 3.4.0: Added support for additional translation tables 26 to 33.

begin

The coordinate at which the gene begins.

Type:

int

cscore

The coding score for the start node, based on 6-mer usage.

Added in version 0.5.1.

Type:

float

end

The coordinate at which the gene ends.

Type:

int

gc_cont

The GC content of the gene (between 0 and 1).

Type:

float

partial_begin

Whether the gene overlaps with the start of the sequence.

Type:

bool

partial_end

Whether the gene overlaps with the end of the sequence.

Type:

bool

rbs_motif

The motif of the Ribosome Binding Site.

Possible non-None values are GGA/GAG/AGG, 3Base/5BMM, 4Base/6BMM, AGxAG, GGxGG, AGGAG(G)/GGAGG, AGGA, AGGA/GGAG/GAGG, GGAG/GAGG, AGGAG/GGAGG, AGGAG, GGAGG or AGGAGG.

Type:

str, optional

rbs_spacer

The number of bases between the RBS and the CDS.

Possible non-None values are 3-4bp, 5-10bp, 11-12bp or 13-15bp.

Type:

str, optional

rscore

The score for the RBS motif.

Added in version 0.5.1.

Type:

float

score

The gene score, sum of the coding and start codon scores.

Added in version 0.7.3.

Type:

float

sscore

The score for the strength of the start codon.

Added in version 0.5.1.

Type:

float

start_node

The start node at the beginning of this gene.

Type:

Node

start_type

The start codon of this gene.

Can be one of ATG, GTG or TTG, or Edge if the GeneFinder has been initialized in open ends mode and the gene starts right at the beginning of the input sequence.

Type:

str

stop_node

The stop node at the end of this gene.

Type:

Node

strand

-1 if the gene is on the reverse strand, +1 otherwise.

Type:

int

translation_table

The translation table used to find the gene.

Type:

int

tscore

The score for the codon kind (ATG/GTG/TTG).

Added in version 0.5.1.

Type:

float

uscore

The score for the upstream regions.

Added in version 0.5.1.

Type:

float