Genes

class pyrodigal.Genes

A list of raw genes found by Prodigal in a single sequence.

sequence

The compressed input sequence for which the gene predictions were made.

Type:

pyrodigal.Sequence

training_info

A reference to the training info these predictions were obtained with.

Type:

pyrodigal.TrainingInfo

nodes

A collection of raw nodes found in the input sequence.

Type:

pyrodigal.Nodes

meta

Whether these genes have been found after a run in metagenomic mode, or in single mode.

Type:

bool

metagenomic_bin

The metagenomic model with which these genes have been found.

Type:

pyrodigal.MetagenomicBin

New in version 0.5.4.

New in version 2.0.0: The meta attribute.

New in version 3.0.0: The metagenomic_bin attribute.

write_genbank(file, sequence_id, division='BCT', date=None)

Write predicted genes and sequence to file in GenBank format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the GenBank record.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • division (str) – The GenBank division to write in the GenBank header. Should often be BCT (for bacterial sequences) given the scope of Prodigal.

  • date (datetime.date, optional) – The date to write in the GenBank header, or None to use now.

Returns:

int – The number of bytes written to the file.

Note

The original Prodigal outputs incomplete GenBank files containing only the coordinates of the predicted genes inside CDS features, without including the translation or the original sequence. Since this is not the most useful output, and often requires additional post-processing, Pyrodigal outputs a complete GenBank record instead.

New in version 3.0.0.

write_genes(file, sequence_id, width=70)

Write nucleotide sequences of genes to file in FASTA format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the nucleotide sequences.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • width (int) – The width to use to wrap sequence lines. Prodigal uses 70 for nucleotide sequences.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional prefix argument with sequence_id.

write_gff(file, sequence_id, header=True, include_translation_table=False)

Write the genes to file in General Feature Format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the features.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from. Used in the first column of the GFF-formated output.

  • header (bool) – True to write a GFF header line, False otherwise.

  • include_translation_table (bool) – True to write the translation table used to predict the genes in the GFF attributes, False otherwise. Useful for genes that were predicted from meta mode, since the different metagenomic models have different translation tables.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional``prefix`` argument with sequence_id.

New in version 3.0.0: The include_translation_table argument.

write_scores(file, sequence_id, header=True)

Write the start scores to file in tabular format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the features.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • header (bool) – True to write a header line, False otherwise.

Returns:

int – The number of bytes written to the file.

New in version 0.7.0.

New in version 2.0.0: The sequence_id argument.

write_translations(file, sequence_id, width=60, translation_table=None, include_stop=True)

Write protein sequences of genes to file in FASTA format.

Parameters:
  • file (io.TextIOBase) – A file open in text mode where to write the protein sequences.

  • sequence_id (str) – The identifier of the sequence these genes were extracted from.

  • width (int) – The width to use to wrap sequence lines. Prodigal uses 60 for protein sequences.

  • translation_table (int, optional) – A different translation to use to translation the genes. If None given, use the one from the training info.

  • include_stop (bool) – Pass False to disable translating the STOP codon into a star character (*) for complete genes. True keeps the default behaviour of Prodigal, however it often does not play nice with other programs or libraries that will use the FASTA file for downstream processing.

Returns:

int – The number of bytes written to the file.

Changed in version 2.0.0: Replaced optional``prefix`` argument with sequence_id.

New in version 3.0.0: The include_stop argument.

class pyrodigal.Gene

A single raw gene found by Prodigal within a DNA sequence.

Caution

The gene coordinates follows the conventions from Prodigal, not Python, so coordinates are 1-based, end-inclusive. To index the original sequence with a gene object, remember to switch back to zero-based coordinates: sequence[gene.begin-1:gene.end].

New in version 0.5.4.

confidence()

Estimate the confidence of the prediction.

Returns:

float – A confidence percentage (between 0 and 100).

sequence()

Build the nucleotide sequence of this predicted gene.

This function takes care of reverse-complementing the sequence if it is on the reverse strand.

Note

Since Pyrodigal uses a generic symbol for unknown nucleotides, any unknown characters in the original sequence will be rendered with an N.

New in version 0.5.4.

translate(translation_table=None, unknown_residue=88, include_stop=True)

Translate the predicted gene into a protein sequence.

Parameters:
  • translation_table (int, optional) – An alternative translation table to use to translate the gene. Use None (the default) to translate using the translation table this gene was found with.

  • unknown_residue (str) – A single character to use for residues translated from codons with unknown nucleotides.

  • include_stop (bool) – Pass False to disable translating the STOP codon into a star character (*) for complete genes. True keeps the default behaviour of Prodigal, however it often does not play nice with other programs or libraries that will use the protein sequence for downstream processing.

Returns:

str – The proteins sequence as a string using the right translation table and the standard single letter alphabet for proteins.

Raises:

ValueError – when translation_table is not a valid genetic code number.

New in version 3.0.0: The include_stop keyword argument.

begin

The coordinate at which the gene begins.

Type:

int

cscore

The coding score for the start node, based on 6-mer usage.

New in version 0.5.1.

Type:

float

end

The coordinate at which the gene ends.

Type:

int

gc_cont

The GC content of the gene (between 0 and 1).

Type:

float

partial_begin

Whether the gene overlaps with the start of the sequence.

Type:

bool

partial_end

Whether the gene overlaps with the end of the sequence.

Type:

bool

rbs_motif

The motif of the Ribosome Binding Site.

Possible non-None values are GGA/GAG/AGG, 3Base/5BMM, 4Base/6BMM, AGxAG, GGxGG, AGGAG(G)/GGAGG, AGGA, AGGA/GGAG/GAGG, GGAG/GAGG, AGGAG/GGAGG, AGGAG, GGAGG or AGGAGG.

Type:

str, optional

rbs_spacer

The number of bases between the RBS and the CDS.

Possible non-None values are 3-4bp, 5-10bp, 11-12bp or 13-15bp.

Type:

str, optional

rscore

The score for the RBS motif.

New in version 0.5.1.

Type:

float

score

The gene score, sum of the coding and start codon scores.

New in version 0.7.3.

Type:

float

sscore

The score for the strength of the start codon.

New in version 0.5.1.

Type:

float

start_node

The start node at the beginning of this gene.

Type:

Node

start_type

The start codon of this gene.

Can be one of ATG, GTG or TTG, or Edge if the GeneFinder has been initialized in open ends mode and the gene starts right at the beginning of the input sequence.

Type:

str

stop_node

The stop node at the end of this gene.

Type:

Node

strand

-1 if the gene is on the reverse strand, +1 otherwise.

Type:

int

translation_table

The translation table used to find the gene.

Type:

int

tscore

The score for the codon kind (ATG/GTG/TTG).

New in version 0.5.1.

Type:

float

uscore

The score for the upstream regions.

New in version 0.5.1.

Type:

float