augur translate

Translate gene regions from nucleotides to amino acids.

Translates nucleotide sequences of nodes in a tree to amino acids for gene regions of the annotated features of the provided reference sequence. Each node then gets assigned a list of amino acid mutations for any position that has a mismatch between its own amino acid sequence and its parent’s sequence. The reference amino acid sequences, genome annotations, and node amino acid mutations are output to a node-data JSON file.

Note

The mutation positions in the node-data JSON are one-based.

usage: augur translate [-h] --tree TREE --ancestral-sequences
                       ANCESTRAL_SEQUENCES --reference-sequence
                       REFERENCE_SEQUENCE [--genes GENES [GENES ...]]
                       [--output-node-data OUTPUT_NODE_DATA]
                       [--alignment-output ALIGNMENT_OUTPUT]
                       [--vcf-reference VCF_REFERENCE]
                       [--vcf-reference-output VCF_REFERENCE_OUTPUT]

Named Arguments 

--tree: prebuilt Newick – no tree will be built if provided
--ancestral-sequences: JSON (fasta input) or VCF (VCF input) containing ancestral and tip sequences
--reference-sequence: GenBank or GFF file containing the annotation
--genes: genes to translate (list or file containing list)
--output-node-data: name of JSON file to save aa-mutations to
--alignment-output: write out translated gene alignments. If a VCF-input, a .vcf or .vcf.gz will be output here (depending on file ending). If fasta-input, specify the file name like so: ‘my_alignment_%GENE.fasta’, where ‘%GENE’ will be replaced by the name of the gene

VCF specific 

These arguments are only applicable if the input (–ancestral-sequences) is in VCF format.

--vcf-reference: fasta file of the sequence the VCF was mapped to
--vcf-reference-output: fasta file where reference sequence translations for VCF input will be written

Example Node Data JSON 

Here’s an example of the output node-data JSON where NODE_1 has no mutations compared to it’s parent and NODE_2 has multiple mutations in multiple genes.

{
    "annotations": {
        "GENE_1": {
            "end": 1685,
            "seqid": "reference.gb",
            "start": 108,
            "strand": "+",
            "type": "CDS"
        },
        "GENE_2": {
            "end": 2705,
            "seqid": "reference.gb",
            "start": 1807,
            "strand": "+",
            "type": "CDS"
        },
    },
    "nodes": {
        "NODE_1": {
            "aa_muts": []
        },
        "NODE_2": {
            "aa_muts": [
                "GENE_1": [
                    "S139N",
                    "R213K",
                    "R439G",
                    "V440A",
                    "D474N",
                    "S479W",
                    "S481T",
                    "P485L",
                    "R521K"
                ],
                "GENE_2": [
                    "P43S",
                    "D46N",
                    "C64R",
                    "R98K",
                    "D136G",
                    "M175V"
                ]
            ]
        }
    },
    "reference": {
        "GENE_1": "MATLLRSLAL...",
        "GENE_2": "MAEEQARHVK..."
    }
}

augur translate

Named Arguments

VCF specific

Example Node Data JSON

Named Arguments 

VCF specific 

Example Node Data JSON 