augur translate
Translate gene regions from nucleotides to amino acids.
Translates nucleotide sequences of nodes in a tree to amino acids for gene regions of the annotated features of the provided reference sequence. Each node then gets assigned a list of amino acid mutations for any position that has a mismatch between its own amino acid sequence and its parent’s sequence. The reference amino acid sequences, genome annotations, and node amino acid mutations are output to a node-data JSON file.
Note
The mutation positions in the node-data JSON are one-based.
usage: augur translate [-h] --tree TREE --ancestral-sequences
ANCESTRAL_SEQUENCES --reference-sequence
REFERENCE_SEQUENCE [--genes GENES [GENES ...]]
[--output-node-data OUTPUT_NODE_DATA]
[--alignment-output ALIGNMENT_OUTPUT]
[--vcf-reference VCF_REFERENCE]
[--vcf-reference-output VCF_REFERENCE_OUTPUT]
Named Arguments
- --tree
prebuilt Newick – no tree will be built if provided
- --ancestral-sequences
JSON (fasta input) or VCF (VCF input) containing ancestral and tip sequences
- --reference-sequence
GenBank or GFF file containing the annotation
- --genes
genes to translate (list or file containing list)
- --output-node-data
name of JSON file to save aa-mutations to
- --alignment-output
write out translated gene alignments. If a VCF-input, a .vcf or .vcf.gz will be output here (depending on file ending). If fasta-input, specify the file name like so: ‘my_alignment_%GENE.fasta’, where ‘%GENE’ will be replaced by the name of the gene
VCF specific
These arguments are only applicable if the input (–ancestral-sequences) is in VCF format.
- --vcf-reference
fasta file of the sequence the VCF was mapped to
- --vcf-reference-output
fasta file where reference sequence translations for VCF input will be written
Example Node Data JSON
Here’s an example of the output node-data JSON where NODE_1
has no
mutations compared to it’s parent and NODE_2
has multiple mutations in
multiple genes.
{
"annotations": {
"GENE_1": {
"end": 1685,
"seqid": "reference.gb",
"start": 108,
"strand": "+",
"type": "CDS"
},
"GENE_2": {
"end": 2705,
"seqid": "reference.gb",
"start": 1807,
"strand": "+",
"type": "CDS"
},
},
"nodes": {
"NODE_1": {
"aa_muts": []
},
"NODE_2": {
"aa_muts": [
"GENE_1": [
"S139N",
"R213K",
"R439G",
"V440A",
"D474N",
"S479W",
"S481T",
"P485L",
"R521K"
],
"GENE_2": [
"P43S",
"D46N",
"C64R",
"R98K",
"D136G",
"M175V"
]
]
}
},
"reference": {
"GENE_1": "MATLLRSLAL...",
"GENE_2": "MAEEQARHVK..."
}
}