augur refineď
Table of Contents
Refine an initial tree using sequence metadata.
usage: augur refine [-h] [--alignment ALIGNMENT] --tree TREE [--metadata FILE]
[--metadata-delimiters METADATA_DELIMITERS [METADATA_DELIMITERS ...]]
[--metadata-id-columns METADATA_ID_COLUMNS [METADATA_ID_COLUMNS ...]]
[--output-tree OUTPUT_TREE]
[--output-node-data OUTPUT_NODE_DATA] [--use-fft]
[--max-iter MAX_ITER] [--timetree]
[--coalescent COALESCENT] [--gen-per-year GEN_PER_YEAR]
[--clock-rate CLOCK_RATE] [--clock-std-dev CLOCK_STD_DEV]
[--root ROOT [ROOT ...]] [--keep-root] [--covariance]
[--no-covariance]
[--keep-polytomies | --stochastic-resolve | --greedy-resolve]
[--precision {0,1,2,3}] [--date-format DATE_FORMAT]
[--date-confidence] [--date-inference {joint,marginal}]
[--branch-length-inference {auto,joint,marginal,input}]
[--clock-filter-iqd CLOCK_FILTER_IQD]
[--vcf-reference VCF_REFERENCE]
[--year-bounds YEAR_BOUNDS [YEAR_BOUNDS ...]]
[--divergence-units {mutations,mutations-per-site}]
[--seed SEED] [--verbosity VERBOSITY]
Named Argumentsď
- --alignment, -a
alignment in fasta or VCF format
- --tree, -t
prebuilt Newick
- --metadata
sequence metadata
- --metadata-delimiters
delimiters to accept when reading a metadata file. Only one delimiter will be inferred.
Default: (â,â, âtâ)
- --metadata-id-columns
names of possible metadata columns containing identifier information, ordered by priority. Only one ID column will be inferred.
Default: (âstrainâ, ânameâ)
- --output-tree
file name to write tree to
- --output-node-data
file name to write branch lengths as node data
- --use-fft
produce timetree using FFT for convolutions
Default: False
- --max-iter
maximal number of iterations TreeTime uses for timetree inference
Default: 2
- --timetree
produce timetree using treetime, requires tree where branch length is in units of average number of nucleotide or protein substitutions per site (and branch lengths do not exceed 4)
Default: False
- --coalescent
coalescent time scale in units of inverse clock rate (float), optimize as scalar (âoptâ), or skyline (âskylineâ)
- --gen-per-year
number of generations per year, relevant for skyline output(âskylineâ)
Default: 50
- --clock-rate
fixed clock rate
- --clock-std-dev
standard deviation of the fixed clock_rate estimate
- --root
rooting mechanism (âbestâ, least-squaresâ, âmin_devâ, âoldestâ, âmid_pointâ) OR node to root by OR two nodes indicating a monophyletic group to root by. Run treetime -h for definitions of rooting methods.
Default: âbestâ
- --keep-root
do not reroot the tree; use it as-is. Overrides anything specified by âroot.
Default: False
- --covariance
Account for covariation when estimating rates and/or rerooting. Use âno-covariance to turn off.
Default: True
- --no-covariance
Default: True
- --keep-polytomies
Do not attempt to resolve polytomies
Default: False
- --stochastic-resolve
Resolve polytomies via stochastic subtree building rather than greedy optimization
Default: False
- --greedy-resolve
Default: True
- --precision
Possible choices: 0, 1, 2, 3
precision used by TreeTime to determine the number of grid points that are used for the evaluation of the branch length interpolation objects. Values range from 0 (rough) to 3 (ultra fine) and default to âautoâ.
- --date-format
date format
Default: â%Y-%m-%dâ
- --date-confidence
calculate confidence intervals for node dates
Default: False
- --date-inference
Possible choices: joint, marginal
assign internal nodes to their marginally most likely dates, not jointly most likely
Default: âjointâ
- --branch-length-inference
Possible choices: auto, joint, marginal, input
branch length mode of treetime to use
Default: âautoâ
- --clock-filter-iqd
clock-filter: remove tips that deviate more than n_iqd interquartile ranges from the root-to-tip vs time regression
- --vcf-reference
fasta file of the sequence the VCF was mapped to
- --year-bounds
specify min or max & min prediction bounds for samples with XX in year
- --divergence-units
Possible choices: mutations, mutations-per-site
Units in which sequence divergences is exported.
Default: âmutations-per-siteâ
- --seed
seed for random number generation
- --verbosity
treetime verbosity, between 0 and 6 (higher values more output)
Default: 1
How we use refine in the zika tutorialď
In the Zika tutorial we used the following basic rule to run the refine command:
rule refine:
input:
tree = rules.tree.output.tree,
alignment = rules.align.output,
metadata = "data/metadata.tsv"
output:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--timetree \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
This rule will estimate the rate of the molecular clock, reroot the tree, and estimate a time tree. The paragraphs below will detail how to exert more control on each of these steps through additional options the refine command.
Specify the evolutionary rateď
By default augur
(through treetime
) will estimate the rate of evolution from the data by regressing divergence vs sampling date.
In some scenarios, however, there is insufficient temporal signal to reliably estimate the rate and the analysis will be more robust and reproducible if one fixes this rate explicitly.
This can be done via the flag --clock-rate <value>
where the implied units are substitutions per site and year.
In our zika example, this would look like this
rule refine:
input:
tree = rules.tree.output.tree,
alignment = rules.align.output,
metadata = "data/metadata.tsv"
output:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
+ params:
+ clock_rate = 0.0008
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--timetree \
+ --clock-rate {params.clock_rate} \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
Confidence intervals for divergence timesď
Divergence time estimates are probabilistic and uncertain for multiple reasons, primarily because the accumulation of mutations is a probabilistic process and the rate estimate itself is not precise.
Augur/TreeTime will account for this uncertainty if the refine command is run with the flag --date-confidence
and the standard deviation of the rate estimate is specified.
rule refine:
input:
tree = rules.tree.output.tree,
alignment = rules.align.output,
metadata = "data/metadata.tsv"
output:
tree = "results/tree.nwk",
node_data = "results/branch_lengths.json"
params:
clock_rate = 0.0008,
+ clock_std_dev = 0.0002
shell:
"""
augur refine \
--tree {input.tree} \
--alignment {input.alignment} \
--metadata {input.metadata} \
--timetree \
--date-confidence \
+ --clock-rate {params.clock_rate} \
+ --clock-std-dev {params.clock_std_dev} \
--output-tree {output.tree} \
--output-node-data {output.node_data}
"""
If run with these parameters, augur will save an confidence interval (e.g. [2014.5,2014.7]
) for each node in the tree.
By default, augur runs TreeTime in a âcovariance-awareâ mode where the root-to-tip regression accounts for shared ancestry and covariance between terminal nodes.
This, however, is sometimes unstable when the temporal signal is low and can be switch off with the flag --no-covariance
.
Specifying the root of the treeď
By default, augur/TreeTime reroots your input tree to optimize the temporal signal in the data. This is robust when there is robust temporal signal.
In other situations, you might want to specify the root explicitly, specify a rerooting mechanisms, or keep the root of the input tree.
The latter can be achieved by passing the argument --keep-root
.
To specify a particular strain (or the common ancestor of a group of strains), pass the name(s) of the(se) strain(s) like so:
--root strain1 [strain2 strain3 ...]
Other available rooting mechanisms are
least-squares
(default): minimize squared deviation of the root-to-tip regression
min-dev
: essentially midpoint rooting minimizing the variance in root-to-tip distance
oldest
: use the oldest strain as outgroup
Polytomy resolutionď
if the data set contains many very similar sequences, their evolutionary relationship some times remains ambiguous resulting in zero-length branches or polytomies (that is internal nodes with more than 2 children).
Augur partially resolves those polytomies if such resolution helps the make the tree fit the temporal structure in the data.
If this is undesired, this can be switched-off using --keep-polytomies
.