augur filter¶
Filter and subsample a sequence set.
usage: augur filter [-h] --sequences SEQUENCES --metadata FILE
[--sequence-index SEQUENCE_INDEX] [--min-date MIN_DATE]
[--max-date MAX_DATE] [--min-length MIN_LENGTH]
[--non-nucleotide] [--exclude EXCLUDE] [--include INCLUDE]
[--priority PRIORITY]
[--sequences-per-group SEQUENCES_PER_GROUP | --subsample-max-sequences SUBSAMPLE_MAX_SEQUENCES]
[--group-by GROUP_BY [GROUP_BY ...]]
[--probabilistic-sampling | --no-probabilistic-sampling]
[--subsample-seed SUBSAMPLE_SEED]
[--exclude-where EXCLUDE_WHERE [EXCLUDE_WHERE ...]]
[--include-where INCLUDE_WHERE [INCLUDE_WHERE ...]]
[--exclude-ambiguous-dates-by {any,day,month,year}]
[--query QUERY] --output OUTPUT
Named Arguments¶
- --sequences, -s
sequences in fasta or VCF format
- --metadata
sequence metadata, as CSV or TSV
- --sequence-index
sequence composition report generated by augur index. If not provided, an index will be created on the fly.
- --min-date
minimal cutoff for date; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD
- --max-date
maximal cutoff for date; may be specified as an Augur-style numeric date (with the year as the integer part) or YYYY-MM-DD
- --min-length
minimal length of the sequences
- --non-nucleotide
exclude sequences that contain illegal characters
Default: False
- --exclude
file with list of strains that are to be excluded
- --include
file with list of strains that are to be included regardless of priorities or subsampling
- --priority
file with list of priority scores for sequences (strain priority)
- --sequences-per-group
subsample to no more than this number of sequences per category
- --subsample-max-sequences
subsample to no more than this number of sequences
- --group-by
categories with respect to subsample; two virtual fields, “month” and “year”, are supported if they don’t already exist as real fields but a “date” field does exist
- --probabilistic-sampling
Enable probabilistic sampling during subsampling. This is useful when there are more groups than requested sequences. This option only applies when –subsample-max-sequences is provided.
Default: True
- --no-probabilistic-sampling
Default: True
- --subsample-seed
random number generator seed to allow reproducible sub-sampling (with same input data). Can be number or string.
- --exclude-where
Exclude samples matching these conditions. Ex: “host=rat” or “host!=rat”. Multiple values are processed as OR (matching any of those specified will be excluded), not AND
- --include-where
Include samples with these values. ex: host=rat. Multiple values are processed as OR (having any of those specified will be included), not AND. This rule is applied last and ensures any sequences matching these rules will be included.
- --exclude-ambiguous-dates-by
Possible choices: any, day, month, year
Exclude ambiguous dates by day (e.g., 2020-09-XX), month (e.g., 2020-XX-XX), year (e.g., 200X-10-01), or any date fields. An ambiguous year makes the corresponding month and day ambiguous, too, even if those fields have unambiguous values (e.g., “201X-10-01”). Similarly, an ambiguous month makes the corresponding day ambiguous (e.g., “2010-XX-01”).
- --query
Filter samples by attribute. Uses Pandas Dataframe querying, see https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-query for syntax.
- --output, -o
output file
How we subsample sequences in the zika-tutoral¶
As an example, we’ll look that the filter
command in greater detail using material form the zika tutorial.
The filter command allows you to selected various subsets of your input data for different types of analysis.
A simple example use of this command would be
augur filter --sequences data/sequences.fasta --metadata data/metadata.tsv --min-date 2012 --output filtered.fasta
This command will select all sequences with collection date in 2012 or later. The filter command has a large number of options that allow flexible filtering for many common situations. One such use-case is the exclusion of sequences that are known to be outliers (e.g. because of sequencing errors, cell-culture adaptation, …). These can be specified in a separate file:
BRA/2016/FC_DQ75D1
COL/FLR_00034/2015
...
To drop such strains, you can pass the name of this file to the augur filter command:
augur filter --sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--output filtered.fasta
(To improve legibility, we have wrapped the command across multiple lines.)
If you run this command (you should be able to copy-paste this into your terminal) on the data provided in the zika tutorial, you should see that one of the sequences in the data set was dropped since its name was in the dropped_strains.txt
file.
Another common filtering operation is subsetting of data to a achieve a more even spatio-temporal distribution or to cut-down data set size to more manageable numbers. The filter command allows you to select a specific number of sequences from specific groups, for example one sequence per month from each country:
augur filter \
--sequences data/sequences.fasta \
--metadata data/metadata.tsv \
--min-date 2012 \
--exclude config/dropped_strains.txt \
--group-by country year month \
--sequences-per-group 1 \
--output filtered.fasta
This subsampling and filtering will reduce the number of sequences in the tutorial data set from 34 to 24.