normalize-strings
Normalize strings to a Unicode normalization form and strip leading and trailing whitespaces.
Strings need to be normalized for predictable string comparisons, especially in cases where strings contain diacritics (see https://unicode.org/faq/normalization.html).
usage: augur curate normalize-strings [-h] [--metadata METADATA]
[--id-column ID_COLUMN] [--fasta FASTA]
[--seq-id-column SEQ_ID_COLUMN]
[--seq-field SEQ_FIELD]
[--unmatched-reporting {error_first,error_all,warn,silent}]
[--duplicate-reporting {error_first,error_all,warn,silent}]
[--output-metadata OUTPUT_METADATA]
[--output-fasta OUTPUT_FASTA]
[--output-id-field OUTPUT_ID_FIELD]
[--output-seq-field OUTPUT_SEQ_FIELD]
[--form {NFC,NFKC,NFD,NFKD}]
INPUTS
Input options shared by all augur curate commands. If no input options are provided, commands will try to read NDJSON records from stdin.
- --metadata
Input metadata file, as CSV or TSV. Accepts ‘-’ to read metadata from stdin.
- --id-column
Name of the metadata column that contains the record identifier for reporting duplicate records. Uses the first column of the metadata file if not provided. Ignored if also providing a FASTA file input.
- --fasta
Plain or gzipped FASTA file. Headers can only contain the sequence id used to match a metadata record. Note that an index file will be generated for the FASTA file as <filename>.fasta.fxi
- --seq-id-column
Name of metadata column that contains the sequence id to match sequences in the FASTA file.
- --seq-field
The name to use for the sequence field when joining sequences from a FASTA file.
- --unmatched-reporting
Possible choices: error_first, error_all, warn, silent
How unmatched records from combined metadata/FASTA input should be reported.
Default: “error_first”
- --duplicate-reporting
Possible choices: error_first, error_all, warn, silent
How should duplicate records be reported.
Default: “error_first”
OUTPUTS
Output options shared by all augur curate commands. If no output options are provided, commands will output NDJSON records to stdout.
- --output-metadata
Output metadata TSV file. Accepts ‘-’ to output TSV to stdout.
- --output-fasta
Output FASTA file.
- --output-id-field
The record field to use as the sequence identifier in the FASTA output.
- --output-seq-field
The record field that contains the sequence for the FASTA output. This field will be deleted from the metadata output.
OPTIONAL
- --form
Possible choices: NFC, NFKC, NFD, NFKD
Unicode normalization form to use for normalization.
Default: “NFC”