normalize-strings

Normalize strings to a Unicode normalization form and strip leading and trailing whitespaces.

Strings need to be normalized for predictable string comparisons, especially in cases where strings contain diacritics (see https://unicode.org/faq/normalization.html).

usage: augur curate normalize-strings [-h] [--metadata METADATA]
                                      [--id-column ID_COLUMN] [--fasta FASTA]
                                      [--seq-id-column SEQ_ID_COLUMN]
                                      [--seq-field SEQ_FIELD]
                                      [--unmatched-reporting {error_first,error_all,warn,silent}]
                                      [--duplicate-reporting {error_first,error_all,warn,silent}]
                                      [--output-metadata OUTPUT_METADATA]
                                      [--output-fasta OUTPUT_FASTA]
                                      [--output-id-field OUTPUT_ID_FIELD]
                                      [--output-seq-field OUTPUT_SEQ_FIELD]
                                      [--form {NFC,NFKC,NFD,NFKD}]

INPUTS

Input options shared by all augur curate commands. If no input options are provided, commands will try to read NDJSON records from stdin.

--metadata

Input metadata file, as CSV or TSV. Accepts ‘-’ to read metadata from stdin.

--id-column

Name of the metadata column that contains the record identifier for reporting duplicate records. Uses the first column of the metadata file if not provided. Ignored if also providing a FASTA file input.

--fasta

Plain or gzipped FASTA file. Headers can only contain the sequence id used to match a metadata record. Note that an index file will be generated for the FASTA file as <filename>.fasta.fxi

--seq-id-column

Name of metadata column that contains the sequence id to match sequences in the FASTA file.

--seq-field

The name to use for the sequence field when joining sequences from a FASTA file.

--unmatched-reporting

Possible choices: error_first, error_all, warn, silent

How unmatched records from combined metadata/FASTA input should be reported.

Default: “error_first”

--duplicate-reporting

Possible choices: error_first, error_all, warn, silent

How should duplicate records be reported.

Default: “error_first”

OUTPUTS

Output options shared by all augur curate commands. If no output options are provided, commands will output NDJSON records to stdout.

--output-metadata: Output metadata TSV file. Accepts ‘-’ to output TSV to stdout.
--output-fasta: Output FASTA file.
--output-id-field: The record field to use as the sequence identifier in the FASTA output.
--output-seq-field: The record field that contains the sequence for the FASTA output. This field will be deleted from the metadata output.

OPTIONAL

--form

Possible choices: NFC, NFKC, NFD, NFKD

Unicode normalization form to use for normalization.

Default: “NFC”