Methods for Influenza A database built

To assemble the sequences database, Influenza A virus nucleotide sequences were retrieved from the NCBI Entrez databases using the Influenza A virus taxonomic ID with an in-house Python tool to access the E-utilities API (last accessed on September 1st, 2024). A BLAST search against a curated IAV reference set, including the eight viral segments, was performed to (i) identify the closest references for each sequence, (ii) to validate the segment number. Alignments for each segment were obtained as follows: first, Nextalign was used to obtain sub-alignments including all the sequences that shared the closest reference in the IAV reference set; second, the reference alignment of the IAV reference set was used to guide the insertion of gaps in individual sub-alignments, ensuring consistency with the corresponding reference sequence; finally, sub-alignments were concatenated to obtain the full alignment per segment. Nucleotide alignments were subsequently trimmed to the coding sequences (CDS) of the IAV proteins and translated into amino acid alignments.

Several curation steps were introduced during the previous process, including: removing non-IAV genomic sequences, eliminating sequences with non-significant blast hits, removing sequences shorter than a pre-established length threshold for each segment, and removing sequences unable to align to the reference with Nextalign. The final IAV sequences database includes the following number of sequences: Segment 1 = 106947, Segment 2 = 105251, Segment 3 = 107209, Segment 4 = 150984, Segment 5 = 71851, Segment 6 = 126274, Segment 7 = 119895, Segment 8 = 77505.

For visualization, the IAV sequence database was clustered, and phylogenetic trees were constructed. Clustering was performed on the nucleotide sequences using MMseqs2 (v15.6f452) with a 95% sequence identity threshold (–cluster-mode 2 –cov-mode 1 –min-seq-id 0.95). Representative sequences for each cluster and segment were aligned using MAFFT (v7.453) with default parameters. Maximum likelihood trees were inferred with IQ-TREE (v2.1.2) using the best-fit model determined by the Bayesian Information Criterion (BIC) and were midpoint-rooted for visualization purposes.

Caption for trees

Maximum likelihood phylogeny of the clustered IAV dataset. The curated database for segment N (XXXX sequences) was clustered using MMSeq2 on a 0.95 identity threshold. Representative sequences (YYYY sequences) were aligned using MAFFT, Maximum likelihood trees were obtained with IQ-tree and midpoint rooted for visualization.

Number of sequences per segment to replace in the respective captions

Segment Full alignments Clustered
Segment 1 106947 550
Segment 2 105251 519
Segment 3 107209 615
Segment 4 150984 143
Segment 5 71851 287
Segment 6 126274 101
Segment 7 119895 149
Segment 8 77505 262

Sources