Overview of the filtering process within the vaccines.watch workflow applied to public genomes of priority bacterial pathogens available in the International Nucleotide Sequence Database Collaboration (INSDC) databases. The genome data are filtered in a series of steps, depicted from left to right in the table, with the numbers in each column representing a subset of those from the previous column.

Pathogen	ENA entries (run accessions)	Illumina paired-end entries	Filtered entries¹	Entries with geotemporal data²	Entries available for download in SRA	Unique entries per sample³	Assembled genomes	Genomes with correct species	Genomes that passed QC	Genomes with collection date post-2010	Genomes typeable by Kaptive	Last updated
All Pathogens	393,698	357,249	339,179	176,396	175,621	166,124	158,105	155,271	151,450	133,794
A. baumannii	53,268	43,515	40,173	27,690	27,644	26,792	26,296	25,910	25,478	23,942	23,572	5/18/2026
K. pneumoniae SC	149,036	130,947	120,228	90,262	89,780	83,533	79,838	77,766	75,534	73,100	71,070	5/19/2026
S. pneumoniae	191,394	182,787	178,778	58,444	58,197	55,799	51,971	51,595	50,438	36,752		5/14/2026

Entries (runs) are filtered to include only those with two FASTQ files, ≥20x mean coverage (via assessment of the "base_count" field) and those associated with a single sample accession.
Entries (runs) are filtered to include those with a collection date that is decodable to at least the year and a sampling location that is decodable to at least the country level.
Entries (runs) are filtered to ensure only one run per sample accession is included (selecting the run with the highest number of bases via assessment of the "base_count" field).