Getting Started#
Database Access#
SpacerDB is accessible as a duckdb database via a dedicated s3 bucket: s3://spacers-data.jgi.doe.gov/spacers. DuckDB can be used to directly query the database, as demonstrated in the notebooks. If you plan on using the database routinely, or intend to analyse the complete database, we (strongly) recommend you copy the database to your local machine/cloud storage and access your local version directly.
A fair warning: the full spacer database is fairly large, needing between ~ 80Gb and 650Gb of disk space depending on the version (see below). Queries on these databases typically require >10Gb of RAM. We highly recommend only downloading and operating on these databases on large systems such as high-performance computing clusters or appropriately-resources cloud instances..
Database Versions#
Because the full spacerDB database is relatively large and may include information you don't need or want, two versions of the spacerDB database are available, each including a subset of the data and designed for a specific type of analysis.
- Full spacer database: This file (global_crispr_db_full_2025-05-02.duckdb) includes all spacers, as well as their hits to IMG/VR v4 and IMG/PR v1. This is the largest database, with a total size of 651Gb. It is available in DuckDB format v1.0.0 - v1.1.3, and can be accessed and/or copied from the s3 endpoint (see above) or downloaded directly Warning - 651Gb file.
- Selected spacers database: This file (global_crispr_db_spacertaxa_2025-05-02.duckdb) includes repeat and sample information for all spacers associated with a taxonomically-assigned repeat. This is the database most relevant for host prediction purposes, i.e. when attempting to use spacer hits to connect new viruses/MGEs to potential host taxa. This database has a total size of 80G, so avoids the need to work with the ~ 651Gb full database for users only interested in host prediction. It is available in DuckDB format v1.0.0 - v1.1.3, and can be accessed and/or copied from the s3 endpoint (see above) or downloaded directy Warning - 80Gb file. A version of this database with indexes (global_crispr_db_spacertaxa_2025-05-02_indexed.duckdb) is also available, mostly for use in the notebooks.
- Legacy spacer database: This file (global_crispr_db_legacy.duckdb) corresponds to the original spacer database used in the Global Spacer manuscript. This should not be used except to retrace the steps of the analysis described in the manuscript, and the other databases above should be used for any new analysis. This database has a total size of 523Gb, and can be downloaded as a DuckDB v0.1.0 at this link.
See the database overview page for more information on exactly which tables and data are included in each database version.
Example Notebooks#
Some common analyses of the spacer database, such as schema exploration, spacer identification for individual taxa, samples, or ecosystem, and spacer information extraction for host prediction purposes, are available as jupyter/ipynb notebooks. See the Example Notebooks section for more details.
Additional files available for download#
Several fasta files of spacers have been pre-exported and can be downloaded directly. These include:
- nr_spacers_hq-all_25-05-10.fna.gz: all high-quality spacers, non-redundant
- nr_spacers_hq-taxoselected_25-05-10.fna.gz: all high-quality spacers connected to a repeat with a taxonomic assignment (most relevant for host prediction purposes)
Local copy#
Database files can be downloaded directly using the links above. Because the files are relatively large, we recommend downloading with a utility such as aria2 that can resumes the download if interrupted. After downloading the duckdb file, you can verify that the files are complete using the following md5sum hashes:
file | md5sum |
---|---|
global_crispr_db_full_2025-05-02.duckdb | 0078da21aa2f991cdf71f94a3f2b07c6 |
global_crispr_db_spacertaxa_2025-05-02.duckdb | fd638d0338e98d4fba16dbf046571d7b |
global_crispr_db_spacertaxa_2025-05-02_indexed.duckdb | fd638d0338e98d4fba16dbf046571d7b |
global_crispr_db_legacy.duckdb | eb164b3e6987f6bcee2de14c8ddbdf21 |
Then, you can use these local files as follows:
-
In python, using the duckdb package:
import polars as pl import duckdb from pathlib import Path # Connect to locally downloaded copy of the database: DB_PATH = "path/to/global_crispr_db_full_2025-05-02.duckdb" con = duckdb.connect(DB_PATH) # Check database size db_stats = Path(DB_PATH).stat() print(f"Database Size: {db_stats.st_size / (1024 * 1024):.2f} MB") # Test a simple query query = "SELECT COUNT(*) as count FROM spacer_tbl" result = pl.from_pandas(con.execute(query).df()) print(f"\nTotal spacers: {result['count'][0]:,}")
-
Using DuckDB CLI: