Database Overview#
SpacerDB is built on DuckDB, a high-performance analytical database system. The full database contains CRISPR spacer sequences, their hits against viral and plasmid databases, and comprehensive metadata about arrays, samples, and analysis results.
Database Structure#
The database is organized into several interconnected tables:
1. Core Tables#
-
repeat_tbl
: Contains CRISPR repeat information:- Repeat sequence and unique repeat ID
- Predicted CRISPR array type (e.g., I-E)
- Taxonomic classification (LCA-based taxonomic assignment)
-
spacer_tbl
: Stores CRISPR spacer sequences:- Spacer sequence and unique spacer IDs
- Corresponding CRISPR repeat (stored under "crispr_array") and sample ID (stored under "library")
- Length and coverage of the spacer
- Quality flags
-
sample_tbl
: Includes sample metadata:- Sample ID, including SRA run identifier
- Ecosystem classification
- BioProject/BioSample IDs
- Sequencing platform, and run statistics
-
spacer_clusters
: Contains information about spacer clusters:- Unique spacer ID
- Unique spacer cluster ID (resulting from a 100% identity clustering, i.e. clustering identical spacers identified in different samples and/or with different repeats)
2. Spacer Hit Tables#
-
imgvr_hits
/imgpr_hits
: Spacer hits against IMG/VR and IMG/PR databases containing:- Virus/Plasmid unique identifier
- Hit coordinates and strand
- Mismatch between spacer and predicted protospacer
- Protospacer and protospacer-flanking sequences
-
imgvr_info
/imgpr_info
: Information about IMG/VR and IMG/PR sequences:- Virus/Plasmid unique identifier
- Length and completeness prediction
- Taxonomic classification (for viruses)
-
imgvr_hits_extra
/imgpr_hits_extra
: Additional hits information (legacy database only) spacer_hits_imgvr
/spacer_hits_imgpr
: Summarized hit information per spacer (legacy database only)
3. Additional Tables#
spacer_hq_tbl
: Table containing spacer information for high-quality spacers onlyspacer_hq_clusters
: Table linking spacer cluster id to spacer if, for high-quality spacers only- other tables are exclusively found in the legacy database, and specific to the original analysis of SpacerDB.
Database versions#
Different versions of the spacer database are available, explained in more details in the Quick Start page. Briefly, the "full" database includes all spacer and spacer hits information, "tax" database includes only spacers associated with taxonomically-assigned repeats and no spacer hit information, and the "legacy" database is only meant to document the original database structure and content. The potential use of each version is illustrated in the Example Notebooks. The table below summarizes which components (rows) are included in each version of the database (columns).
Main tables#
Table | Content | In full | In tax | In legacy | Notes |
---|---|---|---|---|---|
spacer_tbl | Complete spacer table, including high-quality and non-high-quality spacers | yes | filtered | yes | |
spacer_hq_tbl | Table of high-quality spacers only | yes | filtered | yes | named “spacer_filt_tbl” in legacy |
spacer_clusters | Table showing links between spacer and spacer clusters, from All_spacers_info_filtered_clusters-Jul19-24.tsv | yes | filtered | yes | |
spacer_hq_clusters | Table showing the links between high-quality spacers and spacer clusters | yes | filtered | yes | named “spacer_filt_clusters” in legacy |
repeat_tbl | Table of repeats with CRISPR type and taxo info, from Array_info_filtered_for_db-Nov1-24.tsv | yes | filtered | yes | named “array_tbl” in legacy |
sample_tbl | Table of samples from which spacers were extracted, with dataset and ecosystem information, from Runs_to_ecosystem_and_sequencing_and_study_for_db-Jul28-24.tsv | yes | filtered | yes | |
imgvr_info | Table including information about IMG_VR, from IMGVR_sequence_information_Oct17.tsv | yes | no | yes | |
imgvr_hits | Table of hits (0 or 1 mismatch) to img_vr that only include spacers also in spacer_hq_clusters | yes | no | yes | named “imgvr_hits_filt” in legacy |
imgpr_info | Table with information about IMG_PR, from IMGPR_sequence_information_Aug26.tsv | yes | no | yes | |
imgpr_hits | Table of hits (0 or 1 mismatch) to img_pr that only include spacers also in spacer_hq_clusters | yes | no | yes | named “imgpr_hits_filt” in legacy |
Tables or view exclusive to the legacy database#
Table | Content |
---|---|
imgvr_hits_extra_filt | Table of hits (2 or 3 mismatches) to img_vr that only include spacers also in spacer_hq_clusters |
imgpr_hits_extra_filt | Table of hits (2 or 3 mismatches) to img_pr that only include spacers also in spacer_hq_clusters |
multitaxa_uvig_list | Table of uvigs identified as potentially targeted by CRISPR repeats assigned to multiple classes |
selected_sets_for_alphadiv | View of repeat-samples combinations for which at least 10 spacers were extracted |
sets_alphadiv | Table showing information about sets alpha diversity, computed outside of the database |
spacer_cover | View showing the total coverage of each unique spacer id in each individual sample |
spacer_hits_imgpr | Table of high-quality spacers with at least one hit in IMG/PR |
spacer_hits_imgvr | Table of high-quality spacers with at least one hit in IMG/PR |
spacer_len_stat | View showing statistics on spacer length |
array_highcov | Table including all repeats with at least 1 spacer with coverage >=20 |
array_sample | View linking repeat and sample |
Database Statistics#
The database contains millions of records across its tables:
- Over 1.6 billion spacer sequences in spacer_tbl
- Over 790 million filtered spacers in spacer_hq_tbl
- Over 200 million hits against IMG/VR and IMG/PR databases
- Comprehensive coverage of diverse ecosystems and taxonomic groups
For detailed statistics and database metrics, see the Database Overview Notebook.