Skip to content

Database Overview#

SpacerDB is built on DuckDB, a high-performance analytical database system. The full database contains CRISPR spacer sequences, their hits against viral and plasmid databases, and comprehensive metadata about arrays, samples, and analysis results.

Database Structure#

The database is organized into several interconnected tables:

1. Core Tables#

  • repeat_tbl: Contains CRISPR repeat information:

    • Repeat sequence and unique repeat ID
    • Predicted CRISPR array type (e.g., I-E)
    • Taxonomic classification (LCA-based taxonomic assignment)
  • spacer_tbl: Stores CRISPR spacer sequences:

    • Spacer sequence and unique spacer IDs
    • Corresponding CRISPR repeat (stored under "crispr_array") and sample ID (stored under "library")
    • Length and coverage of the spacer
    • Quality flags
  • sample_tbl: Includes sample metadata:

    • Sample ID, including SRA run identifier
    • Ecosystem classification
    • BioProject/BioSample IDs
    • Sequencing platform, and run statistics
  • spacer_clusters: Contains information about spacer clusters:

    • Unique spacer ID
    • Unique spacer cluster ID (resulting from a 100% identity clustering, i.e. clustering identical spacers identified in different samples and/or with different repeats)

2. Spacer Hit Tables#

  • imgvr_hits/imgpr_hits: Spacer hits against IMG/VR and IMG/PR databases containing:

    • Virus/Plasmid unique identifier
    • Hit coordinates and strand
    • Mismatch between spacer and predicted protospacer
    • Protospacer and protospacer-flanking sequences
  • imgvr_info/imgpr_info: Information about IMG/VR and IMG/PR sequences:

    • Virus/Plasmid unique identifier
    • Length and completeness prediction
    • Taxonomic classification (for viruses)
  • imgvr_hits_extra/imgpr_hits_extra: Additional hits information (legacy database only)

  • spacer_hits_imgvr/spacer_hits_imgpr: Summarized hit information per spacer (legacy database only)

3. Additional Tables#

  • spacer_hq_tbl: Table containing spacer information for high-quality spacers only
  • spacer_hq_clusters: Table linking spacer cluster id to spacer if, for high-quality spacers only
  • other tables are exclusively found in the legacy database, and specific to the original analysis of SpacerDB.

Database versions#

Different versions of the spacer database are available, explained in more details in the Quick Start page. Briefly, the "full" database includes all spacer and spacer hits information, "tax" database includes only spacers associated with taxonomically-assigned repeats and no spacer hit information, and the "legacy" database is only meant to document the original database structure and content. The potential use of each version is illustrated in the Example Notebooks. The table below summarizes which components (rows) are included in each version of the database (columns).

Main tables#

Table Content In full In tax In legacy Notes
spacer_tbl Complete spacer table, including high-quality and non-high-quality spacers yes filtered yes
spacer_hq_tbl Table of high-quality spacers only yes filtered yes named “spacer_filt_tbl” in legacy
spacer_clusters Table showing links between spacer and spacer clusters, from All_spacers_info_filtered_clusters-Jul19-24.tsv yes filtered yes
spacer_hq_clusters Table showing the links between high-quality spacers and spacer clusters yes filtered yes named “spacer_filt_clusters” in legacy
repeat_tbl Table of repeats with CRISPR type and taxo info, from Array_info_filtered_for_db-Nov1-24.tsv yes filtered yes named “array_tbl” in legacy
sample_tbl Table of samples from which spacers were extracted, with dataset and ecosystem information, from Runs_to_ecosystem_and_sequencing_and_study_for_db-Jul28-24.tsv yes filtered yes
imgvr_info Table including information about IMG_VR, from IMGVR_sequence_information_Oct17.tsv yes no yes
imgvr_hits Table of hits (0 or 1 mismatch) to img_vr that only include spacers also in spacer_hq_clusters yes no yes named “imgvr_hits_filt” in legacy
imgpr_info Table with information about IMG_PR, from IMGPR_sequence_information_Aug26.tsv yes no yes
imgpr_hits Table of hits (0 or 1 mismatch) to img_pr that only include spacers also in spacer_hq_clusters yes no yes named “imgpr_hits_filt” in legacy

Tables or view exclusive to the legacy database#

Table Content
imgvr_hits_extra_filt Table of hits (2 or 3 mismatches) to img_vr that only include spacers also in spacer_hq_clusters
imgpr_hits_extra_filt Table of hits (2 or 3 mismatches) to img_pr that only include spacers also in spacer_hq_clusters
multitaxa_uvig_list Table of uvigs identified as potentially targeted by CRISPR repeats assigned to multiple classes
selected_sets_for_alphadiv View of repeat-samples combinations for which at least 10 spacers were extracted
sets_alphadiv Table showing information about sets alpha diversity, computed outside of the database
spacer_cover View showing the total coverage of each unique spacer id in each individual sample
spacer_hits_imgpr Table of high-quality spacers with at least one hit in IMG/PR
spacer_hits_imgvr Table of high-quality spacers with at least one hit in IMG/PR
spacer_len_stat View showing statistics on spacer length
array_highcov Table including all repeats with at least 1 spacer with coverage >=20
array_sample View linking repeat and sample

Database Statistics#

The database contains millions of records across its tables: - Over 1.6 billion spacer sequences in spacer_tbl - Over 790 million filtered spacers in spacer_hq_tbl - Over 200 million hits against IMG/VR and IMG/PR databases - Comprehensive coverage of diverse ecosystems and taxonomic groups

For detailed statistics and database metrics, see the Database Overview Notebook.