Skip to content

Database Schema#

This page documents the structure of the SpacerDB database tables and their relationships. The database is stored in DuckDB format, which is a column-store database that is optimized for analytical queries. See DuckDB Python conversion for more information on what datatypes/variable types are called in duckdb and how to interpret them when using Python.

Relationships#

This simplified schematic shows the relationships between the main tables in the SpacerDb databases.

erDiagram
    **spacer_tbl** }|--|| **repeat_tbl**: "connected to"
    **spacer_tbl** {
        string spacer_id
        string crispr_array
        string library
    }
    **repeat_tbl** {
        string repeat_cluster
    }
    **spacer_tbl** }|--|| **sample_tbl** : "sourced from"
    **sample_tbl** {
        string library
    }
    **spacer_tbl** }|--|| **spacer_clusters** : "grouped in"
    **spacer_clusters** {
        string spacer_id
        string spacer_cluster_id
    }
    **spacer_clusters** ||--o{ **imgvr_hits** : "matches in"
    **imgvr_hits** {
        string spacer_cluster_id
        string target_id
    }
    **spacer_clusters** ||--o{ **imgpr_hits** : "matches in"
    **imgpr_hits** {
        string spacer_cluster_id
        string target_id
    }
    **imgvr_hits** }o--|| **imgvr_info** : "has information in"
    **imgvr_info** {
        string uvig
    }
    **imgpr_hits** }o--|| **imgpr_info** : "has information in"
    **imgpr_info** {
        string plasmid_id
    }

Detailed Table Content#

spacer_tbl#

Information about CRISPR spacers and their properties.

Column Type Description
spacer_id VARCHAR Spacer identifier
spacer_sequence VARCHAR The spacer DNA sequence
spacer_length BIGINT Length of the spacer
spacer_coverage BIGINT Coverage depth
warning VARCHAR Quality warning flags
crispr_array VARCHAR Associated repeat ID
hq TINYINT High-quality flag (0 or 1)
library VARCHAR Source library ID

repeat_tbl#

Information about CRISPR repeat, including their predicted CRISPR types and taxonomic assignment.

Column Type Description
repeat_cluster VARCHAR Unique repeat identifier (e.g., Ac_00008)
type VARCHAR CRISPR array type (e.g., I-E)
lca_origin VARCHAR Taxonomic origin confidence
lca_full VARCHAR Full taxonomic lineage
lca_class VARCHAR Class-level taxonomy
lca_family VARCHAR Family-level taxonomy
lca_genus VARCHAR Genus-level taxonomy

sample_tbl#

Sample metadata, including ecosystem information extracted from BioSample.

Column Type Description
library VARCHAR Library identifier
sra_run VARCHAR SRA run accession
ecosystem_base VARCHAR Raw ecosystem classification
ecosystem_sum VARCHAR Summarized ecosystem information
sequencing_platform VARCHAR Sequencing technology used
bioproject VARCHAR BioProject accession
biosample VARCHAR BioSample accession
n_spot BIGINT Number of spots sequenced
n_base BIGINT Number of bases sequenced
title VARCHAR Sample title/description

spacer_clusters#

Table listing the connection between individual spacers and spacer clusters (100% identity of 100% spacer length).

Column Type Description
cluster_id VARCHAR Spacer cluster identifier (e.g. Sp_cl_000000188774)
spacer_id VARCHAR Spacer identifier

Hit Analysis Tables#

imgvr_hits#

Spacer hits against viruses from the IMG/VR v4 database, with 0 or 1 mismatches.

Column Type Description
cluster_id VARCHAR Spacer cluster ID
target_id VARCHAR Target sequence ID
hit_start BIGINT Hit start position
hit_end BIGINT Hit end position
hit_strand VARCHAR Strand orientation
n_mismatches BIGINT Number of mismatches
CIGAR VARCHAR CIGAR string
MD VARCHAR MD tag
spacer VARCHAR Spacer sequence
protospacer VARCHAR Protospacer sequence
upstream VARCHAR Upstream flanking sequence
downstream VARCHAR Downstream flanking sequence

imgvr_info#

Information about the IMG/VR v4 sequences, see the IMG/VR v4 paper for more explanation.

Column Type Description
uvig VARCHAR Unique identifier for the virus genome
ecosystem VARCHAR Ecosystem of origin for the virus genome (if obtained from a metagenome)
votu VARCHAR vOTU of the virus genome (see this link for more information about vOTU clustering)
length BIGINT Length of the virus genome
topology VARCHAR Topology of the virus genome
quality VARCHAR Quality of the virus genome based on MIUViG criteria
taxonomy_full VARCHAR Taxonomic assignment of the virus genome
class VARCHAR Taxonomic assignment at the class rank of the virus genome
host_prediction VARCHAR Predicted host taxonomy, if available
host_method VARCHAR Method used to predict the host taxonomy
host_from_taxo VARCHAR Identification of phages and/or archaea viruses based on virus taxonomic assignment
source VARCHAR Type of dataset the virus genome was obtained from
lifestyle VARCHAR Predicted lifestyle for the virus (if infecting bacteria/archaea)
fibers BOOLEAN Are there multiple tail fibers identified in the virus genome (true=yes)
acr BOOLEAN Was there one or several anti-crispr genes identified in the virus genome (true=yes)
dgr BOOLEAN Was there one or several DGR loci identified in the virus genome (true=yes)

imgpr_hits#

Spacer hits against viruses from the IMG/PR v1 database, with 0 or 1 mismatches.

Column Type Description
cluster_id VARCHAR Spacer cluster ID
target_id VARCHAR Target sequence ID
hit_start BIGINT Hit start position
hit_end BIGINT Hit end position
hit_strand VARCHAR Strand orientation
n_mismatches BIGINT Number of mismatches
CIGAR VARCHAR CIGAR string
MD VARCHAR MD tag
spacer VARCHAR Spacer sequence
protospacer VARCHAR Protospacer sequence
upstream VARCHAR Upstream flanking sequence
downstream VARCHAR Downstream flanking sequence

imgpr_info#

Information about the IMG/PR v1 sequences, see the IMG/PR v1 paper for more explanation.

Column Type Description
full_plasmid_id VARCHAR Unique identifier for the plasmid
plasmid_id VARCHAR Simplified identifier for the plasmid
ptu VARCHAR Plasmid cluster
taxon_oid BIGINT Unique identifier of the original IMG dataset from which the plasmid sequence was obtained
scaffold_oid VARCHAR Unique identifier of the original scaffold of the plasmid on IMG
source_type VARCHAR Type of dataset the plasmid was obtained from
ecosystem VARCHAR Ecosystem of origin for the plasmid (if obtained from a metagenome)
ecosystem_summary VARCHAR Simplified ecosystem type (see "ecosystem" above)
length BIGINT Length of the plasmid
gene_count BIGINT Number of genes predicted on the plasmid
genomad_score DOUBLE GeNomad score for the plasmid prediction
putatively_complete BOOLEAN Is the plasmid considered as (near-)complete (true=yes)
topology VARCHAR Topology of the plasmid
has_mob_genes BOOLEAN Does the plasmid encode mobilization gene(s) (true=yes)
has_tfss BOOLEAN Does the plasmid encode type IV secretion system gene(s) (true=yes)
any_conjugation_gene BOOLEAN Does the plasmid encode conjugation gene(s) (true=yes)
has_mpf BOOLEAN Does the plasmid encode mating pair formation gene(s) (true=yes)
has_arg_genes BOOLEAN Does the plasmid encode antibiotic resistance gene(s) (true=yes)
putative_phage_plasmid BOOLEAN Was the plasmid identified as a possible phage-plasmid (true=yes)
host_prediction_method VARCHAR Method used to predict the host taxonomy
host_taxonomy VARCHAR Predicted host taxonomy, if available
closest_reference VARCHAR Identifier of the closest reference sequence for the plasmid (if available)