Database Schema#

This page documents the structure of the SpacerDB database tables and their relationships. The database is stored in DuckDB format, which is a column-store database that is optimized for analytical queries. See DuckDB Python conversion for more information on what datatypes/variable types are called in duckdb and how to interpret them when using Python.

Relationships#

This simplified schematic shows the relationships between the main tables in the SpacerDb databases.

erDiagram
    **spacer_tbl** }|--|| **repeat_tbl**: "connected to"
    **spacer_tbl** {
        string spacer_id
        string crispr_array
        string library
    }
    **repeat_tbl** {
        string repeat_cluster
    }
    **spacer_tbl** }|--|| **sample_tbl** : "sourced from"
    **sample_tbl** {
        string library
    }
    **spacer_tbl** }|--|| **spacer_clusters** : "grouped in"
    **spacer_clusters** {
        string spacer_id
        string spacer_cluster_id
    }
    **spacer_clusters** ||--o{ **imgvr_hits** : "matches in"
    **imgvr_hits** {
        string spacer_cluster_id
        string target_id
    }
    **spacer_clusters** ||--o{ **imgpr_hits** : "matches in"
    **imgpr_hits** {
        string spacer_cluster_id
        string target_id
    }
    **imgvr_hits** }o--|| **imgvr_info** : "has information in"
    **imgvr_info** {
        string uvig
    }
    **imgpr_hits** }o--|| **imgpr_info** : "has information in"
    **imgpr_info** {
        string plasmid_id
    }

Detailed Table Content#

spacer_tbl#

Information about CRISPR spacers and their properties.

Column	Type	Description
spacer_id	VARCHAR	Spacer identifier
spacer_sequence	VARCHAR	The spacer DNA sequence
spacer_length	BIGINT	Length of the spacer
spacer_coverage	BIGINT	Coverage depth
warning	VARCHAR	Quality warning flags
crispr_array	VARCHAR	Associated repeat ID
hq	TINYINT	High-quality flag (0 or 1)
library	VARCHAR	Source library ID

repeat_tbl#

Information about CRISPR repeat, including their predicted CRISPR types and taxonomic assignment.

Column	Type	Description
repeat_cluster	VARCHAR	Unique repeat identifier (e.g., Ac_00008)
type	VARCHAR	CRISPR array type (e.g., I-E)
lca_origin	VARCHAR	Taxonomic origin confidence
lca_full	VARCHAR	Full taxonomic lineage
lca_class	VARCHAR	Class-level taxonomy
lca_family	VARCHAR	Family-level taxonomy
lca_genus	VARCHAR	Genus-level taxonomy

sample_tbl#

Sample metadata, including ecosystem information extracted from BioSample.

Column	Type	Description
library	VARCHAR	Library identifier
sra_run	VARCHAR	SRA run accession
ecosystem_base	VARCHAR	Raw ecosystem classification
ecosystem_sum	VARCHAR	Summarized ecosystem information
sequencing_platform	VARCHAR	Sequencing technology used
bioproject	VARCHAR	BioProject accession
biosample	VARCHAR	BioSample accession
n_spot	BIGINT	Number of spots sequenced
n_base	BIGINT	Number of bases sequenced
title	VARCHAR	Sample title/description

spacer_clusters#

Table listing the connection between individual spacers and spacer clusters (100% identity of 100% spacer length).

Column	Type	Description
cluster_id	VARCHAR	Spacer cluster identifier (e.g. Sp_cl_000000188774)
spacer_id	VARCHAR	Spacer identifier

Hit Analysis Tables#

imgvr_hits#

Spacer hits against viruses from the IMG/VR v4 database, with 0 or 1 mismatches.

Column	Type	Description
cluster_id	VARCHAR	Spacer cluster ID
target_id	VARCHAR	Target sequence ID
hit_start	BIGINT	Hit start position
hit_end	BIGINT	Hit end position
hit_strand	VARCHAR	Strand orientation
n_mismatches	BIGINT	Number of mismatches
CIGAR	VARCHAR	CIGAR string
MD	VARCHAR	MD tag
spacer	VARCHAR	Spacer sequence
protospacer	VARCHAR	Protospacer sequence
upstream	VARCHAR	Upstream flanking sequence
downstream	VARCHAR	Downstream flanking sequence

imgvr_info#

Information about the IMG/VR v4 sequences, see the IMG/VR v4 paper for more explanation.

Column	Type	Description
uvig	VARCHAR	Unique identifier for the virus genome
ecosystem	VARCHAR	Ecosystem of origin for the virus genome (if obtained from a metagenome)
votu	VARCHAR	vOTU of the virus genome (see this link for more information about vOTU clustering)
length	BIGINT	Length of the virus genome
topology	VARCHAR	Topology of the virus genome
quality	VARCHAR	Quality of the virus genome based on MIUViG criteria
taxonomy_full	VARCHAR	Taxonomic assignment of the virus genome
class	VARCHAR	Taxonomic assignment at the class rank of the virus genome
host_prediction	VARCHAR	Predicted host taxonomy, if available
host_method	VARCHAR	Method used to predict the host taxonomy
host_from_taxo	VARCHAR	Identification of phages and/or archaea viruses based on virus taxonomic assignment
source	VARCHAR	Type of dataset the virus genome was obtained from
lifestyle	VARCHAR	Predicted lifestyle for the virus (if infecting bacteria/archaea)
fibers	BOOLEAN	Are there multiple tail fibers identified in the virus genome (true=yes)
acr	BOOLEAN	Was there one or several anti-crispr genes identified in the virus genome (true=yes)
dgr	BOOLEAN	Was there one or several DGR loci identified in the virus genome (true=yes)

imgpr_hits#

Spacer hits against viruses from the IMG/PR v1 database, with 0 or 1 mismatches.

Column	Type	Description
cluster_id	VARCHAR	Spacer cluster ID
target_id	VARCHAR	Target sequence ID
hit_start	BIGINT	Hit start position
hit_end	BIGINT	Hit end position
hit_strand	VARCHAR	Strand orientation
n_mismatches	BIGINT	Number of mismatches
CIGAR	VARCHAR	CIGAR string
MD	VARCHAR	MD tag
spacer	VARCHAR	Spacer sequence
protospacer	VARCHAR	Protospacer sequence
upstream	VARCHAR	Upstream flanking sequence
downstream	VARCHAR	Downstream flanking sequence

imgpr_info#

Information about the IMG/PR v1 sequences, see the IMG/PR v1 paper for more explanation.

Column	Type	Description
full_plasmid_id	VARCHAR	Unique identifier for the plasmid
plasmid_id	VARCHAR	Simplified identifier for the plasmid
ptu	VARCHAR	Plasmid cluster
taxon_oid	BIGINT	Unique identifier of the original IMG dataset from which the plasmid sequence was obtained
scaffold_oid	VARCHAR	Unique identifier of the original scaffold of the plasmid on IMG
source_type	VARCHAR	Type of dataset the plasmid was obtained from
ecosystem	VARCHAR	Ecosystem of origin for the plasmid (if obtained from a metagenome)
ecosystem_summary	VARCHAR	Simplified ecosystem type (see "ecosystem" above)
length	BIGINT	Length of the plasmid
gene_count	BIGINT	Number of genes predicted on the plasmid
genomad_score	DOUBLE	GeNomad score for the plasmid prediction
putatively_complete	BOOLEAN	Is the plasmid considered as (near-)complete (true=yes)
topology	VARCHAR	Topology of the plasmid
has_mob_genes	BOOLEAN	Does the plasmid encode mobilization gene(s) (true=yes)
has_tfss	BOOLEAN	Does the plasmid encode type IV secretion system gene(s) (true=yes)
any_conjugation_gene	BOOLEAN	Does the plasmid encode conjugation gene(s) (true=yes)
has_mpf	BOOLEAN	Does the plasmid encode mating pair formation gene(s) (true=yes)
has_arg_genes	BOOLEAN	Does the plasmid encode antibiotic resistance gene(s) (true=yes)
putative_phage_plasmid	BOOLEAN	Was the plasmid identified as a possible phage-plasmid (true=yes)
host_prediction_method	VARCHAR	Method used to predict the host taxonomy
host_taxonomy	VARCHAR	Predicted host taxonomy, if available
closest_reference	VARCHAR	Identifier of the closest reference sequence for the plasmid (if available)