Database Schema#
This page documents the structure of the SpacerDB database tables and their relationships. The database is stored in DuckDB format, which is a column-store database that is optimized for analytical queries. See DuckDB Python conversion for more information on what datatypes/variable types are called in duckdb and how to interpret them when using Python.
Relationships#
This simplified schematic shows the relationships between the main tables in the SpacerDb databases.
erDiagram
**spacer_tbl** }|--|| **repeat_tbl**: "connected to"
**spacer_tbl** {
string spacer_id
string crispr_array
string library
}
**repeat_tbl** {
string repeat_cluster
}
**spacer_tbl** }|--|| **sample_tbl** : "sourced from"
**sample_tbl** {
string library
}
**spacer_tbl** }|--|| **spacer_clusters** : "grouped in"
**spacer_clusters** {
string spacer_id
string spacer_cluster_id
}
**spacer_clusters** ||--o{ **imgvr_hits** : "matches in"
**imgvr_hits** {
string spacer_cluster_id
string target_id
}
**spacer_clusters** ||--o{ **imgpr_hits** : "matches in"
**imgpr_hits** {
string spacer_cluster_id
string target_id
}
**imgvr_hits** }o--|| **imgvr_info** : "has information in"
**imgvr_info** {
string uvig
}
**imgpr_hits** }o--|| **imgpr_info** : "has information in"
**imgpr_info** {
string plasmid_id
}
Detailed Table Content#
spacer_tbl#
Information about CRISPR spacers and their properties.
Column | Type | Description |
---|---|---|
spacer_id | VARCHAR | Spacer identifier |
spacer_sequence | VARCHAR | The spacer DNA sequence |
spacer_length | BIGINT | Length of the spacer |
spacer_coverage | BIGINT | Coverage depth |
warning | VARCHAR | Quality warning flags |
crispr_array | VARCHAR | Associated repeat ID |
hq | TINYINT | High-quality flag (0 or 1) |
library | VARCHAR | Source library ID |
repeat_tbl#
Information about CRISPR repeat, including their predicted CRISPR types and taxonomic assignment.
Column | Type | Description |
---|---|---|
repeat_cluster | VARCHAR | Unique repeat identifier (e.g., Ac_00008) |
type | VARCHAR | CRISPR array type (e.g., I-E) |
lca_origin | VARCHAR | Taxonomic origin confidence |
lca_full | VARCHAR | Full taxonomic lineage |
lca_class | VARCHAR | Class-level taxonomy |
lca_family | VARCHAR | Family-level taxonomy |
lca_genus | VARCHAR | Genus-level taxonomy |
sample_tbl#
Sample metadata, including ecosystem information extracted from BioSample.
Column | Type | Description |
---|---|---|
library | VARCHAR | Library identifier |
sra_run | VARCHAR | SRA run accession |
ecosystem_base | VARCHAR | Raw ecosystem classification |
ecosystem_sum | VARCHAR | Summarized ecosystem information |
sequencing_platform | VARCHAR | Sequencing technology used |
bioproject | VARCHAR | BioProject accession |
biosample | VARCHAR | BioSample accession |
n_spot | BIGINT | Number of spots sequenced |
n_base | BIGINT | Number of bases sequenced |
title | VARCHAR | Sample title/description |
spacer_clusters#
Table listing the connection between individual spacers and spacer clusters (100% identity of 100% spacer length).
Column | Type | Description |
---|---|---|
cluster_id | VARCHAR | Spacer cluster identifier (e.g. Sp_cl_000000188774) |
spacer_id | VARCHAR | Spacer identifier |
Hit Analysis Tables#
imgvr_hits#
Spacer hits against viruses from the IMG/VR v4 database, with 0 or 1 mismatches.
Column | Type | Description |
---|---|---|
cluster_id | VARCHAR | Spacer cluster ID |
target_id | VARCHAR | Target sequence ID |
hit_start | BIGINT | Hit start position |
hit_end | BIGINT | Hit end position |
hit_strand | VARCHAR | Strand orientation |
n_mismatches | BIGINT | Number of mismatches |
CIGAR | VARCHAR | CIGAR string |
MD | VARCHAR | MD tag |
spacer | VARCHAR | Spacer sequence |
protospacer | VARCHAR | Protospacer sequence |
upstream | VARCHAR | Upstream flanking sequence |
downstream | VARCHAR | Downstream flanking sequence |
imgvr_info#
Information about the IMG/VR v4 sequences, see the IMG/VR v4 paper for more explanation.
Column | Type | Description |
---|---|---|
uvig | VARCHAR | Unique identifier for the virus genome |
ecosystem | VARCHAR | Ecosystem of origin for the virus genome (if obtained from a metagenome) |
votu | VARCHAR | vOTU of the virus genome (see this link for more information about vOTU clustering) |
length | BIGINT | Length of the virus genome |
topology | VARCHAR | Topology of the virus genome |
quality | VARCHAR | Quality of the virus genome based on MIUViG criteria |
taxonomy_full | VARCHAR | Taxonomic assignment of the virus genome |
class | VARCHAR | Taxonomic assignment at the class rank of the virus genome |
host_prediction | VARCHAR | Predicted host taxonomy, if available |
host_method | VARCHAR | Method used to predict the host taxonomy |
host_from_taxo | VARCHAR | Identification of phages and/or archaea viruses based on virus taxonomic assignment |
source | VARCHAR | Type of dataset the virus genome was obtained from |
lifestyle | VARCHAR | Predicted lifestyle for the virus (if infecting bacteria/archaea) |
fibers | BOOLEAN | Are there multiple tail fibers identified in the virus genome (true=yes) |
acr | BOOLEAN | Was there one or several anti-crispr genes identified in the virus genome (true=yes) |
dgr | BOOLEAN | Was there one or several DGR loci identified in the virus genome (true=yes) |
imgpr_hits#
Spacer hits against viruses from the IMG/PR v1 database, with 0 or 1 mismatches.
Column | Type | Description |
---|---|---|
cluster_id | VARCHAR | Spacer cluster ID |
target_id | VARCHAR | Target sequence ID |
hit_start | BIGINT | Hit start position |
hit_end | BIGINT | Hit end position |
hit_strand | VARCHAR | Strand orientation |
n_mismatches | BIGINT | Number of mismatches |
CIGAR | VARCHAR | CIGAR string |
MD | VARCHAR | MD tag |
spacer | VARCHAR | Spacer sequence |
protospacer | VARCHAR | Protospacer sequence |
upstream | VARCHAR | Upstream flanking sequence |
downstream | VARCHAR | Downstream flanking sequence |
imgpr_info#
Information about the IMG/PR v1 sequences, see the IMG/PR v1 paper for more explanation.
Column | Type | Description |
---|---|---|
full_plasmid_id | VARCHAR | Unique identifier for the plasmid |
plasmid_id | VARCHAR | Simplified identifier for the plasmid |
ptu | VARCHAR | Plasmid cluster |
taxon_oid | BIGINT | Unique identifier of the original IMG dataset from which the plasmid sequence was obtained |
scaffold_oid | VARCHAR | Unique identifier of the original scaffold of the plasmid on IMG |
source_type | VARCHAR | Type of dataset the plasmid was obtained from |
ecosystem | VARCHAR | Ecosystem of origin for the plasmid (if obtained from a metagenome) |
ecosystem_summary | VARCHAR | Simplified ecosystem type (see "ecosystem" above) |
length | BIGINT | Length of the plasmid |
gene_count | BIGINT | Number of genes predicted on the plasmid |
genomad_score | DOUBLE | GeNomad score for the plasmid prediction |
putatively_complete | BOOLEAN | Is the plasmid considered as (near-)complete (true=yes) |
topology | VARCHAR | Topology of the plasmid |
has_mob_genes | BOOLEAN | Does the plasmid encode mobilization gene(s) (true=yes) |
has_tfss | BOOLEAN | Does the plasmid encode type IV secretion system gene(s) (true=yes) |
any_conjugation_gene | BOOLEAN | Does the plasmid encode conjugation gene(s) (true=yes) |
has_mpf | BOOLEAN | Does the plasmid encode mating pair formation gene(s) (true=yes) |
has_arg_genes | BOOLEAN | Does the plasmid encode antibiotic resistance gene(s) (true=yes) |
putative_phage_plasmid | BOOLEAN | Was the plasmid identified as a possible phage-plasmid (true=yes) |
host_prediction_method | VARCHAR | Method used to predict the host taxonomy |
host_taxonomy | VARCHAR | Predicted host taxonomy, if available |
closest_reference | VARCHAR | Identifier of the closest reference sequence for the plasmid (if available) |