SpacerDB Sequence Access Examples

This notebook demonstrates how to access and analyze specific spacer sequences based on various attributes like coverage, array type, taxonomy, etc.

0.1 What we will demonstrat here:

Get all unique spacers for a given taxon
Get all unique spacers for a given ecosystem
Connect taxon to ecosystem for a given CRISPR type

Before doing anything, we install the required python dependencies (polars, duckdb, and itables), if they are not already installed (if you run the notebooks locally, this can be done via a virtual environment manager like conda or pixi, in which case you don’t need to run this cell).

Code

%%capture
%pip install polars duckdb itables pyarrow;

Code

import polars as pl
import duckdb
from pathlib import Path
from datetime import datetime
import itables
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)

# Configure Polars display
pl.Config(tbl_cols=-1, tbl_rows=250)

init_notebook_modetrusted

<polars.config.Config at 0x7a9923bd56a0>

1 Database Connection

Here we will access the spacerDB directly from the s3 bucket. Accessing via s3 is fine for small queries like the ones we have in this notebook, however for any extensive work with the database, including any large queries, we strongly recommend downloading the database and loading it from a local path.
Depending on your network connection, this cell may take ~ 30s to 1 min to load.

Code

# Loading from S3 - first we create a connection to a local db in memory
con = duckdb.connect(":memory:")
# Next we enter the necessary key and secret to be able to read the s3 bucket
con.sql("""
CREATE SECRET spacereader (
  TYPE S3,
  PROVIDER config,
  KEY_ID 'BBPSY391G6MGG864QL7K',
  SECRET 'mAg0Bg2HzSrdjpsDRP6YPFK2DTyfVQCD_HzxJR0P',
  ENDPOINT 'objects-1.jgi.lbl.gov',
  URL_STYLE 'path');
""")
# And then we connect to the database
con.sql("""
ATTACH 's3://spacers/global_crispr_db_full_2025-05-02.duckdb' AS spacer_db (READ_ONLY);
use spacer_db;
""")
# To confirm that everything worked as expected, we check the list of tables in the database
con.sql("""
show tables;
""")

┌────────────────────┐
│        name        │
│      varchar       │
├────────────────────┤
│ imgpr_hits         │
│ imgpr_info         │
│ imgvr_hits         │
│ imgvr_info         │
│ repeat_tbl         │
│ sample_tbl         │
│ spacer_clusters    │
│ spacer_hq_clusters │
│ spacer_hq_tbl      │
│ spacer_tbl         │
├────────────────────┤
│      10 rows       │
└────────────────────┘

If working locally, after downloading the database, the connection would simply be:

Code

# con = duckdb.connect("/path/to/local/directory/global_crispr_db_full_2025-05-02.duckdb")

2 Extracting spacers from specific taxa

Here we will extract high-quality spacers associated with repeat clusters assigned to a given taxon, in this case Gammaproteobacteria. To keep the query from running too long, we limit to the first 10 results.

Code

gspacers_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family,
    FROM spacer_hq_tbl s
    JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
    WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
    LIMIT 10
""" 
gspacers = (con.execute(gspacers_query).pl())
print("First 10 spacers from Gammaproteobacteria repeats:")
gspacers

First 10 spacers from Gammaproteobacteria repeats:

spacer_id	spacer_sequence	lca_family
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

Spacers can also be filtered e.g. by coverage and type, here we ask for high-quality spacers linked to repeats assigned to Gammaproteobacteria, of type I-F, and non-singleton (i.e. spacer coverage > 1).

Code

high_cov_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family, a.type, s.library, s.spacer_coverage
  FROM spacer_hq_tbl s
  JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
  WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria' AND a.type='I-F' AND s.spacer_coverage>1
  LIMIT 10;
"""
high_cov_spacers = (con.execute(high_cov_query).pl())
print("First 10 non-singleton spacers from Gammaproteobacteria I-F systems:")
high_cov_spacers

First 10 non-singleton spacers from Gammaproteobacteria I-F systems:

spacer_id	spacer_sequence	lca_family	type	library	spacer_coverage
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

Finally, we can extract these spacers as a fasta file, after making these non-redundant.

Code

nr_query = """
SELECT first(spacer_id) as spacer_id, spacer_sequence, 
  FROM high_cov_spacers
  GROUP BY spacer_sequence
"""
nr_spacers = (con.execute(nr_query).pl())
print("Non-redundant version of the 10 spacers")
nr_spacers

Non-redundant version of the 10 spacers

spacer_id	spacer_sequence
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

Code

## This is just one of (many) possibilities to extract a fasta text from this type of polars dataframe
def process_row(row: pl.Series) -> str:
    return ">" + str(row[0]) + "\n" + str(row[1])


fasta = "\n".join(nr_spacers.map_rows(lambda x: process_row(x))["map"].to_list())

print(fasta)

>DRR012575.10017584/2|Ac_760540|28
TTTTCGGCCAGGTCCACTTCGCCCGCGAGCTT
>DRR012575.10166867/1|Ac_755949|102
ACGTATTTTTCGAAAACGTGGAAGCGCCCGAC
>DRR012575.10166335/1|Ac_755949|98
GTGAGACGGAAAATCTGTGCAGTAGACATAAT
>DRR012575.10157593/2|Ac_755949|38
GACAGCCGCGTCGAGACATTCCAGCTGAAAAT
>DRR012575.10166335/1|Ac_755949|38
TTAGATTATCGCCGGCGTCGAATGGCAAGGTC
>DRR012575.10024248/1|Ac_755949|110
GCAACGTGCCGGCAGTGGCGCCGTAGCGGTGC
>DRR012575.10166867/1|Ac_755949|42
CCGAAGCGAAGGCGCGCCTGTCGATCGCGCAG
>DRR012575.10024248/1|Ac_755949|50
GAGGATCGGCGACTTGGGGTTATCCGCCATCA
>DRR012575.10015455/1|Ac_755949|34
ATGGCGGCGTTTTCGGCGCCGGCACACTGAAC
>DRR012575.1005512/2|Ac_755949|34
AGGCCGTGGGCATGAACATCACTGAGATGGAT

3 Identify and extract spacers by ecosystem

First we can check the different ecosystem categories available in the SpacerDb database.

Code

ecosystem_check_query = """
SELECT DISTINCT ecosystem_base, COUNT(*) as count
FROM sample_tbl
GROUP BY ecosystem_base
ORDER BY count DESC;
"""
ecosystem_values = (con.execute(ecosystem_check_query).pl())
print("Available ecosystem types:")
ecosystem_values

Available ecosystem types:

ecosystem_base	count
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

Next, we will look for high-quality spacers obtained from engineered samples, and non-singleton. To keep the query from running too long, we limit to the first 100 results.

Code

eco_query = """
SELECT s.spacer_id, s.spacer_sequence, l.title, l.ecosystem_base, s.spacer_coverage, s.crispr_array
  FROM spacer_hq_tbl s
  JOIN sample_tbl l ON s.library=l.library
  WHERE l.ecosystem_sum = 'Engineered' AND s.spacer_coverage>1
  LIMIT 100;
"""
eng_spacers = (con.execute(eco_query).pl())
print("First 100 non-singleton spacers from Engineered ecosystems:")
eng_spacers

First 100 non-singleton spacers from Engineered ecosystems:

spacer_id	spacer_sequence	title	ecosystem_base	spacer_coverage	crispr_array
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

4 Identify taxa-ecosystem combinations

Finally, we can combine the taxonomy, CRISPR Type, and ecosystem information, and for instance count the number of spacers per ecosystem for a given taxon.

Code

combined_query = """
WITH selected_spacers AS (
    SELECT s.spacer_id, s.spacer_sequence, r.lca_family, r.type, s.library
    FROM spacer_hq_tbl s
    JOIN repeat_tbl r ON s.crispr_array = r.repeat_cluster
    WHERE r.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
    LIMIT 5000)
SELECT lca_family, ecosystem_sum, type, count(*) as counts
FROM selected_spacers s, sample_tbl t
WHERE s.library=t.library
GROUP BY lca_family, ecosystem_sum, type;
"""
count_combinations = (con.execute(combined_query).pl())
print("Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:")
count_combinations

Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:

lca_family	ecosystem_sum	type	counts
Loading ITables v2.3.0 from the `init_notebook_mode` cell... (need help?)

Code

# Close connection
con.close()