SpacerDB Sequence Access Examples

This notebook demonstrates how to access and analyze specific spacer sequences based on various attributes like coverage, array type, taxonomy, etc.

0.1 What we will demonstrat here:

  • Get all unique spacers for a given taxon
  • Get all unique spacers for a given ecosystem
  • Connect taxon to ecosystem for a given CRISPR type

Before doing anything, we install the required python dependencies (polars, duckdb, and itables), if they are not already installed (if you run the notebooks locally, this can be done via a virtual environment manager like conda or pixi, in which case you don’t need to run this cell).

Code
%%capture
%pip install polars duckdb itables pyarrow;
Code
import polars as pl
import duckdb
from pathlib import Path
from datetime import datetime
import itables
from itables import init_notebook_mode
init_notebook_mode(all_interactive=True)

# Configure Polars display
pl.Config(tbl_cols=-1, tbl_rows=250)
<polars.config.Config at 0x7a9923bd56a0>

1 Database Connection

Here we will access the spacerDB directly from the s3 bucket. Accessing via s3 is fine for small queries like the ones we have in this notebook, however for any extensive work with the database, including any large queries, we strongly recommend downloading the database and loading it from a local path.
Depending on your network connection, this cell may take ~ 30s to 1 min to load.

Code
# Loading from S3 - first we create a connection to a local db in memory
con = duckdb.connect(":memory:")
# Next we enter the necessary key and secret to be able to read the s3 bucket
con.sql("""
CREATE SECRET spacereader (
  TYPE S3,
  PROVIDER config,
  KEY_ID 'BBPSY391G6MGG864QL7K',
  SECRET 'mAg0Bg2HzSrdjpsDRP6YPFK2DTyfVQCD_HzxJR0P',
  ENDPOINT 'objects-1.jgi.lbl.gov',
  URL_STYLE 'path');
""")
# And then we connect to the database
con.sql("""
ATTACH 's3://spacers/global_crispr_db_full_2025-05-02.duckdb' AS spacer_db (READ_ONLY);
use spacer_db;
""")
# To confirm that everything worked as expected, we check the list of tables in the database
con.sql("""
show tables;
""")
┌────────────────────┐
│        name        │
│      varchar       │
├────────────────────┤
│ imgpr_hits         │
│ imgpr_info         │
│ imgvr_hits         │
│ imgvr_info         │
│ repeat_tbl         │
│ sample_tbl         │
│ spacer_clusters    │
│ spacer_hq_clusters │
│ spacer_hq_tbl      │
│ spacer_tbl         │
├────────────────────┤
│      10 rows       │
└────────────────────┘

If working locally, after downloading the database, the connection would simply be:

Code
# con = duckdb.connect("/path/to/local/directory/global_crispr_db_full_2025-05-02.duckdb")

2 Extracting spacers from specific taxa

Here we will extract high-quality spacers associated with repeat clusters assigned to a given taxon, in this case Gammaproteobacteria. To keep the query from running too long, we limit to the first 10 results.

Code
gspacers_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family,
    FROM spacer_hq_tbl s
    JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
    WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
    LIMIT 10
""" 
gspacers = (con.execute(gspacers_query).pl())
print("First 10 spacers from Gammaproteobacteria repeats:")
gspacers
First 10 spacers from Gammaproteobacteria repeats:
spacer_id spacer_sequence lca_family
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)

Spacers can also be filtered e.g. by coverage and type, here we ask for high-quality spacers linked to repeats assigned to Gammaproteobacteria, of type I-F, and non-singleton (i.e. spacer coverage > 1).

Code
high_cov_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family, a.type, s.library, s.spacer_coverage
  FROM spacer_hq_tbl s
  JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
  WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria' AND a.type='I-F' AND s.spacer_coverage>1
  LIMIT 10;
"""
high_cov_spacers = (con.execute(high_cov_query).pl())
print("First 10 non-singleton spacers from Gammaproteobacteria I-F systems:")
high_cov_spacers
First 10 non-singleton spacers from Gammaproteobacteria I-F systems:
spacer_id spacer_sequence lca_family type library spacer_coverage
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)

Finally, we can extract these spacers as a fasta file, after making these non-redundant.

Code
nr_query = """
SELECT first(spacer_id) as spacer_id, spacer_sequence, 
  FROM high_cov_spacers
  GROUP BY spacer_sequence
"""
nr_spacers = (con.execute(nr_query).pl())
print("Non-redundant version of the 10 spacers")
nr_spacers
Non-redundant version of the 10 spacers
spacer_id spacer_sequence
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)
Code
## This is just one of (many) possibilities to extract a fasta text from this type of polars dataframe
def process_row(row: pl.Series) -> str:
    return ">" + str(row[0]) + "\n" + str(row[1])


fasta = "\n".join(nr_spacers.map_rows(lambda x: process_row(x))["map"].to_list())

print(fasta)
>DRR012575.10017584/2|Ac_760540|28
TTTTCGGCCAGGTCCACTTCGCCCGCGAGCTT
>DRR012575.10166867/1|Ac_755949|102
ACGTATTTTTCGAAAACGTGGAAGCGCCCGAC
>DRR012575.10166335/1|Ac_755949|98
GTGAGACGGAAAATCTGTGCAGTAGACATAAT
>DRR012575.10157593/2|Ac_755949|38
GACAGCCGCGTCGAGACATTCCAGCTGAAAAT
>DRR012575.10166335/1|Ac_755949|38
TTAGATTATCGCCGGCGTCGAATGGCAAGGTC
>DRR012575.10024248/1|Ac_755949|110
GCAACGTGCCGGCAGTGGCGCCGTAGCGGTGC
>DRR012575.10166867/1|Ac_755949|42
CCGAAGCGAAGGCGCGCCTGTCGATCGCGCAG
>DRR012575.10024248/1|Ac_755949|50
GAGGATCGGCGACTTGGGGTTATCCGCCATCA
>DRR012575.10015455/1|Ac_755949|34
ATGGCGGCGTTTTCGGCGCCGGCACACTGAAC
>DRR012575.1005512/2|Ac_755949|34
AGGCCGTGGGCATGAACATCACTGAGATGGAT

3 Identify and extract spacers by ecosystem

First we can check the different ecosystem categories available in the SpacerDb database.

Code
ecosystem_check_query = """
SELECT DISTINCT ecosystem_base, COUNT(*) as count
FROM sample_tbl
GROUP BY ecosystem_base
ORDER BY count DESC;
"""
ecosystem_values = (con.execute(ecosystem_check_query).pl())
print("Available ecosystem types:")
ecosystem_values
Available ecosystem types:
ecosystem_base count
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)

Next, we will look for high-quality spacers obtained from engineered samples, and non-singleton. To keep the query from running too long, we limit to the first 100 results.

Code
eco_query = """
SELECT s.spacer_id, s.spacer_sequence, l.title, l.ecosystem_base, s.spacer_coverage, s.crispr_array
  FROM spacer_hq_tbl s
  JOIN sample_tbl l ON s.library=l.library
  WHERE l.ecosystem_sum = 'Engineered' AND s.spacer_coverage>1
  LIMIT 100;
"""
eng_spacers = (con.execute(eco_query).pl())
print("First 100 non-singleton spacers from Engineered ecosystems:")
eng_spacers
First 100 non-singleton spacers from Engineered ecosystems:
spacer_id spacer_sequence title ecosystem_base spacer_coverage crispr_array
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)

4 Identify taxa-ecosystem combinations

Finally, we can combine the taxonomy, CRISPR Type, and ecosystem information, and for instance count the number of spacers per ecosystem for a given taxon.

Code
combined_query = """
WITH selected_spacers AS (
    SELECT s.spacer_id, s.spacer_sequence, r.lca_family, r.type, s.library
    FROM spacer_hq_tbl s
    JOIN repeat_tbl r ON s.crispr_array = r.repeat_cluster
    WHERE r.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
    LIMIT 5000)
SELECT lca_family, ecosystem_sum, type, count(*) as counts
FROM selected_spacers s, sample_tbl t
WHERE s.library=t.library
GROUP BY lca_family, ecosystem_sum, type;
"""
count_combinations = (con.execute(combined_query).pl())
print("Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:")
count_combinations
Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:
lca_family ecosystem_sum type counts
Loading ITables v2.3.0 from the init_notebook_mode cell... (need help?)
Code
# Close connection
con.close()