Code
%%capture
%pip install polars duckdb itables pyarrow;
This notebook demonstrates how to access and analyze specific spacer sequences based on various attributes like coverage, array type, taxonomy, etc.
Before doing anything, we install the required python dependencies (polars, duckdb, and itables), if they are not already installed (if you run the notebooks locally, this can be done via a virtual environment manager like conda or pixi, in which case you don’t need to run this cell).
Here we will access the spacerDB directly from the s3 bucket. Accessing via s3 is fine for small queries like the ones we have in this notebook, however for any extensive work with the database, including any large queries, we strongly recommend downloading the database and loading it from a local path.
Depending on your network connection, this cell may take ~ 30s to 1 min to load.
# Loading from S3 - first we create a connection to a local db in memory
con = duckdb.connect(":memory:")
# Next we enter the necessary key and secret to be able to read the s3 bucket
con.sql("""
CREATE SECRET spacereader (
TYPE S3,
PROVIDER config,
KEY_ID 'BBPSY391G6MGG864QL7K',
SECRET 'mAg0Bg2HzSrdjpsDRP6YPFK2DTyfVQCD_HzxJR0P',
ENDPOINT 'objects-1.jgi.lbl.gov',
URL_STYLE 'path');
""")
# And then we connect to the database
con.sql("""
ATTACH 's3://spacers/global_crispr_db_full_2025-05-02.duckdb' AS spacer_db (READ_ONLY);
use spacer_db;
""")
# To confirm that everything worked as expected, we check the list of tables in the database
con.sql("""
show tables;
""")
┌────────────────────┐
│ name │
│ varchar │
├────────────────────┤
│ imgpr_hits │
│ imgpr_info │
│ imgvr_hits │
│ imgvr_info │
│ repeat_tbl │
│ sample_tbl │
│ spacer_clusters │
│ spacer_hq_clusters │
│ spacer_hq_tbl │
│ spacer_tbl │
├────────────────────┤
│ 10 rows │
└────────────────────┘
If working locally, after downloading the database, the connection would simply be:
Here we will extract high-quality spacers associated with repeat clusters assigned to a given taxon, in this case Gammaproteobacteria. To keep the query from running too long, we limit to the first 10 results.
gspacers_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family,
FROM spacer_hq_tbl s
JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
LIMIT 10
"""
gspacers = (con.execute(gspacers_query).pl())
print("First 10 spacers from Gammaproteobacteria repeats:")
gspacers
First 10 spacers from Gammaproteobacteria repeats:
spacer_id | spacer_sequence | lca_family |
---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Spacers can also be filtered e.g. by coverage and type, here we ask for high-quality spacers linked to repeats assigned to Gammaproteobacteria, of type I-F, and non-singleton (i.e. spacer coverage > 1).
high_cov_query = """
SELECT s.spacer_id, s.spacer_sequence, a.lca_family, a.type, s.library, s.spacer_coverage
FROM spacer_hq_tbl s
JOIN repeat_tbl a ON s.crispr_array = a.repeat_cluster
WHERE a.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria' AND a.type='I-F' AND s.spacer_coverage>1
LIMIT 10;
"""
high_cov_spacers = (con.execute(high_cov_query).pl())
print("First 10 non-singleton spacers from Gammaproteobacteria I-F systems:")
high_cov_spacers
First 10 non-singleton spacers from Gammaproteobacteria I-F systems:
spacer_id | spacer_sequence | lca_family | type | library | spacer_coverage |
---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Finally, we can extract these spacers as a fasta file, after making these non-redundant.
Non-redundant version of the 10 spacers
spacer_id | spacer_sequence |
---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
>DRR012575.10017584/2|Ac_760540|28
TTTTCGGCCAGGTCCACTTCGCCCGCGAGCTT
>DRR012575.10166867/1|Ac_755949|102
ACGTATTTTTCGAAAACGTGGAAGCGCCCGAC
>DRR012575.10166335/1|Ac_755949|98
GTGAGACGGAAAATCTGTGCAGTAGACATAAT
>DRR012575.10157593/2|Ac_755949|38
GACAGCCGCGTCGAGACATTCCAGCTGAAAAT
>DRR012575.10166335/1|Ac_755949|38
TTAGATTATCGCCGGCGTCGAATGGCAAGGTC
>DRR012575.10024248/1|Ac_755949|110
GCAACGTGCCGGCAGTGGCGCCGTAGCGGTGC
>DRR012575.10166867/1|Ac_755949|42
CCGAAGCGAAGGCGCGCCTGTCGATCGCGCAG
>DRR012575.10024248/1|Ac_755949|50
GAGGATCGGCGACTTGGGGTTATCCGCCATCA
>DRR012575.10015455/1|Ac_755949|34
ATGGCGGCGTTTTCGGCGCCGGCACACTGAAC
>DRR012575.1005512/2|Ac_755949|34
AGGCCGTGGGCATGAACATCACTGAGATGGAT
First we can check the different ecosystem categories available in the SpacerDb database.
Available ecosystem types:
ecosystem_base | count |
---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Next, we will look for high-quality spacers obtained from engineered samples, and non-singleton. To keep the query from running too long, we limit to the first 100 results.
eco_query = """
SELECT s.spacer_id, s.spacer_sequence, l.title, l.ecosystem_base, s.spacer_coverage, s.crispr_array
FROM spacer_hq_tbl s
JOIN sample_tbl l ON s.library=l.library
WHERE l.ecosystem_sum = 'Engineered' AND s.spacer_coverage>1
LIMIT 100;
"""
eng_spacers = (con.execute(eco_query).pl())
print("First 100 non-singleton spacers from Engineered ecosystems:")
eng_spacers
First 100 non-singleton spacers from Engineered ecosystems:
spacer_id | spacer_sequence | title | ecosystem_base | spacer_coverage | crispr_array |
---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Finally, we can combine the taxonomy, CRISPR Type, and ecosystem information, and for instance count the number of spacers per ecosystem for a given taxon.
combined_query = """
WITH selected_spacers AS (
SELECT s.spacer_id, s.spacer_sequence, r.lca_family, r.type, s.library
FROM spacer_hq_tbl s
JOIN repeat_tbl r ON s.crispr_array = r.repeat_cluster
WHERE r.lca_class = 'd__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria'
LIMIT 5000)
SELECT lca_family, ecosystem_sum, type, count(*) as counts
FROM selected_spacers s, sample_tbl t
WHERE s.library=t.library
GROUP BY lca_family, ecosystem_sum, type;
"""
count_combinations = (con.execute(combined_query).pl())
print("Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:")
count_combinations
Combination of taxa, CRISPR type, and ecosystems, based on the first 5000 spacers linked to a Gammaproteobacteria repeat:
lca_family | ecosystem_sum | type | counts |
---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |