Code
%%capture
%pip install polars duckdb itables pyarrow;
This notebook provides an overview of the SpacerDB DuckDB database schema, tables, and metadata. This includes some examples of SQL queries performed on the database. For more information about duckdb database and SQL queries, please see https://duckdb.org/docs/. For more information on the python API for duckdb used in this notebook, please see https://duckdb.org/docs/stable/clients/python/overview.html.
Before doing anything, we install the required python dependencies (polars, duckdb, and itables), if they are not already installed (if you run the notebooks locally, this can be done via a virtual environment manager like conda or pixi, in which case you don’t need to run this cell).
The SpacerDB database is built around an extensive collection of CRISPR spacers, each connected to a CRISPR repeat (associated with a given CRISPR type and taxonomic assignment), a given sample it was extracted from (associated with a sample size and ecosystem type), as well as a CRISPR spacer cluster (i.e. a unique id for identical spacers detected across multiple samples and/or repeats). Spacer clusters are in turn connected to their hits against viral and plasmid databases (IMG/VR and IMG/PR), each target database coming with its own set of metadata.
Here we will access the spacerDB directly from the s3 bucket. This is a database restricted to only spacers matching at least one plasmid in IMG/PR. Accessing via s3 is fine for small queries like the ones we have in this notebook, however for any extensive work with the database, including any large queries, we strongly recommend downloading the database and loading it from a local path.
Depending on your network connection, this cell may take ~ 30s to 1 min to load.
# Loading from S3 - first we create a connection to a local db in memory
con = duckdb.connect(":memory:")
# Next we enter the necessary key and secret to be able to read the s3 bucket
con.sql("""
CREATE SECRET spacereader (
TYPE S3,
PROVIDER config,
KEY_ID 'BBPSY391G6MGG864QL7K',
SECRET 'mAg0Bg2HzSrdjpsDRP6YPFK2DTyfVQCD_HzxJR0P',
ENDPOINT 'objects-1.jgi.lbl.gov',
URL_STYLE 'path');
""")
# And then we connect to the database
con.sql("""
ATTACH 's3://spacers/global_crispr_db_full_2025-05-02.duckdb' AS spacer_db (READ_ONLY);
use spacer_db;
""")
# To confirm that everything worked as expected, we check the list of tables in the database
con.sql("""
show tables;
""")
┌────────────────────┐
│ name │
│ varchar │
├────────────────────┤
│ imgpr_hits │
│ imgpr_info │
│ imgvr_hits │
│ imgvr_info │
│ repeat_tbl │
│ sample_tbl │
│ spacer_clusters │
│ spacer_hq_clusters │
│ spacer_hq_tbl │
│ spacer_tbl │
├────────────────────┤
│ 10 rows │
└────────────────────┘
If working locally, after downloading the database, the connection would simply be:
All tables
Tables in database:
table_name | table_type | table_schema |
---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Here we get a lit of all columns across all tables with number of rows, and we compile it in one giant table.
def get_table_stats(table_name: str) -> pl.DataFrame:
# Get column information
columns_query = f"""
SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_name = '{table_name}'
"""
columns_df = (con.execute(columns_query).pl())
# Get row count
count_query = f"SELECT COUNT(*) as row_count FROM {table_name}"
row_count = con.execute(count_query).fetchone()[0]
return columns_df.with_columns([
pl.lit(table_name).alias('table_name'),
pl.lit(row_count).alias('row_count')
])
# Get statistics for each table
all_stats = []
for table_name in tables_df.filter(pl.col('table_type') == 'BASE TABLE')['table_name']:
try:
stats = get_table_stats(table_name)
all_stats.append(stats)
except Exception as e:
print(f"Error analyzing table {table_name}: {e}")
combined_stats = pl.concat(all_stats)
print("Detailed table statistics:")
combined_stats
Detailed table statistics:
column_name | data_type | is_nullable | table_name | row_count |
---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
We can also preview some data from each table
# Function to preview table data
def preview_table(table_name: str, limit: int = 5) -> pl.DataFrame:
query = f"SELECT * FROM {table_name} LIMIT {limit}"
return (con.execute(query).pl())
print("Data previews:")
for table_name in tables_df.filter(pl.col('table_type') == 'BASE TABLE')['table_name']:
try:
print(f"\nPreview of table '{table_name}':")
preview = preview_table(table_name)
print(preview)
except Exception as e:
print(f"Error previewing table {table_name}: {e}")
Data previews:
Preview of table 'imgpr_hits':
shape: (5, 12)
┌────────┬────────┬────────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┬───────┐
│ cluste ┆ target ┆ hit_st ┆ hit_e ┆ hit_s ┆ n_mis ┆ CIGAR ┆ MD ┆ space ┆ proto ┆ upstr ┆ downs │
│ r_id ┆ _id ┆ art ┆ nd ┆ trand ┆ match ┆ --- ┆ --- ┆ r ┆ space ┆ eam ┆ tream │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ es ┆ str ┆ str ┆ --- ┆ r ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ str ┆ --- ┆ ┆ ┆ str ┆ --- ┆ str ┆ str │
│ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ ┆ ┆ str ┆ ┆ │
╞════════╪════════╪════════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╪═══════╡
│ Sp_cl_ ┆ IMGPR_ ┆ 56909 ┆ 56937 ┆ plus ┆ 1 ┆ 29M ┆ 11C17 ┆ CCTAC ┆ ..... ┆ ACCCG ┆ GGGAC │
│ 000640 ┆ plasmi ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCGAT ┆ ..... ┆ GCCGA ┆ GATGT │
│ 316139 ┆ d_2006 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TTACG ┆ .C... ┆ ┆ │
│ ┆ 543007 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCGAA ┆ ..... ┆ ┆ │
│ ┆ _00000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCACT ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ CAGC ┆ .... ┆ ┆ │
│ Sp_cl_ ┆ IMGPR_ ┆ 6319 ┆ 6355 ┆ minus ┆ 0 ┆ 37M ┆ 37 ┆ GGCCT ┆ ..... ┆ TGGCC ┆ GAAGC │
│ 000178 ┆ plasmi ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCGGG ┆ ..... ┆ GGAAG ┆ CCACC │
│ 198896 ┆ d_2010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CCCGG ┆ ..... ┆ ┆ │
│ ┆ 170001 ┆ ┆ ┆ ┆ ┆ ┆ ┆ AAGGA ┆ ..... ┆ ┆ │
│ ┆ _00000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CACGA ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ATCTC ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGPR_ ┆ 729 ┆ 760 ┆ plus ┆ 1 ┆ 32M ┆ 2A29 ┆ TCGCG ┆ ..A.. ┆ AGCCA ┆ CGAGA │
│ 000240 ┆ plasmi ┆ ┆ ┆ ┆ ┆ ┆ ┆ GATAC ┆ ..... ┆ GCTGG ┆ TATTG │
│ 680873 ┆ d_2010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GTTCC ┆ ..... ┆ ┆ │
│ ┆ 549000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CCGGA ┆ ..... ┆ ┆ │
│ ┆ _00000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CTCAT ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ TGATA ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGPR_ ┆ 3733 ┆ 3763 ┆ plus ┆ 0 ┆ 31M ┆ 31 ┆ CTCGG ┆ ..... ┆ TCCTC ┆ GGCGG │
│ 000128 ┆ plasmi ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCGCT ┆ ..... ┆ GCTCA ┆ GCGTT │
│ 033166 ┆ d_2010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ACGCT ┆ ..... ┆ ┆ │
│ ┆ 549000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CCTGC ┆ ..... ┆ ┆ │
│ ┆ _00000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CGTGA ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ GACTG ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGPR_ ┆ 3523 ┆ 3554 ┆ plus ┆ 0 ┆ 32M ┆ 32 ┆ TTGGA ┆ ..... ┆ CGGCG ┆ AGCGG │
│ 000582 ┆ plasmi ┆ ┆ ┆ ┆ ┆ ┆ ┆ CAAGC ┆ ..... ┆ CGGCG ┆ GTTTC │
│ 069336 ┆ d_2010 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CGCAG ┆ ..... ┆ ┆ │
│ ┆ 549000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCGCG ┆ ..... ┆ ┆ │
│ ┆ _00000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCAGT ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ GGCTT ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
└────────┴────────┴────────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
Preview of table 'imgpr_info':
shape: (5, 22)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
│ ful ┆ pla ┆ ptu ┆ tax ┆ sca ┆ sou ┆ eco ┆ eco ┆ len ┆ gen ┆ gen ┆ put ┆ top ┆ has ┆ has ┆ any ┆ has ┆ has ┆ put ┆ hos ┆ hos ┆ clo │
│ l_p ┆ smi ┆ --- ┆ on_ ┆ ffo ┆ rce ┆ sys ┆ sys ┆ gth ┆ e_c ┆ oma ┆ ati ┆ olo ┆ _mo ┆ _tf ┆ _co ┆ _mp ┆ _ar ┆ ati ┆ t_p ┆ t_t ┆ ses │
│ las ┆ d_i ┆ str ┆ oid ┆ ld_ ┆ _ty ┆ tem ┆ tem ┆ --- ┆ oun ┆ d_s ┆ vel ┆ gy ┆ b_g ┆ ss ┆ nju ┆ f ┆ g_g ┆ ve_ ┆ red ┆ axo ┆ t_r │
│ mid ┆ d ┆ ┆ --- ┆ oid ┆ pe ┆ --- ┆ _su ┆ i64 ┆ t ┆ cor ┆ y_c ┆ --- ┆ ene ┆ --- ┆ gat ┆ --- ┆ ene ┆ pha ┆ ict ┆ nom ┆ efe │
│ _id ┆ --- ┆ ┆ i64 ┆ --- ┆ --- ┆ str ┆ mma ┆ ┆ --- ┆ e ┆ omp ┆ str ┆ s ┆ boo ┆ ion ┆ boo ┆ s ┆ ge_ ┆ ion ┆ y ┆ ren │
│ --- ┆ str ┆ ┆ ┆ str ┆ str ┆ ┆ ry ┆ ┆ i64 ┆ --- ┆ let ┆ ┆ --- ┆ l ┆ _ge ┆ l ┆ --- ┆ pla ┆ _me ┆ --- ┆ ce │
│ str ┆ ┆ ┆ ┆ ┆ ┆ ┆ --- ┆ ┆ ┆ f64 ┆ e ┆ ┆ boo ┆ ┆ ne ┆ ┆ boo ┆ smi ┆ tho ┆ str ┆ --- │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ --- ┆ ┆ l ┆ ┆ --- ┆ ┆ l ┆ d ┆ d ┆ ┆ str │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ boo ┆ ┆ ┆ ┆ boo ┆ ┆ ┆ --- ┆ --- ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ ┆ ┆ l ┆ ┆ ┆ boo ┆ str ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ l ┆ ┆ ┆ │
╞═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╡
│ IMG ┆ IMG ┆ PTU ┆ 253 ┆ 253 ┆ Iso ┆ nul ┆ Unk ┆ 165 ┆ 14 ┆ 0.9 ┆ fal ┆ Lin ┆ tru ┆ tru ┆ tru ┆ fal ┆ fal ┆ fal ┆ Iso ┆ d__ ┆ NZ_ │
│ PR_ ┆ PR_ ┆ _00 ┆ 183 ┆ 184 ┆ lat ┆ l ┆ now ┆ 61 ┆ ┆ 967 ┆ se ┆ ear ┆ e ┆ e ┆ e ┆ se ┆ se ┆ se ┆ lat ┆ Bac ┆ CP0 │
│ pla ┆ pla ┆ 000 ┆ 900 ┆ 161 ┆ e ┆ ┆ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ e ┆ ter ┆ 766 │
│ smi ┆ smi ┆ 016 ┆ 6 ┆ 4 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ tax ┆ ia; ┆ 90. │
│ d_2 ┆ d_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ono ┆ p__ ┆ 1 │
│ 531 ┆ 531 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ my ┆ Pro ┆ │
│ 839 ┆ 839 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ teo ┆ │
│ 006 ┆ 006 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ bac ┆ │
│ _00 ┆ _00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ter ┆ │
│ 000 ┆ 000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ia; ┆ │
│ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │
│ IMG ┆ IMG ┆ PTU ┆ 269 ┆ 269 ┆ Iso ┆ nul ┆ Unk ┆ 109 ┆ 9 ┆ 0.9 ┆ fal ┆ Lin ┆ tru ┆ tru ┆ tru ┆ fal ┆ fal ┆ fal ┆ Iso ┆ d__ ┆ NZ_ │
│ PR_ ┆ PR_ ┆ _00 ┆ 853 ┆ 857 ┆ lat ┆ l ┆ now ┆ 05 ┆ ┆ 967 ┆ se ┆ ear ┆ e ┆ e ┆ e ┆ se ┆ se ┆ se ┆ lat ┆ Bac ┆ LT9 │
│ pla ┆ pla ┆ 000 ┆ 657 ┆ 091 ┆ e ┆ ┆ n ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ e ┆ ter ┆ 852 │
│ smi ┆ smi ┆ 114 ┆ 4 ┆ 7 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ tax ┆ ia; ┆ 27. │
│ d_2 ┆ d_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ono ┆ p__ ┆ 1 │
│ 698 ┆ 698 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ my ┆ Pro ┆ │
│ 536 ┆ 536 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ teo ┆ │
│ 574 ┆ 574 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ bac ┆ │
│ _00 ┆ _00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ter ┆ │
│ 000 ┆ 000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ia; ┆ │
│ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │
│ IMG ┆ IMG ┆ PTU ┆ 272 ┆ 272 ┆ Iso ┆ nul ┆ Unk ┆ 541 ┆ 8 ┆ 0.9 ┆ fal ┆ Con ┆ tru ┆ fal ┆ tru ┆ fal ┆ fal ┆ fal ┆ Iso ┆ d__ ┆ NZ_ │
│ PR_ ┆ PR_ ┆ _00 ┆ 836 ┆ 838 ┆ lat ┆ l ┆ now ┆ 9 ┆ ┆ 967 ┆ se ┆ cat ┆ e ┆ se ┆ e ┆ se ┆ se ┆ se ┆ lat ┆ Bac ┆ LR1 │
│ pla ┆ pla ┆ 015 ┆ 902 ┆ 871 ┆ e ┆ ┆ n ┆ ┆ ┆ ┆ ┆ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ e ┆ ter ┆ 352 │
│ smi ┆ smi ┆ 212 ┆ 6 ┆ 6 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ tax ┆ ia; ┆ 33. │
│ d_2 ┆ d_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ono ┆ p__ ┆ 1 │
│ 728 ┆ 728 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ my ┆ Fir ┆ │
│ 369 ┆ 369 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ mic ┆ │
│ 026 ┆ 026 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ute ┆ │
│ _00 ┆ _00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ s;c ┆ │
│ 000 ┆ 000 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ __B ┆ │
│ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │
│ IMG ┆ IMG ┆ PTU ┆ 330 ┆ Ga0 ┆ Met ┆ Hos ┆ Ani ┆ 756 ┆ 8 ┆ 0.9 ┆ fal ┆ Con ┆ tru ┆ fal ┆ tru ┆ fal ┆ fal ┆ fal ┆ CRI ┆ d__ ┆ NC_ │
│ PR_ ┆ PR_ ┆ _00 ┆ 000 ┆ 105 ┆ age ┆ t-a ┆ mal ┆ 7 ┆ ┆ 967 ┆ se ┆ cat ┆ e ┆ se ┆ e ┆ se ┆ se ┆ se ┆ SPR ┆ Bac ┆ 016 │
│ pla ┆ pla ┆ 000 ┆ 776 ┆ 553 ┆ nom ┆ sso ┆ -as ┆ ┆ ┆ ┆ ┆ eme ┆ ┆ ┆ ┆ ┆ ┆ ┆ spa ┆ ter ┆ 900 │
│ smi ┆ smi ┆ 642 ┆ 7 ┆ _10 ┆ e ┆ cia ┆ soc ┆ ┆ ┆ ┆ ┆ r ┆ ┆ ┆ ┆ ┆ ┆ ┆ cer ┆ ia; ┆ .1 │
│ d_3 ┆ d_3 ┆ ┆ ┆ 414 ┆ ┆ ted ┆ iat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ mat ┆ p__ ┆ │
│ 300 ┆ 300 ┆ ┆ ┆ 16 ┆ ┆ ;Ar ┆ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ch ┆ Pro ┆ │
│ 007 ┆ 007 ┆ ┆ ┆ ┆ ┆ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ teo ┆ │
│ 767 ┆ 767 ┆ ┆ ┆ ┆ ┆ opo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ bac ┆ │
│ _00 ┆ _00 ┆ ┆ ┆ ┆ ┆ da: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ter ┆ │
│ 000 ┆ 000 ┆ ┆ ┆ ┆ ┆ In… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ia; ┆ │
│ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │
│ IMG ┆ IMG ┆ PTU ┆ 330 ┆ Ga0 ┆ Met ┆ Hos ┆ Ani ┆ 639 ┆ 9 ┆ 0.9 ┆ fal ┆ Lin ┆ tru ┆ fal ┆ tru ┆ fal ┆ fal ┆ fal ┆ CRI ┆ d__ ┆ NC_ │
│ PR_ ┆ PR_ ┆ _00 ┆ 000 ┆ 104 ┆ age ┆ t-a ┆ mal ┆ 7 ┆ ┆ 967 ┆ se ┆ ear ┆ e ┆ se ┆ e ┆ se ┆ se ┆ se ┆ SPR ┆ Bac ┆ 016 │
│ pla ┆ pla ┆ 024 ┆ 725 ┆ 149 ┆ nom ┆ sso ┆ -as ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ spa ┆ ter ┆ 900 │
│ smi ┆ smi ┆ 993 ┆ 7 ┆ _10 ┆ e ┆ cia ┆ soc ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ cer ┆ ia; ┆ .1 │
│ d_3 ┆ d_3 ┆ ┆ ┆ 273 ┆ ┆ ted ┆ iat ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ mat ┆ p__ ┆ │
│ 300 ┆ 300 ┆ ┆ ┆ 6 ┆ ┆ ;Ar ┆ ed ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ch ┆ Pro ┆ │
│ 007 ┆ 007 ┆ ┆ ┆ ┆ ┆ thr ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ teo ┆ │
│ 257 ┆ 257 ┆ ┆ ┆ ┆ ┆ opo ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ bac ┆ │
│ _00 ┆ _00 ┆ ┆ ┆ ┆ ┆ da: ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ter ┆ │
│ 000 ┆ 000 ┆ ┆ ┆ ┆ ┆ In… ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ia; ┆ │
│ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
Preview of table 'imgvr_hits':
shape: (5, 12)
┌────────┬────────┬────────┬────────┬────────┬───────┬───────┬─────┬───────┬───────┬───────┬───────┐
│ cluste ┆ target ┆ hit_st ┆ hit_en ┆ hit_st ┆ n_mis ┆ CIGAR ┆ MD ┆ space ┆ proto ┆ upstr ┆ downs │
│ r_id ┆ _id ┆ art ┆ d ┆ rand ┆ match ┆ --- ┆ --- ┆ r ┆ space ┆ eam ┆ tream │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ es ┆ str ┆ str ┆ --- ┆ r ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ str ┆ --- ┆ ┆ ┆ str ┆ --- ┆ str ┆ str │
│ ┆ ┆ ┆ ┆ ┆ i64 ┆ ┆ ┆ ┆ str ┆ ┆ │
╞════════╪════════╪════════╪════════╪════════╪═══════╪═══════╪═════╪═══════╪═══════╪═══════╪═══════╡
│ Sp_cl_ ┆ IMGVR_ ┆ 43965 ┆ 43996 ┆ minus ┆ 0 ┆ 32M ┆ 32 ┆ CTGGT ┆ ..... ┆ CTTCC ┆ CTTCG │
│ 000391 ┆ UViG_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ AGCCA ┆ ..... ┆ ATCCG ┆ TCAAG │
│ 717596 ┆ 013843 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCATC ┆ ..... ┆ ┆ │
│ ┆ 001_00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ AATAA ┆ ..... ┆ ┆ │
│ ┆ 0007|2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GGAAA ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ CGCCC ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGVR_ ┆ 24536 ┆ 24570 ┆ plus ┆ 0 ┆ 35M ┆ 35 ┆ TTACA ┆ ..... ┆ AATTA ┆ TAAAT │
│ 000256 ┆ UViG_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ AATTC ┆ ..... ┆ GTTCA ┆ TATCT │
│ 103330 ┆ 019105 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TTACC ┆ ..... ┆ ┆ │
│ ┆ 003_00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ AATCT ┆ ..... ┆ ┆ │
│ ┆ 0001|2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TTCAT ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ CTACA ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGVR_ ┆ 2537 ┆ 2572 ┆ minus ┆ 0 ┆ 36M ┆ 36 ┆ CTTCA ┆ ..... ┆ TATTC ┆ GTCCA │
│ 000111 ┆ UViG_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCAAT ┆ ..... ┆ CTGTG ┆ CCCTT │
│ 470241 ┆ 019105 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TACGA ┆ ..... ┆ ┆ │
│ ┆ 003_00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GTAGG ┆ ..... ┆ ┆ │
│ ┆ 0005|2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCATA ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCCTT ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGVR_ ┆ 1882 ┆ 1915 ┆ minus ┆ 0 ┆ 34M ┆ 34 ┆ CGATG ┆ ..... ┆ CTCCG ┆ TGACG │
│ 000214 ┆ UViG_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TCCCG ┆ ..... ┆ GTTAC ┆ AATGT │
│ 114542 ┆ 020350 ┆ ┆ ┆ ┆ ┆ ┆ ┆ ATGGT ┆ ..... ┆ ┆ │
│ ┆ 001_00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CATAG ┆ ..... ┆ ┆ │
│ ┆ 0036|2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TAGCA ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCAGA ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
│ Sp_cl_ ┆ IMGVR_ ┆ 1870 ┆ 1905 ┆ minus ┆ 0 ┆ 36M ┆ 36 ┆ ATGGT ┆ ..... ┆ CGATG ┆ CCGCC │
│ 000346 ┆ UViG_2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ CATAG ┆ ..... ┆ TCCCG ┆ TACCT │
│ 077667 ┆ 020350 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TAGCA ┆ ..... ┆ ┆ │
│ ┆ 001_00 ┆ ┆ ┆ ┆ ┆ ┆ ┆ GCAGA ┆ ..... ┆ ┆ │
│ ┆ 0036|2 ┆ ┆ ┆ ┆ ┆ ┆ ┆ TTTCT ┆ ..... ┆ ┆ │
│ ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ GACGA ┆ ..... ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ ┆ │
└────────┴────────┴────────┴────────┴────────┴───────┴───────┴─────┴───────┴───────┴───────┴───────┘
Preview of table 'imgvr_info':
shape: (5, 16)
┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬──────┐
│ uvi ┆ eco ┆ vot ┆ len ┆ top ┆ qua ┆ tax ┆ cla ┆ hos ┆ hos ┆ hos ┆ sou ┆ lif ┆ fib ┆ acr ┆ dgr │
│ g ┆ sys ┆ u ┆ gth ┆ olo ┆ lit ┆ ono ┆ ss ┆ t_p ┆ t_m ┆ t_f ┆ rce ┆ est ┆ ers ┆ --- ┆ --- │
│ --- ┆ tem ┆ --- ┆ --- ┆ gy ┆ y ┆ my_ ┆ --- ┆ red ┆ eth ┆ rom ┆ --- ┆ yle ┆ --- ┆ boo ┆ bool │
│ str ┆ --- ┆ str ┆ i64 ┆ --- ┆ --- ┆ ful ┆ str ┆ ict ┆ od ┆ _ta ┆ str ┆ --- ┆ boo ┆ l ┆ │
│ ┆ str ┆ ┆ ┆ str ┆ str ┆ l ┆ ┆ ion ┆ --- ┆ xo ┆ ┆ str ┆ l ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ --- ┆ ┆ --- ┆ str ┆ --- ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ str ┆ ┆ str ┆ ┆ ┆ ┆ ┆ │
╞═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪═════╪══════╡
│ IMG ┆ Unk ┆ vOT ┆ 559 ┆ Pro ┆ Hig ┆ r__ ┆ r__ ┆ d__ ┆ Iso ┆ pha ┆ gen ┆ chr ┆ nul ┆ nul ┆ null │
│ VR_ ┆ now ┆ U_0 ┆ 4 ┆ vir ┆ h-q ┆ Mon ┆ Mon ┆ Bac ┆ lat ┆ ge ┆ ome ┆ oni ┆ l ┆ l ┆ │
│ UVi ┆ n ┆ 000 ┆ ┆ us ┆ ual ┆ odn ┆ odn ┆ ter ┆ e ┆ ┆ ┆ c ┆ ┆ ┆ │
│ G_2 ┆ ┆ 000 ┆ ┆ ┆ ity ┆ avi ┆ avi ┆ ia; ┆ tax ┆ ┆ ┆ ┆ ┆ ┆ │
│ 504 ┆ ┆ 1 ┆ ┆ ┆ ┆ ria ┆ ria ┆ p__ ┆ ono ┆ ┆ ┆ ┆ ┆ ┆ │
│ 643 ┆ ┆ ┆ ┆ ┆ ┆ ;k_ ┆ ;k_ ┆ Pro ┆ my ┆ ┆ ┆ ┆ ┆ ┆ │
│ 025 ┆ ┆ ┆ ┆ ┆ ┆ _Lo ┆ _Lo ┆ teo ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ _00 ┆ ┆ ┆ ┆ ┆ ┆ ebv ┆ ebv ┆ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000 ┆ ┆ ┆ ┆ ┆ ┆ ira ┆ ira ┆ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1|2 ┆ ┆ ┆ ┆ ┆ ┆ e;p ┆ e;p ┆ ia; ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ IMG ┆ Unk ┆ vOT ┆ 939 ┆ Pro ┆ Hig ┆ r__ ┆ r__ ┆ d__ ┆ Iso ┆ pha ┆ gen ┆ chr ┆ nul ┆ nul ┆ null │
│ VR_ ┆ now ┆ U_0 ┆ 9 ┆ vir ┆ h-q ┆ Mon ┆ Mon ┆ Bac ┆ lat ┆ ge ┆ ome ┆ oni ┆ l ┆ l ┆ │
│ UVi ┆ n ┆ 000 ┆ ┆ us ┆ ual ┆ odn ┆ odn ┆ ter ┆ e ┆ ┆ ┆ c ┆ ┆ ┆ │
│ G_2 ┆ ┆ 000 ┆ ┆ ┆ ity ┆ avi ┆ avi ┆ ia; ┆ tax ┆ ┆ ┆ ┆ ┆ ┆ │
│ 504 ┆ ┆ 2 ┆ ┆ ┆ ┆ ria ┆ ria ┆ p__ ┆ ono ┆ ┆ ┆ ┆ ┆ ┆ │
│ 756 ┆ ┆ ┆ ┆ ┆ ┆ ;k_ ┆ ;k_ ┆ Pro ┆ my ┆ ┆ ┆ ┆ ┆ ┆ │
│ 089 ┆ ┆ ┆ ┆ ┆ ┆ _Lo ┆ _Lo ┆ teo ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ _00 ┆ ┆ ┆ ┆ ┆ ┆ ebv ┆ ebv ┆ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000 ┆ ┆ ┆ ┆ ┆ ┆ ira ┆ ira ┆ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1|2 ┆ ┆ ┆ ┆ ┆ ┆ e;p ┆ e;p ┆ ia; ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ IMG ┆ Ter ┆ vOT ┆ 764 ┆ Pro ┆ Hig ┆ r__ ┆ r__ ┆ d__ ┆ Iso ┆ pha ┆ gen ┆ chr ┆ nul ┆ nul ┆ null │
│ VR_ ┆ res ┆ U_0 ┆ 5 ┆ vir ┆ h-q ┆ Mon ┆ Mon ┆ Bac ┆ lat ┆ ge ┆ ome ┆ oni ┆ l ┆ l ┆ │
│ UVi ┆ tri ┆ 000 ┆ ┆ us ┆ ual ┆ odn ┆ odn ┆ ter ┆ e ┆ ┆ ┆ c ┆ ┆ ┆ │
│ G_2 ┆ al_ ┆ 000 ┆ ┆ ┆ ity ┆ avi ┆ avi ┆ ia; ┆ tax ┆ ┆ ┆ ┆ ┆ ┆ │
│ 505 ┆ Soi ┆ 3 ┆ ┆ ┆ ┆ ria ┆ ria ┆ p__ ┆ ono ┆ ┆ ┆ ┆ ┆ ┆ │
│ 313 ┆ l_a ┆ ┆ ┆ ┆ ┆ ;k_ ┆ ;k_ ┆ Pro ┆ my ┆ ┆ ┆ ┆ ┆ ┆ │
│ 053 ┆ nd_ ┆ ┆ ┆ ┆ ┆ _Lo ┆ _Lo ┆ teo ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ _00 ┆ pla ┆ ┆ ┆ ┆ ┆ ebv ┆ ebv ┆ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000 ┆ nts ┆ ┆ ┆ ┆ ┆ ira ┆ ira ┆ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1|2 ┆ ┆ ┆ ┆ ┆ ┆ e;p ┆ e;p ┆ ia; ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ IMG ┆ Oth ┆ vOT ┆ 102 ┆ Lin ┆ Hig ┆ r__ ┆ r__ ┆ d__ ┆ Iso ┆ pha ┆ gen ┆ chr ┆ nul ┆ nul ┆ null │
│ VR_ ┆ er ┆ U_0 ┆ 05 ┆ ear ┆ h-q ┆ Mon ┆ Mon ┆ Bac ┆ lat ┆ ge ┆ ome ┆ oni ┆ l ┆ l ┆ │
│ UVi ┆ ┆ 000 ┆ ┆ ┆ ual ┆ odn ┆ odn ┆ ter ┆ e ┆ ┆ ┆ c ┆ ┆ ┆ │
│ G_2 ┆ ┆ 000 ┆ ┆ ┆ ity ┆ avi ┆ avi ┆ ia; ┆ tax ┆ ┆ ┆ ┆ ┆ ┆ │
│ 506 ┆ ┆ 4 ┆ ┆ ┆ ┆ ria ┆ ria ┆ p__ ┆ ono ┆ ┆ ┆ ┆ ┆ ┆ │
│ 210 ┆ ┆ ┆ ┆ ┆ ┆ ;k_ ┆ ;k_ ┆ Pro ┆ my ┆ ┆ ┆ ┆ ┆ ┆ │
│ 030 ┆ ┆ ┆ ┆ ┆ ┆ _Lo ┆ _Lo ┆ teo ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ _00 ┆ ┆ ┆ ┆ ┆ ┆ ebv ┆ ebv ┆ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000 ┆ ┆ ┆ ┆ ┆ ┆ ira ┆ ira ┆ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1|2 ┆ ┆ ┆ ┆ ┆ ┆ e;p ┆ e;p ┆ ia; ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ IMG ┆ Hum ┆ vOT ┆ 739 ┆ Lin ┆ Hig ┆ r__ ┆ r__ ┆ d__ ┆ Iso ┆ pha ┆ gen ┆ chr ┆ nul ┆ nul ┆ null │
│ VR_ ┆ an- ┆ U_0 ┆ 2 ┆ ear ┆ h-q ┆ Mon ┆ Mon ┆ Bac ┆ lat ┆ ge ┆ ome ┆ oni ┆ l ┆ l ┆ │
│ UVi ┆ ass ┆ 000 ┆ ┆ ┆ ual ┆ odn ┆ odn ┆ ter ┆ e ┆ ┆ ┆ c ┆ ┆ ┆ │
│ G_2 ┆ oci ┆ 000 ┆ ┆ ┆ ity ┆ avi ┆ avi ┆ ia; ┆ tax ┆ ┆ ┆ ┆ ┆ ┆ │
│ 507 ┆ ate ┆ 5 ┆ ┆ ┆ ┆ ria ┆ ria ┆ p__ ┆ ono ┆ ┆ ┆ ┆ ┆ ┆ │
│ 262 ┆ d_O ┆ ┆ ┆ ┆ ┆ ;k_ ┆ ;k_ ┆ Pro ┆ my ┆ ┆ ┆ ┆ ┆ ┆ │
│ 023 ┆ the ┆ ┆ ┆ ┆ ┆ _Lo ┆ _Lo ┆ teo ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ _00 ┆ r ┆ ┆ ┆ ┆ ┆ ebv ┆ ebv ┆ bac ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 000 ┆ ┆ ┆ ┆ ┆ ┆ ira ┆ ira ┆ ter ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ 1|2 ┆ ┆ ┆ ┆ ┆ ┆ e;p ┆ e;p ┆ ia; ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ … ┆ ┆ ┆ ┆ ┆ ┆ … ┆ … ┆ … ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴──────┘
Preview of table 'repeat_tbl':
shape: (5, 7)
┌──────────────┬─────────┬──────────────┬──────────────┬──────────────┬──────────────┬─────────────┐
│ repeat_clust ┆ type ┆ lca_origin ┆ lca_full ┆ lca_class ┆ lca_family ┆ lca_genus │
│ er ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ --- ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
│ str ┆ ┆ ┆ ┆ ┆ ┆ │
╞══════════════╪═════════╪══════════════╪══════════════╪══════════════╪══════════════╪═════════════╡
│ Ac_00008 ┆ Unknown ┆ No-informati ┆ NA ┆ NA ┆ NA ┆ NA │
│ ┆ ┆ on_short ┆ ┆ ┆ ┆ │
│ Ac_00016 ┆ I-E ┆ Genome_mediu ┆ d__Bacteria; ┆ d__Bacteria; ┆ d__Bacteria; ┆ d__Bacteria │
│ ┆ ┆ m-confidence ┆ p__Pseudomon ┆ p__Pseudomon ┆ p__Pseudomon ┆ ;p__Pseudom │
│ ┆ ┆ ┆ adota;… ┆ adota;… ┆ adota;… ┆ onadota;… │
│ Ac_00017 ┆ Unknown ┆ No-informati ┆ NA ┆ NA ┆ NA ┆ NA │
│ ┆ ┆ on_short ┆ ┆ ┆ ┆ │
│ Ac_00019 ┆ Unknown ┆ No-informati ┆ NA ┆ NA ┆ NA ┆ NA │
│ ┆ ┆ on_short ┆ ┆ ┆ ┆ │
│ Ac_00021 ┆ I-E ┆ Genome_mediu ┆ d__Bacteria; ┆ d__Bacteria; ┆ d__Bacteria; ┆ d__Bacteria │
│ ┆ ┆ m-confidence ┆ p__Desulfoba ┆ p__Desulfoba ┆ p__Desulfoba ┆ ;p__Desulfo │
│ ┆ ┆ ┆ cterot… ┆ cterot… ┆ cterot… ┆ bacterot… │
└──────────────┴─────────┴──────────────┴──────────────┴──────────────┴──────────────┴─────────────┘
Preview of table 'sample_tbl':
shape: (5, 11)
┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐
│ librar ┆ sra_ru ┆ ecosys ┆ ecosys ┆ biofil ┆ sequen ┆ biopro ┆ biosam ┆ n_spot ┆ n_base ┆ title │
│ y ┆ n ┆ tem_ba ┆ tem_su ┆ m ┆ cing_p ┆ ject ┆ ple ┆ --- ┆ --- ┆ --- │
│ --- ┆ --- ┆ se ┆ m ┆ --- ┆ latfor ┆ --- ┆ --- ┆ i64 ┆ i64 ┆ str │
│ str ┆ str ┆ --- ┆ --- ┆ str ┆ m ┆ str ┆ str ┆ ┆ ┆ │
│ ┆ ┆ str ┆ str ┆ ┆ --- ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ┆ ┆ ┆ str ┆ ┆ ┆ ┆ ┆ │
╞════════╪════════╪════════╪════════╪════════╪════════╪════════╪════════╪════════╪════════╪════════╡
│ DRR001 ┆ DRR001 ┆ Human- ┆ Human- ┆ Unknow ┆ ILLUMI ┆ PRJDA7 ┆ SAMD00 ┆ 104058 ┆ 832467 ┆ P1-026 │
│ 355 ┆ 355 ┆ associ ┆ associ ┆ n ┆ NA ┆ 2415 ┆ 008984 ┆ 4 ┆ 20 ┆ D │
│ ┆ ┆ ated;O ┆ ated_O ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ther ┆ ther ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR001 ┆ DRR001 ┆ Human- ┆ Human- ┆ Unknow ┆ ILLUMI ┆ PRJDA7 ┆ SAMD00 ┆ 156150 ┆ 124920 ┆ P1-026 │
│ 356 ┆ 356 ┆ associ ┆ associ ┆ n ┆ NA ┆ 2415 ┆ 008985 ┆ 1 ┆ 080 ┆ E │
│ ┆ ┆ ated;O ┆ ated_O ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ ┆ ┆ ther ┆ ther ┆ ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR001 ┆ DRR001 ┆ Human- ┆ Human- ┆ Unknow ┆ ILLUMI ┆ PRJDA7 ┆ SAMD00 ┆ 113484 ┆ 851136 ┆ P2 │
│ 359 ┆ 359 ┆ associ ┆ associ ┆ n ┆ NA ┆ 2415 ┆ 008986 ┆ 8 ┆ 00 ┆ lymph │
│ ┆ ┆ ated;O ┆ ated_O ┆ ┆ ┆ ┆ ┆ ┆ ┆ node │
│ ┆ ┆ ther ┆ ther ┆ ┆ ┆ ┆ ┆ ┆ ┆ DNA │
│ DRR001 ┆ DRR001 ┆ Human- ┆ Human- ┆ Unknow ┆ ILLUMI ┆ PRJDA7 ┆ SAMD00 ┆ 216333 ┆ 162250 ┆ P2 │
│ 360 ┆ 360 ┆ associ ┆ associ ┆ n ┆ NA ┆ 2415 ┆ 008988 ┆ 7 ┆ 275 ┆ lymph │
│ ┆ ┆ ated;O ┆ ated_O ┆ ┆ ┆ ┆ ┆ ┆ ┆ node │
│ ┆ ┆ ther ┆ ther ┆ ┆ ┆ ┆ ┆ ┆ ┆ RNA │
│ DRR001 ┆ DRR001 ┆ Terres ┆ Terres ┆ Unknow ┆ ILLUMI ┆ PRJDB2 ┆ SAMD00 ┆ 214520 ┆ 321781 ┆ Pollut │
│ 455 ┆ 455 ┆ trial; ┆ trial_ ┆ n ┆ NA ┆ 729 ┆ 015677 ┆ 87 ┆ 3050 ┆ ed │
│ ┆ ┆ Soil ┆ Soil_a ┆ ┆ ┆ ┆ ┆ ┆ ┆ soil │
│ ┆ ┆ ┆ nd_pla ┆ ┆ ┆ ┆ ┆ ┆ ┆ sample │
│ ┆ ┆ ┆ nts ┆ ┆ ┆ ┆ ┆ ┆ ┆ 0 │
└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Preview of table 'spacer_clusters':
shape: (5, 2)
┌────────────────────┬─────────────────────────────────┐
│ cluster_id ┆ spacer_id │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════╪═════════════════════════════════╡
│ Sp_cl_000000000001 ┆ DRR001355.505894/2|Ac_318340|1… │
│ Sp_cl_000000000002 ┆ DRR001356.1528018/2|Ac_749704|… │
│ Sp_cl_000000000003 ┆ DRR001360.1357947/1|Ac_268812|… │
│ Sp_cl_000000000004 ┆ DRR001455.6540945/2|Ac_72607|2… │
│ Sp_cl_000000000005 ┆ DRR001456.1389607/2|Ac_757060|… │
└────────────────────┴─────────────────────────────────┘
Preview of table 'spacer_hq_clusters':
shape: (5, 2)
┌────────────────────┬─────────────────────────────────┐
│ cluster_id ┆ spacer_id │
│ --- ┆ --- │
│ str ┆ str │
╞════════════════════╪═════════════════════════════════╡
│ Sp_cl_000000300846 ┆ DRR171573.10097189/1|Ac_250175… │
│ Sp_cl_000000314449 ┆ DRR171573.10105139/1|Ac_16972|… │
│ Sp_cl_000001848525 ┆ DRR171573.10105139/1|Ac_16972|… │
│ Sp_cl_000001848526 ┆ DRR171573.10105978/1|Ac_33126|… │
│ Sp_cl_000001848527 ┆ DRR171573.10105978/1|Ac_33126|… │
└────────────────────┴─────────────────────────────────┘
Preview of table 'spacer_hq_tbl':
shape: (5, 8)
┌─────────────┬─────────────┬─────────────┬─────────────┬──────────┬─────────────┬─────┬───────────┐
│ spacer_id ┆ spacer_sequ ┆ spacer_leng ┆ spacer_cove ┆ warning ┆ crispr_arra ┆ hq ┆ library │
│ --- ┆ ence ┆ th ┆ rage ┆ --- ┆ y ┆ --- ┆ --- │
│ str ┆ --- ┆ --- ┆ --- ┆ str ┆ --- ┆ i8 ┆ str │
│ ┆ str ┆ i64 ┆ i64 ┆ ┆ str ┆ ┆ │
╞═════════════╪═════════════╪═════════════╪═════════════╪══════════╪═════════════╪═════╪═══════════╡
│ DRR002467.1 ┆ AAAAAGCTCAA ┆ 30 ┆ 7 ┆ null ┆ Ac_31564 ┆ 1 ┆ DRR002467 │
│ 003712/1|Ac ┆ GCACTAGTCGA ┆ ┆ ┆ ┆ ┆ ┆ │
│ _31564|6… ┆ TATTGTTA ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR002467.1 ┆ GCGAAGAGCCA ┆ 30 ┆ 1 ┆ null ┆ Ac_16930 ┆ 1 ┆ DRR002467 │
│ 0103/1|Ac_1 ┆ TAGCAGACTTA ┆ ┆ ┆ ┆ ┆ ┆ │
│ 6930|74 ┆ GCAACATG ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR002467.1 ┆ GCTGAAGGTTA ┆ 29 ┆ 2 ┆ Partial ┆ Ac_16930 ┆ 1 ┆ DRR002467 │
│ 0103/1|Ac_1 ┆ TTGGTAACGAC ┆ ┆ ┆ ┆ ┆ ┆ │
│ 6930|9 ┆ TGCAGGA ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR002467.1 ┆ TCACGCTGTAA ┆ 30 ┆ 1 ┆ Mismatch ┆ Ac_16930 ┆ 1 ┆ DRR002467 │
│ 011278/1|Ac ┆ ATGGCTATTCT ┆ ┆ ┆ ┆ ┆ ┆ │
│ _16930|6… ┆ CGCTACGA ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR002467.1 ┆ ACAGCAACAAC ┆ 29 ┆ 1 ┆ null ┆ Ac_16930 ┆ 1 ┆ DRR002467 │
│ 011278/2|Ac ┆ GGGGTAGCCGT ┆ ┆ ┆ ┆ ┆ ┆ │
│ _16930|7… ┆ CGCCATT ┆ ┆ ┆ ┆ ┆ ┆ │
└─────────────┴─────────────┴─────────────┴─────────────┴──────────┴─────────────┴─────┴───────────┘
Preview of table 'spacer_tbl':
shape: (5, 8)
┌─────────────┬─────────────┬─────────────┬────────────┬────────────┬────────────┬─────┬───────────┐
│ spacer_id ┆ spacer_sequ ┆ spacer_leng ┆ spacer_cov ┆ warning ┆ crispr_arr ┆ hq ┆ library │
│ --- ┆ ence ┆ th ┆ erage ┆ --- ┆ ay ┆ --- ┆ --- │
│ str ┆ --- ┆ --- ┆ --- ┆ str ┆ --- ┆ i8 ┆ str │
│ ┆ str ┆ i64 ┆ i64 ┆ ┆ str ┆ ┆ │
╞═════════════╪═════════════╪═════════════╪════════════╪════════════╪════════════╪═════╪═══════════╡
│ DRR001355.5 ┆ GTGCAGGCCAC ┆ 32 ┆ 1 ┆ Partial ┆ Ac_318340 ┆ 0 ┆ DRR001355 │
│ 05894/2|Ac_ ┆ CCAGTGTATTT ┆ ┆ ┆ ┆ ┆ ┆ │
│ 318340|1… ┆ CGCAGTCG… ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR001356.1 ┆ TCGATAAGGGA ┆ 31 ┆ 1 ┆ Partial;lo ┆ Ac_749704 ┆ 0 ┆ DRR001356 │
│ 528018/2|Ac ┆ GATTTTTTTAA ┆ ┆ ┆ w_complexi ┆ ┆ ┆ │
│ _749704|… ┆ TGCATGAT… ┆ ┆ ┆ ty_spacer ┆ ┆ ┆ │
│ DRR001360.1 ┆ TGATAGCGAAT ┆ 28 ┆ 2 ┆ Partial ┆ Ac_268812 ┆ 0 ┆ DRR001360 │
│ 357947/1|Ac ┆ ACTTGGAAGCT ┆ ┆ ┆ ┆ ┆ ┆ │
│ _268812|… ┆ GCAAAA ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR001455.6 ┆ GAGAAACCCAT ┆ 26 ┆ 1 ┆ Partial ┆ Ac_72607 ┆ 0 ┆ DRR001455 │
│ 540945/2|Ac ┆ CTCTTGGCAGC ┆ ┆ ┆ ┆ ┆ ┆ │
│ _72607|2… ┆ TATA ┆ ┆ ┆ ┆ ┆ ┆ │
│ DRR001456.1 ┆ TGCGATGCCGC ┆ 32 ┆ 1 ┆ Partial ┆ Ac_757060 ┆ 0 ┆ DRR001456 │
│ 389607/2|Ac ┆ GCAGCACGTTG ┆ ┆ ┆ ┆ ┆ ┆ │
│ _757060|… ┆ GCGGCAAC… ┆ ┆ ┆ ┆ ┆ ┆ │
└─────────────┴─────────────┴─────────────┴────────────┴────────────┴────────────┴─────┴───────────┘
Here are some example of SQL queries we can run to get a better sense of the content of the database. First, looking at Array type distribution
Array Type Distribution:
type | repeat_count | percentage |
---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
We can also look at the distribution of ecosystems in the samples
Ecosystem Distribution:
ecosystem_sum | sample_count | total_spots | total_tbp |
---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
And we can look at the number of hits to IMG/PR sequences, at different levels of mismatches (0 or 1). To avoid scanning the entire img_pr table, we will instead subsample the first 100,000 hits.
Hits to IMG/PR:
n_mismatches | total_hits |
---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Often, you will need to connect information across tables. Below are two examples of this type of query.
Because both queries have to traverse multiple tables, they can take a few minutes to complete. To speed things up, and because they are more reliable, we only look at high-quality spacers, and thus use the (smaller) tables “spacer_hq_tbl” and “spacer_hq_clusters” instead of “spacer_tbl” and “spacer_cluster”, and we only look at the first 100,000 results (or later, the first 10 hits).
First, we will look for all CRISPR spacers that were found in engineered samples and linked to a CRISPR system of Type I-E. The first cell provides an example of a query that connects spacer_tbl to repeat_tbl (using crispr_repeat/repeat_cluster) and would look for spacers associated with repeats of type I-E (limiting results to the first 1,000,000 instances). The table also reports the spacer coverage in the corresponding sample, i.e. the number of reads in which this spacer was found associated with this repeat.
ctype_query = """SELECT spacer_id, spacer_sequence, spacer_coverage, library, repeat_cluster
FROM spacer_hq_tbl s, repeat_tbl r
WHERE s.crispr_array=r.repeat_cluster AND type='I-E'
LIMIT 100000
"""
ctype_result = (con.execute(ctype_query).pl())
print("First 100,000 CRISPR spacers linked to repeats of (predicted) type I-E:")
ctype_result
First 100,000 CRISPR spacers linked to repeats of (predicted) type I-E:
spacer_id | spacer_sequence | spacer_coverage | library | repeat_cluster |
---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Next, the full query becomes a bit more complex than the previous ones: first, we connect spacer_tbl to repeat_tbl (using crispr_array/repeat_cluster), and only select high-quality spacers connected to a repeat of type III-E. Then, we cross-reference this list of spacers with the sample_tbl (using library) to only select spacers from samples categorized as “Engineered”. We also include the taxonomic assignment of the corresponding repeat (if available), as well as the original ecosystem category of the sample, and the title of the corresponding BioSample.
ctype_eco_query = """
SELECT tmp.*, l.ecosystem_base, l.title
FROM (
SELECT spacer_id, spacer_sequence, spacer_coverage, library, repeat_cluster, r.lca_class
FROM spacer_hq_tbl s, repeat_tbl r
WHERE s.crispr_array=r.repeat_cluster AND type='I-E'
LIMIT 100000
) AS tmp, sample_tbl l
WHERE ecosystem_sum='Engineered' AND tmp.library=l.library
"""
ctype_eco_result = (con.execute(ctype_eco_query).pl())
print("CRISPR spacers connected to repeats of (predicted) type I-E and identified in engineered samples:")
ctype_eco_result
CRISPR spacers connected to repeats of (predicted) type I-E and identified in engineered samples:
spacer_id | spacer_sequence | spacer_coverage | library | repeat_cluster | lca_class | ecosystem_base | title |
---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Next, we will look for spacers with a hit to a specific viral taxon (in that case, the Monodnaviria realm, i.e. viruses with ssDNA genomes, which taxonomy string in IMG/VR v4 is “r__Monodnaviria;“).
The query starts by extracting all the hits to genomes assigned to this viral taxon. In the resulting table, because spacer hits are based on a non-redundant set of spacers, the identifier for spacers is a spacer cluster id (cluster_id) and not directly a spacer id. Hence, we next have to connect these hits to the spacer_hq_clusters table (via cluster_id), and then we can connect the resulting spacers to spacer_tbl (via spacer_id) and get the repeat cluster and library identifiers. To avoid cells running too long, we will store temporary results in polars df, but the same search can be run as a single SQL query (see at the end in the last commented cell).
Note: because hits to ssDNA viruses are comparatively rare, this query can take ~ 1 to 2 minutes depending on your connection
First 10 hits to ssDNA viruses:
cluster_id | target_id | hit_start | hit_end | hit_strand | n_mismatches | CIGAR | MD | spacer | protospacer | upstream | downstream |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
Note: the second and third queries can take several minutes when executed remotely
To limit the waiting time (but it’s still a bit long), we’ve set “LIMIT 10” in this query. You would not want to do that in a real analysis.
First 10 spacers with hit(s) to ssDNA viruses:
spacer_id | cluster_id | target_id | hit_start | hit_end | hit_strand | n_mismatches | CIGAR | MD | spacer | protospacer | upstream | downstream |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
info_spacer_query = """
SELECT repeat_cluster, lca_class, sample_tbl.library, ecosystem_sum, selected_spacers.*
FROM selected_spacers, spacer_hq_tbl, repeat_tbl, sample_tbl
WHERE selected_spacers.spacer_id=spacer_hq_tbl.spacer_id AND repeat_tbl.repeat_cluster=spacer_hq_tbl.crispr_array AND sample_tbl.library=spacer_hq_tbl.library;
"""
spacer_info = (con.execute(info_spacer_query).pl())
print("Information about the first 10 spacers with hit(s) to ssDNA viruses:")
spacer_info
Information about the first 10 spacers with hit(s) to ssDNA viruses:
repeat_cluster | lca_class | library | ecosystem_sum | spacer_id | cluster_id | target_id | hit_start | hit_end | hit_strand | n_mismatches | CIGAR | MD | spacer | protospacer | upstream | downstream |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.3.0 from the init_notebook_mode cell...
(need help?) |
### THIS IS AN EXAMPLE OF THE SAME QUERY RUN IN ONE STEP AND WITHOUT LIMIT, WE RECOMMEND ONLY RUNNING THIS CELL WITH A LOCAL COPY OF THE DATABASE
# ssdna_query = """
# SELECT repeat_cluster, lca_class, sample_tbl.library, ecosystem_sum, selected_spacers.*
# FROM (
# SELECT spacer_id, tmp.* FROM (
# SELECT imgvr_hits.*
# FROM imgvr_hits, imgvr_info
# WHERE taxonomy_full='r__Monodnaviria;%' AND target_id=uvig)
# as tmp, spacer_hq_clusters
# WHERE spacer_hq_clusters.cluster_id=tmp.cluster_id)
# AS tmp2, spacer_hq_tbl
# WHERE tmp2.spacer_id=spacer_hq_tbl.spacer_id AND repeat_tbl.repeat_cluster=spacer_hq_tbl.crispr_array AND sample_tbl.library=spacer_hq_tbl.library;
# """
# ssdna_info = (con.execute(ssdna_query).pl())
# print("Information about hits to ssDNA viruses:")
# ssdna_info