0.5.0 • Published 5d ago

@codonsplice/wasm

Licence

MIT

Version

0.5.0

Deps

Size

940 kB

Vulns

Weekly

Summary Dependency Versions

CodonSplice / SpliceQL

A SQL-like query language over genomic files, on a self-contained engine. Write SpliceQL, point it at a BAM/VCF, and get variants, coverage, somatic calls, or annotated records back — natively, embedded, or in the browser via WebAssembly, with no external runtime on the query path.

FROM bam "tumor.bam"
WHERE chr = "7" AND pos >= 55000000 AND pos <= 55300000 AND depth > 30
CALL variants
WITH min_af = 0.05, min_base_quality = 20, reference = "chr7.fa"
INTO vcf "egfr.vcf"

$ splice query "$(cat egfr.spq)"
wrote 274 record(s) to egfr.vcf (vcf)

CodonSplice is the engine; SpliceQL is the language. They are developed as separate crates with a hard boundary:

spliceql  (language)        →   codonsplice  (engine)
Lexer → Parser → AST        →   Compiler → Bytecode → VM → cnvlens-core

spliceql turns source into an AST. codonsplice-core compiles that AST to a compact stack-machine bytecode and executes it against real genomic data via cnvlens-core. The splice binary wraps both in a CLI + TUI, and the same engine compiles to WebAssembly for the browser.

What it is — and what it is not

SpliceQL is not a bcftools replacement. It is a common-80% engine with verified parity on the operations it covers, plus reach that bcftools structurally lacks:

Verified parity on the common path — variant calling, VCF set operations, multi-allelic normalization, tumor/normal somatic pairing, and annotation — each checked against a named oracle (see Verified parity).
Reach bcftools lacks — the whole engine runs in the browser and embedded, with no external runtime on the query path, and adds parallel CNV calling.

Use SpliceQL for those. Reach for bcftools/samtools for the long tail of the full VCF/BCF surface. The project's premise is verified honesty: every parity claim names its oracle, and every scope limit is stated plainly below.

Headline: one query vs. a bcftools pipeline

A common task: take a tumor and a normal VCF, keep the tumor-private (somatic) variants, annotate them with gene, HGVS, and ClinVar significance, then filter to the pathogenic ones. In SpliceQL that is a single declarative query:

FROM vcf "tumor.vcf.gz" PAIRED WITH vcf "normal.vcf.gz"
CALL variants WITH reference = "chr7.fa"
ANNOTATE WITH genes = "refGene.gff3", clinvar = "clinvar.vcf.gz"
WHERE clinvar_significance = "Pathogenic"
SELECT chrom, pos, gene, aa_change, clinvar_significance
INTO vcf "somatic_pathogenic.vcf"

The equivalent bcftools route is roughly nine steps of bgzip / tabix / isec / csq / annotate, with intermediate files at each stage:

# normalize + index both inputs
bgzip -c tumor.vcf  > tumor.vcf.gz   && tabix -p vcf tumor.vcf.gz
bgzip -c normal.vcf > normal.vcf.gz  && tabix -p vcf normal.vcf.gz
# tumor-private (somatic) set
bcftools isec -p isec_out tumor.vcf.gz normal.vcf.gz
bgzip -c isec_out/0000.vcf > somatic.vcf.gz && tabix -p vcf somatic.vcf.gz
# HGVS / consequence, then ClinVar significance
bcftools csq -f chr7.fa -g refGene.gff3 somatic.vcf.gz -Oz -o csq.vcf.gz && tabix -p vcf csq.vcf.gz
bcftools annotate -a clinvar.vcf.gz -c INFO/CLNSIG csq.vcf.gz -Oz -o ann.vcf.gz
# filter to pathogenic
bcftools view -i 'INFO/CLNSIG="Pathogenic"' ann.vcf.gz -o somatic_pathogenic.vcf

Same answer. The SpliceQL somatic set is verified byte-identical to the bcftools isec private-A partition.

Verified parity

Each claim is backed by a test that compares SpliceQL's output to a named oracle.

Operation	Oracle & result
variant calling	Differential vs the GIAB truth set and samtools/bcftools on NA12878 — concordance is measured, not assumed by construction.
ISEC / set ops	Byte-identical to bcftools `isec` partitions (`0000`–`0003`) on an exact `(chrom, pos, ref, alt)` key; live-bcftools differential test.
PAIRED WITH (somatic)	Somatic set byte-identical to bcftools `isec` private-A; germline to the shared partition.
SPLIT (multi-allelic)	Record-set identical to bcftools `norm -m -`, with per-allele AF apportioned and indels left-trimmed.
ANNOTATE (HGVS)	EGFR L858R → p.Leu858Arg / `c.2573T>G`, derived from the genetic code and verified against the real chr7 reference (forward strand).
parallel CNV	Byte-identical to the serial caller across 2–8 shards, with a positive control: one amplification spanning a shard boundary, emitted as a single call.

Honest scope limits

Stated plainly, not buried:

A common-80% engine, not the full bcftools/BCF surface.
CNV amplification sensitivity is unvalidated on real tumors. Correctness is proven (no false calls on flat diploid); a sensitivity number awaits an amplified-tumor truth set.
HGVS is verified on the forward strand (EGFR) only. Reverse-strand output is implemented but not yet verified end-to-end against a reference.
BAQ (base alignment quality) is not implemented — this accounts for the residual precision-margin gap vs bcftools.
INTO bam / cram are unsupported sinks; CRAM input is planned.

Install

spliceql and cnvlens are git submodules, so build from source with a recursive clone:

git clone --recursive https://github.com/Pogo-Bash/codonsplice
cd codonsplice
# if you forgot --recursive:
git submodule update --init --recursive

cargo build --release            # binary at target/release/splice
# …or install it onto your PATH:
cargo install --path crates/splice-cli

Prebuilt-binary installers (a curl | sh script, an @codonsplice/cli npm package, and a Windows PowerShell one-liner) pull the matching platform binary when published; see the project's releases page. splice update self-updates; splice uninstall removes it.

The language

A query is a FROM clause followed by any of the optional clauses below, in any order (FROM must be first):

Clause	Purpose	Example
`FROM <fmt> <path>`	the input source (required)	`FROM bam "x.bam"`
`SELECT <expr> [AS name], …`	project columns (omit for whole records)	`SELECT chr, pos, depth`
`WHERE <expr>`	per-record predicate	`WHERE chr = "7" AND depth > 30`
`CALL <op>`	the genomic operation to run	`CALL variants`
`WITH <key> = <val>, …`	tune the `CALL`	`WITH min_af = 0.05`
`ORDER BY <expr> [ASC\|DESC], …`	sort the results	`ORDER BY depth DESC`
`LIMIT <n>`	cap the row count	`LIMIT 100`
`INTO <fmt> <path>`	write results to a file	`INTO vcf "out.vcf"`
`ISEC <fmt> <path> [MODE …]`	two-input VCF set operation	`FROM vcf "a" ISEC vcf "b" MODE shared`
`PAIRED WITH <fmt> <path> [MODE …]`	tumor/normal somatic pairing	`FROM vcf "t" PAIRED WITH vcf "n"`
`SPLIT`	normalize multi-allelic records	`FROM vcf "x" SPLIT`
`ANNOTATE WITH <key> = <path>, …`	join local gene / ClinVar / HGVS	`ANNOTATE WITH genes = "g.gff3"`

Sources & sinks

Format	`FROM` (input)	`INTO` (output)
`bam`	yes (with `.bai` region seek)	—
`vcf`	yes	yes
`bed`	yes	yes
`fasta`	yes	yes (JSON array, legacy)
`json`	—	yes (NDJSON, one object per record)
`tsv`	—	yes (header row + tab-separated values)
`cram`	planned	—

FROM bam with a chr/pos range in WHERE is recognized at compile time and turned into a BAI-indexed region seek instead of a full scan.

Operations (`CALL`) and their `WITH` parameters

Operation	Parameters
`variants`	`min_depth`, `min_base_quality`, `min_mapping_quality`, `min_variant_reads`, `min_allele_freq` (alias `min_af`), `min_strand_bias`, `reference`
`cnv` / `coverage`	`window_size`, `amp_threshold`, `del_threshold`, `min_windows`, `segmentation_method`
`reads`	(none)
`header`	(none)

An unknown parameter is a compile error with a "did you mean" hint (Levenshtein + shared-token ranking over the known names).

Functions

Functions are implemented and run at execution time (they are no longer no-ops). Usable in WHERE / SELECT / ORDER BY; unknown names, wrong arity, and string-arg type mismatches are caught at splice check.

scalar / math — abs, round(x[, n]), floor, ceil, sqrt, pow(b, e), log(x[, base]), min(…), max(…), coalesce(…)
string — len, upper, lower, concat(…), contains(s, sub), starts_with / ends_with, substr(s, start[, len])
genomic — gc(seq), revcomp(seq), translate(seq[, frame]) (NCBI table 1, * = stop), codon_at(seq, i)

FROM bam "tumor.bam"
WHERE chr = "7" AND abs(af - 0.5) < 0.1          -- functions in WHERE
CALL variants
SELECT chrom, pos, ref, alt,
       gc(ref) AS gc,                            -- genomic fns as columns
       revcomp(alt) AS alt_rc

Fields (usable in `WHERE` / `SELECT` / `ORDER BY`)

variants: chr/chrom, pos, ref, alt, qual, depth, ref_count, alt_count, af/allele_freq, strand_bias, kind, filter, id
reads: chr/chrom, pos, mapq, flag, depth, strand, is_reverse, is_duplicate, is_secondary
coverage windows: chrom, start, end, coverage, normalized
annotation (added by ANNOTATE WITH): gene, transcript, exon, exon_id, region, consequence, aa_change, hgvs_c, clinvar_significance, clinvar_oncogenic, clinvar_id, rsid

Somatic, set operations & annotation

PAIRED WITH — tumor / normal somatic

FROM vcf "tumor" PAIRED WITH vcf "normal" keeps variants present in the tumor but not the normal (default MODE somatic); MODE germline keeps the shared set. The match key is the exact (chrom, pos, ref, alt) tuple — the same engine as ISEC.

FROM vcf "tumor.vcf.gz" PAIRED WITH vcf "normal.vcf.gz" MODE somatic
INTO vcf "somatic.vcf"

ISEC — VCF set operations

FROM vcf "a" ISEC vcf "b" MODE … computes a two-input set operation with bcftools isec semantics: private_a / private_b (records unique to one input), shared / shared_b (intersection, taking A's or B's record), or union.

FROM vcf "a.vcf.gz" ISEC vcf "b.vcf.gz" MODE private_a
INTO vcf "only_in_a.vcf"

SPLIT — multi-allelic normalization

SPLIT decomposes multi-allelic records (comma-separated ALTs) into one biallelic record per ALT, with per-allele AF apportioned and indels left-trimmed — the semantics of bcftools norm -m -.

FROM vcf "multiallelic.vcf" SPLIT
CALL variants
INTO vcf "biallelic.vcf"

ANNOTATE WITH — gene, ClinVar, HGVS (all local files)

ANNOTATE WITH genes = "…", clinvar = "…" joins each variant against local annotation databases — a GFF3 gene model and a ClinVar VCF. Every source is a local file; there are no live API calls, so annotation works offline and in the browser (privacy, and a requirement for WASM).

FROM vcf "egfr.vcf"
CALL variants WITH reference = "chr7.fa"
ANNOTATE WITH genes = "refGene.gff3", clinvar = "clinvar.vcf.gz"
SELECT chrom, pos, gene, exon, aa_change, hgvs_c, clinvar_significance
INTO vcf "annotated.vcf"

Note. ANNOTATE WITH accepts only genes and clinvar. The reference for HGVS is supplied on the CALL clause — CALL variants WITH reference = "…" — and the annotator picks it up from there.

Looked-up vs. computed. ClinVar significance is a lookup — a variant gets a clinical interpretation only if it is already in the ClinVar file. HGVS is the opposite: aa_change (e.g. p.Leu858Arg) and hgvs_c (e.g. c.2573T>G) are derived from the genetic code and the reference codon, so they are produced even for novel variants never seen before.

Referencing — `WITH reference`, and why it matters

Pass a reference FASTA with CALL variants WITH reference = "chr7.fa". It is what makes REF the actual reference base at each position. Without it, REF is inferred as the pileup-majority base — a coin-flip at balanced heterozygous sites, and invisible for homozygous variants (where nearly every read differs from the reference). A reference is therefore required to call indels and homozygous variants at all, for valid VCF, for truth-set concordance, for indel normalization, and for HGVS codon translation. FASTA contig names must match the input (e.g. >7 7).

`.spq` scripts

A .spq file is a reusable, parameterized query with a typed CLI interface declared in -- directives:

#!/usr/bin/env splice
-- @name: egfr-variant-caller
-- @input: bam required "Input BAM file"
-- @input: min_af optional float 0.05 "Minimum allele frequency"
-- @output: vcf "Variant calls"

FROM bam $bam
WHERE chr = "7" AND pos >= 55000000 AND pos <= 55300000 AND depth > 30
CALL variants
WITH min_af = $min_af, min_depth = 10
INTO vcf $output

$name template variables bind from --flag value arguments at run time:

splice new caller                                 # scaffold caller.spq
splice run caller.spq --bam tumor.bam --output out.vcf --min-af 0.03
splice build caller.spq --release                 # → self-contained ./caller binary
./caller --bam tumor.bam --output out.vcf         # same flags as `run`

splice build produces a ~22 MB self-contained native binary (or --wasm for a .wasm module). Flags: -o <name>, --release, --target <triple>, --wasm.

The CLI

splice                          launch the interactive TUI
splice query   "FROM bam …"     compile + run a one-liner
splice compile "FROM bam …"     compile + print disassembled bytecode
splice check   "FROM bam …"     parse + type-check only, no execution
splice new     <name>           scaffold <name>.spq
splice run     <file.spq> …     run a script, binding $vars from --flag value
splice build   <file.spq> …     compile a script to a native binary or .wasm
splice create  [framework] …    scaffold a web app wired to the WASM engine
splice update | uninstall       self-update / remove the binary

Scripts run via splice run <file.spq> (or one-liners via splice query "…"); there is no inline -e flag.

splice compile shows exactly what the VM runs — pipeline opcodes inline, with per-record sub-programs (the WHERE predicate, SELECT items, ORDER BY keys) appended after HALT.

The TUI

Launching splice with no subcommand opens a three-pane educational editor (query on the left; bytecode / results / errors on the right).

Key	Action
`Ctrl+Enter` / `F5`	compile + run the current query
`Ctrl+D`	disassemble bytecode
`Ctrl+A`	pretty-print the parsed AST
`Tab`	switch focus between editor and output
`F1`	toggle the keybindings help
`Ctrl+Q`	quit

Browser / npm

CodonSplice compiles to WebAssembly and runs entirely client-side — no server, no genomic data leaving the browser. The fastest start is the scaffolder:

splice create                  # interactive menu — react / vue / svelte / astro
splice create react my-app     # …or non-interactively

It generates a Vite/Astro app pre-wired to @codonsplice/wasm with a live SpliceQL playground (type-checks and compiles to bytecode as you type) plus a BAM upload to run a real query. To wire it up by hand, the ergonomic helpers live at @codonsplice/wasm/helpers:

import { execute, compile, check } from "@codonsplice/wasm/helpers";

const result = await execute({
  query: 'FROM bam "sample.bam" WHERE chr = "7" CALL variants',
  files: { "sample.bam": bamBytes },   // name → File | ArrayBuffer | Uint8Array
});

Framework wrappers add idiomatic state (useSpliceQL for react/vue, createSpliceQL for svelte) and re-export the core tooling, so an app imports everything from one package — @codonsplice/{react,vue,svelte,astro}. The worker needs cross-origin isolation (Cross-Origin-Opener-Policy: same-origin, Cross-Origin-Embedder-Policy: require-corp). See docs/NPM_PACKAGE.md.

A live, in-browser playground for all of the above is at swapdoesbioandis-a.dev/splice.

Workspace layout

codonsplice/
├── crates/
│   ├── spliceql/           the language: lexer + parser + AST  (submodule, read-only path dep)
│   ├── codonsplice-core/   compiler + bytecode + VM + execution + annotation + HGVS + materialization
│   ├── splice-cli/         the `splice` binary: CLI + ratatui TUI + .spq + installer
│   ├── codonsplice-wasm/   wasm32 bindings (wasm-bindgen cdylib)
│   └── spliceql-grammar/   TextMate grammar + Linguist assets + VS Code manifest
├── cnvlens/                cnvlens-core: BAM/VCF readers, pileup, variant/coverage/CNV callers (submodule)
├── pkg/                    built WASM + npm packages (@codonsplice/*)
├── scripts/                install.sh, build-wasm.sh, build-cli-packages.sh
├── templates/              `splice build` Cargo project template
└── docs/                   per-phase API + design docs

spliceql and cnvlens are git submodules — clone with git clone --recursive, or run git submodule update --init --recursive.

Build & test

cargo test  --workspace          # compiler / VM / disassembler / execution / parity tests
cargo clippy --workspace
cargo run   -p splice-cli        # launch the TUI
bash scripts/build-wasm.sh       # build the WASM npm package

How it runs (engine internals)

codonsplice-core is three layers — see docs/PHASE4_API.md and docs/PHASE5_API.md for the full as-built API.

Compile — AST → Program { consts, code, debug, region }. extract_region statically lifts a chr/pos WHERE into a Region for BAI seeking.
Execute (Vm::run) — a stack machine walks the bytecode: OPEN_SOURCE builds a Dataset, SCAN wraps a Cursor, FILTER/SET_PARAM/CALL_* configure it, and materialize applies the WHERE predicate, SELECT projection, ORDER BY sort, and LIMIT to produce the record stream.
Serialize (WRITE_INTO) — records → VCF / BED / JSON / TSV bytes via the Io trait (filesystem natively; an in-memory map in WASM).

A Program serializes to a compact .spq.bc (Program::to_bytes / from_bytes) — the format compiled binaries embed.

Editor support

crates/spliceql-grammar ships a TextMate grammar (scope source.spq), a VS Code manifest, and GitHub Linguist assets so .spq files highlight on GitHub and in editors.

License

MIT.