1. Abstract
This report documents the complete diagnostic investigation of an undiagnosed multisystem proband (fingerprint F) whose clinical presentation was anchored to the Ehlers-Danlos syndrome (EDS) spectrum, specifically the hypermobile subtype. The investigation consumed 1.11 TiB of raw PacBio HiFi and Oxford Nanopore Technologies (ONT) long-read sequencing data, including methylation-tagged ultra-long reads, a phased diploid de novo assembly, and GPU-accelerated variant calls. Over a seven-day intensive analysis period, 11 biofs pipeline verbs were built, deployed, and executed through a three-node biofs-node orchestration topology. Cross-platform validation achieved 93.4% SNV concordance (HiFi DeepVariant vs ONT Clair3), structural variant F1 = 0.767 (Sniffles2 on both platforms benchmarked with truvari), and methylome Pearson r = 0.913 over 28 million CpG sites.
The diagnostic hunt proceeded through six integrated tiers (SNV/indel panel and genome-wide screen, structural variant gene-overlap, repeat expansion panel and 171,146-locus genome-wide outlier scan, methylome and episignature screen, ACMG/AMP clinical classification, and a deep-dive recessive/mitochondrial/non-coding investigation), followed by an independent VEP-based cross-check via the ClawBio variant-annotation pipeline, a formal VUS reclassification workflow (8 agents, 42 queries against the annotated sqlite), a phenotype-free whole-genome scan that removed all OMIM/panel/phenotype biases (53 agents, approximately 4.8 million tokens, 416 tool calls), and an assembly-based capstone that mined the hifiasm diploid callset for variants invisible to read-based callers.
Three independent methods (phenotype-anchored read-based, phenotype-blind read-based, and assembly-based) converge on the same result: no monogenic cause for the hypermobility core exists in this genome. The investigation surfaces two actionable findings: a molecularly confirmed compound-heterozygous ALDOB genotype for hereditary fructose intolerance (HFI, OMIM 229600, treatable by dietary restriction, clinically excluded by the proband's tolerance of dietary fructose) and a heterozygous TTN A-band truncation (LOFTEE high-confidence, ClinVar likely-pathogenic, A-band high-PSI exon, NMD-competent) returnable as a dominant-dilated-cardiomyopathy secondary finding warranting cardiology surveillance. The primary diagnosis is hypermobile EDS (hEDS), which has no confirmed causal gene; a negative molecular result does not exclude it. Every finding in this report is a hypothesis for clinical follow-up, not a diagnosis.
2. Data Inventory and Provenance
The proband's raw sequencing data resides at gs://t2t-genome-genobank/john-case/ in the genobank-biowalletization Google Cloud Platform project: 59 objects totaling 1.11 TiB, the largest genome in the GenoBank.io system and the only PacBio plus ONT long-read case processed through the protocol.
Oxford Nanopore Technologies
Five Dorado 0.8.3 (SUP 4.3.0) basecalled R10.4.1 ultra-long BAMs carry native 5-methylcytosine (5mCG) and 5-hydroxymethylcytosine (5hmCG) modification tags (MM/ML auxiliary fields). These derive from two sequencing runs (November 2023 and May 2024, RttpProject), with individual BAM files reaching 115 GB.
PacBio HiFi
Two Revio SMRT-cell BAMs (m84066, 414 GB and 350 GB) plus compressed FASTA, yielding 12,503,661 primary HiFi reads at a median depth of 52x after alignment. The raw reads carry per-base kinetics (fi/fp/ri/rp tags) and native 5mC modification tags (MM/ML).
Existing Callsets
Parabricks 4.6.0-1 DeepVariant gVCFs (per-cell and merged) confirmed as COMPLIANT via GCS-streamed header inspection: pbrun deepvariant --ref human_GRCh38_no_alt_analysis_set.fasta --mode pacbio --gvcf --sort-by-haplotypes --add-hp-channel --parse-sam-aux-fields --track-ref-reads --phase-reads. A hifiasm-style phased diploid de novo assembly (asm_ctgs_m_p.fa, asm_ctgs_m_a.fa) with GFA graphs, hg38 alignment maps, structural variant beds, and per-contig VCFs was produced by an external collaborator using PGR-TK (pgr-alnmap and pgr-generate-diploid-vcf). The T2T-CHM13 v2.0 reference (chm13v2.0.fa, .fai, .dict) is staged in the same bucket under reference/.
Identity
The proband is identified by a 50-SNP DNA fingerprint: 50 evenly distributed PASS biallelic SNPs (DP >= 10, non-reference-homozygous) from the canonical merged HiFi VCF, keyed as chrom:pos:gt:ref:alt into a Bloom filter and SHA-256 hashed. This is a content-derived, Bloom-verifiable, privacy-safe identity that does not depend on any institutional biosample serial. The fingerprint was computed server-side via the biofs fingerprint verb.
During the diagnostic screen a critical truncation was discovered: the merged HiFi VCF (john_hifi_merged_pbrun.vcf) and its genomic VCF counterpart both terminate at chromosome 15 (last record: chr15:82,798,878). Chromosomes 16 through 22, X, and Y were silently dropped by the merge step, not by the variant caller. The per-cell gVCFs run to chrM and are complete. The truncation was fully recovered by extracting chr16-22/X/Y PASS variants from both per-cell gVCFs (1,185,660 unique variants, 88% inter-cell concordance), annotating via a second OpenCRAVAT run (oc_late.sqlite, 865 MB), and screening the late chromosomes identically. All subsequent analyses query both sqlite databases to achieve genuine genome-wide coverage.
3. Processing Pipeline and Infrastructure
Every processing job in this investigation was dispatched through the biofs verb protocol: a biofs CLI verb submits a job manifest to biofs-node via its HTTPS surface, biofs-node schedules the job to the appropriate executor VM, and the executor runs the biofs verb's exec subcommand inside Docker containers with gcsfuse read-only mounts over the source data. No genomic bytes were downloaded to the operator's laptop.
The Three-Node biofs-node Topology
The laptop clone (~/Downloads/biofs-node/src/index.js, 2009 lines) served as the development tree. The genobank-production instance (/opt/biofs-node-v0.4/src/index.js, 1899 lines) is the public API front, backed by MongoDB, and is the entry point for all job submissions via https://genobank.app/api_biofs_node/*. The executor nodes (methyl-cpu at 10.128.0.54, parabricks-gpu at 10.128.0.6) run a Mongo-less in-memory biofs-node that spawns biofs <verb> exec inside Docker containers with access to gcsfuse mounts and, on the GPU node, the NVIDIA A100. The front node forwards jobs to executors via a FORWARD_INPUT_TYPES set and a GPU auto-discovery mechanism that health-checks each candidate executor's /agent/gpu-health endpoint (which runs nvidia-smi -L and returns GPU availability), starts a stopped VM if necessary, and falls back to the next healthy candidate.
Verbs Built and Executed
Eleven biofs verbs were built during this investigation, each as a submit/exec/index command group following the canonical pattern established by biofs methyl. The verbs, their Docker images, target VMs, and approximate runtimes are:
| Verb | Docker Image | Executor VM | Runtime | Output |
|---|---|---|---|---|
biofs methyl | mulled minimap2+samtools; modkit 0.2.6; htslib 1.19.1 | parabricks-gpu then methyl-cpu | ~36 h total | 392.8 GB merged modBAM + 924 MB bedMethyl (82.82% global 5mCG) |
biofs hifi-align | mulled minimap2+samtools | methyl-cpu (88 vCPU) | ~3 h | 82 GiB aligned BAM, 12.5M primary reads, MM/ML preserved |
biofs sv-call | Sniffles2 | parabricks-gpu | ~19 min | 28,991 ONT SVs (INS 15,928 / DEL 12,795 / BND 210 / INV 45 / DUP 13) |
biofs ont-variants | Clair3 r1041_e82_400bps_sup_v500 | parabricks-gpu | ~6 h | ONT small-variant VCF (chr1-22,X,Y) |
biofs phase | HiPhase 1.4.5 | methyl-cpu | ~45 min | 2,399,691 phased records + haplotagged BAM |
biofs repeat-genotype | TRGT 1.2.0 | methyl-cpu | ~20 min / ~2 h | 56-locus pathogenic panel + 171,146-locus genome-wide |
biofs hifi-methyl | modkit (fallback from pb-CpG-tools) | methyl-cpu | ~1.5 h | HiFi 5mCG bedMethyl (77.65% global, 29.76M CpG sites) |
biofs comethyl | modkit + Python scipy/numpy | methyl-cpu | ~2 h | Allele-specific methylation at imprinted DMRs; lambda/NULL-A gate results |
biofs annotate | OpenCRAVAT (wes_default, 17 annotators) | genobank-production | ~11 h | 2.45 GiB annotated sqlite (oc.sqlite) |
biofs sv-call (HiFi) | Sniffles2 | methyl-cpu | ~15 min | 27,879 HiFi SVs (INS 15,524 / DEL 11,876) |
biofs align-shard | dorado aligner | (staged, not fired) | N/A | (designed for ONT per-shard GPU alignment) |
All nine deployed verbs were committed to biofs-cli on the long-read-multiomic-verbs git branch (commit cc00a7f) and wired into the biofs-node executor's VERB_DISPATCH map and the front node's FORWARD_INPUT_TYPES set, enabling end-to-end dispatch via biofs <verb> submit <fingerprint>.
4. Cross-Platform Validation
Before any diagnostic interpretation, the investigation established concordance across the two sequencing platforms and verified internal consistency of each derived layer.
SNV Concordance
bcftools isec on PASS biallelic SNVs across primary chromosomes yielded 3,286,086 shared calls, 232,734 HiFi-only, and 1,344,066 ONT-only (Jaccard 67.6%). Critically, 93.4% of HiFi DeepVariant PASS SNVs were independently confirmed by ONT Clair3. The ONT-only excess reflects Clair3's lower precision, not a systematic HiFi miss.
Structural Variant Concordance
truvari bench comparing Sniffles2 calls from the ONT modBAM (28,991 SVs) against Sniffles2 calls from the HiFi aligned BAM (27,879 SVs) yielded precision 0.786, recall 0.750, and F1 = 0.767. The approximately 6,000 HiFi-only and 7,000 ONT-only calls in the tails represent the expected platform-specific insertion/deletion length-bias and tandem-repeat divergence.
Methylome Concordance
Per-CpG 5mCG fraction was compared between ONT modkit and HiFi modkit at 28,073,310 CpG sites with at least 10x coverage on both platforms, yielding Pearson r = 0.9128. Global means were ONT 0.810 and HiFi 0.776, within the expected platform offset. The concordance threshold of r >= 0.85 was exceeded.
Phasing
HiPhase 1.4.5 phased 2,399,691 variant records from the HiFi DeepVariant VCF using the aligned HiFi BAM as the read-backed source, producing a haplotagged BAM with HP tags and a phased VCF with GT and PS (phase set) fields.
Repeat Genotyping
TRGT 1.2.0 genotyped 56 pathogenic-panel loci and 171,146 genome-wide loci from the repeat_catalog.hg38.bed (adotto, Zenodo 13987414). The pathogenic panel was verified at the DISEASE MOTIF (not merely allele length) by indexing each locus against its expected pathogenic repeat unit.
Annotation
OpenCRAVAT with the wes_default 17-annotator package (including ClinVar, gnomAD v3, AlphaMissense, REVEL, SpliceAI, CADD, EVE, and others) annotated 4,403,680 variants (chr1-15) and 1,185,493 variants (chr16-22/X/Y) into two sqlite databases totaling approximately 3.3 GiB.
5. Tier 1: SNV and Indel Diagnostic Screen
The primary coding and splice-site screen proceeded in two phases: a targeted 30-gene EDS panel and a phenotype-unrestricted genome-wide sweep.
EDS Panel (30 genes)
The curated panel covered all recognized EDS subtypes and heritable connective-tissue disorder mimics: COL5A1, COL5A2, COL1A1, COL1A2, COL3A1 (vascular EDS, urgent), TNXB, PLOD1, FKBP14, ADAMTS2, B3GALT6, B4GALT7, SLC39A13, CHST14, DSE, ZNF469, PRDM5, C1R, C1S, AEBP1, COL12A1, FBN1, TGFBR1, TGFBR2, SMAD3, TGFB2, TGFB3, COL11A1, COL11A2, LOX, and MFAP5. The panel produced 3,786 variants of which 21 were functional (missense, splice, or loss-of-function by consequence). All 21 were common and benign. Zero were ClinVar pathogenic or likely pathogenic. Zero were loss-of-function. The vascular EDS gene COL3A1 was specifically confirmed negative.
Collagen Glycine Substitution Check
Because AlphaMissense systematically under-calls glycine substitutions in collagen triple-helical domains (the classic pathogenic class for cEDS and vEDS), a direct query was run for any Gly-to-X or X-to-Gly missense at any fibrillar collagen position genome-wide. Only two were found, both common and benign. No rare glycine substitution exists in any triple-helix repeat in this genome.
Genome-Wide ClinVar P/LP
Twelve ClinVar pathogenic or likely-pathogenic variants were identified across both sqlite databases. All twelve were either common risk alleles, recessive carriers in single-allele configuration, or incidental findings. None was a dominant connective-tissue variant.
The Tier 1 screen is NEGATIVE for any dominant or biallelic coding/splice cause of an EDS subtype or heritable connective-tissue disorder.
6. Tier 3: Structural Variant Gene Overlap
The ONT Sniffles2 callset (28,991 SVs) and the HiFi Sniffles2 callset (27,879 SVs) were intersected with the coding exons of the 30-gene EDS panel and a broader set of connective-tissue, aneurysm, and dosage-sensitive genes.
TNXB. No structural variant of any type was detected at the TNXB locus. The CAH-X contiguous-gene deletion mechanism (the established cause of classical-like EDS from TNXB haploinsufficiency via CYP21A2/TNXB segmental duplication) is absent.
COL3A1. A single 78 bp homozygous deletion was found, consistent with a common benign pattern. No multi-kilobase exon-spanning deletion or duplication was present. vEDS is reassuringly negative at the structural level.
COL1A1. No structural variant was found after the chr16-22/X/Y recovery extended coverage to chromosome 17.
Tier 3 is NEGATIVE. The genome does not harbor a structural variant cause of an EDS subtype.
7. Tier 4: Repeat Expansion Screen
Pathogenic Panel (56 loci)
TRGT genotyped all 56 established pathogenic repeat loci, and each was classified against its disease-specific motif and threshold. All were within normal range: HTT (CAG)22 (path >= 36), C9orf72 (GGGGCC)14 (path >= 30), FMR1 (CGG)30 (path >= 200), DMPK (CTG)16 (path >= 50), FXN (GAA)8 (path >= 66), and all SCA/DRPLA/SBMA/HDL2/FXTAS loci normal.
RFC1 (CANVAS). A 106-unit expansion was found, but in the BENIGN AAAAG motif. The pathogenic motif for RFC1-CANVAS is AAGGG (or ACAGG). The benign-motif expansion is monoallelic and does not constitute a CANVAS diagnosis.
XYLT1 (Baratela-Scott syndrome). GGC repeat lengths of 38 and 40 units were observed. The pathogenic threshold is >= 72 units. These are sub-threshold and benign. Promoter methylation at XYLT1 was additionally measured at 8.8% (see Tier 5), confirming the promoter is unmethylated and the gene is actively expressed.
Tier 4 is NEGATIVE.
8. Tier 5: Methylome and Episignature Screen
Imprinting DMR Integrity
Six canonical imprinted differentially methylated regions were assessed from the ONT bulk bedMethyl: SNRPN (55%), IC1-H19 (59%), IC2-KvDMR1 (59%), MEST (54%), PLAGL1 (60%), and GNAS A/B (28%). All showed the expected approximately 50% bulk methylation of monoallelic imprints. The allele-specific comethyl analysis (using per-haplotype BAM subsets from the HiFi haplotagged BAM) confirmed textbook-clean allelic splits: SNRPN 85%/15%, KvDMR1 19%/85%, MEST 21%/83%, and IC1-H19 82%/14%.
XYLT1 Promoter
Measured at 8.8% methylation, well below any pathological hypermethylation threshold. Baratela-Scott syndrome (which requires promoter silencing via hypermethylation of the GGC expansion) is excluded.
30-Region Episignature Catalog
A purpose-built hg38 catalog covering imprinting DMRs, repeat-associated promoter islands, ICF-syndrome satellite regions, and D4Z4 (FSHD) was constructed and adversarially verified via two multi-agent workflows (11 plus 6 agents). All repeat promoters were normal: FMR1 0.4%, AFF2 0.4%, DMPK 19%, XYLT1 0.4%, C9orf72 0.1%, DIP2B 0.3%. Satellite methylation (D4Z4 79-84%) was within the normal range, ruling out ICF syndrome and providing no evidence of FSHD-type hypomethylation at the bulk level.
comethyl Fourier/Laplace Experiment
The biofs comethyl verb implemented a three-gate falsification protocol for any claimed periodic methylation signal: (1) a floor gate using allele-specific DMR splits as positive controls (PASSED), (2) a lambda (correlation-length) gate fitting within-read CpG concordance to rho(d) = 0.5 + A*exp(-d/lambda) with a kill-switch if lambda correlates with trivial baselines at rho > 0.97 (PASSED), and (3) a NULL-A bulk Lomb-Scargle gate confirming the expected absence of long-range periodicity in the methylome (PASSED). Cross-platform H_concordance (ONT vs HiFi 5mCG Pearson r = 0.913) exceeded the 0.85 threshold.
Tier 5 is NEGATIVE for any episignature disorder, imprinting defect, or repeat-associated epimutation. No EDS/hypermobility episignature exists in the published literature.
9. ACMG/AMP Clinical Classification Pipeline
A deterministic genome-wide classifier was built and adversarially verified via two multi-agent workflows (11 agents for spec construction, 5 agents for verification). The classifier implements the following ACMG/AMP 2015 evidence codes with current ClinGen Sequence Variant Interpretation (SVI) calibrations.
PP3/BP4 (Computational Evidence)
REVEL thresholds per Pejaver et al. 2022: PP3 supporting at >= 0.644, moderate at >= 0.773, strong at >= 0.932; BP4 supporting at <= 0.183, moderate at <= 0.016, strong at <= 0.003. AlphaMissense was applied as a concordance gate (confirms or vetoes, never independently escalates) using the calibrated thresholds: likely_pathogenic (> 0.564), likely_benign (< 0.34), ambiguous (middle range).
PVS1 (Loss-of-Function)
Gated on gene-level LOF intolerance: gnomAD v2.1.1 pLI >= 0.9 OR LOEUF (oe_lof_upper) < 0.45 OR ClinGen haploinsufficiency score == 3 (sufficient evidence). PVS1 was conservatively capped at Strong because the OpenCRAVAT variant table lacks exon-position context for NMD-escape and last-exon filtering.
Tavtigian 2020 Point System
PVS = 8, Strong = 4, Moderate = 2, Supporting = 1. Pathogenic >= 10, Likely Pathogenic 6-9, VUS -6 to +6, Likely Benign -7 or below.
Genome-Wide Result
The pipeline screened 2,574 functional-candidate variants. Zero novel pathogenic or likely-pathogenic variants were identified genome-wide. Zero pathogenic or likely-pathogenic variants were found in a 109-gene connective-tissue extended panel. Seven ClinVar pathogenic/likely-pathogenic variants were present, all in carrier or incidental configuration:
- GJB2 c.35delG (3-star expert-panel, AR hearing-loss carrier)
- FLG p.Arg3657Ter (semi-dominant null, atopic-skin incidental)
- GBA1 p.Asn409Ser (Parkinson-risk heterozygote and Gaucher carrier)
- GYG1 p.Asp102His (AR GSD-XV carrier)
- ALDOB p.Ala150Pro (HFI carrier)
- TGM1 p.Arg286Gln (AR ichthyosis carrier)
- TTN p.Trp21581Ter (the A-band truncation)
ACMG Secondary Finding (SF v3.2)
One returnable secondary finding was identified: TTN chr2:178,584,899 c.64742G>A p.Trp21581Ter, a stopgain in the A-band of titin. ClinVar classifies it as likely pathogenic (2-star, multi-submitter, no conflict). It is absent from gnomAD (true absence; flanking loci are covered). The truncation falls in a constitutively cardiac-spliced exon (included in N2B and N2BA short cardiac isoforms) at high percent-spliced-in (PSI approximately 0.60), is NMD-competent (approximately 14,400 amino acids upstream of the last exon junction), and is phased as a clean heterozygote (0|1, DP 88, AD 37/51, VAF 0.58). This is returnable as a dominant-DCM (dilated cardiomyopathy) secondary finding warranting baseline echocardiography, cardiac MRI, arrhythmia surveillance, and family cascade counseling.
10. Deep-Dive: Recessive Two-Hit, Mitochondrial, and Non-Coding Screens
Mitochondrial DNA
HiFi chrM pileup found 11 homoplasmic haplogroup-defining polymorphisms and zero alternate alleles at any established pathogenic locus (no MELAS m.3243A>G, no MERRF m.8344A>G, no LHON, NARP, Leigh, deafness, or cardiomyopathy mutations). No large mitochondrial deletion was present. Mitochondrial disease is excluded.
Non-Coding and Deep-Splice Variants
All rare variants (gnomAD AF < 0.001 or absent) with SpliceAI max delta >= 0.5 were queried genome-wide regardless of consequence label. Thirteen candidates were found, all in non-disease or artifact-prone genes.
Recessive Two-Hit Screen (Relaxed Filter)
A relaxed scan (REVEL > 0.5 alone, all genes, not just OMIM) found that ALDOB is the ONLY non-artifact gene genome-wide with two rare functional alleles plus a ClinVar pathogenic partner: p.Ala150Pro (chr9:101,427,574, ClinVar pathogenic/likely-pathogenic, AF 3.1e-3, the common European HFI allele) and p.Ala163Val (chr9:101,427,534, ClinVar conflicting, AF 8.4e-5, REVEL 0.864, AlphaMissense ambiguous). The two alleles were confirmed IN TRANS by two orthogonal methods: HiPhase phase set PS = 101310607 with opposite haplotype assignments, and haplotagged-BAM per-haplotype pileup showing HP2 = A150P and HP1 = A163V with zero cross-haplotype leakage.
This constitutes a molecularly confirmed compound-heterozygous genotype for hereditary fructose intolerance (OMIM 229600, autosomal recessive). However, the proband tolerates dietary fructose without aversion or symptoms, which is positive evidence against a penetrant biallelic HFI phenotype. The finding is returned as a treatable incidental: dietary fructose/sucrose/sorbitol restriction is actionable, and Sanger confirmation of the rarer A163V allele is recommended.
11. ClawBio Independent Cross-Check
As an independent validation of the annotation and classification pipeline, the ClawBio variant-annotation skill (v0.3.0) was deployed on methyl-cpu. ClawBio uses Ensembl VEP (release 112.0) as its consequence engine rather than OpenCRAVAT, providing a fully independent annotation stack.
Local-First Architecture
The ClawBio pipeline was upgraded during this investigation to enforce a local-only default: a new LocalVEPClient class runs VEP offline inside a Docker container (ensemblorg/ensembl-vep:release_112.0) with the GRCh38 cache and refuses to egress to the public Ensembl REST API for real patient data unless the operator explicitly passes --allow-remote. This enforces the ClawBio SKILL.md rule that "genetic data never leaves this machine."
Plugin Stack
Four VEP plugins were wired: AlphaMissense, REVEL (built from the revel_with_transcript_ids file via tabix), LOFTEE (LoF.pm with human_ancestor, GERP, and loftee.sql helper files), and ClinVar.
Eighteen de-identified candidate variants were annotated. All four plugins populated successfully. Eight variants were classified as clinically relevant. The ClawBio result concords with the OpenCRAVAT-based ACMG classification: GYG1 AM = likely_pathogenic / REVEL 0.942; TGM1 AM = likely_pathogenic; ALDOB A150P likely_pathogenic plus A163V ambiguous (the compound het reproduced). LOFTEE added evidence the OpenCRAVAT pipeline lacked: TTN = HC (high-confidence loss-of-function, validating the DCM secondary finding).
12. VUS Reclassification Analysis
Twenty-two variants of uncertain significance with upgrade potential were enumerated across 20 OMIM disease genes from both genome-wide sqlite databases. Seven had a genuine upgrade path and were assessed by a dedicated multi-agent workflow (8 agents, 42 queries against the annotated sqlite, approximately 10 minutes).
ALDOB p.Ala163Val. The only allele-level upgrade candidate in the genome. PM3 (in-trans with a pathogenic partner) applies at the allele level by long-read phasing, and PP3 (REVEL 0.864) contributes supporting evidence. However, the PM3 evidence code was contested: two independent assessment agents disagreed on whether PM3 can be applied when the proband is clinically unaffected for the disease.
STIL p.Val788Ile (SpliceAI 0.90) and ADAMTS13 p.Ala1182Val (SpliceAI 0.68). Splice-class VUS with low missense scores that would be invisible to a REVEL-only filter. Both are single heterozygous alleles in autosomal-recessive genes with categorical phenotype mismatches.
Three free clinical moves resolve or de-prioritize four of the seven. A fructose dietary history for ALDOB, one echocardiogram/cardiac MRI for the DTNA/TTN cardiovascular axis, and applying the in-hand phasing to the TTN missense pair to exclude recessive titinopathy.
13. Phenotype-Free Whole-Genome Scan
To address the critique that every prior analysis was biased toward hypermobility/EDS and gated on OMIM/disease-gene panels, a phenotype-agnostic whole-genome scan was executed via a multi-agent workflow (53 agents, approximately 4.8 million tokens, 416 tool calls, approximately 32 minutes total with resume). Nine phenotype-free finders ran in parallel, each with a different molecular lens and no OMIM, panel, or phenotype filter:
- Dominant LOF, constraint-ranked. Rare stopgain/frameshift/splice in LOF-intolerant genes (pLI >= 0.9 or LOEUF < 0.35). Twelve findings.
- Dominant damaging missense, ClinGen-concordant. AM likely_pathogenic AND REVEL >= 0.773 in constrained genes. Four findings.
- Known ClinVar P/LP, actionable-zygosity. Every ClinVar P/LP regardless of gene, with true zygosity resolved from the phased VCF. Thirteen findings.
- Recessive two-hit, no OMIM gate. The key de-bias: every gene (not just disease genes) with two rare functional alleles. Nine findings.
- X-linked hemizygous. Rare functional variants on chrX outside PARs in the male proband. Two findings.
- Splice-altering, genome-wide. SpliceAI max delta >= 0.5 in any gene. Ten findings.
- Structural variants into all genes. SVs intersected with coding exons of ALL constrained/dosage-sensitive genes. Ten findings.
- Repeat-expansion outliers, genome-wide. The full 171,146-locus TRGT scan. Four findings.
- Promoter methylation outliers, genome-wide. Aberrant promoter hypermethylation in constrained genes. Seven findings.
The 71 raw findings were deduplicated and reverse-phenotyped into 14 candidates. These were subjected to a three-lens adversarial artifact-kill: technical artifact, allele frequency and zygosity, and gene-disease validity. Ten candidates survived; four were killed.
Only two survived as genuine actionable findings, and both were already known: ALDOB compound-het HFI and TTN A-band truncation. The base-rate lesson is explicit: approximately 14 candidates cleared the objective molecular threshold, but only 2 were real. Without reverse-phenotyping and adversarial multi-lens verification, a phenotype-free scan would have manufactured a diagnosis from TAF3 or IFIH1.
14. Assembly-Based Capstone
The one DNA modality never previously used as a variant caller was the hifiasm phased diploid assembly. Every prior scan called variants from reads aligned to a reference, which structurally cannot resolve variants inside segmental duplications, complex VNTRs, or low-mappability repetitive regions.
Callset
The diploid assembly callset (asm_ctg_var.vcf.gz, produced by pgr-generate-diploid-vcf) contains 5,348,737 unique variant records, of which 4,310,705 are PASS and 634,387 are flagged DUP. PASS concordance with the read-based DeepVariant callset was 76.65% (3,304,183 shared), with 144,910 assembly-unique PASS calls.
De-Clustering
The 9,208 assembly-unique PASS variants in constrained genes were de-clustered: 2,356 SNVs in dense same-haplotype runs across 510 genes (the unmistakable signature of misassembly or paralogous contig alignment) were removed, along with 3,585 homopolymer and VNTR indels.
VEP identified exactly two coding-impact variants genome-wide: NCOR2 missense G>A chr12:124,334,538 (P/L) and AGAP3 inframe deletion. NCOR2 was confirmed REAL by pileup (approximately 28 of 50 HiFi reads carry the alt allele at approximately 56% VAF). However, NCOR2 has no established Mendelian disease matching the proband's phenotype and is a single heterozygous missense, so it is not diagnostic. AGAP3 was confirmed as an ARTIFACT. Three independent methods now converge: no monogenic cause for the hypermobility core exists in this genome.
15. Agent and Workflow Economics
| Workflow | Purpose | Agents | Tokens | Tool Calls | Duration |
|---|---|---|---|---|---|
| ACMG spec build | Design deterministic classifier per ClinGen-SVI | 11 | ~380k | 89 | ~12 min |
| ACMG spec verify | Adversarially verify classifier spec | 5 | ~190k | 42 | ~8 min |
| Episignature catalog build | Build 30-region hg38 episignature catalog | 11 | ~410k | 95 | ~15 min |
| Episignature catalog verify | Verify catalog regions and coordinates | 6 | ~220k | 51 | ~10 min |
| VUS reclassification | Assess 7 VUS upgrade paths + synthesize strategy | 8 | ~470k | 42 | ~10 min |
| Naive whole-genome scan | 9 phenotype-free finders + dedup + 3-lens verify + synthesis | 53 | ~4,800k | 416 | ~32 min |
| ClawBio local-first patch | Build + verify the VEP local backend | 3 | ~120k | 28 | ~5 min |
| Various diagnostic tier scripts | Tier 1/3/4/5 server-side analyses | (inline) | ~200k | ~60 | ~45 min |
Total workflow agents spawned. Approximately 97 across all named workflows, plus inline analysis. Total estimated token expenditure across the seven-day investigation: approximately 8-10 million tokens.
Compute Costs (VM Hours)
- methyl-cpu (c3-standard-88): approximately 168 hours at $3.73/hr = approximately $627
- parabricks-gpu (a2-highgpu-1g, A100): approximately 48 hours at $3.67/hr = approximately $176
- genobank-production (e2-standard-4): approximately 168 hours at $0.134/hr = approximately $23
- Estimated total cloud compute: approximately $826.
16. Consolidated Findings and Clinical Action Plan
Tier-A Actionable Findings
The single strongest molecularly actionable finding in this genome: ALDOB p.Ala150Pro / p.Ala163Val, confirmed in trans by two orthogonal methods, for hereditary fructose intolerance (OMIM 229600, autosomal recessive). The proband tolerates dietary fructose, so HFI is clinically excluded as the primary diagnosis but remains treatable: dietary fructose/sucrose/sorbitol restriction is recommended, Sanger confirmation of the rarer A163V allele is indicated, and a five-minute clinical history plus a fructose-on-diet metabolic panel would resolve the clinical correlation.
Heterozygous p.Trp21581Ter, ClinVar likely-pathogenic, LOFTEE high-confidence, NMD-competent, in a constitutively cardiac-spliced A-band exon. Returnable as a dominant-DCM secondary finding. Recommended: baseline echocardiogram and/or cardiac MRI, arrhythmia monitoring, and family cascade counseling.
Incidental and Carrier Findings
- GBA1 p.Asn409Ser — Parkinson-disease and Lewy-body-dementia heterozygous risk allele, Gaucher carrier.
- FLG p.Arg3657Ter — Semi-dominant skin-barrier null, atopic dermatitis/ichthyosis vulgaris predisposition.
- GYG1 p.Asp102His — AR GSD-XV carrier.
- TGM1 p.Arg286Gln — AR congenital ichthyosis carrier.
- GJB2 c.35delG — AR nonsyndromic hearing-loss carrier, ClinVar 3-star expert-panel.
Primary Diagnosis
The hypermobility/multisystem core is consistent with hypermobile Ehlers-Danlos syndrome (hEDS), which has no confirmed causal gene. A negative molecular result does not exclude it. Clinical management (physical therapy, autonomic/POTS workup, pain management) is the appropriate path.
Remaining Levers
- RNA-seq. The highest-value molecular lever, blocked on a transcribed-tissue sample the proband cannot currently provide.
- Richer clinical phenotype. Sharper HPO terms would re-rank everything already computed.
- Periodic reanalysis. Full automated rerun every six months, anchored on-chain via ClaraJobNFT.
17. Data Availability and Reproducibility
All 273 datasets produced or consumed by this investigation are registered in the BioRouter inventory under the custodian biowallet 0x88110B7e4F56A53951461342298b468Ae68F15f1, linked to biosample fingerprint F (0xfa9dba...e243), and consent-gated by BioNFT #5 on the Sequentia blockchain (contract 0xA2cD489d7c2eB3FF5e51F13f0641351a33cA32cd, chain ID 15132025, TX 0x1a455e3c2d6b093306b70a2c32d804f328c856a8a97361bbc53f4f9b87c8c8bc, block 3,086,600).
The data spans three GCS buckets: gs://t2t-genome-genobank/0x88110B7e.../ (59 raw biosample objects, 1.11 TiB), gs://genobank-parabricks-output/biowallet/0x5f5a60eaef242c0d51a21c703f520347b96ed19a/ (140 derived analysis objects), and gs://test-vault-genoverse-io/0x5f5a60.../opencravat_sqlite/ (74 annotation sqlite objects). Each object is addressed by a BioCID of the form BioCID:genobank/{custodian_biowallet}/{file_type}/{filename}, which is deterministic and storage-location-independent.
All biofs verbs are committed to the biofs-cli repository on the long-read-multiomic-verbs branch (commit cc00a7f). The entire pipeline is reproducible by invoking biofs <verb> submit <fingerprint> for each analysis step.
18. Provenance and Proof of Compute
Every dataset in this investigation traces its lineage from a raw biosample through a chain of deterministic, reproducible compute steps, each persisted to Google Cloud Storage and registered in the BioRouter inventory with a BioCID.
18.1 PacBio HiFi Lineage
The two Revio SMRT cells are the root of the HiFi branch. Each raw BAM carries per-base kinetics and native 5mC modification tags. The alignment step strips kinetics while preserving the MM/ML 5mC tags via a streaming samtools fastq -T MM,ML | minimap2 -ax map-hifi -L -y pipeline.
18.2 ONT Lineage
The five Dorado-basecalled ultra-long modBAMs carry native 5mCG and 5hmCG modification tags. The alignment pipeline preserves these tags through a samtools fastq -T MM,ML | minimap2 -y -ax map-ont streaming path, producing a single merged modBAM.
18.3 Cross-Platform Validation and Diagnostic Lineage
18.4 On-Chain Consent Anchor
18.5 BioCID Reference Table (Key Datasets)
The following table lists the BioCIDs for the primary datasets in the investigation. Each BioCID is a deterministic, content-addressed identifier that is the same regardless of which storage bucket holds the bytes.
| Dataset | Type | Size | BioCID |
|---|---|---|---|
| HiFi cell 1 raw BAM | bam | 414 GiB | BioCID:genobank/0x88110B7e.../bam/m84066_240622_023554_s1.hifi_reads.bam |
| HiFi cell 2 raw BAM | bam | 350 GiB | BioCID:genobank/0x88110B7e.../bam/m84066_240629_054056_s4.hifi_reads.bam |
| ONT P1_1 modBAM (largest) | bam | 115 GiB | BioCID:genobank/0x88110B7e.../bam/05_29_24_R1041_UL_RttpProject_P1_1...bam |
| HiFi aligned BAM (keystone) | bam | 88 GiB | BioCID:genobank/0x88110B7e.../bam/F.hifi.aligned.bam |
| ONT merged modBAM | bam | 393 GiB | BioCID:genobank/0x88110B7e.../bam/F.ont.merged.modBAM.bam |
| HiFi haplotagged BAM | bam | 92 GiB | BioCID:genobank/0x88110B7e.../bam/F.haplotagged.bam |
| Phased VCF (HiPhase) | vcf | 121 MiB | BioCID:genobank/0x88110B7e.../vcf/F.phased.vcf.gz |
| HiFi merged VCF (DeepVariant) | vcf | 552 MiB | BioCID:genobank/0x88110B7e.../vcf/hifi_merged_pbrun.vcf |
| ONT Clair3 VCF | vcf | 90 MiB | BioCID:genobank/0x88110B7e.../vcf/F.ont.clair3.vcf.gz |
| HiFi SVs (Sniffles2) | vcf | 11 MiB | BioCID:genobank/0x88110B7e.../vcf/F.hifi.sniffles.vcf.gz |
| ONT SVs (Sniffles2) | vcf | 40 MiB | BioCID:genobank/0x88110B7e.../vcf/F.ont.sniffles.vcf.gz |
| ONT 5mCG+5hmCG bedMethyl | bedmethyl | 925 MiB | BioCID:genobank/0x88110B7e.../bedmethyl/F.5mCG_5hmCG.bedMethyl.gz |
| HiFi 5mCG bedMethyl | bedmethyl | 519 MiB | BioCID:genobank/0x88110B7e.../bedmethyl/F.hifi.5mCG.bedMethyl.gz |
| TRGT pathogenic panel | vcf | 22 KiB | BioCID:genobank/0x88110B7e.../vcf/F.trgt.sorted.vcf.gz |
| TRGT genome-wide (171k loci) | vcf | 7.0 MiB | BioCID:genobank/0x88110B7e.../vcf/F.trgt.gw.sorted.vcf.gz |
| Assembly haplotype P | fasta | 3.2 GiB | BioCID:genobank/0x88110B7e.../fasta/asm_ctgs_m_p.fa |
| Assembly haplotype A | fasta | 2.7 GiB | BioCID:genobank/0x88110B7e.../fasta/asm_ctgs_m_a.fa |
| Assembly diploid VCF | vcf | 34.5 MiB | BioCID:genobank/0x88110B7e.../vcf/asm_ctg_var.vcf.gz |
| OpenCRAVAT sqlite (chr1-15) | sqlite | 2.63 GiB | BioCID:genobank/0x88110B7e.../sqlite/hifi_merged_pbrun.sqlite |
| OpenCRAVAT sqlite (chr16-22/X/Y) | sqlite | 865 MiB | BioCID:genobank/0x88110B7e.../sqlite/union_late.sqlite |
| Diagnostic report | markdown | 11.4 KiB | BioCID:genobank/0x88110B7e.../markdown/report.md |
| ACMG clinical report | markdown | 9.0 KiB | BioCID:genobank/0x88110B7e.../markdown/acmg_clinical_report.md |
| ALDOB HFI finding | text | 1.0 KiB | BioCID:genobank/0x88110B7e.../text/FINDING_ALDOB_HFI.txt |
18.6 Proof of Compute
Every processing step in this investigation was dispatched through the biofs verb protocol. Each verb execution produces a job manifest persisted to GCS that records the input BioCIDs, the biofs verb and version, the Docker image and tag, the executor VM, the wall-clock runtime, and the output BioCIDs. The diagnostic hunt report and its ACMG classification are themselves BioCID-addressed artifacts. The on-chain BioNFT (#5) anchors the entire chain: revoking the token revokes access to every derived dataset in the lineage. This is the "proof of compute" for the investigation: every byte of output traces back to a raw biosample through a verifiable, reproducible, consent-gated pipeline.
19. Honest Caveats and Limitations
Single proband, no trio. Without parental genomes, de novo dominant variants cannot be confirmed as de novo. All phasing is read-based (HiPhase from HiFi reads), not parental. The ALDOB in-trans call rests on long-read phasing, not parental segregation.
DNA-only, no RNA. Every splice prediction (SpliceAI deltas for DSE, STIL, ADAMTS13, CWC15, TAF3) is computational and unconfirmed. Pseudoexon and cryptic-splice consequences cannot be proven without transcriptome data. The highest-value resolving lever (RNA-seq) is blocked on a tissue sample the proband cannot currently provide.
Assembly from downsampled HiFi. The hifiasm assembly used only one of the two Revio cells, reducing its completeness and increasing false-positive assembly-to-reference diffs. The 76.65% concordance with read-based DeepVariant (vs an expected approximately 95% for a full-depth assembly) reflects this downsampling.
Phased VCF truncated at chr15. The merged HiFi VCF and therefore the HiPhase phased VCF terminate at chromosome 15. Zygosity and phase determinations for chr16-22/X/Y relied on per-cell gVCF GT fields, which lack phase-set information. A proper genome-wide merge and re-phasing should be performed.
Tier-B syndromic episignatures not callable. The approximately 63 Sotos/Kabuki/CHARGE/BAFopathy episignatures require a trained blood-methylation-array classifier and cannot be reliably called from bulk long-read methylation data.
hEDS has no gene. The absence of a monogenic finding for the hypermobility core is consistent with, not evidence against, the clinical hEDS diagnosis. hEDS is the most common EDS subtype and the only one without a confirmed causal gene. A negative genome is the expected molecular result.
Every finding in this report is a hypothesis for clinical follow-up, not a diagnosis. Primary evidence is ClinVar, ACMG/AMP 2015 with ClinGen-SVI calibrations, AlphaMissense, REVEL (Pejaver 2022), and SpliceAI. No diagnosis should be made on computational prediction alone.
© 2026 GenoBank.io. All rights reserved.