AutoDiagnosis: AI-Agentic Whole Exome Sequencing Pipeline for Undiagnosed Rare Disease
Multi-Sample Concordance, Systematic False-Negative Rescue, and Oligogenic Model Assembly
Abstract
We present AutoDiagnosis, an AI-agentic bioinformatics pipeline that combines multi-sample whole exome sequencing (WES), systematic variant caller false-negative rescue, structural variant analysis, and 22-annotator functional scoring to diagnose previously unresolved rare disease cases. Applied to an undiagnosed adult male with short stature, skeletal dysplasia, and neuromotor involvement, the pipeline processed four independent exome libraries from the same individual through GPU-accelerated variant calling (NVIDIA Clara Parabricks DeepVariant), identified 279 systematic false-negatives in the CNN-based caller, performed 4-sample structural variant analysis (Manta), and assembled an oligogenic diagnostic model implicating seven candidate genes across neurodevelopmental and musculoskeletal pathways.
Key findings include: (1) a constitutive heterozygous ClinVar Pathogenic variant in NRCAM (p.Asn469Ser, OMIM 615419) with confirmed carrier status across all four samples; (2) a hemizygous X-linked variant in TSPAN6 (p.Gly39Asp, AlphaMissense 0.996) requiring only one allele in males; (3) an ultra-rare variant in Steel syndrome gene COL27A1 (p.Arg1287Trp, REVEL 0.694) directly matching the skeletal phenotype; (4) quantification of DeepVariant's systematic CNN failure mode at textbook heterozygous sites (VAF 0.3–0.6, depth 50–250x); and (5) biophysical annotation of all candidate variants using the Spectral Consensus Disruption Score (SCDS), which quantifies electromagnetic energy shifts in the protein spectrum caused by each missense substitution, providing an orthogonal layer of evidence independent of machine learning predictors. The pipeline demonstrates that AI-orchestrated multi-sample concordance analysis with false-negative rescue can resolve cases where single-sample clinical WES fails. We further argue that access to AI-driven multi-omics diagnostic analysis is a fundamental human right: every individual, regardless of economic means or geographic location, deserves to be informed about what their own biodata reveals.
1. Introduction
1.1 The Diagnostic Gap in Rare Disease
Approximately 50% of patients with suspected rare genetic conditions remain undiagnosed after standard clinical whole exome sequencing (WES). The reasons for this diagnostic gap include: (a) variant caller sensitivity limitations, particularly for heterozygous variants in regions with complex sequence context; (b) single-sample analysis lacking concordance validation; (c) incomplete annotation databases; and (d) Mendelian-only interpretation frameworks that miss oligogenic contributions.
1.2 Limitations of CNN-Based Variant Callers
DeepVariant, Google's convolutional neural network (CNN)-based variant caller, achieves high precision but sacrifices recall at specific loci where the pileup image confuses the classifier. These false-negatives are particularly insidious because they produce confident reference calls (RefCall) rather than low-quality ambiguous calls—they appear as clean negatives in clinical reports. Traditional quality filters cannot rescue these variants because the caller never emitted them as candidates.
1.3 Contribution of This Work
AutoDiagnosis addresses these limitations through:
- Multi-sample concordance: Four independent exome libraries from the same individual provide biological replication, enabling detection of discordant calls that indicate caller failures rather than true biological absence.
- Systematic false-negative rescue: Automated cross-referencing of annotated variant databases against raw VCF files to recover variants that one or more callers missed but the reads support.
- Structural variant overlay: Manta SV analysis across all four BAM files to detect deletions, duplications, inversions, and translocations that could serve as second hits in autosomal recessive models.
- Oligogenic model assembly: Integration of all evidence (SNVs, indels, SVs, CNVs, repeat expansions, mitochondrial variants) into a unified multi-gene disease model.
1.4 AI-First AutoDiagnosis as a Human Right
An estimated 300 million people worldwide live with rare diseases, and roughly half remain undiagnosed after standard clinical testing. For most families, the diagnostic odyssey extends years or decades, constrained by the availability of specialist geneticists, the cost of sequencing, and the interpretive bottleneck of manual variant curation. AutoDiagnosis exists because this bottleneck is no longer technically necessary.
When a patient's whole exome can be sequenced for under $300, annotated by 22 independent sources in 36 minutes, cross-referenced against 76,000 population genomes, and assembled into an oligogenic disease model by an AI agent, the limiting factor is no longer technology. It is access. The question is not whether AI can assist in rare disease diagnosis; the results in this paper demonstrate that it can. The question is whether every individual has the right to know what their own biodata reveals.
We argue that they do. The Universal Declaration of Human Rights (Article 25) recognizes the right to medical care. The UNESCO Universal Declaration on the Human Genome and Human Rights (1997) affirms that the human genome "underlies the fundamental unity of all members of the human family, as well as the recognition of their inherent dignity." The European Convention on Human Rights (Article 8) protects the right to private life, which courts have interpreted to include access to information about one's own health. From these foundations, a coherent principle emerges: every person has the right to be informed about what their own genomic, proteomic, and clinical data reveals, interpreted through the best available computational methods, regardless of economic means or geographic location.
This is not an abstract aspiration. AutoDiagnosis implements it. The pipeline runs on commodity cloud infrastructure (a complete exome analysis costs less than a monthly Netflix subscription), stores all data in patient-owned encrypted vaults gated by revocable BioNFTs, and produces interpretable results without requiring a human geneticist in the loop for the initial computational triage. The AI agent does not replace the clinician; it ensures that every patient reaches the clinician with a complete, multi-annotator, multi-sample analysis already in hand.
Patient data sovereignty is the architectural prerequisite. If the patient does not own their biodata, they cannot authorize its analysis. If they cannot authorize its analysis, they cannot exercise their right to be informed. GenoBank.io's Metamorphic Consent framework (revocable BioNFTs encoding BioPIL terms on the Sequentia blockchain) ensures that the patient retains ownership and control at every stage: from sample collection, through sequencing and annotation, to the final diagnostic model. The AI agent operates on the patient's behalf, authenticated by the patient's biowallet, and the results belong to the patient.
2. Methods
2.1 Subject and Sample Collection
The subject is an undiagnosed adult male presenting with short stature, skeletal dysplasia (proportionate), and neuromotor involvement requiring adaptive seating. Prior clinical testing was non-diagnostic. Four independent exome libraries were prepared from the same individual using the Agilent SureSelect V8 capture kit across two sequencing events at an accredited genomics laboratory (AUGenomics, Lab Serial 56).
| Sample ID | Library | Mean Depth | Variants Called | BAM Size |
|---|---|---|---|---|
| A014 | Library 1 | ~120x | ~85,000 | 6.6 GB |
| A060 | Library 2 | ~115x | ~80,000 | 6.7 GB |
| 13972 | Library 3 | ~95x | ~45,000 | 16.0 GB |
| 14008 | Library 4 | ~90x | ~35,000 | 6.7 GB |
2.2 Identity Concordance Verification
Prior to joint analysis, sample identity was verified using somalier (Pedersen et al., 2020), which extracts genotypes at ~17,000 informative SNP sites and computes pairwise relatedness. All six pairwise comparisons yielded relatedness coefficients >0.95, confirming all four samples derive from the same individual. This step is critical to distinguish biological discordance (different individuals) from technical discordance (caller failures).
2.3 Variant Calling
Variants were called using NVIDIA Clara Parabricks DeepVariant (v4.6.0-1) on dual NVIDIA A100 GPUs (80GB HBM2e each). Two calling strategies were employed:
- Kit-aware calling: Standard DeepVariant with
--interval-file Agilent_V8_SureSelect_hg38.bedand--use-wes-model, restricting calls to captured regions. - Naive calling: DeepVariant without interval file or WES model flag, calling variants genome-wide from exome BAMs. This generates more candidates (including off-target reads) at the cost of higher false-positive rate in non-captured regions.
Both strategies produce VCF, gVCF, and sorted BAM outputs. Reference genome: GRCh38/hg38 (Homo_sapiens_assembly38.fasta).
2.4 Annotation Pipeline
All VCFs were annotated using OpenCRAVAT (v2.17) with a 22-annotator panel:
| Category | Annotators |
|---|---|
| Clinical | ClinVar, OMIM, HPO, ClinGen |
| Population | gnomAD v3, AllOfUs 250K, COSMIC |
| Pathogenicity | AlphaMissense, CADD, REVEL, SIFT, PolyPhen-2, VEST, EVE |
| Splicing | SpliceAI |
| Conservation | GERP, LoFtool |
| Pharmacogenomics | PharmGKB |
| Constraint | Regeneron (pLI, oe_lof) |
| Biophysical | SCDS (Cosic-RRM spectral energy shift) |
Annotation results were stored in SQLite databases (one per sample, 209–529 MB each), enabling efficient cross-sample querying via SQL. Additionally, all candidate missense variants were scored using the Spectral Consensus Disruption Score (SCDS), a biophysical annotation based on Cosic's Resonant Recognition Model (Section 7.3).
2.5 False-Negative Rescue Protocol
The rescue protocol identifies variants that are annotated in at least one sample but absent from one or more other samples (discordant calls). For each discordant variant:
- Query all four annotated SQLite databases for ClinVar Pathogenic/Likely Pathogenic variants and computationally predicted damaging variants (AlphaMissense > 0.9, REVEL > 0.8, CADD > 25).
- Identify variants present in <4 samples (discordant).
- For each missing sample, grep the raw (unannotated) naive VCF for the exact chromosomal position.
- Parse the raw VCF line to extract: genotype (0/1, 1/1), variant allele frequency (VAF), read depth (DP), and genotype quality (GQ/PHRED).
- Classify as rescued if the raw VCF contains the variant with VAF ≥ 0.15 and DP ≥ 20.
2.6 Structural Variant Calling
Structural variants were called using Manta (v1.6.0, Illumina) jointly across all four BAM files. Manta uses paired-end and split-read evidence to detect deletions, duplications, inversions, insertions, and translocations. The 4-sample joint calling maximizes sensitivity by combining read-pair evidence across independent libraries.
2.7 Complementary Analyses
- CNV detection: Read-depth ratio analysis per exon for candidate genes, comparing target gene depth to genome-wide median.
- Repeat expansions: ExpansionHunter (Illumina) targeting 30+ known pathogenic repeat loci (FMR1, HTT, DMPK, etc.).
- Mitochondrial variants: chrM variant calling with heteroplasmy detection.
- Protein-protein interaction network: STRING-based PPI network diffusion from seed genes.
3. Pipeline Architecture
Figure 1. AutoDiagnosis pipeline architecture. GPU-accelerated variant calling feeds into multi-annotator scoring, cross-sample concordance analysis, and AI-agentic model assembly.
3.1 Infrastructure
| Component | Specification | Cost Model |
|---|---|---|
| GPU Instance | a2-highgpu-2g (2x A100 80GB, 170GB RAM) | GCP Spot: ~$2.45/hr |
| Annotation VM | e2-standard-8 (8 vCPU, 32GB RAM, 500GB SSD) | GCP Spot: ~$0.12/hr |
| Storage | GCS multi-regional (gcsfuse mounted) | ~$0.02/GB/month |
| Blockchain | Sequentia Network (Chain ID 15132025) | Negligible gas |
3.2 Data Provenance and Consent
All genomic data is registered on the Sequentia blockchain via BioRouter smart contracts. Each biosample is represented as a parent BioAsset (ERC-1155), with derivative files (BAM, VCF, gVCF, SQLite) registered as child BioAssets forming an immutable provenance chain. Patient consent is encoded as a revocable BioNFT (ERC-721) implementing Metamorphic Consent—consent transforms from static permission to an ongoing economic relationship via BioPIL (Programmable IP License) revocable terms.
4. Multi-Sample Concordance Analysis
4.1 Rationale
When multiple independent sequencing libraries from the same individual are available, variant discordance between samples cannot be biological—it must be technical. This creates a powerful diagnostic tool: any clinically significant variant present in some but not all samples indicates either (a) a variant caller false-negative in the missing samples, or (b) a variant caller false-positive in the present samples. By examining the raw read evidence at discordant sites, we can definitively distinguish these cases.
4.2 Concordance Metrics
| Metric | Value |
|---|---|
| Total unique variants (union across 4 samples) | ~142,000 |
| Fully concordant (4/4 present) | ~78% |
| Discordant ClinVar P/LP variants | 159 |
| Discordant high-scoring variants (AM>0.9 or REVEL>0.8) | ~320 |
| Confirmed false-negatives (rescued) | 279 |
| Confirmed false-positives (no read support) | ~40 |
4.3 Example: NRCAM p.Asn469Ser
The ClinVar Pathogenic variant chr7:108195818 T>C (NRCAM p.Asn469Ser) was initially called only in sample 14008. Cross-sample examination revealed the variant was present in the raw reads of all four samples with textbook heterozygous evidence:
| Sample | Called? | VAF | Depth | PHRED |
|---|---|---|---|---|
| A014 | RefCall (missed) | 0.44 | 140x | N/A |
| A060 | RefCall (missed) | 0.38 | 111x | N/A |
| 13972 | RefCall (missed) | 0.51 | 126x | N/A |
| 14008 | 0/1 (called) | 0.47 | 118x | 36.8 |
This demonstrates a systematic DeepVariant CNN failure: despite VAFs of 0.38–0.51 at depths of 111–140x (textbook heterozygous signal), the model assigned RefCall in 3 of 4 independent attempts. Without multi-sample concordance, this ClinVar Pathogenic variant would have been reportable only from sample 14008.
5. Systematic False-Negative Rescue
5.1 Scale of the Problem
Key Finding
Across 159 discordant ClinVar Pathogenic/Likely Pathogenic variants, 279 individual false-negative calls were identified and rescued. This represents a per-sample false-negative rate of approximately 1.1% for pathogenic variants—a clinically significant error rate that single-sample WES cannot detect or correct.
5.2 Characteristics of Rescued Variants
Analysis of the 279 rescued false-negatives reveals consistent characteristics of the DeepVariant failure mode:
- VAF range: 0.28–0.58 (centered on 0.42, the expected heterozygous midpoint)
- Depth range: 50x–250x (well above minimum thresholds)
- Sequence context: No single motif dominates; failures occur in diverse genomic contexts
- Reproducibility: Variants missed in one sample are frequently missed in multiple samples (indicating sequence-context-dependent CNN sensitivity)
5.3 Clinically Significant Rescued Variants
| Gene | Variant | ClinVar | Disease | Samples Rescued |
|---|---|---|---|---|
| NRCAM | p.Asn469Ser | Pathogenic | ND with NM+skeletal (OMIM 615419) | 3 of 4 |
| COL27A1 | p.Arg1287Trp | — | Steel syndrome (OMIM 615155) | 3 of 4 |
| PGAP1 | p.Lys111Glu | — | Intellectual disability (OMIM 615802) | 2 of 4 |
| DCLK2 | p.Cys269Arg | — | Neuronal migration | 2 of 4 |
| HSPB2 | p.Arg119Cys | — | Muscle heat-shock protein | 1 of 4 |
| IGF1R | p.Arg1217His | Likely Pathogenic | Growth restriction (OMIM 270450) | 2 of 4 |
| SLC12A3 | p.Asp486Ala | Uncertain | Gitelman syndrome (OMIM 263800) | 1 of 4 |
5.4 Implications for Clinical Genomics
The systematic nature of these false-negatives has profound implications for clinical WES interpretation:
- Single-sample WES has a non-negligible pathogenic variant miss rate. If any one of the four samples were the sole clinical test, 1–3 ClinVar Pathogenic/LP variants in the diagnostic gene list would have been missed.
- Multi-sample replication is the only detection mechanism. These false-negatives pass all quality filters; they appear as clean reference calls, not as low-quality/filtered variants. No single-sample heuristic can flag them.
- The CNN failure is sequence-context-dependent, not random. The same loci tend to fail across independent libraries, suggesting the pileup image at these sites consistently confuses the network. This is not a stochastic sensitivity issue—it is a systematic blind spot.
6. Structural Variant Analysis
6.1 Manta Joint 4-Sample Calling
Manta structural variant calling was performed jointly on all four BAM files to maximize detection sensitivity. BAM files were accessed via gcsfuse (GCS FUSE mount) for samples stored in cloud buckets, and local disk for samples on the GPU VM.
| SV Category | Count | Clinically Relevant |
|---|---|---|
| Deletions (DEL) | ~60 | None in candidate genes |
| Duplications (DUP) | ~25 | None in candidate genes |
| Inversions (INV) | ~15 | None in candidate genes |
| Insertions (INS) | ~40 | None in candidate genes |
| Breakends (BND) | ~57 | None in candidate genes |
| Total | 197 | 0 |
6.2 Target Region Analysis
Specific attention was given to the genomic regions of the two autosomal recessive candidate genes where a second hit was sought:
- NRCAM locus (chr7:107,000,000–108,500,000): Zero SVs detected. No deletions, duplications, or rearrangements affecting any NRCAM exon or known regulatory region.
- COL27A1 locus (chr9:113,000,000–115,000,000): Zero SVs detected. No structural rearrangements in or near the gene.
6.3 Concordance of Detected SVs
Only 4 of 197 SVs were concordant across all four samples (all were breakends on alternative contigs with no clinical significance). This low concordance rate for SVs reflects the stochastic nature of SV calling from short-read exome data, where off-target reads provide limited and inconsistent structural evidence.
7. Multi-Source Variant Annotation
7.1 Annotation Strategy
The 22-annotator panel was selected to maximize coverage across clinical databases, population frequencies, computational predictors, conservation metrics, and pharmacogenomic relevance. Each annotator provides an independent axis of evidence:
- ClinVar: Expert-curated clinical significance assertions (Pathogenic, Likely Pathogenic, VUS, Benign)
- AlphaMissense: Deep learning protein structure-aware pathogenicity scores (0–1, threshold >0.564 = likely pathogenic)
- REVEL: Ensemble meta-predictor integrating 13 individual tools (0–1, threshold >0.5 = likely pathogenic)
- CADD: Combined Annotation-Dependent Depletion (Phred-scaled, threshold ≥25 = top 0.3% deleterious)
- SpliceAI: Deep learning splice prediction (0–1 per acceptor gain/loss, donor gain/loss)
- gnomAD v3: Population allele frequencies across 76,156 genomes
- AllOfUs 250K: Diversity-enriched population frequencies (250,000 participants)
- SCDS (Cosic-RRM): Biophysical spectral disruption scores derived from the Resonant Recognition Model; quantifies the electromagnetic energy shift caused by each missense substitution in the protein's informational spectrum (see Section 7.3)
7.2 Scoring Integration
Variants are tiered based on a composite scoring framework:
| Tier | Criteria | Action |
|---|---|---|
| Tier 1 | ClinVar P/LP OR (AM>0.9 + REVEL>0.8 + gnomAD<0.001 + phenotype match) | Diagnostic candidate |
| Tier 2 | (AM>0.8 OR REVEL>0.7) + gnomAD<0.001 + pathway relevance | Supporting/modifier |
| Tier 3 | gnomAD<0.01 + single elevated predictor + indirect phenotype | Monitor/carrier |
7.3 Biophysical Annotation: Spectral Consensus Disruption Score (SCDS)
In addition to the 21 statistical and machine learning annotators described above, we applied the Spectral Consensus Disruption Score (SCDS) to all candidate missense variants. SCDS is grounded in Cosic's Resonant Recognition Model (RRM), which posits that biological function in proteins correlates with characteristic frequencies in the discrete Fourier transform of the amino acid sequence, where each residue is encoded by its Electron-Ion Interaction Potential (EIIP value, Veljkovic and Cosic 1985).
For each missense variant, SCDS computes:
- Σ|ΔF|: The sum of absolute differences between the wild-type and mutant Fourier amplitude spectra across all frequency bins. Larger values indicate greater disruption of the protein's informational spectrum.
- ΔE%: The percentage change in total spectral energy (Σ|F|²) between wild-type and mutant sequences. Positive values indicate the mutation injects energy into the spectrum; negative values indicate energy loss.
SCDS provides an orthogonal layer of evidence that is fundamentally different from all other annotators in the panel. While AlphaMissense, REVEL, CADD, and EVE rely on evolutionary conservation, protein structure prediction, or ensemble machine learning trained on known pathogenic variants, SCDS derives its signal from the electromagnetic properties of the amino acid sequence itself. It requires no training data, no homology databases, and no structural models. The score is a direct physical measurement of how much a substitution perturbs the protein's frequency domain representation.
SCDS Validation (Super-SCDS Dataset, 223,987 Variants)
The SCDS energy shift metric (ΔE, denoted SCDS_dE) achieves a global Spearman correlation of ρ = 0.144 (p < 10-300, n = 2,714,347) against MaveDB deep mutational scanning functional scores across 397 genes. Per-gene analysis reveals substantially stronger signals: TP53 (ρ = 0.335, n = 686, p = 2 × 10-19), MID1 (ρ = 0.517, n = 52), FLI1 (ρ = 0.415, n = 56), and BAG3 (ρ = 0.362, n = 111). In a Random Forest hybrid model combining SCDS_dE with AlphaMissense, cross-validated R² reaches 0.133, confirming that SCDS captures variance in functional impact that AlphaMissense alone does not.
The following table summarizes SCDS scores for all seven candidate variants in this case:
| Gene | Variant | Tier | Σ|ΔF| | ΔE% | Interpretation |
|---|---|---|---|---|---|
| TSPAN6 | p.Gly39Asp | Tier 1 | 1.192 | +23.3% | Strongest spectral disruption; large energy injection consistent with functional loss |
| SLC12A3 | p.Asp486Ala | Tier 3 | 0.704 | −14.6% | Substantial energy loss; consistent with electrolyte transporter dysfunction |
| NRCAM | p.Asn469Ser | Tier 1 | 0.626 | +9.0% | Moderate disruption; energy gain at neural adhesion molecule |
| COL27A1 | p.Arg1287Trp | Tier 1 | 0.357 | −7.2% | Measurable energy loss in collagen triple-helix domain |
| PGAP1 | p.Lys111Glu | Tier 2 | 0.329 | −2.0% | Mild spectral shift; GPI-anchor enzyme may tolerate small perturbations |
| HSPB2 | p.Arg119Cys | Tier 2 | 0.136 | −1.5% | Minimal disruption; heat-shock protein structure may be partially buffered |
| DCLK2 | p.Cys269Arg | Tier 2 | 0.124 | +1.2% | Low spectral change despite high AM/REVEL scores; possible compensatory spectral effect |
The SCDS results reveal a notable pattern: TSPAN6 p.Gly39Asp, the X-linked hemizygous variant that requires only one allele in males, produces the largest spectral disruption by a wide margin (Σ|ΔF| = 1.192, ΔE% = +23.3%). This is consistent with the hypothesis that this variant causes the most severe biophysical perturbation of protein function among all candidates. The ranking also highlights an interesting discordance: DCLK2 p.Cys269Arg scores AM = 0.999 and REVEL = 0.892 (near ceiling for statistical predictors) but produces minimal spectral disruption (Σ|ΔF| = 0.124), suggesting that the evolutionary and structural predictors may be detecting a different dimension of pathogenicity than the electromagnetic one.
This complementarity is the core contribution of SCDS to the annotation stack. SCDS does not supersede existing predictors; it provides a biophysical dimension that none of them capture. When SCDS and AlphaMissense agree (as with TSPAN6), confidence in pathogenicity increases. When they disagree (as with DCLK2), the variant warrants closer examination for context-dependent effects that one method detects and the other does not.
8. Diagnostic Results
8.1 Tier 1 Candidates
NRCAM p.Asn469Ser (chr7:108195818 T>C)
| ClinVar: | Pathogenic (ID 1329989) |
| OMIM: | 615419 — Neurodevelopmental disorder with neuromuscular and skeletal abnormalities |
| Inheritance: | Autosomal Recessive |
| Zygosity: | Heterozygous (4/4 samples, 3 rescued) |
| gnomAD AF: | 0.000044 |
| AlphaMissense: | 0.841 (Likely Pathogenic) |
| CADD: | 28.7 |
| SCDS Σ|ΔF|: | 0.626 (moderate spectral disruption) |
| SCDS ΔE%: | +9.0% (energy gain) |
| Status: | Carrier — no second hit identified (coding, CNV, SV all negative) |
Clinical correlation: The associated disorder (ND with neuromuscular and skeletal abnormalities) is a near-perfect phenotypic match. However, the disorder is autosomal recessive and no second pathogenic allele was identified despite exhaustive search (coding variants, read-depth CNV analysis, Manta SV calling, and repeat expansion screening). The patient is a confirmed carrier. A deep intronic or regulatory second hit remains possible and would require whole-genome sequencing to investigate.
TSPAN6 p.Gly39Asp (chrX:100635718 G>A)
| ClinVar: | VUS |
| Inheritance: | X-linked (hemizygous in male = functionally homozygous) |
| Zygosity: | Hemizygous (4/4 samples concordant) |
| gnomAD AF: | 0.000009 (ultra-rare) |
| AlphaMissense: | 0.996 (Pathogenic, near-ceiling) |
| REVEL: | 0.959 |
| CADD: | 25.3 |
| SIFT: | Damaging |
| PolyPhen-2: | Probably Damaging |
| SCDS Σ|ΔF|: | 1.192 (strongest spectral disruption of all candidates) |
| SCDS ΔE%: | +23.3% (major energy injection) |
Clinical correlation: TSPAN6 encodes Tetraspanin-6, a transmembrane protein expressed in brain and skeletal tissue. Being on the X chromosome, hemizygosity in a male means only one damaging allele is required for full loss of function. The computational prediction scores are among the highest observed in this case (AM=0.996, REVEL=0.959). While TSPAN6 is not yet established as a Mendelian disease gene, its expression pattern and the extreme pathogenicity scores make this a strong candidate for the neuromotor component of the phenotype.
COL27A1 p.Arg1287Trp (chr9:114282544 C>T)
| ClinVar: | Not submitted |
| OMIM: | 615155 — Steel syndrome (short stature, skeletal dysplasia) |
| Inheritance: | Autosomal Recessive |
| Zygosity: | Heterozygous (4/4 samples, 3 rescued from false-negatives) |
| gnomAD AF: | 0.000028 (ultra-rare) |
| REVEL: | 0.694 |
| CADD: | 29.3 |
| SCDS Σ|ΔF|: | 0.357 (measurable disruption in collagen triple-helix) |
| SCDS ΔE%: | −7.2% (energy loss) |
Clinical correlation: COL27A1 encodes Type XXVII Collagen, mutations in which cause Steel syndrome (OMIM 615155)—characterized by short stature, bilateral hip and radial head dislocations, scoliosis, and facial dysmorphism. This is a direct phenotypic match for the skeletal dysplasia and short stature. Like NRCAM, the disorder is autosomal recessive and no second hit was identified. Notably, this variant was discovered only through the false-negative rescue protocol (missed in 3 of 4 samples).
8.2 Tier 2 Candidates
| Gene | Variant | AM Score | REVEL | gnomAD AF | SCDS Σ|ΔF| | ΔE% | Relevance |
|---|---|---|---|---|---|---|---|
| PGAP1 | p.Lys111Glu | 0.998 | 0.961 | 0.00005 | 0.329 | −2.0% | GPI-anchor synthesis; biallelic mutations cause intellectual disability (OMIM 615802) |
| DCLK2 | p.Cys269Arg | 0.999 | 0.892 | Absent | 0.124 | +1.2% | Doublecortin-like kinase; neuronal migration. Note: low SCDS despite near-ceiling AM/REVEL suggests a different dimension of pathogenicity. |
| HSPB2 | p.Arg119Cys | 0.938 | 0.811 | Novel | 0.136 | −1.5% | Small heat-shock protein B2; muscle-expressed, myofibrillar component |
8.3 Tier 3 (Modifier)
| Gene | Variant | Note |
|---|---|---|
| SLC12A3 | p.Asp486Ala | Gitelman syndrome carrier (electrolyte wasting); may contribute to growth restriction as a modifier. Het in 2/4 samples. SCDS: Σ|ΔF| = 0.704, ΔE% = −14.6% (second highest spectral disruption among all candidates; substantial energy loss consistent with transporter dysfunction). |
8.4 Negative Results
- Repeat expansions: ExpansionHunter screening of 30+ pathogenic repeat loci returned no expansions above threshold.
- Mitochondrial variants: No pathogenic mtDNA variants detected; heteroplasmy levels normal.
- Large structural variants: 197 SVs detected by Manta, none affecting candidate gene regions.
- CNVs in NRCAM/COL27A1: Read-depth analysis shows no exonic deletions or duplications.
9. Oligogenic Disease Model
9.1 Hypothesis
The clinical presentation cannot be explained by a single Mendelian gene under classical inheritance models. Instead, we propose an oligogenic model where multiple rare heterozygous variants across functionally related pathways compound to produce the complex phenotype:
Figure 2. Oligogenic model decomposing the complex phenotype into contributory gene modules. Each variant alone is insufficient (carrier or VUS), but their combined effect on interconnected developmental pathways produces the clinical presentation.
9.2 Pathway Convergence
The identified variants converge on two major developmental systems:
- Neural cell adhesion and migration: NRCAM, TSPAN6, DCLK2, and PGAP1 all participate in neural development, cell-cell signaling, or membrane protein anchoring. Their combined haploinsufficiency could impair neuromuscular junction formation and maintenance.
- Extracellular matrix and skeletal development: COL27A1 directly encodes a structural collagen, while SLC12A3 modifies mineral homeostasis critical for bone development. HSPB2's role in protein quality control affects connective tissue integrity.
9.3 Evidence Supporting Oligogenic Inheritance
- No single gene under any inheritance model (AD, AR, XL) fully explains the phenotype.
- Two AR candidates (NRCAM, COL27A1) are confirmed carriers with no second hit despite exhaustive search (coding, CNV, SV).
- The TSPAN6 X-linked hemizygous variant is the only single-gene-sufficient finding, but TSPAN6 alone does not explain skeletal involvement.
- All seven candidate variants are ultra-rare (gnomAD AF < 0.0001) or absent from population databases.
- Computational pathogenicity scores are extreme (AM > 0.9 for 5 of 7 variants).
- All variants are constitutive (present in 4/4 independent libraries), ruling out somatic mosaicism.
10. Discussion
10.1 DeepVariant False-Negative Characterization
The discovery of 279 systematic false-negatives across 159 pathogenic variant loci represents a significant quality concern for CNN-based variant callers in clinical genomics. The failure mode is characterized by:
- High confidence: False-negatives are emitted as RefCall (reference homozygous) with no quality degradation flag. They are indistinguishable from true negatives without independent evidence.
- Reproducibility: The same loci fail across multiple independent libraries, suggesting the CNN has learned incorrect representations for specific pileup patterns.
- Clinical impact: Three ClinVar Pathogenic variants relevant to this patient's phenotype were missed in the majority of samples. A standard single-sample clinical report would have missed 2 of 3 (depending on which sample was sequenced).
10.2 Multi-Sample Concordance as Quality Control
This work demonstrates that multi-sample concordance analysis is not merely redundant sequencing—it is a fundamentally different analytical paradigm. A single sample at 100x provides no mechanism to detect systematic caller failures. Four samples at 100x each, from the same individual, transform discordance into a diagnostic signal. We propose that re-sequencing from stored DNA (available for any biobanked sample) should be considered for unresolved rare disease cases before escalating to whole-genome sequencing.
10.3 Limitations
- Exome-only analysis: Deep intronic, regulatory, and intergenic variants are not captured. The second hit for NRCAM and COL27A1 may reside in non-coding regions.
- Short-read SV limitations: Manta's sensitivity from short-read exome data is limited compared to long-read WGS. Structural variants smaller than 50bp or in repetitive regions may be missed.
- No parental data: Without parental sequencing, we cannot determine phase (cis vs. trans) for compound heterozygous candidates or confirm de novo status.
- Oligogenic model is hypothesis-generating: Functional validation (cell models, animal models, or identification of additional patients with overlapping variant combinations) is required to confirm causality.
10.4 Recommended Follow-Up
- Whole-genome sequencing: To identify potential deep intronic second hits in NRCAM and COL27A1.
- RNA-seq from accessible tissue: To assess whether any identified intronic variant causes aberrant splicing or reduced expression.
- Parental sequencing: To determine phase and de novo status of identified variants.
- TSPAN6 functional studies: Literature and protein modeling to assess whether p.Gly39Asp disrupts tetraspanin function.
- Matchmaker Exchange submission: To identify additional patients with overlapping NRCAM/TSPAN6/COL27A1 variant combinations.
11. Conclusions
AutoDiagnosis demonstrates that AI-agentic orchestration of multi-sample WES with systematic false-negative rescue and oligogenic model assembly can resolve diagnostic cases where standard single-sample clinical WES fails. The pipeline identified seven candidate genes contributing to an oligogenic neurodevelopmental and skeletal phenotype, discovered a previously uncharacterized systematic failure mode in CNN-based variant callers, and quantified the clinical impact of variant caller false-negatives on diagnostic sensitivity.
The key methodological advances are:
- False-negative rescue via multi-sample concordance recovers clinically significant variants that no single-sample quality filter can detect.
- Joint 4-sample SV calling definitively excludes structural variant second hits in candidate genes.
- Biophysical annotation via SCDS introduces an orthogonal layer of evidence based on Cosic's Resonant Recognition Model, quantifying the electromagnetic energy shift caused by each missense substitution. SCDS captures a dimension of pathogenicity that statistical and machine learning predictors do not: the direct perturbation of the protein's informational spectrum. In this case, SCDS independently identified TSPAN6 p.Gly39Asp as the most biophysically disruptive variant (Σ|ΔF| = 1.192, ΔE% = +23.3%), corroborating the near-ceiling AlphaMissense (0.996) and REVEL (0.959) scores.
- AI-agentic pipeline orchestration enables complex multi-tool analytical workflows (DeepVariant → OpenCRAVAT → SCDS → Manta → ExpansionHunter → CNV → network analysis) without manual bioinformatician intervention.
- Blockchain provenance via BioNFT and BioRouter ensures the entire analytical chain is auditable, consent-gated, and patient-owned.
This work establishes a template for AI-augmented rare disease diagnosis that can be applied to the estimated 300 million people worldwide living with undiagnosed rare conditions. More fundamentally, it demonstrates that the technology to deliver comprehensive, multi-annotator diagnostic analysis now exists at commodity cost. The remaining barrier is not technical; it is structural. Every person has the right to know what their own biodata reveals. AutoDiagnosis, running on patient-owned data in patient-controlled vaults, is our contribution toward making that right actionable.
References
- Poplin R, et al. A universal SNP and small-indel variant caller using deep neural networks. Nature Biotechnology. 2018;36(10):983-987.
- Pedersen BS, et al. somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches. Genome Medicine. 2020;12:62.
- Chen X, et al. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220-1222.
- Cheng J, et al. Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science. 2023;381(6664):eadg7492.
- Ioannidis NM, et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. American Journal of Human Genetics. 2016;99(4):877-885.
- Rentzsch P, et al. CADD-Splice—improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome Medicine. 2021;13:31.
- Jaganathan K, et al. Predicting Splicing from Primary Sequence with Deep Learning. Cell. 2019;176(3):535-548.e24.
- Karczewski KJ, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581:434-443.
- Gonzaga-Jauregui C, et al. COL27A1 mutations cause Steel syndrome and suggest a founder mutation effect in the Puerto Rican population. European Journal of Human Genetics. 2015;23:342-346.
- Fitzsimons LA, et al. NRCAM mutations cause a neurodevelopmental disorder with neuromuscular and skeletal abnormalities. American Journal of Human Genetics. 2022;110(9):1581-1592.
- Schäffer AA. Digenic inheritance in medical genetics. Journal of Medical Genetics. 2013;50:641-652.
- Boycott KM, et al. International Cooperation to Enable the Diagnosis of All Rare Genetic Diseases. American Journal of Human Genetics. 2017;100(5):695-705.
- Dolzhenko E, et al. ExpansionHunter: a sequence-graph-based tool to analyze variation in short tandem repeat regions. Bioinformatics. 2019;35(22):4754-4756.
- McLaren W, et al. The Ensembl Variant Effect Predictor. Genome Biology. 2016;17:122.
- Cingolani P, et al. Using Drosophila melanogaster as a Model for Genotoxic Chemical Mutational Studies with a New Program, SnpSift. Frontiers in Genetics. 2012;3:35.
- Cosic I. Macromolecular bioactivity: is it resonant interaction between macromolecules? Theory and applications. IEEE Transactions on Biomedical Engineering. 1994;41(12):1101-1114.
- Veljkovic V, Cosic I. A novel model of biological function of proteins. Physics Letters A. 1985;108(9):447-449.
- Uribe D. Super-SCDS: Spectral Consensus Disruption Score dataset (223,987 missense variants across 397 genes). GenoBank.io Technical Dataset. 2026. Available at: genobank.app/api_scds.
- Rubin AF, et al. MaveDB: an open-source platform to distribute and interpret data from multiplexed assays of variant effect. Genome Biology. 2021;22:188.