11 min to read
A Systematic Approach to Complex Variant Interpretation in Rare Disease Genomics
Decoding clustered variants, one step at a time
You’re analyzing whole genome sequencing data for a patient with differences of sex development (DSD). Your variant caller flags something unusual in OSBPL6: a frameshift deletion and two nearby substitutions, all appearing on the same chromosome. Your standard annotation pipeline treats them as three separate variants. But are they? What’s the combined effect? Is this pathogenic, benign, or somewhere frustratingly in between?
This scenario—multiple variants in cis (meaning on the same physical chromosome copy, inherited together) forming a complex haplotype (a combination of variants that travel together as a unit)—represents one of the most challenging situations in clinical genomics. Standard tools aren’t designed for this. You need a systematic approach that combines computational analysis, manual sequence reconstruction, and biological reasoning.
This guide provides that framework.
The Challenge: When Variants Don’t Act Alone
Most variant interpretation workflows assume independence: each variant is annotated, classified, and interpreted separately. This works well for single nucleotide variants (SNVs) spread across a gene. But when multiple variants cluster together on the same chromosome copy, they interact:
- A frameshift deletion changes the reading frame for all downstream sequence
- Substitutions in the altered reading frame may produce different amino acids than predicted
- The combined effect might be more severe, less severe, or qualitatively different than any single variant alone
Your annotation tools don’t account for this. VEP, ANNOVAR, and similar pipelines evaluate each variant in isolation, assuming wild-type sequence context. This gives you three separate predictions that may not reflect biological reality.
The solution requires manual analysis—but manual doesn’t mean unstructured. You need a systematic workflow that ensures reproducibility, completeness, and defensibility for clinical reporting.
The Complete Workflow: Seven Essential Steps
Here’s the structured approach for analyzing complex variant clusters in singleton WGS data, particularly relevant for rare disease cases where gene-disease relationships may be uncertain:
Step 1: Combined Variant Annotation
Create a unified variant representation and annotate together
Instead of working with three separate VCF entries, combine the variants into a single representation that captures their clustered nature. Generate a unified notation that describes all three changes together, then annotate this combined variant. This forces annotation tools to consider the variants as a functional unit rather than independent events, though you’ll still need manual verification of the predicted consequences.
Step 2: Manual Sequence Reconstruction
Predict the actual biological consequence
When variants are close together—especially when a frameshift precedes substitutions—you must manually reconstruct the sequence. Start with wild-type codons, apply the deletion and frameshift, then incorporate how the substitutions alter the new reading frame. This reveals the true predicted amino acid sequence, early stop codons, and domain disruptions that automated tools miss. This is the critical step that reveals what’s actually happening at the protein level.
Step 3: Transcript and Functional Context Verification
Confirm the canonical transcript and functional relevance
Verify which transcript these variants affect, whether it’s canonical (the main, clinically-relevant transcript—usually the longest or most extensively studied version of the gene), and where they fall relative to functional domains. Not all transcripts are created equal—a variant cluster in a minor isoform has different implications than the same cluster in the primary disease-associated transcript. Check exon/intron boundaries, functional domains, and protein structure context.
Step 4: In-Silico Prediction Tools
Leverage computational predictions appropriately
Use prediction tools, but understand their limitations with complex variants. Assess splice-site impacts (especially critical if variants near junctions), predict nonsense-mediated decay likelihood, and map effects on protein domains. But remember: these tools assume single variants in wild-type context, so interpret predictions cautiously and always validate against your manual reconstruction.
Step 5: Population Frequency Analysis
Context from population databases
Document allele frequencies for individual variants AND check if the complete haplotype pattern appears in gnomAD or other population databases. If the exact variant combination appears at appreciable frequency in healthy populations, this strongly suggests benign interpretation regardless of predicted molecular impact. Some complex haplotypes are common polymorphisms that look scary in isolation.
Step 6: Biological Plausibility Assessment
Evaluate gene-disease fit
Consider whether the candidate gene makes biological sense for the phenotype. Is there expression in relevant tissues? Functional connection to the biological pathway? Literature support? For DSD cases, this means evaluating gonadal expression, steroid/lipid metabolism roles, and known sex development pathways. Genes without clear connections require stronger variant-level evidence.
Step 7: Differential Candidate Analysis and Validation Planning
Maintain diagnostic breadth and plan next steps
Review the entire WGS data for alternative explanations. Are there variants in established disease genes? Structural variants? Complex variants demand extra scrutiny, but they shouldn’t prevent you from identifying more likely candidates. Also consider available resources for functional follow-up: RNA samples for transcript stability testing, long-read sequencing to confirm the variant structure, or cell-based assays. Plan what additional evidence would strengthen or refute the interpretation.
A Note on Phasing: Why It May Not Be Critical Here
Important consideration: Without trio data, determining phase (whether variants are on the same chromosome copy or different copies) from short-read WGS is challenging and often unreliable. The good news? For this specific scenario, it may not matter as much as you’d think.
Why phasing matters less here:
If the variants are very close together (within a few base pairs to ~100 bp), they’re almost certainly in cis (on the same chromosome) simply due to the extremely low probability of independent mutations occurring so close together. The physical proximity makes cis arrangement the only plausible explanation.
When you DO need to consider phasing:
- If variants are spread over larger distances (>100-200 bp apart)
- If determining compound heterozygosity matters for recessive inheritance
- If you need to distinguish between two risk alleles vs. one complex allele
For singleton short-read data: Focus your effort on the manual reconstruction and functional assessment rather than spending significant time on phasing methods that may not be reliable without parental samples. Document the assumption that closely-spaced variants are in cis based on proximity, and proceed with the interpretation framework.
The Interpretation Framework
After completing the seven steps, you’re ready to interpret. Here’s the decision framework:
For the Variant Itself
Primary driver: In most cases involving frameshift + substitutions, the frameshift drives the functional consequence. The substitutions are unlikely to rescue function unless they specifically restore the reading frame or create a fortuitous stop codon that limits damage.
Loss-of-function presumption: Treat complex alleles containing frameshifts as putative loss-of-function variants unless evidence suggests otherwise. The bar for “mitigating” substitutions is high—they must demonstrably alter the consequence, not just theoretically.
For the Gene-Disease Relationship
Evidence strength matters: A strong predicted molecular impact doesn’t equal pathogenicity if the gene-disease relationship is weak. For novel or uncertain gene-disease associations, complex variants remain VUS (Variants of Uncertain Significance) until additional evidence emerges.
Evidence that strengthens interpretation:
- Second hit on the other allele (for recessive conditions)
- Functional data showing disrupted protein activity
- Additional cases reported in literature
- Strong expression in relevant tissue
- Known pathogenic variants in the same gene/region
Classification Guidelines
Likely Loss-of-Function (LoF): High confidence for the molecular consequence (frameshift predicts truncation/NMD)
Pathogenicity Evidence: Moderate at best if gene-disease association is uncertain, even with clear LoF prediction
Final Classification: VUS for novel gene-disease associations, even with convincing molecular predictions. Upgrade only with supporting evidence.
Why This Workflow Matters
Standard annotation pipelines will fail you on complex variants. They’ll give you three separate predictions that don’t reflect reality. Without systematic manual analysis, you risk:
- Misinterpretation: Thinking variants are independent when they interact
- Incomplete analysis: Missing the true predicted consequence
- Over-calling: Reporting variants as pathogenic without sufficient gene-disease evidence
- Under-calling: Dismissing complex variants because automated tools flag uncertainty
This seven-step framework ensures you:
- Analyze complex variants systematically and reproducibly
- Document your reasoning for clinical reports
- Maintain appropriate confidence levels in interpretation
- Identify when additional evidence is needed
Common Pitfalls to Avoid
Pitfall 1: Trusting automated annotations blindly
Annotation tools don’t handle complex haplotypes well. Always manually verify predicted consequences for variant clusters.
Pitfall 2: Over-interpreting molecular predictions
A predicted LoF variant isn’t pathogenic if the gene isn’t established for the disease. Molecular severity ≠ clinical pathogenicity.
Pitfall 3: Ignoring population data
Complex haplotypes that appear benign in isolation might be common polymorphisms. Always check population databases.
Pitfall 4: Tunnel vision on one candidate
Complex variants are interesting, but don’t let them distract from better-supported candidates elsewhere in the genome.
Pitfall 5: Obsessing over phasing in singleton data
Without trio data, precise phasing from short reads is often unreliable. Focus effort where it matters: manual reconstruction and functional assessment.
What This Series Will Cover
In upcoming posts, we’ll deep-dive into each step of this workflow:
Part 1: Combined Variant Annotation - Creating unified variant representations, tools for combined annotation, and documenting complex alleles
Part 2: Manual Sequence Reconstruction - Step-by-step guide to predicting combined consequences of clustered variants
Part 3: Transcript and Context Verification - Identifying canonical transcripts, mapping to functional domains, and assessing structural context
Part 4: In-Silico Prediction Tools - Which tools to use, how to interpret results for complex variants, and understanding limitations
Part 5: Population Frequency Analysis - Using gnomAD and internal databases effectively for complex haplotypes
Part 6: Biological Plausibility Assessment - Evaluating gene-disease fit, expression data, and pathway relevance
Part 7: Differential Analysis and Validation Planning - Maintaining diagnostic breadth and designing follow-up experiments
Each post will include practical examples, real data scenarios, and reproducible workflows you can implement in your own analyses.
Getting Started
Before the next post, you’ll need:
- VCF files from your WGS sequencing Analysis pipeline
- Reference genome (typically GRCh37/hg19 or GRCh38/hg38)
- Basic bioinformatics environment with alignment and variant calling tools
The Goal: Defensible Clinical Interpretation
Complex variant interpretation isn’t about having perfect answers—it’s about having defensible, systematic reasoning. When you report a variant (or don’t report it), you need to articulate:
- How you determined the combined molecular consequence
- What evidence supports (or doesn’t support) pathogenicity
- What uncertainties remain
- What additional data would change the interpretation
This framework gives you that structure. You’ll move from “these variants look complicated” to “here’s my systematic analysis showing why this is/isn’t likely pathogenic.”
Real-World Impact
In rare disease diagnostics, complex variants represent both opportunity and risk:
Opportunity: They might explain previously unsolved cases, identify novel disease genes, or reveal new disease mechanisms.
Risk: They’re easy to misinterpret, over-call, or miss entirely without proper analysis.
This systematic approach ensures you maximize the opportunity while minimizing the risk. You’ll identify true pathogenic complex variants while avoiding false positive reports that undermine diagnostic confidence.
Key Takeaways
- Complex variants require manual analysis - Automated pipelines aren’t sufficient
- Systematic workflows ensure reproducibility - Follow the seven-step framework
- Combined annotation is your starting point - Merge variants before analyzing impact
- Manual reconstruction reveals truth - This is where you discover the real predicted consequence
- Molecular predictions ≠ pathogenicity - Gene-disease evidence matters as much as variant severity
- Phasing may not be critical - For closely-spaced variants in singleton data, proximity suggests cis arrangement
- Population data provides context - Some complex variants are benign polymorphisms
What’s Next
In the next post, we’ll tackle Step 1: Combined Variant Annotation. You’ll learn:
- How to create unified representations of complex variants
- Tools and approaches for annotating combined variants
- Working with VCF files to merge variant entries
- Nomenclature standards for describing complex alleles
- What annotation tools can and cannot tell you about combined variants
- Preparing your data for manual sequence reconstruction
We’ll work through the OSBPL6 example step-by-step, starting with three separate VCF entries and creating a single combined variant representation that captures their interaction. By the end, you’ll have a properly formatted complex variant ready for the critical manual reconstruction step that follows.
Complex variants don’t have to be overwhelming. With systematic analysis and clear reasoning, you can confidently interpret even the most challenging variant clusters. Join me in the next post as we start with the foundation: combining your variants into a unified representation that reflects biological reality.
Comments