Chapter 13

Computational Genomics & Bioinformatics

Sequence analysis algorithms, phylogenetic analysis, comparative genomics, population genetics analysis, genome assembly and annotation, functional genomics data analysis.

Computational Genomics & Bioinformatics

Computational genomics and bioinformatics represent the intersection of biology, computer science, and mathematics. These fields provide the computational tools to analyze, interpret, and understand the vast amounts of data generated by modern genomic technologies.

Sequence Analysis Fundamentals

Sequence Representation and Storage

String Algorithms in Genomics

Sequence:S=s1,s2,...,snwheresiΣ\text{Sequence}: S = s_1, s_2, ..., s_n \quad \text{where} \quad s_i \in \Sigma

Where Σ={A,T,G,C}\Sigma = \{A,T,G,C\} for DNA sequences.

Alphabet and k-mer Analysis

Number of possible k-mers=Σk=4k(for DNA)\text{Number of possible k-mers} = |\Sigma|^k = 4^k \quad \text{(for DNA)} Observed k-mers4k(due to sequence composition)\text{Observed k-mers} \leq 4^k \quad \text{(due to sequence composition)}

Sequence Alignment Algorithms

Dynamic Programming Approach

Needleman-Wunsch (Global Alignment)
F(i,j)=max{F(i1,j1)+s(xi,yj)(match/mismatch)F(i1,j)d(gap in sequence 2)F(i,j1)d(gap in sequence 1)F(i,j) = \max \begin{cases} F(i-1,j-1) + s(x_i,y_j) & \text{(match/mismatch)} \\ F(i-1,j) - d & \text{(gap in sequence 2)} \\ F(i,j-1) - d & \text{(gap in sequence 1)} \end{cases}

Where s(xi,yj)s(x_i,y_j) is the substitution score and dd is the gap penalty.

Smith-Waterman (Local Alignment)
F(i,j)=max{0(start new alignment)F(i1,j1)+s(xi,yj)(match/mismatch)F(i1,j)d(gap in sequence 2)F(i,j1)d(gap in sequence 1)F(i,j) = \max \begin{cases} 0 & \text{(start new alignment)} \\ F(i-1,j-1) + s(x_i,y_j) & \text{(match/mismatch)} \\ F(i-1,j) - d & \text{(gap in sequence 2)} \\ F(i,j-1) - d & \text{(gap in sequence 1)} \end{cases}

Time and Space Complexity

Time complexity (NW)=O(mn)Space complexity=O(mn)\text{Time complexity (NW)} = O(mn) \quad \text{Space complexity} = O(mn) Space-optimized:O(min(m,n))(using Hirschberg’s algorithm)\text{Space-optimized}: O(\min(m,n)) \quad \text{(using Hirschberg's algorithm)}

Multiple Sequence Alignment

Progressive Alignment Algorithm

Score(A,B,C)=Sum-of-pairs=i<jscore(Ai,Bj,Ck)\text{Score}(A,B,C) = \text{Sum-of-pairs} = \sum_{i<j} \text{score}(A_i, B_j, C_k) Star alignment:Center star=argmaxciscore(c,si)\text{Star alignment}: \text{Center star} = \arg\max_{c} \sum_{i} \text{score}(c, s_i)

Consistency-Based Approaches

Consistency:Cij(k)=lCil(k1)Clj(k1)\text{Consistency}: C_{ij}^{(k)} = \sum_{l} C_{il}^{(k-1)} \cdot C_{lj}^{(k-1)}

Phylogenetic Analysis

Distance-Based Methods

Neighbor-Joining Algorithm
Qij=(n2)dijk=1ndikk=1ndjkQ_{ij} = (n-2)d_{ij} - \sum_{k=1}^{n} d_{ik} - \sum_{k=1}^{n} d_{jk} du,k=12(di,k+dj,kdi,j)d_{u,k} = \frac{1}{2}(d_{i,k} + d_{j,k} - d_{i,j})

Character-Based Methods

Maximum Parsimony
Parsimony score(T)=e=(u,v)E(T)Hamming(Su,Sv)\text{Parsimony score}(T) = \sum_{e=(u,v) \in E(T)} \text{Hamming}(S_u, S_v)
Maximum Likelihood
P(DT,M,θ)=i=1nP(xiT,M,θ)P(D|T,M,\theta) = \prod_{i=1}^{n} P(x_i|T,M,\theta) Felsenstein pruning algorithm:P(DT,M,θ)=iP(root=i)jP(data at subtree jroot state=i)\text{Felsenstein pruning algorithm}: P(D|T,M,\theta) = \sum_{i} P(\text{root}=i) \prod_{j} P(\text{data at subtree } j|\text{root state}=i)

Hidden Markov Models in Genomics

Profile HMM for Sequence Analysis

P(sequenceprofile HMM)=pathsP(sequence, pathHMM)P(\text{sequence}|\text{profile HMM}) = \sum_{\text{paths}} P(\text{sequence, path}|\text{HMM})

Forward Algorithm

fk(i)=P(prefix x1...xi,state πi=k)f_{k}(i) = P(\text{prefix } x_1...x_i, \text{state } \pi_i = k) fk(i)=ek(xi)lfl(i1)alkf_{k}(i) = e_k(x_i) \sum_{l} f_{l}(i-1)a_{lk}

Where ek(xi)e_k(x_i) is emission probability and alka_{lk} is transition probability.

Genome Assembly

Short Read Assembly

De Bruijn Graph Approach

Vertices:k-mers in readsEdges:(k1)-mer overlaps\text{Vertices}: k\text{-mers in reads} \quad \text{Edges}: (k-1)\text{-mer overlaps} Path in graphReconstructed sequence\text{Path in graph} \xleftrightarrow{} \text{Reconstructed sequence}

Assembly Metrics

N50:Length such that contigs of this length or longer cover 50% of assembly\text{N50}: \text{Length such that contigs of this length or longer cover 50\% of assembly} NG50:N50 calculated with respect to reference genome size\text{NG50}: \text{N50 calculated with respect to reference genome size}

Long Read Assembly

Overlap-Layout-Consensus (OLC)

Step 1 (Overlap):Find overlaps between reads\text{Step 1 (Overlap)}: \text{Find overlaps between reads} Step 2 (Layout):Construct layout of overlaps\text{Step 2 (Layout)}: \text{Construct layout of overlaps} Step 3 (Consensus):Generate consensus sequence\text{Step 3 (Consensus)}: \text{Generate consensus sequence}

Hybrid Assembly

Short reads:accuracy+Long reads:contiguityOptimal assembly\text{Short reads}: \text{accuracy} + \text{Long reads}: \text{contiguity} \rightarrow \text{Optimal assembly}

Functional Genomics Analysis

Gene Expression Analysis

RNA-Seq Data Processing

Raw countsQCAligned readsQuantificationNormalized expression\text{Raw counts} \xrightarrow{\text{QC}} \text{Aligned reads} \xrightarrow{\text{Quantification}} \text{Normalized expression}

Differential Expression

log2(fold change)=log2(Condition ACondition B)\log_2(\text{fold change}) = \log_2\left(\frac{\text{Condition A}}{\text{Condition B}}\right) Statistical model:g(μi)=β0+β1x1++βpxp\text{Statistical model}: g(\mu_i) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p

Where gg is link function and μi\mu_i is expected count for gene ii.

Normalization Methods

Library Size Normalization

CPM=countslibrary size×106\text{CPM} = \frac{\text{counts}}{\text{library size}} \times 10^6

TPM (Transcripts Per Million)

TPMi=countsi/lengthij(countsj/lengthj)×106\text{TPM}_i = \frac{\text{counts}_i / \text{length}_i}{\sum_j (\text{counts}_j / \text{length}_j)} \times 10^6

DESeq2 Normalization

Size factorj=medianicountsijkcountsik\text{Size factor}_j = \text{median}_{i} \frac{\text{counts}_{ij}}{\sqrt{\prod_k \text{counts}_{ik}}}

Co-expression Networks

Weighted Gene Co-expression Network Analysis (WGCNA)

aij=adjacency(i,j)=correlation(Xi,Xj)βa_{ij} = \text{adjacency}(i,j) = \text{correlation}(X_i, X_j)^{\beta} Module identification:hierarchical clustering of topological overlap\text{Module identification}: \text{hierarchical clustering of topological overlap} Module eigengene:first principal component of module expression\text{Module eigengene}: \text{first principal component of module expression}

Population Genomics

Variant Detection and Genotyping

SNV Detection Probabilities

P(genotypereads)P(readsgenotype)×P(genotype)P(\text{genotype}| \text{reads}) \propto P(\text{reads}|\text{genotype}) \times P(\text{genotype}) Genotype likelihood:GL(g)=P(readsgenotype=g)=iP(readigenotype=g)\text{Genotype likelihood}: GL(g) = P(\text{reads}|\text{genotype}=g) = \prod_{i} P(\text{read}_i|\text{genotype}=g)

Hardy-Weinberg Equilibrium Testing

χ2=(OE)2EwhereE=expected under HWE\chi^2 = \sum \frac{(O-E)^2}{E} \quad \text{where} \quad E = \text{expected under HWE}

Population Structure Analysis

Principal Component Analysis (PCA)

SNP matrix:XRn×m(n samples, m SNPs)\text{SNP matrix}: \mathbf{X} \in \mathbb{R}^{n \times m} \quad \text{(n samples, m SNPs)} X=UΣVTSingular value decomposition\mathbf{X} = \mathbf{U\Sigma V}^T \quad \text{Singular value decomposition}

Admixture Analysis

Q-matrix:Qik=proportion of individual i in population k\text{Q-matrix}: Q_{ik} = \text{proportion of individual } i \text{ in population } k k=1KQik=1i\sum_{k=1}^{K} Q_{ik} = 1 \quad \forall i

Selection Scans

FST Calculation

FST=σB2σB2+σW2=between-population variancetotal varianceF_{ST} = \frac{\sigma^2_B}{\sigma^2_B + \sigma^2_W} = \frac{\text{between-population variance}}{\text{total variance}} FST=HTHSHTwhere HT,HS are total and subpopulation heterozygositiesF_{ST} = \frac{H_T - H_S}{H_T} \quad \text{where } H_T, H_S \text{ are total and subpopulation heterozygosities}

Tajima's D Statistic

D=θ^πθ^WS[Var(θ^πθ^W)]1/2D = \frac{\hat{\theta}_\pi - \hat{\theta}_W}{S[\text{Var}(\hat{\theta}_\pi - \hat{\theta}_W)]^{1/2}}

Where θ^π\hat{\theta}_\pi is pairwise diversity and θ^W\hat{\theta}_W is Watterson's estimator.

Comparative Genomics

Ortholog Detection

Reciprocal Best Hits (RBH)

Orthologs:(A,B) such that Abest hitB and Bbest hitA\text{Orthologs}: (A,B) \text{ such that } A \xrightarrow{\text{best hit}} B \text{ and } B \xrightarrow{\text{best hit}} A

Phylogenetic Approaches

Gene treereconciliationSpecies treeDuplication/Speciation events\text{Gene tree} \xrightarrow{\text{reconciliation}} \text{Species tree} \rightarrow \text{Duplication/Speciation events}

Synteny Analysis

Synteny:Conserved gene order between species\text{Synteny}: \text{Conserved gene order between species} Breakpoint:Position where synteny is disrupted\text{Breakpoint}: \text{Position where synteny is disrupted} Genomic distance:d(A,B)=minimum operations to transform genome A to B\text{Genomic distance}: d(A,B) = \text{minimum operations to transform genome A to B}

Metagenomics Analysis

Taxonomic Classification

k-mer Based Classification

Signature:Staxon={k-mers associated with taxon}\text{Signature}: S_{\text{taxon}} = \{k\text{-mers associated with taxon}\} Classification:assign reads to taxa with most matching k-mers\text{Classification}: \text{assign reads to taxa with most matching k-mers}

Reference-Based Approaches

Database:{ref sequences,taxonomy}\text{Database}: \{\text{ref sequences}, \text{taxonomy}\} Alignment:readsalignmentreferencesLCAtaxonomic assignment\text{Alignment}: \text{reads} \xrightarrow{\text{alignment}} \text{references} \xrightarrow{\text{LCA}} \text{taxonomic assignment}

Functional Profiling

Gene catalogs:ORFsannotationFunctional categories\text{Gene catalogs}: \text{ORFs} \xrightarrow{\text{annotation}} \text{Functional categories} Pathway reconstruction:gene abundancespathway databasepathway completeness\text{Pathway reconstruction}: \text{gene abundances} \xrightarrow{\text{pathway database}} \text{pathway completeness}

Machine Learning in Genomics

Supervised Learning Approaches

Genomic Prediction Models

yi=f(xi;θ)+ϵiwhere xi is genomic featuresy_i = f(\mathbf{x}_i; \theta) + \epsilon_i \quad \text{where } \mathbf{x}_i \text{ is genomic features} Accuracy:r2=correlation2(observed,predicted)\text{Accuracy}: r^2 = \text{correlation}^2(\text{observed}, \text{predicted})

Regularized Regression

Lasso:minβ12nyXβ22+λβ1\text{Lasso}: \min_{\beta} \frac{1}{2n} ||\mathbf{y} - \mathbf{X\beta}||_2^2 + \lambda ||\mathbf{\beta}||_1 Ridge:minβ12nyXβ22+λβ22\text{Ridge}: \min_{\beta} \frac{1}{2n} ||\mathbf{y} - \mathbf{X\beta}||_2^2 + \lambda ||\mathbf{\beta}||_2^2 Elastic net:minβ12nyXβ22+λ[(1α)β1+αβ22]\text{Elastic net}: \min_{\beta} \frac{1}{2n} ||\mathbf{y} - \mathbf{X\beta}||_2^2 + \lambda[(1-\alpha)||\mathbf{\beta}||_1 + \alpha||\mathbf{\beta}||_2^2]

Deep Learning in Genomics

Convolutional Neural Networks for Sequence Analysis

Input:SRL×4ConvFiltersRW×4Feature maps\text{Input}: S \in \mathbb{R}^{L \times 4} \xrightarrow{\text{Conv}} \text{Filters} \in \mathbb{R}^{W \times 4} \rightarrow \text{Feature maps} Output:h=σ(Wcx+b)\text{Output}: \mathbf{h} = \sigma(\mathbf{W}_c * \mathbf{x} + \mathbf{b})

Recurrent Neural Networks for Genomic Sequences

ht=tanh(Whhht1+Wxhxt+bh)\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h) yt=Whyht+by\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y

Attention Mechanisms

et=tanh(Wa[ht,st1])\mathbf{e}_t = \tanh(\mathbf{W}_a[\mathbf{h}_t, \mathbf{s}_{t-1}]) αt=exp(waTet)i=1Texp(waTei)\alpha_t = \frac{\exp(\mathbf{w}_a^T\mathbf{e}_t)}{\sum_{i=1}^{T} \exp(\mathbf{w}_a^T\mathbf{e}_i)} ct=i=1Tαihi\mathbf{c}_t = \sum_{i=1}^{T} \alpha_i \mathbf{h}_i

Quality Control and Filtering

Sequencing Quality Assessment

Phred quality score:Q=10log10(Perror)\text{Phred quality score}: Q = -10 \log_{10}(P_{\text{error}}) Base call accuracy:Pcorrect=110Q/10\text{Base call accuracy}: P_{\text{correct}} = 1 - 10^{-Q/10}

Coverage Analysis

Coverage:C=Total bases mappedReference length\text{Coverage}: C = \frac{\text{Total bases mapped}}{\text{Reference length}} Uniformity:CV=σ(coverage)μ(coverage)\text{Uniformity}: \text{CV} = \frac{\sigma(\text{coverage})}{\mu(\text{coverage})}

Statistical Methods in Genomics

Multiple Testing Correction

Bonferroni Correction

αcorrected=αmwhere m is number of tests\alpha_{\text{corrected}} = \frac{\alpha}{m} \quad \text{where } m \text{ is number of tests}

False Discovery Rate (FDR)

FDR=E[false discoveriestotal discoveries]FDR = E\left[\frac{\text{false discoveries}}{\text{total discoveries}}\right] Benjamini-Hochberg procedure:Rank p-values, reject if p(i)imα\text{Benjamini-Hochberg procedure}: \text{Rank p-values, reject if } p_{(i)} \leq \frac{i}{m}\alpha

Power Analysis

1β=Power=P(reject H0H1 is true)1 - \beta = \text{Power} = P(\text{reject } H_0 | H_1 \text{ is true}) n=(Zα/2+Zβ)2(p1(1p1)+p2(1p2))(p1p2)2n = \frac{(Z_{\alpha/2} + Z_{\beta})^2 \cdot (p_1(1-p_1) + p_2(1-p_2))}{(p_1 - p_2)^2}

For comparing proportions between groups.

Visualization in Genomics

Genomic Data Visualization

Circos Plots

Circular representation:Genomeradial layoutChromosome interactions\text{Circular representation}: \text{Genome} \xrightarrow{\text{radial layout}} \text{Chromosome interactions}

Manhattan Plots

log10(p-value) vs. Genomic position-\log_{10}(p\text{-value}) \text{ vs. } \text{Genomic position}

Heatmaps

Data matrix:XRn×mcolor scaleVisual pattern\text{Data matrix}: \mathbf{X} \in \mathbb{R}^{n \times m} \xrightarrow{\text{color scale}} \text{Visual pattern}

Real-World Application: Population Genomics Analysis

Population genomics analysis helps us understand evolutionary history and genetic diversity patterns.

Population Structure Analysis

# Population genomics analysis for human diversity study
pop_gen_params = {
    'populations': ['YRI', 'CEU', 'ASN'],  # Yoruba, European, Asian
    'sample_size': {'YRI': 108, 'CEU': 99, 'ASN': 97},  # Sample sizes
    'snps': 500000,  # Number of SNPs analyzed
    'linkage_disequilibrium_decay': 0.1,  # LD decay parameter
    'migration_rates': [[0, 0.002, 0.001], [0.002, 0, 0.003], [0.001, 0.003, 0]],  # Migration matrix
    'ancestral_populations': 3,  # K value for admixture analysis
    'recombination_rate': 1e-8  # per bp per generation
}

# Calculate genetic differentiation statistics
# FST between populations
fst_matrix = [[0 for _ in range(len(pop_gen_params['populations']))] for _ in range(len(pop_gen_params['populations']))]

# Using Weir & Cockerham method for FST estimation
for i, pop1 in enumerate(pop_gen_params['populations']):
    for j, pop2 in enumerate(pop_gen_params['populations']):
        if i != j:
            # Calculate FST based on sample sizes and migration rates
            avg_sample_size = (pop_gen_params['sample_size'][pop1] + pop_gen_params['sample_size'][pop2]) / 2
            # Simplified FST based on migration and sample size
            fst_value = 1 / (1 + 4 * pop_gen_params['migration_rates'][i][j] * avg_sample_size)
            fst_matrix[i][j] = fst_value

# Calculate nucleotide diversity (pi)
theta_pi_expected = 0.001  # per bp (human average)
genome_size = 3.2e9  # bp
expected_nucleotide_diversity = theta_pi_expected * genome_size * pop_gen_params['snps'] / (genome_size / 30000)  # Adjust for SNP density

# Calculate Tajima's D expectation for each population
# Under neutral evolution, D ~ N(0, sqrt(variance))
tajima_d_variance = 1  # Simplified variance
for i, pop in enumerate(pop_gen_params['populations']):
    sample_size = pop_gen_params['sample_size'][pop]
    # Expected Tajima's D values under neutrality
    expected_tajima_d = 0  # neutral evolution expectation
    # Variance of Tajima's D
    var_tajima_d = math.sqrt(1.0)  # Scaled by sample size and other factors

# Estimate effective population sizes from diversity
# Theta = 4*Ne*mu, so Ne = theta/(4*mu)
mutation_rate = 1.2e-8  # per bp per generation (human average)
for i, pop in enumerate(pop_gen_params['populations']):
    # Estimate Ne from observed diversity
    sample_size = pop_gen_params['sample_size'][pop]
    estimated_diversity = theta_pi_expected * (1 - 0.1 * i)  # Simulated slight variation
    estimated_ne = estimated_diversity / (4 * mutation_rate)
    
    print(f"Population {pop}:")
    print(f"  Estimated effective population size: {estimated_ne:,.0f}")
    print(f"  Sample size: {sample_size}")
    print(f"  Expected nucleotide diversity: {estimated_diversity:.5f}")

# Population differentiation matrix
print(f"\nFST differentiation matrix:")
pop_names = pop_gen_params['populations']
for i, name1 in enumerate(pop_names):
    for j, name2 in enumerate(pop_names):
        if i != j:
            print(f"  {name1} vs {name2}: {fst_matrix[i][j]:.4f}")

# Estimate admixture proportions
# Using simplified model for demonstration
admixture_results = {}
for pop in pop_gen_params['populations']:
    # Calculate expected admixture proportions based on migration matrix
    total_migration = sum(pop_gen_params['migration_rates'][pop_names.index(pop)])
    ancestral_contributions = []
    for anc in range(pop_gen_params['ancestral_populations']):
        # Simulated ancestral contribution
        contrib = (1 - total_migration * 0.1) if anc == pop_names.index(pop) else total_migration * 0.1 / (len(pop_gen_params['populations']) - 1)
        ancestral_contributions.append(contrib)
    admixture_results[pop] = ancestral_contributions

print(f"\nSimulated admixture proportions (by population):")
for pop, props in admixture_results.items():
    print(f"  {pop}: {[f'{p:.3f}' for p in props]}")

# Time to most recent common ancestor (TMRCA) estimation
# For a diploid population: T_MRCA ~ 4*Ne generations
for i, pop in enumerate(pop_gen_params['populations']):
    sample_size = pop_gen_params['sample_size'][pop]
    estimated_ne = estimated_diversity / (4 * mutation_rate)  # from calculation above
    tmrca_generations = 4 * estimated_ne
    tmrca_years = tmrca_generations * 25  # 25 years per generation
    print(f"  {pop} TMRCA: {tmrca_years:,.0f} years ago ({tmrca_generations:,.0f} generations)")

# Selection scan analysis
# Calculate expected distribution of allele frequencies under neutrality
num_snps = pop_gen_params['snps']
expected_distribution = {
    'rare': num_snps * 0.1,  # Alleles < 0.01 frequency
    'low_freq': num_snps * 0.15,  # Alleles 0.01-0.05
    'intermediate': num_snps * 0.5,  # Alleles 0.05-0.95
    'high_freq': num_snps * 0.15,  # Alleles 0.95-0.99
    'fixed': num_snps * 0.1  # Alleles >= 0.99
}

print(f"\nExpected allele frequency distribution (under neutrality):")
for category, count in expected_distribution.items():
    print(f"  {category}: {count:.0f} SNPs ({count/num_snps*100:.1f}%)")

# Detect departures from neutrality (potential selection)
observed_dist = {
    'rare': expected_distribution['rare'] * 0.95,  # Slight decrease due to selection
    'low_freq': expected_distribution['low_freq'] * 1.1,
    'intermediate': expected_distribution['intermediate'] * 1.05,
    'high_freq': expected_distribution['high_freq'] * 0.9,
    'fixed': expected_distribution['fixed'] * 0.85
}

print(f"Simulated observed distribution:")
for category, count in observed_dist.items():
    print(f"  {category}: {count:.0f} SNPs ({count/num_snps*100:.1f}%)")

# Calculate neutrality test statistics
differences = {cat: obs - exp for cat, obs, exp in 
               zip(expected_distribution.keys(), observed_dist.values(), expected_distribution.values())}
excess_categories = [cat for cat, diff in differences.items() if abs(diff) > 0.05 * num_snps]

print(f"\nCategories showing departure from neutrality (|diff| > {0.05 * num_snps:.0f} SNPs): {excess_categories}")
if excess_categories:
    print("  These may indicate regions under selection or demographic effects")
else:
    print("  Allele frequency distribution consistent with neutral expectations")

Demographic Inference

Understanding population history through genomic data analysis.


Your Challenge: Variant Calling and Filtering Pipeline

Design a computational pipeline for detecting and filtering genetic variants from sequencing data.

Goal: Create a variant calling workflow with appropriate quality filtering.

Sequencing Data Analysis

import math

# Variant analysis for whole exome sequencing data
seq_params = {
    'coverage_depth': 30,      # Average coverage per base
    'error_rate': 0.001,       # Per base error rate (0.1%)
    'reference_genome': 'GRCh38',
    'aligner': 'BWA-MEM',
    'variant_caller': 'GATK HaplotypeCaller',
    'target_regions': 50000000,  # bp (50 Mb exome)
    'false_discovery_rate': 0.05,  # 5% FDR target
    'quality_threshold': 30,     # Minimum variant quality score
    'strand_bias_limit': 0.3,    # Maximum allowed strand bias
    'allele_frequency_threshold': 0.05  # Minimum frequency to consider variant
}

# Calculate raw variant statistics
expected_errors_per_sample = seq_params['target_regions'] * seq_params['coverage_depth'] * seq_params['error_rate']
number_of_variants_detected = int(expected_errors_per_sample * 2)  # 2x to account for real variants

# Apply quality filtering
# Based on GATK best practices filters
dp_filter_threshold = 10  # Minimum depth
qual_filter_threshold = seq_params['quality_threshold']
sb_filter_threshold = seq_params['strand_bias_limit']

# Calculate probability of true variant vs. sequencing error
prior_error_prob = seq_params['error_rate']
posterior_variant_prob = 0.1  # After quality filtering

# Calculate false discovery rate after filtering
# If we expect 5% FDR, then 5% of called variants should be false
expected_false_calls = number_of_variants_detected * seq_params['false_discovery_rate']
expected_true_calls = number_of_variants_detected - expected_false_calls

# Apply strand bias filtering
# Calculate strand bias for each variant
# Using Fisher's exact test on strand counts
def calculate_strand_bias(ref_forward, ref_reverse, alt_forward, alt_reverse):
    # Simplified strand bias calculation
    total_ref = ref_forward + ref_reverse
    total_alt = alt_forward + alt_reverse
    if total_alt == 0:
        return 0.0
    
    # Calculate strand bias as difference from expectation
    expected_forward_fraction = (ref_forward + alt_forward) / (total_ref + total_alt)
    if expected_forward_fraction > 0.5:
        # More forward reads than reverse - potential bias
        strand_bias = abs(expected_forward_fraction - 0.5) * 2
    else:
        strand_bias = abs((ref_reverse + alt_reverse) / (total_ref + total_alt) - 0.5) * 2
    
    return min(strand_bias, 1.0)  # Cap at 1.0

# Simulated variant data
sample_variants = [
    {'position': 100000, 'ref_depth': 25, 'alt_depth': 12, 'quality': 45, 
     'ref_forward': 15, 'ref_reverse': 10, 'alt_forward': 7, 'alt_reverse': 5},
    {'position': 150000, 'ref_depth': 30, 'alt_depth': 8, 'quality': 25,
     'ref_forward': 25, 'ref_reverse': 5, 'alt_forward': 8, 'alt_reverse': 0},
    {'position': 200000, 'ref_depth': 20, 'alt_depth': 3, 'quality': 60,
     'ref_forward': 12, 'ref_reverse': 8, 'alt_forward': 2, 'alt_reverse': 1},
    {'position': 250000, 'ref_depth': 40, 'alt_depth': 15, 'quality': 50,
     'ref_forward': 20, 'ref_reverse': 20, 'alt_forward': 7, 'alt_reverse': 8}
]

# Apply filtering criteria
passing_variants = []
filtered_reasons = {'low_quality': 0, 'strand_bias': 0, 'low_coverage': 0, 'low_frequency': 0}

for variant in sample_variants:
    depth = variant['ref_depth'] + variant['alt_depth']
    frequency = variant['alt_depth'] / depth if depth > 0 else 0
    
    # Calculate strand bias
    sb_score = calculate_strand_bias(
        variant['ref_forward'], variant['ref_reverse'],
        variant['alt_forward'], variant['alt_reverse']
    )
    
    # Apply filters
    passes_filters = True
    reason = ""
    
    if variant['quality'] < qual_filter_threshold:
        passes_filters = False
        reason = "low_quality"
    elif sb_score > sb_filter_threshold:
        passes_filters = False
        reason = "strand_bias"
    elif depth < dp_filter_threshold:
        passes_filters = False
        reason = "low_coverage"
    elif frequency < seq_params['allele_frequency_threshold']:
        passes_filters = False
        reason = "low_frequency"
    
    if passes_filters:
        passing_variants.append(variant)
    else:
        filtered_reasons[reason] += 1

# Calculate sensitivity and specificity estimates
# Based on expected error rates and filtering
theoretical_precision = 1 - seq_params['false_discovery_rate']  # After filtering
theoretical_sensitivity = 0.90  # Estimated sensitivity after quality filtering

# Calculate positive predictive value (PPV) after filtering
prior_prob_real = 0.01  # 1% of positions have true variants
posterior_prob_real = (prior_prob_real * theoretical_sensitivity) / \
                     ((prior_prob_real * theoretical_sensitivity) + 
                      ((1 - prior_prob_real) * seq_params['false_discovery_rate']))

# Calculate coverage uniformity
# Standard deviation of coverage depth across regions
coverage_uniformity_cv = 0.25  # Typical for well-designed exome capture

# Calculate callable regions (regions with sufficient coverage)
callable_fraction = 0.90  # 90% of target regions have >10x coverage
callable_bases = seq_params['target_regions'] * callable_fraction

# Estimated variant load per sample
expected_rare_variants = callable_bases * 0.001  # 1 per 1000 bases
expected_common_variants = callable_bases * 0.0001  # 1 per 10000 bases

Create a variant calling and filtering pipeline to identify high-confidence genetic variants.

Hint:

  • Consider multiple quality metrics for variant filtering
  • Account for coverage depth and uniformity
  • Evaluate strand bias and other systematic artifacts
  • Estimate precision and recall after filtering
# TODO: Calculate variant filtering results
high_quality_variants = 0        # Number of variants passing all filters
filtering_efficiency = 0         # Fraction of variants that pass filters
call_rate = 0                    # Fraction of target sites with sufficient coverage
precision_after_filtering = 0    # Fraction of called variants that are true positives
strand_bias_score = 0            # Average strand bias across filtered variants

# Calculate results from analysis above
high_quality_variants = len(passing_variants)
filtering_efficiency = high_quality_variants / len(sample_variants) if sample_variants else 0
call_rate = callable_fraction
precision_after_filtering = posterior_prob_real
strand_bias_score = sum(calculate_strand_bias(
    v['ref_forward'], v['ref_reverse'], 
    v['alt_forward'], v['alt_reverse']) for v in passing_variants) / len(passing_variants) if passing_variants else 0

# Calculate additional metrics
coverage_uniformity = 1 - coverage_uniformity_cv  # Higher is better uniformity
expected_variants_passing_filters = expected_rare_variants * theoretical_sensitivity  # After sensitivity loss

# Print results
print(f"Variant calling results:")
print(f"  Raw variants detected: {len(sample_variants)}")
print(f"  High-quality variants: {high_quality_variants}")
print(f"  Filtering efficiency: {filtering_efficiency:.2f}")
print(f"  Callable genome fraction: {call_rate:.3f}")
print(f"  Estimated precision: {precision_after_filtering:.3f}")
print(f"  Strand bias score: {strand_bias_score:.3f}")
print(f"  Coverage uniformity: {coverage_uniformity:.3f}")

# Quality assessment
if precision_after_filtering > 0.9 and filtering_efficiency > 0.7:
    quality_assessment = "High quality - good balance of precision and sensitivity"
elif precision_after_filtering > 0.8:
    quality_assessment = "Good quality - high precision with some sensitivity loss"
else:
    quality_assessment = "Low quality - significant filtering artifact concerns"

print(f"  Quality assessment: {quality_assessment}")

# Reporting metrics
for category, count in filtered_reasons.items():
    if count > 0:
        print(f"  Filtered for {category}: {count} variants")

# Clinical interpretation
if high_quality_variants > 0:
    clinical_interpretation = f"Identified {high_quality_variants} high-confidence variants for further analysis"
else:
    clinical_interpretation = "No high-confidence variants identified - consider additional samples or quality improvements"
    
print(f"  Clinical interpretation: {clinical_interpretation}")

How would you modify the filtering criteria if you were specifically looking for rare pathogenic variants versus common population variants?

ELI10 Explanation

Simple analogy for better understanding

Think of computational genomics like being a digital librarian for the world's most complex instruction manuals (genomes). Instead of reading one book, you're managing a library with billions of pages of genetic text, and you need special computer programs to search, compare, and understand the patterns in this massive collection. Just like how a search engine can find specific information in millions of webpages in seconds, bioinformatics tools can scan through entire genomes to find genes, compare them between different organisms, and identify patterns that tell us about evolution, disease, and adaptation. It's like having a super-powered magnifying glass that can not only read the 'words' (DNA sequences) but also understand the 'meaning' (gene functions and regulatory networks) across entire biological systems. These tools help us understand how genetic variations contribute to traits, diseases, and evolutionary relationships by comparing and analyzing vast amounts of genetic data computationally.

Self-Examination

Q1.

How do sequence alignment algorithms efficiently compare genomic sequences?

Q2.

What are the challenges in assembling and annotating large eukaryotic genomes?

Q3.

How can population genomic analyses reveal evolutionary history and demographic patterns?