Computational Genomics & Bioinformatics
Sequence analysis algorithms, phylogenetic analysis, comparative genomics, population genetics analysis, genome assembly and annotation, functional genomics data analysis.
Computational Genomics & Bioinformatics
Computational genomics and bioinformatics represent the intersection of biology, computer science, and mathematics. These fields provide the computational tools to analyze, interpret, and understand the vast amounts of data generated by modern genomic technologies.
Sequence Analysis Fundamentals
Sequence Representation and Storage
String Algorithms in Genomics
Where for DNA sequences.
Alphabet and k-mer Analysis
Sequence Alignment Algorithms
Dynamic Programming Approach
Needleman-Wunsch (Global Alignment)
Where is the substitution score and is the gap penalty.
Smith-Waterman (Local Alignment)
Time and Space Complexity
Multiple Sequence Alignment
Progressive Alignment Algorithm
Consistency-Based Approaches
Phylogenetic Analysis
Distance-Based Methods
Neighbor-Joining Algorithm
Character-Based Methods
Maximum Parsimony
Maximum Likelihood
Hidden Markov Models in Genomics
Profile HMM for Sequence Analysis
Forward Algorithm
Where is emission probability and is transition probability.
Genome Assembly
Short Read Assembly
De Bruijn Graph Approach
Assembly Metrics
Long Read Assembly
Overlap-Layout-Consensus (OLC)
Hybrid Assembly
Functional Genomics Analysis
Gene Expression Analysis
RNA-Seq Data Processing
Differential Expression
Where is link function and is expected count for gene .
Normalization Methods
Library Size Normalization
TPM (Transcripts Per Million)
DESeq2 Normalization
Co-expression Networks
Weighted Gene Co-expression Network Analysis (WGCNA)
Population Genomics
Variant Detection and Genotyping
SNV Detection Probabilities
Hardy-Weinberg Equilibrium Testing
Population Structure Analysis
Principal Component Analysis (PCA)
Admixture Analysis
Selection Scans
FST Calculation
Tajima's D Statistic
Where is pairwise diversity and is Watterson's estimator.
Comparative Genomics
Ortholog Detection
Reciprocal Best Hits (RBH)
Phylogenetic Approaches
Synteny Analysis
Metagenomics Analysis
Taxonomic Classification
k-mer Based Classification
Reference-Based Approaches
Functional Profiling
Machine Learning in Genomics
Supervised Learning Approaches
Genomic Prediction Models
Regularized Regression
Deep Learning in Genomics
Convolutional Neural Networks for Sequence Analysis
Recurrent Neural Networks for Genomic Sequences
Attention Mechanisms
Quality Control and Filtering
Sequencing Quality Assessment
Coverage Analysis
Statistical Methods in Genomics
Multiple Testing Correction
Bonferroni Correction
False Discovery Rate (FDR)
Power Analysis
For comparing proportions between groups.
Visualization in Genomics
Genomic Data Visualization
Circos Plots
Manhattan Plots
Heatmaps
Real-World Application: Population Genomics Analysis
Population genomics analysis helps us understand evolutionary history and genetic diversity patterns.
Population Structure Analysis
# Population genomics analysis for human diversity study
pop_gen_params = {
'populations': ['YRI', 'CEU', 'ASN'], # Yoruba, European, Asian
'sample_size': {'YRI': 108, 'CEU': 99, 'ASN': 97}, # Sample sizes
'snps': 500000, # Number of SNPs analyzed
'linkage_disequilibrium_decay': 0.1, # LD decay parameter
'migration_rates': [[0, 0.002, 0.001], [0.002, 0, 0.003], [0.001, 0.003, 0]], # Migration matrix
'ancestral_populations': 3, # K value for admixture analysis
'recombination_rate': 1e-8 # per bp per generation
}
# Calculate genetic differentiation statistics
# FST between populations
fst_matrix = [[0 for _ in range(len(pop_gen_params['populations']))] for _ in range(len(pop_gen_params['populations']))]
# Using Weir & Cockerham method for FST estimation
for i, pop1 in enumerate(pop_gen_params['populations']):
for j, pop2 in enumerate(pop_gen_params['populations']):
if i != j:
# Calculate FST based on sample sizes and migration rates
avg_sample_size = (pop_gen_params['sample_size'][pop1] + pop_gen_params['sample_size'][pop2]) / 2
# Simplified FST based on migration and sample size
fst_value = 1 / (1 + 4 * pop_gen_params['migration_rates'][i][j] * avg_sample_size)
fst_matrix[i][j] = fst_value
# Calculate nucleotide diversity (pi)
theta_pi_expected = 0.001 # per bp (human average)
genome_size = 3.2e9 # bp
expected_nucleotide_diversity = theta_pi_expected * genome_size * pop_gen_params['snps'] / (genome_size / 30000) # Adjust for SNP density
# Calculate Tajima's D expectation for each population
# Under neutral evolution, D ~ N(0, sqrt(variance))
tajima_d_variance = 1 # Simplified variance
for i, pop in enumerate(pop_gen_params['populations']):
sample_size = pop_gen_params['sample_size'][pop]
# Expected Tajima's D values under neutrality
expected_tajima_d = 0 # neutral evolution expectation
# Variance of Tajima's D
var_tajima_d = math.sqrt(1.0) # Scaled by sample size and other factors
# Estimate effective population sizes from diversity
# Theta = 4*Ne*mu, so Ne = theta/(4*mu)
mutation_rate = 1.2e-8 # per bp per generation (human average)
for i, pop in enumerate(pop_gen_params['populations']):
# Estimate Ne from observed diversity
sample_size = pop_gen_params['sample_size'][pop]
estimated_diversity = theta_pi_expected * (1 - 0.1 * i) # Simulated slight variation
estimated_ne = estimated_diversity / (4 * mutation_rate)
print(f"Population {pop}:")
print(f" Estimated effective population size: {estimated_ne:,.0f}")
print(f" Sample size: {sample_size}")
print(f" Expected nucleotide diversity: {estimated_diversity:.5f}")
# Population differentiation matrix
print(f"\nFST differentiation matrix:")
pop_names = pop_gen_params['populations']
for i, name1 in enumerate(pop_names):
for j, name2 in enumerate(pop_names):
if i != j:
print(f" {name1} vs {name2}: {fst_matrix[i][j]:.4f}")
# Estimate admixture proportions
# Using simplified model for demonstration
admixture_results = {}
for pop in pop_gen_params['populations']:
# Calculate expected admixture proportions based on migration matrix
total_migration = sum(pop_gen_params['migration_rates'][pop_names.index(pop)])
ancestral_contributions = []
for anc in range(pop_gen_params['ancestral_populations']):
# Simulated ancestral contribution
contrib = (1 - total_migration * 0.1) if anc == pop_names.index(pop) else total_migration * 0.1 / (len(pop_gen_params['populations']) - 1)
ancestral_contributions.append(contrib)
admixture_results[pop] = ancestral_contributions
print(f"\nSimulated admixture proportions (by population):")
for pop, props in admixture_results.items():
print(f" {pop}: {[f'{p:.3f}' for p in props]}")
# Time to most recent common ancestor (TMRCA) estimation
# For a diploid population: T_MRCA ~ 4*Ne generations
for i, pop in enumerate(pop_gen_params['populations']):
sample_size = pop_gen_params['sample_size'][pop]
estimated_ne = estimated_diversity / (4 * mutation_rate) # from calculation above
tmrca_generations = 4 * estimated_ne
tmrca_years = tmrca_generations * 25 # 25 years per generation
print(f" {pop} TMRCA: {tmrca_years:,.0f} years ago ({tmrca_generations:,.0f} generations)")
# Selection scan analysis
# Calculate expected distribution of allele frequencies under neutrality
num_snps = pop_gen_params['snps']
expected_distribution = {
'rare': num_snps * 0.1, # Alleles < 0.01 frequency
'low_freq': num_snps * 0.15, # Alleles 0.01-0.05
'intermediate': num_snps * 0.5, # Alleles 0.05-0.95
'high_freq': num_snps * 0.15, # Alleles 0.95-0.99
'fixed': num_snps * 0.1 # Alleles >= 0.99
}
print(f"\nExpected allele frequency distribution (under neutrality):")
for category, count in expected_distribution.items():
print(f" {category}: {count:.0f} SNPs ({count/num_snps*100:.1f}%)")
# Detect departures from neutrality (potential selection)
observed_dist = {
'rare': expected_distribution['rare'] * 0.95, # Slight decrease due to selection
'low_freq': expected_distribution['low_freq'] * 1.1,
'intermediate': expected_distribution['intermediate'] * 1.05,
'high_freq': expected_distribution['high_freq'] * 0.9,
'fixed': expected_distribution['fixed'] * 0.85
}
print(f"Simulated observed distribution:")
for category, count in observed_dist.items():
print(f" {category}: {count:.0f} SNPs ({count/num_snps*100:.1f}%)")
# Calculate neutrality test statistics
differences = {cat: obs - exp for cat, obs, exp in
zip(expected_distribution.keys(), observed_dist.values(), expected_distribution.values())}
excess_categories = [cat for cat, diff in differences.items() if abs(diff) > 0.05 * num_snps]
print(f"\nCategories showing departure from neutrality (|diff| > {0.05 * num_snps:.0f} SNPs): {excess_categories}")
if excess_categories:
print(" These may indicate regions under selection or demographic effects")
else:
print(" Allele frequency distribution consistent with neutral expectations")
Demographic Inference
Understanding population history through genomic data analysis.
Your Challenge: Variant Calling and Filtering Pipeline
Design a computational pipeline for detecting and filtering genetic variants from sequencing data.
Goal: Create a variant calling workflow with appropriate quality filtering.
Sequencing Data Analysis
import math
# Variant analysis for whole exome sequencing data
seq_params = {
'coverage_depth': 30, # Average coverage per base
'error_rate': 0.001, # Per base error rate (0.1%)
'reference_genome': 'GRCh38',
'aligner': 'BWA-MEM',
'variant_caller': 'GATK HaplotypeCaller',
'target_regions': 50000000, # bp (50 Mb exome)
'false_discovery_rate': 0.05, # 5% FDR target
'quality_threshold': 30, # Minimum variant quality score
'strand_bias_limit': 0.3, # Maximum allowed strand bias
'allele_frequency_threshold': 0.05 # Minimum frequency to consider variant
}
# Calculate raw variant statistics
expected_errors_per_sample = seq_params['target_regions'] * seq_params['coverage_depth'] * seq_params['error_rate']
number_of_variants_detected = int(expected_errors_per_sample * 2) # 2x to account for real variants
# Apply quality filtering
# Based on GATK best practices filters
dp_filter_threshold = 10 # Minimum depth
qual_filter_threshold = seq_params['quality_threshold']
sb_filter_threshold = seq_params['strand_bias_limit']
# Calculate probability of true variant vs. sequencing error
prior_error_prob = seq_params['error_rate']
posterior_variant_prob = 0.1 # After quality filtering
# Calculate false discovery rate after filtering
# If we expect 5% FDR, then 5% of called variants should be false
expected_false_calls = number_of_variants_detected * seq_params['false_discovery_rate']
expected_true_calls = number_of_variants_detected - expected_false_calls
# Apply strand bias filtering
# Calculate strand bias for each variant
# Using Fisher's exact test on strand counts
def calculate_strand_bias(ref_forward, ref_reverse, alt_forward, alt_reverse):
# Simplified strand bias calculation
total_ref = ref_forward + ref_reverse
total_alt = alt_forward + alt_reverse
if total_alt == 0:
return 0.0
# Calculate strand bias as difference from expectation
expected_forward_fraction = (ref_forward + alt_forward) / (total_ref + total_alt)
if expected_forward_fraction > 0.5:
# More forward reads than reverse - potential bias
strand_bias = abs(expected_forward_fraction - 0.5) * 2
else:
strand_bias = abs((ref_reverse + alt_reverse) / (total_ref + total_alt) - 0.5) * 2
return min(strand_bias, 1.0) # Cap at 1.0
# Simulated variant data
sample_variants = [
{'position': 100000, 'ref_depth': 25, 'alt_depth': 12, 'quality': 45,
'ref_forward': 15, 'ref_reverse': 10, 'alt_forward': 7, 'alt_reverse': 5},
{'position': 150000, 'ref_depth': 30, 'alt_depth': 8, 'quality': 25,
'ref_forward': 25, 'ref_reverse': 5, 'alt_forward': 8, 'alt_reverse': 0},
{'position': 200000, 'ref_depth': 20, 'alt_depth': 3, 'quality': 60,
'ref_forward': 12, 'ref_reverse': 8, 'alt_forward': 2, 'alt_reverse': 1},
{'position': 250000, 'ref_depth': 40, 'alt_depth': 15, 'quality': 50,
'ref_forward': 20, 'ref_reverse': 20, 'alt_forward': 7, 'alt_reverse': 8}
]
# Apply filtering criteria
passing_variants = []
filtered_reasons = {'low_quality': 0, 'strand_bias': 0, 'low_coverage': 0, 'low_frequency': 0}
for variant in sample_variants:
depth = variant['ref_depth'] + variant['alt_depth']
frequency = variant['alt_depth'] / depth if depth > 0 else 0
# Calculate strand bias
sb_score = calculate_strand_bias(
variant['ref_forward'], variant['ref_reverse'],
variant['alt_forward'], variant['alt_reverse']
)
# Apply filters
passes_filters = True
reason = ""
if variant['quality'] < qual_filter_threshold:
passes_filters = False
reason = "low_quality"
elif sb_score > sb_filter_threshold:
passes_filters = False
reason = "strand_bias"
elif depth < dp_filter_threshold:
passes_filters = False
reason = "low_coverage"
elif frequency < seq_params['allele_frequency_threshold']:
passes_filters = False
reason = "low_frequency"
if passes_filters:
passing_variants.append(variant)
else:
filtered_reasons[reason] += 1
# Calculate sensitivity and specificity estimates
# Based on expected error rates and filtering
theoretical_precision = 1 - seq_params['false_discovery_rate'] # After filtering
theoretical_sensitivity = 0.90 # Estimated sensitivity after quality filtering
# Calculate positive predictive value (PPV) after filtering
prior_prob_real = 0.01 # 1% of positions have true variants
posterior_prob_real = (prior_prob_real * theoretical_sensitivity) / \
((prior_prob_real * theoretical_sensitivity) +
((1 - prior_prob_real) * seq_params['false_discovery_rate']))
# Calculate coverage uniformity
# Standard deviation of coverage depth across regions
coverage_uniformity_cv = 0.25 # Typical for well-designed exome capture
# Calculate callable regions (regions with sufficient coverage)
callable_fraction = 0.90 # 90% of target regions have >10x coverage
callable_bases = seq_params['target_regions'] * callable_fraction
# Estimated variant load per sample
expected_rare_variants = callable_bases * 0.001 # 1 per 1000 bases
expected_common_variants = callable_bases * 0.0001 # 1 per 10000 bases
Create a variant calling and filtering pipeline to identify high-confidence genetic variants.
Hint:
- Consider multiple quality metrics for variant filtering
- Account for coverage depth and uniformity
- Evaluate strand bias and other systematic artifacts
- Estimate precision and recall after filtering
# TODO: Calculate variant filtering results
high_quality_variants = 0 # Number of variants passing all filters
filtering_efficiency = 0 # Fraction of variants that pass filters
call_rate = 0 # Fraction of target sites with sufficient coverage
precision_after_filtering = 0 # Fraction of called variants that are true positives
strand_bias_score = 0 # Average strand bias across filtered variants
# Calculate results from analysis above
high_quality_variants = len(passing_variants)
filtering_efficiency = high_quality_variants / len(sample_variants) if sample_variants else 0
call_rate = callable_fraction
precision_after_filtering = posterior_prob_real
strand_bias_score = sum(calculate_strand_bias(
v['ref_forward'], v['ref_reverse'],
v['alt_forward'], v['alt_reverse']) for v in passing_variants) / len(passing_variants) if passing_variants else 0
# Calculate additional metrics
coverage_uniformity = 1 - coverage_uniformity_cv # Higher is better uniformity
expected_variants_passing_filters = expected_rare_variants * theoretical_sensitivity # After sensitivity loss
# Print results
print(f"Variant calling results:")
print(f" Raw variants detected: {len(sample_variants)}")
print(f" High-quality variants: {high_quality_variants}")
print(f" Filtering efficiency: {filtering_efficiency:.2f}")
print(f" Callable genome fraction: {call_rate:.3f}")
print(f" Estimated precision: {precision_after_filtering:.3f}")
print(f" Strand bias score: {strand_bias_score:.3f}")
print(f" Coverage uniformity: {coverage_uniformity:.3f}")
# Quality assessment
if precision_after_filtering > 0.9 and filtering_efficiency > 0.7:
quality_assessment = "High quality - good balance of precision and sensitivity"
elif precision_after_filtering > 0.8:
quality_assessment = "Good quality - high precision with some sensitivity loss"
else:
quality_assessment = "Low quality - significant filtering artifact concerns"
print(f" Quality assessment: {quality_assessment}")
# Reporting metrics
for category, count in filtered_reasons.items():
if count > 0:
print(f" Filtered for {category}: {count} variants")
# Clinical interpretation
if high_quality_variants > 0:
clinical_interpretation = f"Identified {high_quality_variants} high-confidence variants for further analysis"
else:
clinical_interpretation = "No high-confidence variants identified - consider additional samples or quality improvements"
print(f" Clinical interpretation: {clinical_interpretation}")
How would you modify the filtering criteria if you were specifically looking for rare pathogenic variants versus common population variants?
ELI10 Explanation
Simple analogy for better understanding
Self-Examination
How do sequence alignment algorithms efficiently compare genomic sequences?
What are the challenges in assembling and annotating large eukaryotic genomes?
How can population genomic analyses reveal evolutionary history and demographic patterns?