Bioinformatics
Sequence alignment, structural prediction, and genomic data analysis.
Bioinformatics
Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data, particularly when applied to molecular biology. It combines biology, computer science, mathematics, and statistics to analyze and interpret biological data.
Sequence Analysis and Alignment
Sequence Representation
DNA/RNA Sequences
Protein Sequences
Sequence Alignment Algorithms
Dynamic Programming Approach
Needleman-Wunsch Algorithm (Global Alignment)
Where:
- is the score matrix
- is the similarity score between residues
- is the gap penalty
Smith-Waterman Algorithm (Local Alignment)
Scoring Matrices
DNA Scoring
Protein Scoring: PAM and BLOSUM Matrices
Where is the observed probability of substitution and are background frequencies.
Multiple Sequence Alignment (MSA)
Progressive Alignment (ClustalW Algorithm)
Where are guide tree branch weights.
Hidden Markov Models (HMMs)
Where is the observation sequence, is the state sequence, and represents model parameters.
Database Searching
BLAST Algorithm (Basic Local Alignment Search Tool)
Statistical Significance
Or equivalently:
Where:
- is the expectation value
- , are query and database lengths
- , , are statistical parameters
BLAST Score Calculation
FASTA Algorithm
Phylogenetic Analysis
Evolutionary Distance Metrics
p-distance
Jukes-Cantor Model
Kimura 2-parameter Model
Where is proportion of transitions and is proportion of transversions.
Phylogenetic Tree Construction
Neighbor-Joining Algorithm
Where is the row sum.
Maximum Parsimony
Maximum Likelihood
Where are model parameters and is the tree topology.
Genomic Analysis
Sequence Assembly
Overlap-Layout-Consensus Method
De Bruijn Graphs
Gene Prediction and Annotation
Ab Initio Gene Prediction
Hidden Markov Models for Gene Finding
Whole Genome Alignment
Synteny Analysis
Protein Structure Prediction
Primary Structure Analysis
Secondary Structure Prediction
Homology Modeling
Threading/Template-based Modeling
Where are weights and are different scoring terms.
De Novo Structure Prediction
Functional Genomics Analysis
RNA-Seq Analysis
Quality Control and Mapping
Differential Expression Analysis
Normalization
ChIP-Seq Analysis
Population Genomics
Variant Calling
GWAS (Genome-Wide Association Studies)
Population Structure Analysis
Statistical Methods in Bioinformatics
Multiple Testing Correction
Bonferroni Correction
Where is the number of tests.
Benjamini-Hochberg Procedure (FDR)
Machine Learning in Bioinformatics
Supervised Learning
Feature Selection
Metagenomics
Sequence Classification
Network Biology
Protein-Protein Interaction Networks
Gene Regulatory Networks
Big Data Approaches in Bioinformatics
Cloud Computing Platforms
- Amazon Web Services (AWS): EC2, S3, EMR
- Google Cloud Platform (GCP): Google Genomics API
- Microsoft Azure: HDInsight
Scalable Computing Frameworks
- Apache Spark: Distributed data processing
- Hadoop: MapReduce for large datasets
- Kubernetes: Container orchestration
Real-World Application: Phylogenetic Analysis of Viral Evolution
Phylogenetic analysis is critical for understanding viral evolution and tracking disease spread.
Viral Sequence Analysis
# Phylogenetic analysis for viral evolution tracking
viral_data = {
'sequences': 100, # Number of sequences
'alignment_length': 30000, # Length of viral genome alignment
'mutation_rate': 1e-4, # Mutations per site per year
'divergence_rates': [0.001, 0.005, 0.01, 0.05], # Proportion different from reference
'bootstrap_support': [70, 85, 95, 60], # Confidence values for tree nodes
'collection_dates': ['2020-01', '2020-03', '2020-06', '2021-01'] # Time stamps
}
# Calculate evolutionary rate and divergence
evolutionary_distance = {} # Store distance calculations
for i, div_rate in enumerate(viral_data['divergence_rates']):
# Using Jukes-Cantor correction for multiple hits
corrected_distance = -(3/4) * math.log(1 - (4/3) * div_rate)
evolutionary_distance[f'strain_{i}'] = corrected_distance
# Estimate time to most recent common ancestor (TMRCA)
# Molecular clock assumption: distance = rate * time
mean_divergence = sum(viral_data['divergence_rates']) / len(viral_data['divergence_rates'])
tmrca_years = mean_divergence / viral_data['mutation_rate'] / 2 # Divide by 2 for pairwise comparison
# Calculate phylogenetic tree confidence
mean_bootstrap = sum(viral_data['bootstrap_support']) / len(viral_data['bootstrap_support'])
tree_confidence = "High" if mean_bootstrap > 80 else "Medium" if mean_bootstrap > 60 else "Low"
print(f"Viral evolution analysis:")
print(f" Number of sequences: {viral_data['sequences']}")
print(f" Alignment length: {viral_data['alignment_length']:,} bp")
print(f" Estimated mutation rate: {viral_data['mutation_rate']*1e6:.0f} x 10⁻⁶ per site per year")
print(f" Mean divergence: {mean_divergence*100:.2f}%")
print(f" Estimated TMRCA: {tmrca_years:.2f} years ago")
print(f" Mean bootstrap support: {mean_bootstrap:.1f}%")
print(f" Tree confidence: {tree_confidence}")
# Assess epidemic growth rate
if len(viral_data['collection_dates']) >= 2:
# Calculate growth rate from sequence accumulation
import datetime
from dateutil.parser import parse
# Convert dates to numeric values for growth calculation
date_values = [i for i in range(len(viral_data['collection_dates']))]
# Simple exponential growth model: N(t) = N₀ * e^(rt)
# Calculate growth rate r
time_span = len(viral_data['collection_dates']) - 1 # intervals
if time_span > 0:
initial_sequences = 1 # Assume started with 1 sequence
final_sequences = viral_data['sequences']
growth_rate = math.log(final_sequences / initial_sequences) / time_span
doubling_time = math.log(2) / growth_rate if growth_rate > 0 else float('inf')
print(f" Growth rate estimate: {growth_rate:.3f} per sampling interval")
print(f" Doubling time: {doubling_time:.1f} intervals")
else:
print(" Insufficient temporal data for growth rate calculation")
# Transmission cluster identification
cluster_cutoff = 0.001 # 0.1% divergence threshold
clusters_identified = sum(1 for d in viral_data['divergence_rates'] if d < cluster_cutoff)
print(f" Putative transmission clusters: {clusters_identified}")
Evolutionary Insights
Understanding viral evolution patterns from sequence analysis.
Your Challenge: Multiple Sequence Alignment Analysis
Analyze a set of protein sequences to identify conserved regions and functional domains using bioinformatics tools.
Goal: Perform multiple sequence alignment and identify functionally important regions.
Sequence Analysis Data
import math
# Protein sequence dataset
sequences = {
'seq1': {
'name': 'ProteinA_human',
'sequence': 'MKTIIIACTSLVGVIRLAKFIDKQLKTLKESNQVVRNLIAHQQPLLKAIQDDKIPSAIMRLGLDNVDLVVTDSSKDVETLLRIMQAEKDSVGLLNVQVQKGDYVKLGSVFTGLQSLIKELKVDNFGVDFKQRIKVVSVFQKDDLKELKTNVQEKVQELAQELQLAEKQKLDKIEKEQDKIQQVAKYFKDLDLVVQLKKLVQKDLNNLLKDKLQKLDKVAKQLEDKISLLKKILDKKLSVLNKLKDLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQKDLKDLKDILQDKLQKLDQLLEKLDKLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQK',
'length': 420,
'organism': 'Homo sapiens',
'function': 'kinase'
},
'seq2': {
'name': 'ProteinA_mouse',
'sequence': 'MKTVIACTSLVGVIRLAKFIDKQLKTLKESNQVVRNLIAHQQPLLKAIQDDKIPSAIMRLGLDNVDLVVTDSSKDVETLLRIMQAEKDSVGLLNVQVQKGDYVKLGSVFTGLQSLIKELKVDNFGVDFKQRIKVVSVFQKDDLKELKTNVQEKVQELAQELQLAEKQKLDKIEKEQDKIQQVAKYFKDLDLVVQLKKLVQKDLNNLLKDKLQKLDKVAKQLEDKISLLKKILDKKLSVLNKLKDLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQK',
'length': 420,
'organism': 'Mus musculus',
'function': 'kinase'
},
'seq3': {
'name': 'ProteinA_rat',
'sequence': 'MKTVIACTSLVGVIRLAKFIDKQLKTLKESNQVVRNLIAHQQPLLKAIQDDKIPSAIMRLGLDNVDLVVTDSSKDVETLLRIMQAEKDSVGLLNVQVQKGDYVKLGSVFTGLQSLIKELKVDNFGVDFKQRIKVVSVFQKDDLKELKTNVQEKVQELAQELQLAEKQKLDKIEKEQDKIQQVAKYFKDLDLVVQLKKLVQKDLNNLLKDKLQKLDKVAKQLEDKISLLKKILDKKLSVLNKLKDLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQKDLKDLKDLLQDKLQKLDQLLEKLDKLQK',
'length': 420,
'organism': 'Rattus norvegicus',
'function': 'kinase'
}
}
# Calculate pairwise identities
def calculate_identity(seq1, seq2):
matches = 0
min_len = min(len(seq1), len(seq2))
for i in range(min_len):
if seq1[i] == seq2[i]:
matches += 1
return (matches / min_len) * 100
# Perform basic multiple sequence alignment analysis
identities = []
for i, (name1, data1) in enumerate(sequences.items()):
for j, (name2, data2) in enumerate(sequences.items()):
if i < j: # Avoid comparing sequence with itself
identity = calculate_identity(data1['sequence'], data2['sequence'])
identities.append({'seq1': name1, 'seq2': name2, 'identity': identity})
# Calculate average sequence identity
avg_identity = sum(id['identity'] for id in identities) / len(identities) if identities else 0
# Identify conserved regions (regions with 100% identity across all sequences)
conserved_regions = []
min_seq_len = min(len(data['sequence']) for data in sequences.values())
for pos in range(min_seq_len):
residues_at_pos = [data['sequence'][pos] for data in sequences.values()]
if all(r == residues_at_pos[0] for r in residues_at_pos):
conserved_regions.append(pos)
# Calculate conservation score along sequence (sliding window approach)
window_size = 10
conservation_profile = []
for pos in range(min_seq_len - window_size + 1):
window_conserved = 0
for i in range(window_size):
window_pos = pos + i
residues_at_pos = [data['sequence'][window_pos] for data in sequences.values()]
# Count how many positions are identical
if all(r == residues_at_pos[0] for r in residues_at_pos):
window_conserved += 1
conservation_score = window_conserved / window_size
conservation_profile.append(conservation_score)
# Find peaks in conservation profile (highly conserved regions)
highly_conserved_regions = []
for i, score in enumerate(conservation_profile):
if score > 0.8: # 80% conservation in window
highly_conserved_regions.append((i, i + window_size, score))
# Functional domain prediction based on conservation
if highly_conserved_regions:
domain_predictions = []
for start, end, score in highly_conserved_regions:
# Assume that highly conserved regions correspond to functional domains
domain_predictions.append({
'start': start,
'end': end,
'conservation_score': score,
'predicted_function': 'active_site_motif' if score > 0.9 else 'structural_domain'
})
# Calculate entropy as a measure of conservation (lower entropy = higher conservation)
def calculate_entropy(position):
residues = [data['sequence'][position] for data in sequences.values()]
unique_residues = set(residues)
entropy = 0
for residue in unique_residues:
freq = residues.count(residue) / len(residues)
if freq > 0:
entropy -= freq * math.log2(freq)
return entropy
# Calculate average entropy across aligned positions
entropies = [calculate_entropy(pos) for pos in range(min_seq_len)]
avg_entropy = sum(entropies) / len(entropies) if entropies else float('inf')
Perform detailed sequence analysis to identify conserved regions and predict functional domains.
Hint:
- Calculate sequence identity and similarity
- Identify conserved sequence motifs
- Predict functional domains based on conservation
- Consider the relationship between conservation and function
# TODO: Calculate sequence analysis parameters
percent_identity = 0 # Average pairwise sequence identity
conserved_positions = 0 # Number of positions conserved across all sequences
functional_domains = [] # List of predicted functional domains
entropy_measure = 0 # Average Shannon entropy across alignable positions
motif_predictions = [] # Predicted functional motifs
# Calculate average percent identity
if identities:
percent_identity = sum(id['identity'] for id in identities) / len(identities)
# Calculate conserved positions
conserved_positions = len(conserved_regions)
# Calculate average entropy
if entropies:
entropy_measure = sum(entropies) / len(entropies)
# Predict functional domains based on conservation profile
for start, end, score in highly_conserved_regions:
if score > 0.9:
domain_type = "Active site"
elif score > 0.7:
domain_type = "Binding site"
else:
domain_type = "Structural motif"
functional_domains.append({
'start': start,
'end': end,
'type': domain_type,
'score': score
})
# Identify motifs based on conserved residues
motif_length = 5
motif_occurrences = {}
for pos in conserved_regions:
if pos + motif_length <= min_seq_len:
motif = sequences[list(sequences.keys())[0]]['sequence'][pos:pos + motif_length]
if motif.isalpha() and len(motif) == motif_length: # Valid amino acid motif
motif_occurrences[motif] = motif_occurrences.get(motif, 0) + 1
# Store high-frequency conserved motifs
for motif, freq in motif_occurrences.items():
if freq >= len(sequences) * 0.8: # Present in at least 80% of sequences
motif_predictions.append({
'motif': motif,
'frequency': freq,
'type': 'conserved_motif'
})
# Print results
print(f"Average sequence identity: {percent_identity:.2f}%")
print(f"Conserved positions: {conserved_positions} out of {min_seq_len}")
print(f"Average entropy: {entropy_measure:.3f}")
print(f"Number of functional domains predicted: {len(functional_domains)}")
print(f"Number of conserved motifs predicted: {len(motif_predictions)}")
# Domain characterization
domain_stats = {}
for domain in functional_domains:
domain_type = domain['type']
domain_stats[domain_type] = domain_stats.get(domain_type, 0) + 1
for dom_type, count in domain_stats.items():
print(f" {dom_type} domains: {count}")
# Functional implications
if avg_identity > 90:
functional_implication = "Highly conserved - likely essential function"
elif avg_identity > 70:
functional_implication = "Moderately conserved - conserved function with species-specific adaptations"
else:
functional_implication = "Divergent sequences - possible functional differences"
print(f"Functional implication: {functional_implication}")
How might the identified conserved domains be related to the protein's catalytic mechanism and what experimental validation would you propose?
ELI10 Explanation
Simple analogy for better understanding
Self-Examination
How do sequence alignment algorithms work and what are their applications?
What methods are used for protein structure prediction and validation?
How is genomic data analyzed and interpreted in modern research?