Chapter 12

Computational Biology & AI in Biotech

Machine learning algorithms for biological data analysis, deep learning in genomics and proteomics, systems modeling and simulation, network analysis and pathway reconstruction, drug discovery and design using AI.

Computational Biology & AI in Biotechnology

Computational biology combines computer science, mathematics, and statistics to understand and model biological systems. With the exponential growth of biological data, AI and machine learning have become essential tools for extracting meaningful insights from complex datasets.

Introduction to Computational Biology

Data-Driven Biology Revolution

The scale of biological data has grown exponentially:

Data volume growth:V(t)=V0eλt\text{Data volume growth}: V(t) = V_0 \cdot e^{\lambda t}

Where biological datasets are doubling every ~7 months, far outpacing Moore's law.

Key Applications

  • Genomics: Sequence analysis, variant calling, annotation
  • Proteomics: Structure prediction, function prediction
  • Systems biology: Network inference, pathway modeling
  • Drug discovery: Virtual screening, ADMET prediction
  • Precision medicine: Biomarker discovery, treatment stratification

Machine Learning in Biology

Supervised Learning

Classification Problems

f:XY,Y{classes}f: \mathbf{X} \rightarrow Y, \quad Y \in \{\text{classes}\}

Where X\mathbf{X} represents biological features (expression, sequence, structure).

Regression Problems

f:XR,e.g.,binding affinity, stabilityf: \mathbf{X} \rightarrow \mathbb{R}, \quad \text{e.g.,} \text{binding affinity, stability}

Common ML Algorithms in Biology

Support Vector Machines (SVMs)

minw,b,ξ12w2+Ci=1nξi\min_{\mathbf{w}, b, \xi} \frac{1}{2}||\mathbf{w}||^2 + C\sum_{i=1}^{n} \xi_i

Subject to:

yi(wTϕ(xi)+b)1ξi,ξi0y_i(\mathbf{w}^T\phi(\mathbf{x}_i) + b) \geq 1 - \xi_i, \quad \xi_i \geq 0

Random Forests

Ensemble classifier:F(x)=1Tt=1Tft(x)\text{Ensemble classifier}: F(\mathbf{x}) = \frac{1}{T} \sum_{t=1}^{T} f_t(\mathbf{x})

Where ftf_t are individual decision trees.

Neural Networks

Feedforward Networks
a(l+1)=σ(W(l)a(l)+b(l))\mathbf{a}^{(l+1)} = \sigma(\mathbf{W}^{(l)}\mathbf{a}^{(l)} + \mathbf{b}^{(l)})

Where σ\sigma is the activation function.

Backpropagation
Lwij(l)=δj(l+1)ai(l)\frac{\partial L}{\partial w_{ij}^{(l)}} = \delta_j^{(l+1)} a_i^{(l)}

Where δj(l+1)\delta_j^{(l+1)} is the error term for neuron jj in layer l+1l+1.

Unsupervised Learning

Clustering

mini=1kxCixμi2\min \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} ||\mathbf{x} - \mathbf{\mu}_i||^2

For k-means clustering with centroids μi\mathbf{\mu}_i.

Dimensionality Reduction

Principal Component Analysis (PCA)
X=UΣVT\mathbf{X} = \mathbf{U\Sigma V}^T PCi=j=1pXjloadingsij\text{PC}_i = \sum_{j=1}^{p} \mathbf{X}_j \cdot \text{loadings}_{ij}
t-Distributed Stochastic Neighbor Embedding (t-SNE)
Pji=exp(xixj2/2σi2)kiexp(xixk2/2σi2)P_{j|i} = \frac{\exp(-||\mathbf{x}_i - \mathbf{x}_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||\mathbf{x}_i - \mathbf{x}_k||^2 / 2\sigma_i^2)} Qji=(1+yiyj2)1ki(1+yiyj2)1Q_{j|i} = \frac{(1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}}{\sum_{k \neq i} (1 + ||\mathbf{y}_i - \mathbf{y}_j||^2)^{-1}}

Optimize: KL(PQ)\text{KL}(P||Q)

Deep Learning in Biology

Convolutional Neural Networks (CNNs)

Architecture for Biological Sequences

Input:sequence1,,sequencen\text{Input}: \text{sequence}_1, \ldots, \text{sequence}_n Convolution:(fg)[n]=m=f[m]g[nm]\text{Convolution}: (f * g)[n] = \sum_{m=-\infty}^{\infty} f[m] \cdot g[n-m]

Where filters ff detect motifs in biological sequences.

Protein Sequence Analysis

CNN for protein:SRL×20featuresprediction\text{CNN for protein}: \mathbf{S} \in \mathbb{R}^{L \times 20} \rightarrow \text{features} \rightarrow \text{prediction}

Where LL is sequence length and 20 represents amino acid alphabet.

Recurrent Neural Networks (RNNs)

Long Short-Term Memory (LSTM)

ft=σ(Wf[ht1,xt]+bf)\mathbf{f}_t = \sigma(\mathbf{W}_f \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) it=σ(Wi[ht1,xt]+bi)\mathbf{i}_t = \sigma(\mathbf{W}_i \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) Ct=ftCt1+ittanh(WC[ht1,xt]+bC)\mathbf{C}_t = \mathbf{f}_t * \mathbf{C}_{t-1} + \mathbf{i}_t * \tanh(\mathbf{W}_C \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_C) ot=σ(Wo[ht1,xt]+bo)\mathbf{o}_t = \sigma(\mathbf{W}_o \cdot [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) ht=ottanh(Ct)\mathbf{h}_t = \mathbf{o}_t * \tanh(\mathbf{C}_t)

Transformer Models

Attention Mechanism

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{QK}^T}{\sqrt{d_k}}\right)\mathbf{V}

Self-Attention in Biology

Biological sequence:s1,s2,,sn\text{Biological sequence}: s_1, s_2, \ldots, s_n Positional encoding:PE(pos,2i)=sin(pos100002i/dmodel)\text{Positional encoding}: PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) Attention weight:αij=exp(score(qi,kj))k=1nexp(score(qi,kk))\text{Attention weight}: \alpha_{ij} = \frac{\exp(\text{score}(\mathbf{q}_i, \mathbf{k}_j))}{\sum_{k=1}^{n} \exp(\text{score}(\mathbf{q}_i, \mathbf{k}_k))}

Graph Neural Networks (GNNs)

Protein Structure Analysis

Protein as graph:G=(V,E)\text{Protein as graph}: G = (V, E)

Where VV represents amino acid residues and EE represents spatial contacts.

Message Passing

muN(u)(t+1)=AGGREGATE(t)({hv(t):vN(u)})\mathbf{m}_{u \leftarrow N(u)}^{(t+1)} = \text{AGGREGATE}^{(t)}(\{\mathbf{h}_v^{(t)} : v \in N(u)\}) hu(t+1)=UPDATE(t)(hu(t),muN(u)(t+1))\mathbf{h}_u^{(t+1)} = \text{UPDATE}^{(t)}(\mathbf{h}_u^{(t)}, \mathbf{m}_{u \leftarrow N(u)}^{(t+1)})

Sequence Analysis with ML

Sequence Classification

K-mer Based Representations

XRN×Ak\mathbf{X} \in \mathbb{R}^{N \times |A|^k}

Where NN is number of sequences, AA is alphabet, kk is k-mer length.

Convolutional Filters for Motif Detection

Filter response:ri,j=m,nWm,nXi+m,j+n\text{Filter response}: r_{i,j} = \sum_{m,n} \mathbf{W}_{m,n} \cdot \mathbf{X}_{i+m,j+n}

Multiple Sequence Alignment (MSA)

MSA:{s1,s2,,sN}SalignedRN×L\text{MSA}: \{\mathbf{s}_1, \mathbf{s}_2, \ldots, \mathbf{s}_N\} \rightarrow \mathbf{S}_{aligned} \in \mathbb{R}^{N \times L} Evolutionary coupling:ECij=PijabPiaPjb\text{Evolutionary coupling}: EC_{ij} = |P_{ij}^{ab} - P_i^a \cdot P_j^b|

Where PijabP_{ij}^{ab} is joint probability and PiaP_i^a is marginal probability.

Structure Prediction

AlphaFold Architecture

Evoformer

Pair representations:zi,jRcz\text{Pair representations}: \mathbf{z}_{i,j} \in \mathbb{R}^{c_z} Single representations:siRcs\text{Single representations}: \mathbf{s}_i \in \mathbb{R}^{c_s} MSA representations:ai,rRcm\text{MSA representations}: \mathbf{a}_{i,r} \in \mathbb{R}^{c_m}

Structure Module

Invariant point attention:v=MLP(v)+attention(v)\text{Invariant point attention}: \mathbf{v}' = \text{MLP}(\mathbf{v}) + \text{attention}(\mathbf{v})

Training Objective

L=Ldist+Lang+Lconf\mathcal{L} = \mathcal{L}_{\text{dist}} + \mathcal{L}_{\text{ang}} + \mathcal{L}_{\text{conf}}

Where:

  • Ldist\mathcal{L}_{\text{dist}} = distance predictions
  • Lang\mathcal{L}_{\text{ang}} = angle predictions
  • Lconf\mathcal{L}_{\text{conf}} = confidence estimation

Protein Folding with Deep Learning

SequenceNeural networkDistance matrix3D coordinates\text{Sequence} \xrightarrow{\text{Neural network}} \text{Distance matrix} \xrightarrow{\text{3D coordinates}} XinputAttentionZpairStructure moduleX3D\mathbf{X}_{\text{input}} \xrightarrow{\text{Attention}} \mathbf{Z}_{\text{pair}} \xrightarrow{\text{Structure module}} \mathbf{X}_{\text{3D}}

Genomic Applications

Variant Effect Prediction

SIFT Score

SIFT score=aAallowedP(ax)aAP(ax)\text{SIFT score} = \frac{\sum_{a \in A_{allowed}} P(a|x)}{\sum_{a \in A} P(a|x)}

PolyPhen-2 Score

Naive Bayes classifier:P(deleteriousx)=P(xdeleterious)P(deleterious)P(x)\text{Naive Bayes classifier}: P(\text{deleterious}|x) = \frac{P(x|\text{deleterious}) \cdot P(\text{deleterious})}{P(x)}

Gene Expression Analysis

Differential Expression

Log fold change:LFC=log2(condition Acondition B)\text{Log fold change}: LFC = \log_2\left(\frac{\text{condition A}}{\text{condition B}}\right) Statistical test:t=xˉ1xˉ2s12n1+s22n2\text{Statistical test}: t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}

Single-Cell RNA Sequencing Analysis

Expression matrix:XRn×p\text{Expression matrix}: \mathbf{X} \in \mathbb{R}^{n \times p}

Where nn is cells and pp is genes.

Normalization:xij(norm)=log(cij106libraryi+1)\text{Normalization}: x_{ij}^{(norm)} = \log\left(c_{ij} \cdot \frac{10^6}{\text{library}_i} + 1\right)

Genomic Sequence Analysis

CNN for Regulatory Element Prediction

Filter:WkRl×4\text{Filter}: \mathbf{W}_k \in \mathbb{R}^{l \times 4}

Where ll is filter length and 4 represents nucleotides (A,C,G,T).

Convolution:hk,i=max(0,Wkxi:i+l1+bk)\text{Convolution}: h_{k,i} = \max(0, \mathbf{W}_k \cdot \mathbf{x}_{i:i+l-1} + b_k)

Regulatory Motif Discovery

PWM score:S=i=1llogP(bimotif)P(bibackground)\text{PWM score}: S = \sum_{i=1}^{l} \log \frac{P(b_i|\text{motif})}{P(b_i|\text{background})}

Drug Discovery Applications

Virtual Screening

Docking score:DS=f(shape complementarity,chemical interactions)\text{Docking score}: DS = f(\text{shape complementarity}, \text{chemical interactions}) QSAR model:Activity=f(molecular descriptors)\text{QSAR model}: \text{Activity} = f(\text{molecular descriptors})

ADMET Prediction

ADMET property=f(molecular structure,QSAR model)\text{ADMET property} = f(\text{molecular structure}, \text{QSAR model})

Solubility Prediction

logS=i=1ncifi\log S = \sum_{i=1}^{n} c_i \cdot f_i

Where fif_i are molecular descriptors and cic_i are coefficients.

CYP450 Inhibition

P(inhibitor)=σ(wixi+b)P(\text{inhibitor}) = \sigma\left(\sum w_i x_i + b\right)

Where σ\sigma is sigmoid function, wiw_i are weights, xix_i are features.

Network Biology & Systems Modeling

Gene Regulatory Networks

dxdt=f(x,θ)\frac{d\mathbf{x}}{dt} = f(\mathbf{x}, \mathbf{\theta})

Where x\mathbf{x} is gene expression levels and θ\mathbf{\theta} are parameters.

GRN Inference

XN(0,C)\mathbf{X} \sim \mathcal{N}(\mathbf{0}, \mathbf{C})

Where C\mathbf{C} is covariance matrix and [C1]ij0[\mathbf{C}^{-1}]_{ij} \neq 0 indicates regulatory connection.

Metabolic Modeling

Flux Balance Analysis

dxdt=Sv=0\frac{d\mathbf{x}}{dt} = \mathbf{Sv} = \mathbf{0} maxcTvs.t.Sv=0,vminvvmax\max \mathbf{c}^T \mathbf{v} \quad \text{s.t.} \quad \mathbf{Sv} = \mathbf{0}, \mathbf{v}_{min} \leq \mathbf{v} \leq \mathbf{v}_{max}

Where S\mathbf{S} is stoichiometric matrix and v\mathbf{v} is flux vector.

AI in Structural Biology

Cryo-EM Analysis

Y=PR,tV+N\mathbf{Y} = \mathbf{P}_{R,\mathbf{t}} \cdot \mathbf{V} + \mathbf{N}

Where Y\mathbf{Y} is observed image, P\mathbf{P} is projection operator, V\mathbf{V} is 3D volume, N\mathbf{N} is noise.

Deep Learning for Denoising

Vdenoised=fθ(Y)\mathbf{V}_{denoised} = f_{\theta}(\mathbf{Y})

Where fθf_{\theta} is neural network with parameters θ\theta.

Molecular Dynamics Enhancement

ML potential:UML=fθ({ri})\text{ML potential}: U_{ML} = f_{\theta}(\{\mathbf{r}_i\}) d2ridt2=1miUMLri\frac{d^2\mathbf{r}_i}{dt^2} = -\frac{1}{m_i} \frac{\partial U_{ML}}{\partial \mathbf{r}_i}

Machine Learning Pipelines

Cross-Validation

CV error=1Kk=1K1DkiDkL(yi,fk(xi))\text{CV error} = \frac{1}{K} \sum_{k=1}^{K} \frac{1}{|D_k|} \sum_{i \in D_k} L(y_i, f_{-k}(x_i))

Where DkD_k is k-th fold and fkf_{-k} is model trained without fold kk.

Feature Engineering

One-hot Encoding

DNA:A[1,0,0,0],C[0,1,0,0],G[0,0,1,0],T[0,0,0,1]\text{DNA}: A \rightarrow [1,0,0,0], C \rightarrow [0,1,0,0], G \rightarrow [0,0,1,0], T \rightarrow [0,0,0,1]

Embedding Representations

Protein:ERL×dembed\text{Protein}: \mathbf{E} \in \mathbb{R}^{L \times d_{embed}}

Where dembedd_{embed} is embedding dimension.

Model Evaluation Metrics

Classification Metrics

Precision=TPTP+FP,Recall=TPTP+FN,F1=2PrecisionRecallPrecision+Recall\text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN}, \quad \text{F1} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} AUC=01TPR(FPR1(x))dx\text{AUC} = \int_0^1 \text{TPR}(FPR^{-1}(x)) dx

Regression Metrics

MSE=1ni=1n(yiy^i)2\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 R2=1(yiy^i)2(yiyˉ)2\text{R}^2 = 1 - \frac{\sum (y_i - \hat{y}_i)^2}{\sum (y_i - \bar{y})^2}

Interpretability in AI Models

SHAP Values

ϕi=SN{i}S!(nS1)!n![f(S{i})f(S)]\phi_i = \sum_{S \subseteq N \setminus \{i\}} \frac{|S|!(n-|S|-1)!}{n!}[f(S \cup \{i\}) - f(S)]

Where ϕi\phi_i is SHAP value for feature ii.

Attention Visualization

Attention weights:Aij=exp(eij)kexp(eik)\text{Attention weights}: A_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}

Deep Learning Frameworks in Biology

Specialized Architectures

Graph Convolutional Networks for Proteins

H(l+1)=σ(D~1/2A~D~1/2H(l)W(l))\mathbf{H}^{(l+1)} = \sigma(\mathbf{\tilde{D}}^{-1/2}\mathbf{\tilde{A}}\mathbf{\tilde{D}}^{-1/2}\mathbf{H}^{(l)}\mathbf{W}^{(l)})

Where A~=A+I\mathbf{\tilde{A}} = \mathbf{A} + \mathbf{I} and D~\mathbf{\tilde{D}} is degree matrix.

Variational Autoencoders for Molecular Generation

LVAE=Eqϕ(zx)[logpθ(xz)]KL[qϕ(zx)p(z)]\mathcal{L}_{VAE} = \mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}[\log p_{\theta}(\mathbf{x}|\mathbf{z})] - \text{KL}[q_{\phi}(\mathbf{z}|\mathbf{x}) || p(\mathbf{z})]

Challenges and Opportunities

Data Quality Issues

  • Batch effects: Technical variation between datasets
  • Class imbalance: Rare diseases, underrepresented populations
  • Missing data: Incomplete omics measurements
  • Noise: Technical and biological variability

Interpretability Challenges

  • Black box models: Difficulty understanding predictions
  • Causal inference: Distinguishing correlation from causation
  • Generalizability: Model performance across populations

Computational Requirements

  • Scalability: Handling massive biological datasets
  • GPU utilization: Accelerating deep learning computations
  • Cloud computing: Distributed learning and inference

Real-World Application: AI-Powered Drug Discovery Pipeline

AI is revolutionizing drug discovery by predicting molecular properties and identifying novel therapeutic candidates.

AI Drug Discovery Analysis

# Machine learning pipeline for drug discovery
ml_pipeline = {
    'target_identification': {
        'omics_integration': 0.85,   # Integration score for multi-omics data
        'literature_mining': 0.72,   # Literature-based target prioritization
        'genetic_validation': 0.91,  # Validation through genetic studies
        'druggability_score': 0.68   # Potential for small molecule intervention
    },
    'lead_discovery': {
        'virtual_screening_rate': 1e6,  # Compounds screened per day
        'hit_rate': 0.005,            # 0.5% hit rate in HTS
        'binding_affinity_pred': 0.88, # Correlation with experimental values
        'synthetic_accessibility': 0.75 # Estimated ease of synthesis
    },
    'lead_optimization': {
        'admet_prediction_accuracy': 0.82,  # ADMET property prediction accuracy
        'activity_cliff_detection': 0.91,   # Ability to identify SAR discontinuities
        'selectivity_prediction': 0.85,     # Off-target effect prediction
        'pk_simulation': 0.79              # Pharmacokinetic modeling accuracy
    }
}

# Calculate overall pipeline efficiency
# Traditional drug discovery takes ~10-15 years and $2-3B
traditional_cost = 2.5e9  # $2.5 billion
traditional_time = 13     # years
success_rate_traditional = 0.1  # 10% success rate to market

# AI-assisted pipeline improvements
time_reduction = 0.4      # 40% reduction in discovery phase
cost_reduction = 0.3      # 30% reduction in discovery costs
success_rate_improvement = 0.15  # 15% improvement in success rate

# Calculate new parameters
new_time = traditional_time * (1 - time_reduction)  # Discovery phase time
new_cost = traditional_cost * (1 - cost_reduction)  # Discovery phase cost
new_success_rate = success_rate_traditional + success_rate_improvement

# Calculate time savings per phase
discovery_phase_years = 4  # Traditional discovery phase years
development_phase_years = 9  # Traditional development phase years

ai_discovery_time = discovery_phase_years * (1 - time_reduction)
time_saved = discovery_phase_years - ai_discovery_time

# Calculate cost savings
discovery_cost_fraction = 0.3  # Discovery represents ~30% of total cost
traditional_discovery_cost = traditional_cost * discovery_cost_fraction
ai_discovery_cost = traditional_discovery_cost * (1 - cost_reduction)
cost_saved = traditional_discovery_cost - ai_discovery_cost

# Evaluate compound optimization efficiency
compound_library_size = 10000  # Number of compounds to evaluate
screening_throughput = ml_params['lead_discovery']['virtual_screening_rate']
total_screening_time = compound_library_size / screening_throughput  # days

# Calculate hit-to-lead efficiency
initial_hits = compound_library_size * ml_params['lead_discovery']['hit_rate']
expected_leads = initial_hits * 0.1  # 10% of hits become leads in optimization
optimization_success_rate = 0.25  # 25% of leads become candidates
expected_candidates = expected_leads * optimization_success_rate

print(f"AI-powered drug discovery pipeline analysis:")
print(f"  Traditional discovery time: {traditional_time} years")
print(f"  AI-assisted discovery time: {new_time:.1f} years")
print(f"  Time saved: {time_saved:.1f} years")
print(f"  Traditional discovery cost: ${traditional_discovery_cost/1e9:.1f}B")
print(f"  AI-assisted discovery cost: ${ai_discovery_cost/1e9:.1f}B")
print(f"  Cost saved: ${cost_saved/1e9:.2f}B")
print(f"  Success rate improvement: +{success_rate_improvement*100:.1f}% points")
print(f"  New success rate: {new_success_rate*100:.1f}%")
print(f"\n  Compound optimization:")
print(f"    Library size: {compound_library_size:,} compounds")
print(f"    Virtual screening rate: {screening_throughput:,} compounds/day")
print(f"    Screening time required: {total_screening_time:.1f} days")
print(f"    Estimated hits: {initial_hits:.0f}")
print(f"    Expected leads: {expected_leads:.0f}")
print(f"    Expected candidates: {expected_candidates:.0f}")

# Risk assessment
if success_rate_improvement > 0.12:
    risk_level = "Low - substantial improvement over traditional methods"
elif success_rate_improvement > 0.05:
    risk_level = "Moderate - meaningful but limited improvement"
else:
    risk_level = "High - minimal advantage over traditional approaches"

print(f"  Risk assessment: {risk_level}")

# Technology readiness assessment
if ml_params['target_identification']['genetic_validation'] > 0.8 and \
   ml_params['lead_optimization']['admet_prediction_accuracy'] > 0.8:
    tech_readiness = "Advanced - ready for industrial deployment"
elif ml_params['target_identification']['druggability_score'] > 0.6 and \
     ml_params['lead_discovery']['binding_affinity_pred'] > 0.8:
    tech_readiness = "Developing - needs validation in clinical settings"
else:
    tech_readiness = "Early - requires foundational improvements"

print(f"  Technology readiness: {tech_readiness}")

Pipeline Validation

Evaluating AI predictions against experimental results.


Your Challenge: Genomic Variant Classification Model

Develop a machine learning model to classify the pathogenicity of genetic variants.

Goal: Train and evaluate a classifier to predict whether genetic variants are pathogenic or benign.

Genomic Data

import math
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score

# Simulated genomic variant dataset
variant_data = {
    'variants': 5000,  # Number of variants to classify
    'features': [
        'conservation_score',     # PhyloP score
        'allele_frequency',       # Population frequency 
        'functional_impact',      # SIFT/PolyPhen score
        'gene_expression',        # Expression level in relevant tissues
        'protein_domain',         # Located in functional domain
        'repeat_masking',         # Located in repetitive region
        'splice_site_disruption', # Affects splice sites
        'known_db_variants'       # Previously annotated in ClinVar
    ],
    'labels': [],  # Pathogenic (1) or Benign (0) - to be generated
    'data_matrix': []  # Feature matrix - to be generated
}

# Generate simulated genomic features with realistic distributions
import numpy as np
np.random.seed(42)  # For reproducibility

n_variants = ml_params['variants']
n_features = len(ml_params['features'])

# Create feature matrix with realistic biological correlations
feature_matrix = np.zeros((n_variants, n_features))

# Conservation scores (higher for pathogenic)
feature_matrix[:, 0] = np.random.beta(2, 5, n_variants) * 10  # PhyloP score

# Allele frequencies (lower for pathogenic)
feature_matrix[:, 1] = np.random.exponential(0.001, n_variants)  # MAF

# Functional impact scores (higher for pathogenic)
feature_matrix[:, 2] = np.random.beta(2, 3, n_variants) * 1  # Combined deleteriousness

# Generate labels based on features (pathogenic if high impact + low frequency)
labels = []
for i in range(n_variants):
    # Pathogenic if high impact score AND low frequency AND high conservation
    pathogenic_score = (feature_matrix[i, 2] * 5) * (1 / (feature_matrix[i, 1] + 0.0001)) * (feature_matrix[i, 0] / 10)
    is_pathogenic = 1 if pathogenic_score > 8 else 0
    labels.append(is_pathogenic)

# Add some noise to make it more realistic
for i in range(len(labels)):
    if np.random.rand() < 0.1:  # 10% noise
        labels[i] = 1 - labels[i]

# Split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(feature_matrix, labels, test_size=0.2, random_state=42)

# Train multiple models
models = {
    'random_forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'logistic_regression': LogisticRegression(random_state=42),
}

results = {}

for model_name, model in models.items():
    # Train model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]  # Probability of positive class
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred, zero_division=0)
    auc_score = roc_auc_score(y_test, y_prob)
    
    results[model_name] = {
        'accuracy': accuracy,
        'precision': precision, 
        'recall': recall,
        'auc': auc_score
    }

# Select best model
best_model_name = max(results, key=lambda x: results[x]['auc'])
best_metrics = results[best_model_name]

# Calculate feature importance for best model
if best_model_name == 'random_forest':
    feature_importance = models[best_model_name].feature_importances_
else:
    # For logistic regression, use absolute coefficients
    feature_importance = np.abs(models[best_model_name].coef_[0])

# Identify key predictive features
feature_importance_dict = dict(zip(ml_params['features'], feature_importance))
sorted_features = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

# Calculate clinical utility
# Number needed to screen (NNS) to find one pathogenic variant
overall_pathogenic_rate = sum(labels) / len(labels)
nns = 1 / overall_pathogenic_rate

# Calculate positive predictive value improvement
baseline_ppv = overall_pathogenic_rate  # Without model
model_ppv = best_metrics['recall'] * overall_pathogenic_rate / (best_metrics['recall'] * overall_pathogenic_rate + (1-best_metrics['specificity'])*(1-overall_pathogenic_rate))

improvement_factor = model_ppv / baseline_ppv if baseline_ppv > 0 else float('inf')

Build and evaluate a machine learning model to classify genetic variants by pathogenicity.

Hint:

  • Consider biological significance of different features
  • Evaluate model performance using appropriate metrics for imbalanced data
  • Assess feature importance to understand biological drivers
  • Consider the clinical utility of the predictions
# TODO: Calculate model performance parameters
best_model_accuracy = 0      # Overall accuracy of best model
best_model_precision = 0     # Precision of best model (positive predictions correct)
best_model_recall = 0        # Recall/sensitivity of best model (true positive rate)
best_model_auc = 0           # AUC-ROC of best model
feature_importance_ranking = []  # List of (feature, importance) tuples, ranked
clinical_improvement_factor = 0  # Factor improvement over baseline in clinical utility

# Calculate results from the analysis above
best_model_accuracy = best_metrics['accuracy']
best_model_precision = best_metrics['precision']
best_model_recall = best_metrics['recall']
best_model_auc = best_metrics['auc']

# Sort features by importance
feature_importance_ranking = sorted_features

# Calculate clinical improvement factor from analysis above
clinical_improvement_factor = improvement_factor

# Print results
print(f"Variant classification model results:")
print(f"  Best model: {best_model_name}")
print(f"  Accuracy: {best_model_accuracy:.3f}")
print(f"  Precision: {best_model_precision:.3f}")
print(f"  Recall: {best_model_recall:.3f}")
print(f"  AUC-ROC: {best_model_auc:.3f}")
print(f"  Clinical utility improvement: {clinical_improvement_factor:.2f}x over baseline")

print(f"\nTop predictive features:")
for i, (feature, importance) in enumerate(feature_importance_ranking[:5]):
    print(f"  {i+1}. {feature}: {importance:.3f}")

# Clinical assessment
if best_model_auc > 0.9:
    clinical_utility = "High - suitable for clinical decision support"
elif best_model_auc > 0.8:
    clinical_utility = "Medium - requires expert review"
else:
    clinical_utility = "Low - needs significant improvement"

print(f"\nClinical utility: {clinical_utility}")

# Limitations assessment
high_impact_variants = sum(1 for label in labels if any(
    feature_matrix[i, 2] > 0.8 and feature_matrix[i, 0] > 5  # High impact + high conservation
    for i, label in enumerate(labels) if label == 1
))

print(f"  High-confidence pathogenic variants identified: {high_impact_variants}")
print(f"  Overall pathogenic rate in dataset: {overall_pathogenic_rate:.3f}")

# Recommendations
if best_model_recall < 0.7:
    recommendations = ["Focus on increasing sensitivity to reduce false negatives"]
elif best_model_precision < 0.7:
    recommendations = ["Focus on increasing precision to reduce false positives"]
else:
    recommendations = ["Balanced model suitable for clinical use"]

recommendations.append(f"Validate on independent dataset with clinical follow-up")
print(f"  Recommendations: {recommendations}")

How would you modify your model if you needed to account for genetic variants with uncertain significance (VUS) in addition to clearly pathogenic and benign variants?

ELI10 Explanation

Simple analogy for better understanding

Think of computational biology like having a super-intelligent assistant that can read and understand the entire library of life sciences in seconds and spot patterns that would take humans years to notice. Imagine that instead of a human researcher spending months looking through microscope slides or analyzing protein structures by hand, we have a digital scientist that can examine millions of molecular structures in the time it takes to drink a cup of coffee. AI in biotech is like having a team of expert mathematicians, statisticians, and biologists working together in a supercomputer, looking for hidden patterns in huge amounts of biological data - like finding needles in haystacks that are themselves located inside mountains of needles. These digital assistants can predict how proteins will fold into their 3D shapes, identify which genes might cause diseases, design new medicines by virtually testing millions of molecular combinations, and even model how entire biological systems will behave under different conditions. It's like having a crystal ball for biology that can predict how a new drug might work in the body, which patients might respond best to a treatment, or how a virus might mutate before it even happens.

Self-Examination

Q1.

How do machine learning algorithms improve biological data analysis compared to traditional methods?

Q2.

What are the applications of deep learning in genomics and structural biology?

Q3.

How can AI accelerate drug discovery and development processes?