Bioinformatics with Python
Learn Python by decoding life: read DNA, transcribe and translate genes, align sequences, call mutations, and analyze whole genomes.
10 projects, 250 hands-on levels, run in your browser.
Syllabus
- DNA Basics: A DNA sequence is just a string over the alphabet A, C, G, T. Read it, count its bases, take its complement and reverse complement, measure GC content, and find the motifs that matter, the foundation of all sequence analysis.
- Transcription & Translation: The central dogma in code: transcribe DNA into messenger RNA, then translate RNA into protein using the genetic code. Read codons, look them up in the codon table, handle stop codons, and scan reading frames for open reading frames (ORFs).
- Sequence Composition & Motifs: Characterize a sequence statistically and find the patterns in it: nucleotide and dinucleotide frequencies, CpG sites, melting temperature, k-mer spectra, exact and approximate motif matching, and building a consensus from a set of related sequences.
- Pairwise Alignment: The most important algorithm in bioinformatics: aligning two sequences with dynamic programming. Build edit distance, then Needleman-Wunsch global alignment and Smith-Waterman local alignment with numpy matrices, traceback to recover the alignment, and measure percent identity.
- Mutations & Variants: Compare a reference sequence to a sample to find and interpret mutations: locate SNPs, classify substitutions as transitions or transversions and as silent, missense, or nonsense, handle insertions and deletions and the frameshifts they cause, and produce a variant report, the core of genetic-variant analysis.
- FASTA & FASTQ Parsing: Real sequence data arrives in FASTA and FASTQ files. Parse FASTA records (header + sequence, possibly multi-line), handle many records at once, parse FASTQ reads with their quality strings, decode Phred quality scores, and filter low-quality reads, the daily bread of working with sequencing data.
- Phylogenetics: Reconstruct evolutionary relationships from sequences. Measure pairwise distances (and correct them with Jukes-Cantor), build a distance matrix, cluster taxa with UPGMA by repeatedly merging the closest pair, and express the resulting tree in Newick format, the pipeline from sequences to a phylogenetic tree.
- Population Genetics: Study variation within and between populations. Compute allele and genotype frequencies, test for Hardy-Weinberg equilibrium, measure genetic diversity with heterozygosity, and quantify how differentiated two populations are with Fst, the quantitative core of evolutionary and conservation genetics.
- Genome Search & Assembly: Search and reconstruct genomes. Build a k-mer index for fast lookup, map short reads to a reference by seed-and-verify, compute overlaps between reads, and assemble overlapping reads into a contig, the algorithms behind genome search engines and sequence assemblers.
- Capstone: Characterize an Unknown Sample: The grand finale. An unknown DNA sample arrives in your lab. Using everything from the track, composition, gene finding, translation, motifs, and variant analysis against a reference, characterize it completely and assemble a full dossier: what it is, what it encodes, and how it differs from the known reference.
Key concepts
- Codon: A triplet of nucleotides that codes for one amino acid; reading a sequence in codons translates DNA to protein.
- Distance matrix: A symmetric table of pairwise distances between all sequences, the input to tree-building.
- GC content: The fraction of a sequence that is G or C, a basic descriptor correlating with stability and organism type.
- Hamming distance: The number of positions at which two equal-length sequences differ, the simplest measure of sequence dissimilarity.
- Motif: A short recurring sequence pattern with biological meaning (e.g., a binding site), found by scanning for matches.
- Nucleotide: A building block of DNA/RNA, represented by a letter: A, C, G, T (or U in RNA). A sequence is a string of these.
- Phylogenetic tree: A branching diagram of evolutionary relationships, built from a matrix of pairwise distances between sequences.
- Reading frame: Where you start grouping a sequence into codons; the three possible offsets give three frames (six with the reverse strand).
- Scoring matrix: A table of match/mismatch scores used to evaluate an alignment, rewarding biologically likely substitutions.
- Sequence alignment: Arranging two sequences to maximize matching positions (allowing gaps), revealing shared ancestry or function.
- UPGMA: A simple clustering method that builds a tree by repeatedly merging the two closest clusters and averaging distances.