warming up your workspace

Machine Learning in R

Build the predictive models, not call the library. Feature engineering and the model matrix, k-nearest neighbors, naive Bayes, decision trees, random forests and boosting, k-means and hierarchical clustering, PCA, and the metrics that judge a model, all written from scratch in base R and checked against R's own implementations. The predictive-modeling complement to statistical inference.

10 projects, 250 hands-on levels, run in your browser.

Syllabus

  • Features & the Model Matrix: Before any model can learn, the data has to be turned into a clean numeric matrix: rows are observations, columns are features. This project builds that pipeline, splitting data into honest train and test sets, scaling features so no one variable dominates by its units, encoding categories into numbers, transforming and binning raw columns, and filling the missing values real data is full of. The golden rule throughout: fit every transformation on the training data alone, then apply it to the test data, so no future information leaks back into the past.
  • k-Nearest Neighbors: k-nearest neighbors is the most intuitive model in machine learning: to label a new point, find the k training points closest to it and let them vote. There is no training step, just a distance and a tally. This project builds it from the ground up: the distance metrics that define closeness, the search that ranks neighbors, the majority vote and the distance-weighted vote, the averaging that makes it a regressor, and the choice of k that trades noise against blur. Along the way you will see why kNN demands scaled features and why it struggles in high dimensions.
  • Naive Bayes: Naive Bayes turns Bayes' theorem into a classifier with one simplifying leap: it assumes the features are independent given the class. That assumption is almost never true, yet the model is fast, needs little data, and is a champion at text classification. This project builds it from first principles: class priors from the label counts, the likelihood of a feature under each class (Gaussian for continuous data, multinomial for counts), the Laplace smoothing that rescues unseen values, and the shift to log-space that stops a product of tiny probabilities from underflowing to zero. By the end you will have a working spam-style classifier you wrote yourself.
  • Decision Trees: A decision tree is the most interpretable model in the toolkit: a cascade of simple questions, each splitting the data into purer subsets until a confident prediction falls out. This project builds the CART algorithm from scratch, the impurity measures (entropy and Gini) that score how mixed a group is, the information gain that ranks candidate splits, the search for the best threshold on a numeric feature, the recursion that grows the tree, and the traversal that turns a new row into a prediction. You will also meet the stopping rules and pruning that keep a tree from memorizing its training data.
  • Ensembles: The deepest idea in applied machine learning: a crowd of mediocre models, combined well, beats a single finely-tuned one. This project builds both families of ensemble. Bagging trains many models on bootstrap resamples and averages them, taming variance; random forests add feature subsampling for extra diversity and get an honest error estimate for free from the out-of-bag rows. Boosting works the other way, fitting models in sequence where each one focuses on what the last got wrong, AdaBoost by reweighting examples and gradient boosting by fitting residuals. You will write the resampling, the aggregation, the OOB error, the feature importance, and the boosting updates by hand, all seeded so every run reproduces.
  • k-Means & Partitional Clustering: Clustering finds groups in data that carries no labels, and k-means is its workhorse. The algorithm is a beautiful two-step loop: assign each point to its nearest center, then move each center to the mean of its points, and repeat until nothing moves. This project builds Lloyd's algorithm from scratch, the assignment step, the update step, the inertia that measures how tight the clusters are, and the convergence check. You will add k-means++ seeding to escape bad starts, and the elbow and silhouette heuristics for the eternal question of how many clusters there are. Throughout, your results are checked against R's own stats::kmeans.
  • Hierarchical & Density Clustering: k-means needs you to name the number of clusters and only finds round blobs. Two other families lift those limits. Agglomerative hierarchical clustering builds a whole tree of nested groupings by repeatedly merging the two closest clusters, you cut the tree wherever you like, with no k chosen in advance. DBSCAN takes a different view entirely: clusters are dense regions separated by sparse ones, so it discovers clusters of any shape and quietly labels the leftover points as noise. This project builds the distance matrix, the linkage rules, the merge sequence and the dendrogram cut, then the core/border/noise machinery of DBSCAN, checked where possible against R's own dist and hclust.
  • PCA & Dimensionality Reduction: Principal component analysis finds the directions along which the data varies most, and re-expresses each point in those few directions instead of all the original features. This project builds it from the linear algebra up: center and scale the data, form the covariance matrix, and take its eigen-decomposition, the eigenvectors are the principal components and the eigenvalues are the variance each captures. You will project the data down, reconstruct it back, and measure what was lost, then use the explained-variance ratio to decide how many components to keep. Every result is checked against R's prcomp, with one subtlety respected throughout: a principal component points the same way whether you flip its sign, so the tests compare magnitudes and reconstruction, never the raw signed loadings.
  • Model Evaluation & Selection: Training a model is half the job; knowing whether it is any good is the other half, and accuracy alone often lies. This project builds the evaluation toolkit a working practitioner reaches for every day. From the confusion matrix come precision, recall, and the F1 score that balance the two kinds of error a classifier can make. The ROC curve and its area summarize performance across every threshold at once. Cross-validation replaces a single lucky split with an honest average, and grid search uses that honest estimate to tune the knobs. Get these right and you can tell a model that generalizes from one that merely memorized.
  • Capstone: A Predictive Modeling Pipeline: Nine projects of tools, one workflow. This capstone assembles a complete supervised-learning pipeline the way a practitioner actually runs it: a preparation stage that imputes, scales, and splits the data without leakage; a modeling stage that trains several competing classifiers from the earlier projects; a tuning stage that uses cross-validation to set each model's hyperparameters; an evaluation stage that scores the finalists on a held-out test set with the full metric suite; and a selection stage that picks the winner and reports it. The end-to-end discipline that turns a pile of algorithms into a model you can trust.

Key concepts

  • Accuracy: The fraction of predictions that are correct. Simple, but misleading on imbalanced data, where always guessing the majority class can score high while being us…
  • AdaBoost: Adaptive boosting: after each weak learner, raise the weight of the misclassified examples so the next learner concentrates on them. Each learner's vote is…
  • AUC: The area under the ROC curve, equal to the probability a random positive scores above a random negative. 1 is perfect, 0.5 is random guessing.
  • Bagging: Bootstrap aggregation: train each model on a bootstrap resample and average (or vote) their predictions. Cuts variance without raising bias, the basis of rando…
  • Baseline: A trivial reference model, like always predicting the majority class, that any real model must beat to be worth deploying. The honest yardstick for a result.
  • Bayes' theorem: The rule that updates belief with evidence: the posterior is proportional to the prior times the likelihood. The engine of naive Bayes.
  • Bias-variance tradeoff: Expected error splits into bias squared plus variance plus irreducible noise. Too-simple models have high bias (underfit); too-complex ones have high variance…
  • Binning: Discretizing a continuous feature into buckets (equal-width or equal-frequency). Can capture non-linear effects and tame outliers at the cost of resolution.
  • Boosting: Building models in sequence, each one focused on what its predecessors got wrong. AdaBoost reweights examples; gradient boosting fits the residuals. Turns weak…
  • Bootstrap sample: A resample of n rows drawn WITH replacement, so some repeat and roughly a third are left out (the out-of-bag rows). Each ensemble member trains on its own boot…
  • CART: Classification and Regression Trees: the recursive algorithm that grows a tree by greedily choosing the best binary split at each node until a stopping rule fi…
  • Class imbalance: When one class is far rarer than the other, so accuracy flatters a lazy model. Precision, recall, F1, and AUC give a truer picture, and resampling can help.
  • Classification: Supervised learning where the output is a category (spam/ham, digit, species). The model predicts a discrete label, scored by accuracy, precision, recall, and…
  • Clustering: Unsupervised grouping of similar points. k-means makes round, equal-ish blobs; hierarchical builds a tree of nested groups; DBSCAN finds dense regions of any s…
  • Confusion matrix: The 2x2 table of true/false positives and negatives. Every classification metric, accuracy, precision, recall, comes from its four cells.
  • Core, border, noise: DBSCAN's three roles: a core point has at least minPts neighbors within eps; a border point is near a core but not itself dense; a noise point is neither.
  • Cosine similarity: A measure of the angle between two vectors, ignoring their magnitude. Cosine distance is 1 minus it; common for text and high-dimensional data.
  • Covariance matrix: The square, symmetric matrix of feature variances (diagonal) and covariances (off-diagonal). PCA decomposes it; its symmetry guarantees real eigenvalues.
  • Cross-validation: Splitting the data into k folds and averaging the score across k train/test rounds, a steadier estimate of generalization than a single split. Leave-one-out is…
  • Curse of dimensionality: In high dimensions, points spread out and the nearest neighbor is barely closer than the farthest, so distance-based methods lose meaning. A reason to reduce d…
  • Data leakage: When information from the test set (or the future) sneaks into training, inflating the score. Avoided by fitting every transformation on the training data alon…
  • DBSCAN: Density-based clustering: a point is core if it has minPts neighbors within eps. Cores grow clusters of arbitrary shape, and the sparse leftovers are labeled n…
  • Decision threshold: The score cutoff that turns a model's probability into a 0/1 label. Moving it trades precision against recall and traces out the ROC curve.
  • Decision tree: A model of nested yes/no questions that split the data into purer groups, predicting by the majority label at the leaf a point reaches. The most interpretable…
  • Dendrogram: The tree of merges hierarchical clustering produces, with merge heights showing how far apart clusters were. Cutting it at a height or for k clusters yields a…
  • Distance metric: A measure of closeness between points. Euclidean is the straight line, Manhattan the grid path, Minkowski generalizes both, cosine compares direction not magni…
  • Dummy encoding: One-hot encoding with one level dropped as a reference, leaving k-1 columns. Avoids the redundancy (perfect collinearity) that breaks linear models.
  • Eigenvectors & eigenvalues: For the covariance matrix, the eigenvectors are the principal-component directions and the eigenvalues are the variance along each. An eigenvector is defined o…
  • Encoding: Turning categories into numbers a model can use: label encoding assigns integer codes, one-hot makes a 0/1 column per category, frequency encoding uses counts.…
  • Ensemble: Combining many models into one stronger predictor. Bagging averages models trained on resamples; boosting builds them in sequence to fix each other's mista…
  • Entropy: Disorder measured in bits: -sum(p * log2 p). 0 for a pure set, 1 for an even binary split. Paired with information gain to choose splits.
  • Euclidean distance: The straight-line distance, the root of the summed squared differences. The default metric for kNN and k-means.
  • Explained variance: The share of total variance a principal component captures, its eigenvalue over the sum of eigenvalues. The cumulative curve guides how many components to keep.
  • F1 score: The harmonic mean of precision and recall, 2pr/(p+r), which punishes sacrificing one for the other. F-beta tilts the balance toward recall (beta>1) or preci…
  • Feature: A measured input variable, one column of the model matrix. Good features make a problem easy; feature engineering is the craft of building them.
  • Feature engineering: Transforming raw columns into informative model inputs: scaling, encoding, binning, interactions, and missing-value handling. Often it matters more than the ch…
  • Feature importance: How much each feature matters to a model. Permutation importance shuffles a feature and measures the drop in accuracy; a useless feature registers near zero.
  • Feature scaling: Putting features on a common scale so none dominates by its units. Standardizing (z-score) gives mean 0 and sd 1; min-max maps to [0, 1]; robust scaling uses t…
  • Gaussian naive Bayes: Naive Bayes for continuous features, modeling each feature per class as a normal distribution with a fitted mean and variance.
  • Generalization: How well a model performs on new, unseen data, the only thing that ultimately matters. The gap between training and test accuracy measures how much it overfit.
  • Gini impurity: The chance two random draws from a set disagree: 1 - sum(p^2). 0 for a pure set, 0.5 for an even binary split. CART's default splitting criterion.
  • Gradient boosting: Boosting as gradient descent in function space: each new model fits the residual error of the running ensemble, scaled by a small learning rate. Behind XGBoost…
  • Grid search: Tuning hyperparameters by trying every combination on a grid and keeping the one with the best cross-validated score, never the test score.
  • Hierarchical clustering: Agglomerative clustering builds a tree by repeatedly merging the two closest clusters, needing no k chosen up front; cut the dendrogram wherever you like to ge…
  • Hyperparameter: A setting not learned from the data but chosen by you (k in kNN, depth in a tree, the learning rate). Tuned with cross-validation on validation data, never the…
  • Impurity: How mixed a set of labels is. Gini impurity (1 - sum p^2) and entropy both peak at a 50/50 split and vanish for a pure set. A split tries to lower it.
  • Imputation: Filling missing values rather than dropping rows: mean or median imputation are simplest; a missingness indicator flags where data was absent. Fit the fill val…
  • Inertia (within-cluster SS): The total squared distance from each point to its cluster center, the quantity k-means minimizes. Falls as k rises, which is why choosing k needs more than min…
  • Information gain: The drop in entropy from a split: parent entropy minus the size-weighted child entropy. The greedy criterion that picks each split in a tree.
  • k-Means: A clustering algorithm that alternates assigning points to the nearest center and moving each center to its points' mean (Lloyd's loop) until nothing c…
  • k-means++: A seeding scheme that spreads the initial centers far apart, so k-means is far less likely to converge to a poor local optimum than with random starts.
  • k-Nearest Neighbors: A classifier (or regressor) that labels a point by the majority vote (or average) of its k closest training points. No training step, just a distance and a tal…
  • Laplace smoothing: Adding a pseudocount to every category so an unseen feature value gets a small nonzero probability instead of zeroing out the whole product. Add-one smoothing…
  • Learning rate: The step size that scales each boosting round's contribution. Small rates improve in cautious steps and generalize better, at the cost of needing more roun…
  • Likelihood: The probability of the observed features under a given class. For continuous features it comes from a Gaussian; for counts, a multinomial.
  • Linkage: How the distance between two clusters is defined: single linkage uses the closest members (and can chain), complete uses the farthest (compact clusters), avera…
  • Lloyd's algorithm: The two-step loop behind k-means: assign every point to its nearest center, then update each center to the mean of its assigned points. Repeat until the assign…
  • Log-probabilities: Working with the logarithm of probabilities so a product of many tiny numbers becomes a sum, dodging numerical underflow. The argmax is unchanged because log i…
  • Manhattan distance: The sum of absolute coordinate differences, the path along a grid. Less sensitive to large single-feature gaps than Euclidean.
  • Min-max scaling: Rescaling a feature into [0, 1] via (x - min) / (max - min). Keeps the shape of the distribution but bounds the range.
  • Model matrix: The numeric table a model actually trains on: one row per observation, one column per feature. Turning raw data into this matrix is the first job of any pipeli…
  • mtry: The number of features a random forest considers at each split. A common default is floor(sqrt(p)); smaller values mean more diverse, less correlated trees.
  • Naive Bayes: A probabilistic classifier from Bayes' theorem assuming features are conditionally independent given the class. Fast, data-light, and a champion at text cl…
  • Noise points: Points DBSCAN leaves unclustered because they sit in sparse regions. Labeling outliers as noise instead of forcing them into a cluster is a DBSCAN signature.
  • One-hot encoding: Representing a category with one indicator (0/1) column per level, so no false ordering is implied. The standard encoding for nominal categories.
  • Ordinal encoding: Mapping genuinely ordered categories (low/medium/high) to ranks by a supplied order, preserving the order that label encoding would scramble alphabetically.
  • Out-of-bag (OOB) error: The rows a tree did not train on form a free validation set. Voting each row using only the trees that missed it gives an honest error estimate without a separ…
  • Overfitting: When a model fits the training data's noise, scoring high on training but poorly on new data. The tell is a large train-minus-test accuracy gap.
  • Pipeline: The end-to-end sequence from raw data to evaluated model: clean, engineer features, split, train, tune, evaluate, select. Encapsulating it prevents leakage and…
  • Posterior: The probability of a class after seeing the features, prior times likelihood, normalized. The classifier predicts the class with the largest posterior.
  • Precision: Of the cases the model called positive, the fraction that really were: TP / (TP + FP). High precision means few false alarms.
  • Principal component analysis: Finds the orthogonal directions of greatest variance (the principal components) and re-expresses data in a few of them. Built from the covariance matrix's…
  • Prior: How likely each class is before seeing any features, the fraction of training rows in that class. Multiplied by the likelihood to form the posterior.
  • Projection & reconstruction: Projecting data onto the top components gives its scores (compressed coordinates); reconstructing from them rebuilds an approximation. The reconstruction error…
  • Pruning: Cutting a tree back to fight overfitting. Pre-pruning stops early (depth or minimum-leaf limits); cost-complexity pruning collapses splits that do not earn the…
  • Rand index: A measure of agreement between two clusterings: the fraction of point pairs they treat the same way (both together or both apart). 1 means identical partitions…
  • Random forest: Bagging of decision trees with an extra twist: each split considers only a random subset of features (mtry), decorrelating the trees. Comes with OOB error and…
  • Recall: Of the actual positives, the fraction the model caught: TP / (TP + FN). High recall means few misses. Trades off against precision.
  • Regression (ML): Supervised learning where the output is a continuous number. Here it appears as kNN regression and gradient boosting; the dedicated theory lives in the statist…
  • Regularization: Penalizing model complexity (error + lambda * complexity) to trade a little bias for less variance. A bigger lambda favors simpler, more generalizable models.
  • Residual: What the current model still gets wrong, y minus the prediction. Gradient boosting fits a new model to the residual each round.
  • ROC curve: A plot of true-positive rate against false-positive rate as the decision threshold sweeps, showing a classifier's performance at every cutoff at once.
  • Scree plot: Eigenvalues plotted in decreasing order; the elbow where they level off suggests how many components matter. The Kaiser rule (keep eigenvalues above 1) is a re…
  • Silhouette score: How well a point fits its cluster: (b - a) / max(a, b), where a is its average distance within the cluster and b to the nearest other cluster. Near 1 is a clea…
  • Stacking: An ensemble that trains a meta-model on the predictions of the base models, learning how best to blend them. More flexible than a fixed average.
  • Standardize (z-score): Rescaling a feature to mean 0 and sd 1 via (x - mean) / sd. Essential for distance- and gradient-based models so large-unit features do not drown out small one…
  • Supervised learning: Learning a mapping from inputs to a known output (label or number) using labeled examples. Classification and regression are its two forms.
  • The elbow method: A heuristic for choosing k: plot inertia against k and pick the bend where adding clusters stops helping much. Subjective but quick.
  • Train/test split: Holding out part of the data so a model's score is measured on examples it never trained on. The single most important habit for honest evaluation, and the…
  • Underfitting: When a model is too simple to capture the real pattern, scoring poorly on both training and test data. The high-bias failure mode.
  • Unsupervised learning: Finding structure in data with no labels: clustering groups similar points, dimensionality reduction compresses features. The model is judged by the structure…