Statistical Computing in R
Learn R the way statisticians use it, and learn statistics by building it: vectors and the apply family, data frames rebuilt from a list of columns, probability distributions and hypothesis tests written from scratch, least-squares regression by hand, the bootstrap and permutation tests, time-series smoothing, and a full statistical-analysis engine. The language of data analysis, from first principles.
10 projects, 250 hands-on levels, run in your browser.
Syllabus
- R Foundations: R is built on one idea: the vector. Almost every value is a vector, arithmetic is vectorized, and loops are the exception, not the rule. This project builds fluency in that worldview, creating vectors, operating on them whole, subsetting them by position, logic, and name, handling the NA values real data is full of, and working with the character and factor types that label it. Get this right and the rest of R follows.
- Vectors, Lists & the apply Family: A vector holds one type; a LIST holds anything, including other lists, which makes it R's general-purpose container. Paired with the apply family, lapply, sapply, vapply, Map, and the functionals Reduce and Filter, lists let you express in one line what other languages write as loops. This project builds that functional toolkit, then matrices and the split-apply-combine pattern that is the soul of data analysis in R.
- Data Frames from Scratch: A data frame is the workhorse of R, and underneath it is just a list of equal-length columns with a class attribute. This project builds that intuition, then reimplements the verbs every analyst reaches for, select columns, filter rows, mutate new ones, arrange by a key, and summarize, on plain base R. Then group-by aggregation and a join. By the end the dplyr 'magic' is just indexing and the split-apply-combine you already know.
- Probability & Distributions: Probability is the engine under every statistical method, and R was built to compute with it. This project builds distributions from scratch, a binomial PMF, a normal density, a CDF as a running sum, then connects them to R's d/p/q/r family (density, cumulative, quantile, random). You will compute expectation and variance from first principles and, most importantly, watch the Law of Large Numbers and the Central Limit Theorem appear out of simulation, R's superpower.
- Descriptive Statistics & EDA: Before any model, you describe the data: where it centers, how it spreads, what shape it takes, and how its variables relate. This project builds those tools, mean/median/mode, variance and the IQR, quantiles and the five-number summary, covariance and correlation, and the rules that flag outliers, while showing why robust statistics survive the messy values that wreck the mean.
- Hypothesis Testing: A hypothesis test asks whether an observed effect could plausibly be chance. This project builds the machinery from the test statistic up, the z- and t-tests, the confidence interval, the chi-square test for tables, and the ANOVA F-test, then checks each against R's t.test, chisq.test, and aov. You will compute p-values yourself, build intervals from quantiles, and reason about type I and II error and the power to detect a real effect.
- Linear Regression from Scratch: Regression is the workhorse of applied statistics, and R was built around its formula syntax. This project builds it from the ground up: the least-squares slope and intercept by hand, then the matrix normal equations that generalize to many predictors, residuals and R-squared to measure fit, prediction and the diagnostics that reveal a bad model, and finally logistic regression trained by gradient descent. Every result is checked against R's own lm() and glm().
- Resampling & Simulation: When a formula for the standard error is hard or unknown, you can simulate one. This project builds the resampling toolkit R is famous for, the bootstrap that estimates uncertainty by sampling the data with replacement, the permutation test that builds a null distribution by shuffling labels, cross-validation that estimates honest prediction error, and the jackknife. The unifying idea: let the computer do the statistics the math cannot.
- Time Series & Smoothing: Time series break the usual assumption that observations are independent, each point leans on the ones before it. This project builds the core toolkit: moving averages and exponential smoothing to see through noise, autocorrelation to measure memory, differencing to remove a trend and reach stationarity, decomposition into trend and season, and a simple autoregressive model you can forecast with. The groundwork under every forecasting method.
- Capstone: A Statistical Analysis Engine: Nine projects of tools, one analysis. This capstone assembles a small but complete statistical pipeline: a cleaning stage that handles missing values and outliers, an EDA stage that summarizes and correlates, a modeling stage that fits and evaluates a regression, an inference stage that tests a hypothesis and bootstraps a confidence interval, and a reporting stage that turns it all into a verdict. The kind of analysis a working statistician runs every day, built from the pieces you wrote yourself.
Key concepts
- aggregate: Base R's group-and-summarize in formula form: aggregate(value ~ group, data = df, FUN = mean) returns a data frame of per-group summaries.
- ANOVA: Analysis of variance: tests whether several group means differ, using the F-statistic (between-group variance over within-group). R's aov fits it from a fo…
- apply over margins: apply(m, 1, f) applies f to each ROW, apply(m, 2, f) to each COLUMN. Margin 1 is rows, margin 2 is columns.
- Atomic types: The basic vector types: logical, integer, double (numeric), character, and complex. A vector holds one type; mixing them triggers coercion to the most general.
- Autocorrelation: The correlation of a series with a lagged copy of itself, measuring how much it remembers its past. White noise has autocorrelation near zero at every lag; acf…
- Autoregressive model: A model where a value depends on its own past: AR(1) is x[t] = phi*x[t-1] + noise , with phi the lag-1 regression slope. Forecasting one step ahead is phi * x[…
- Bootstrap: Resample the data WITH replacement many times, recompute the statistic, and use the spread of those values as its standard error or the percentiles as a confid…
- CDF: The cumulative distribution function P(X <= x) , the running total of the PMF or the area under the PDF to the left of x. pnorm , pbinom , pexp compute it.
- Center: Three measures of the typical value: the mean (balance point), the median (middle value, robust to outliers), and the mode (most frequent). They diverge under…
- Central Limit Theorem: The mean of many independent samples is approximately NORMAL, whatever the underlying distribution, with standard error sigma/sqrt(n) . The reason the normal s…
- Chi-square test: A test for COUNTS: it compares observed cell counts to those expected under a null, summing (O - E)^2 / E . Used for goodness-of-fit and for independence in a…
- Coercion: Automatic type conversion. Logicals become numbers (TRUE is 1, FALSE is 0), so sum(x > 3) counts how many elements exceed 3. Combining types promotes to the…
- Complete cases: Rows or values with no missing data. x[!is.na(x)] keeps them; imputation (filling NA with a value like the mean) is the alternative to dropping.
- Confidence interval: A range of plausible values for a parameter, e.g. mean +/- t_crit * SE . A 95% interval excludes exactly the null values a 5% test would reject, the duality of…
- Contingency table: A cross-tabulation of two categorical variables. chisq.test on it tests whether the variables are independent; Cramer's V scales the association to [0, 1].
- Correlation: Covariance scaled to [-1, 1]: cor(x, y) = cov(x, y) / (sd(x)*sd(y)) . Plus or minus 1 is a perfect line. Spearman correlation is Pearson on the ranks, catching…
- Covariance: Measures whether two variables rise and fall together (positive) or oppose (negative). cov(x, y) . Its scale depends on the units, which is why correlation nor…
- Cross-validation: Estimating honest prediction error by holding out folds: train on the rest, test on the held-out fold, and average. k-fold splits into k folds; leave-one-out u…
- Data frame: R's table: a LIST of equal-length columns with a class attribute, so is.list(df) is TRUE. Columns can differ in type; df$col or df[['col']] reads o…
- Decomposition: Splitting a time series into trend, seasonal, and remainder components. Once separated, each piece can be modeled or removed; the parts sum back to the origina…
- Degrees of freedom: The number of independent pieces of information left after estimating parameters, e.g. n-1 for a sample variance. It sets the shape of the t and chi-square dis…
- Design matrix: The matrix X of predictors used in regression, with a leading column of 1s for the intercept. cbind(1, x) builds the simple-regression version.
- Differencing: Replacing a series with its period-to-period changes, diff(x) . One difference flattens a linear trend; a second handles a quadratic. The route to stationarity.
- Distribution: The rule assigning probabilities to outcomes. R names them with the d/p/q/r prefixes: dnorm (density), pnorm (cumulative), qnorm (quantile), rnorm (random draw…
- Expectation: The probability-weighted average of outcomes, sum(values * probs) , the distribution's center of mass. For a binomial it is n*p .
- F-statistic: The ratio of between-group to within-group mean squares in ANOVA, or of two variances. Large F means the groups differ more than chance would explain.
- Factor: R's type for categorical data: values stored as integer codes over a fixed set of LEVELS (alphabetical by default). The right type for groups, and what mod…
- Forecasting: Predicting future values of a series. Baselines include the naive forecast (carry the last value) and the drift method (extend the average trend); accuracy is…
- Formula: R's y ~ x syntax describing a relationship: response on the left, predictors on the right. lm , aov , glm , and aggregate all read formulas.
- Functionals: Functions that take functions: Reduce folds a vector to one value, Filter keeps matching elements, Map applies in parallel, do.call calls a function with a lis…
- Gradient descent: An iterative optimizer that nudges parameters in the direction that improves the objective: w <- w + lr * gradient . How logistic regression and most ML mod…
- Hat matrix & leverage: H = X(X'X)^-1 X' projects y onto the fitted values; its diagonal gives each point's LEVERAGE, how much its predictors can pull the fit. Its trace e…
- Hypothesis test: A procedure asking whether an observed effect could plausibly be chance. Compute a test statistic, get its p-value under the null, and reject the null if p is…
- Imputation: Filling missing values rather than dropping them, e.g. replacing each NA with the column mean. Keeps the sample size at the cost of some added assumption.
- IQR: The interquartile range Q3 - Q1 , the spread of the middle half of the data. Robust to outliers, and the basis of the boxplot and the 1.5-IQR outlier fence.
- Jackknife: A resampling method that leaves out one observation at a time to estimate a statistic's bias and variance. The simpler ancestor of the bootstrap.
- lapply / sapply: lapply(x, f) applies f to each element and returns a LIST; sapply is the same but simplifies the result to a vector or matrix when it can.
- Law of Large Numbers: As the sample size grows, the sample mean converges to the true mean. Why more data gives more reliable estimates, and why simulation works.
- Least squares: The criterion that picks the line minimizing the sum of squared residuals. In matrix form the solution is the normal equations beta = (X'X)^-1 X'y , wh…
- Levels: The distinct categories of a factor, returned by levels(f) . Counting them ( length(levels(...)) ) gives the number of groups.
- Linear regression: Fitting a line (or plane) by LEAST SQUARES, minimizing the sum of squared residuals. lm(y ~ x) fits it; the slope is cov(x,y)/var(x) , the intercept passes thr…
- List: R's heterogeneous container: it can hold values of different types, and other lists, at once. [[ extracts a single element (the bare value); [ returns a su…
- lm: R's linear-model function: lm(y ~ x, data = df) . coef extracts coefficients, fitted / resid the fitted values and residuals, predict makes new predictions…
- Logical indexing: Using a TRUE/FALSE vector to keep the TRUE positions: x[x > 0] keeps the positives. The R idiom for filtering.
- Logistic regression: Regression for a yes/no outcome: it models the probability through the sigmoid 1/(1+exp(-z)) , fit by maximizing the log-likelihood (R's glm(y ~ x, family…
- Matrix: A vector with two dimensions, filled COLUMN by column. Indexed m[i, j] (1-based), with fast whole-matrix summaries rowSums / colMeans and margin-wise apply(m,…
- merge (join): merge(a, b, by = 'id') joins two data frames on a shared key (an inner join by default; all.x = TRUE makes it a left join). The relational join in base…
- Monte Carlo: Estimating a quantity by simulating it: a probability becomes the fraction of trials where an event occurs, an integral becomes the average of a function at ra…
- Moving average & smoothing: Averaging a sliding window to see through noise. Exponential smoothing weights recent points more: s[t] = alpha*x[t] + (1-alpha)*s[t-1] .
- Multiple comparisons: Run many tests and false positives accumulate: the family-wise error rate is 1 - (1 - alpha)^m . The Bonferroni fix tests each at alpha/m .
- Mutate: Adding or transforming a column by assignment: df$total <- df$price * df$qty . The verb for derived variables.
- NA (missing value): R's marker for missing data. NA is contagious: any arithmetic with it yields NA, so summaries need na.rm = TRUE . is.na detects it; anyNA is a fast any-mis…
- na.rm: The argument that tells summary functions to drop missing values first: mean(x, na.rm = TRUE) averages the non-missing. Without it, one NA poisons the whole re…
- Named vector: A vector whose elements carry names, looked up with v['key'] (a sub-vector) or v[['key']] (the bare value). The lightweight key-value store of…
- Normal distribution: The bell curve defined by a mean and sd. About 68/95/99.7% of values fall within 1/2/3 sd; pnorm / qnorm give areas and cut points. The 97.5th percentile is th…
- Normal equations: The closed-form least-squares solution beta = solve(t(X) %*% X) %*% t(X) %*% y , where X is the design matrix (a column of 1s plus the predictors). The algebra…
- Null hypothesis: The default 'no effect' claim a test tries to disprove. Rejecting it means the data are unlikely under no effect; failing to reject is not the same as…
- One-based indexing: R indexes from 1, not 0: x[1] is the first element. A negative index DROPS elements ( x[-1] is everything but the first), unlike most languages.
- Outlier: A value far from the rest. Tukey's rule flags points beyond Q1 - 1.5*IQR or Q3 + 1.5*IQR ; the z-score rule flags points more than k sd from the mean. Wins…
- Overfitting: When a model fits its training data better than new data, capturing noise as if it were signal. The tell is test error exceeding train error, revealed by cross…
- p-value: The probability, IF the null is true, of a result at least as extreme as the one observed. Small p casts doubt on the null. It is not the probability the null…
- Permutation test: Build the null distribution by SHUFFLING group labels: if the groups did not matter, the labels are exchangeable. The p-value is the fraction of shuffles at le…
- PMF and PDF: A probability MASS function gives the probability of each discrete outcome (and sums to 1); a probability DENSITY function describes a continuous distribution…
- Power: The probability of correctly rejecting a false null, the chance of DETECTING a real effect. It rises with effect size, sample size, and alpha, and is often est…
- predict: predict(model, newdata) applies a fitted model to new data. The honest test of a model is its error on data it did not train on.
- Probability: A number in [0, 1] measuring how likely an event is. Independent events multiply; complements subtract from 1. The foundation under every statistical method.
- Quantile: The value below which a given fraction of the data lies: quantile(x, 0.5) is the median. Percentiles are quantiles times 100; fivenum gives the five-number sum…
- R-squared: The fraction of the response's variance the model explains: 1 - SSE/SST . For simple regression it equals the squared correlation. Adjusted R-squared penal…
- Recycling: When two vectors differ in length, R repeats the shorter one to match the longer. c(1,2,3,4) + c(10,20) becomes c(11,22,13,24) . A scalar recycles to any lengt…
- Reduce (fold): Reduce(f, x) combines elements left to right into a single value, with an optional init seed that also handles the empty case. Sum, product, and running aggreg…
- Resampling: Estimating uncertainty by re-drawing from the data: WITH replacement (the bootstrap) or by shuffling labels (permutation). When a formula is hard, let the comp…
- Residual: The gap between an actual value and the model's fitted value, y - fitted . Residuals sum to zero for a least-squares fit and reveal where the model misses.
- RMSE / MAE: Prediction-error summaries: RMSE is the root mean squared residual (in the original units, outlier-sensitive); MAE is the mean absolute residual (more robust).
- Robust statistics: Summaries that resist outliers: the median and IQR and MAD barely move when an extreme value is added, while the mean and sd can be wrecked by a single one.
- Select & filter: The two core data-frame verbs, done with indexing: df[, cols] picks columns, df[df$age >= 18, ] keeps rows matching a condition. The dplyr verbs reimplement…
- set.seed: Fixes R's random-number generator so a simulation produces the SAME draws every run. Essential for reproducible (and testable) Monte Carlo and resampling.
- Sigmoid & logit: The sigmoid 1/(1+exp(-z)) squashes any real number to a probability in (0,1); its inverse, the logit log(p/(1-p)) , is the log-odds. The link between linear pr…
- Significance level (alpha): The threshold for rejecting the null, conventionally 0.05. It is also the type I error rate: the chance of a false positive when the null is actually true.
- Simulation: Generating random data from a model to study its behavior. replicate(n, expr) runs an experiment n times; sample , runif , and rnorm provide the randomness.
- Skewness: A measure of asymmetry: positive means a long right tail (and the mean exceeds the median). The quick check is mean(x) > median(x) .
- Split-apply-combine: The pattern at the heart of data analysis: split data into groups, apply a summary to each, combine the results. tapply , split , and aggregate do it in one li…
- Standard error: The standard deviation of a STATISTIC across samples, e.g. sd(x)/sqrt(n) for the mean. It shrinks as n grows and is the denominator of a test statistic.
- Standardize (z-score): Rescaling to mean 0 and sd 1 via (x - mean(x)) / sd(x) . The z-score says how many standard deviations a value sits from the mean.
- Stationarity: A series whose statistical behavior (mean, variance) does not drift over time. Differencing ( diff ) removes a trend to reach it; a random walk is the classic…
- Subsetting: Extracting parts of a vector by position ( x[c(1,3)] ), by a logical mask ( x[x > 0] ), or by name ( x['a'] ). The most-used operation in R.
- t-test: Tests a mean (or difference of means) when the sd is estimated from the data, using the t-distribution with n-1 degrees of freedom. R's t.test does one-sam…
- table: table(x) tallies how many times each distinct value appears, returning a named count vector; table(a, b) builds a contingency table for two variables.
- tapply: tapply(x, group, f) applies f to each group of x defined by group , returning a named result, the canonical split-apply-combine call.
- The apply family: Functions that apply another function over a structure without a loop: lapply returns a list, sapply simplifies to a vector, vapply adds a type check, mapply w…
- The d/p/q/r family: R's four functions per distribution: d for the density or mass, p for the cumulative probability (left tail), q for the quantile (inverse of p), r for rand…
- Time series: Data ordered in time, where each point depends on those before it, breaking the usual independence assumption. Analyzed with smoothing, autocorrelation, and fo…
- Type I and II error: A type I error rejects a true null (false positive, rate alpha); a type II error fails to reject a false null (false negative). Power is 1 minus the type II ra…
- unlist: Flattens a list of values into a plain vector, so sum(unlist(l)) totals a list of numbers. The bridge from list-land back to vectors.
- vapply: Like sapply but you declare the expected result type with a template (e.g. numeric(1) ), so a surprise type fails loudly instead of silently returning a list.
- Variance & sd: Variance is the expected squared deviation from the mean; standard deviation is its square root, back in the original units. R's var / sd use the unbiased…
- Vector: R's atomic data structure: an ordered sequence of values of ONE type. Even a single number is a length-1 vector, which is why so much of R is vectorized.
- Vectorization: Applying an operation to a whole vector at once instead of looping element by element. x * 2 doubles every element; vectorized code is both shorter and faster…
- which: which(condition) returns the POSITIONS where a logical vector is TRUE, while %in% tests membership. which.max and which.min return the position of the extreme…