Data Science with Python
Learn data science and machine learning by doing: pandas, analysis, visualization, statistics, and ML algorithms built from scratch.
10 projects, 250 hands-on levels, run in your browser.
Syllabus
- Data with pandas: pandas is the foundation of data work in Python. Master the two core structures: the Series (a labeled 1D array) and the DataFrame (a labeled table). Load data, inspect it, select rows and columns, filter, and compute summary statistics, the everyday vocabulary of every data scientist.
- Data Cleaning & Wrangling: Real data is messy. Detect and handle missing values, find and drop duplicates, fix wrong data types, standardize inconsistent text, and transform columns with map, apply, and binning. Cleaning is where data scientists spend most of their time, and getting it right is what makes every later analysis trustworthy.
- Exploratory Data Analysis: Before modeling, you explore. Compute summary statistics (center and spread), aggregate by group, examine distributions with value counts, and measure relationships between variables with correlation. EDA is how you build intuition for a dataset and discover the patterns worth investigating.
- Data Visualization: A chart often reveals what a table of numbers hides. Build the four workhorse plots with matplotlib: line charts for trends, bar charts for comparing categories, histograms for distributions, and scatter plots for relationships. Label them clearly and choose the right chart for the question.
- Statistics & Inference: Move from describing data to reasoning about it. Compute descriptive statistics with numpy, understand sampling and standard error, build confidence intervals, and run hypothesis tests (t-tests) to decide whether an effect is real or just noise, the statistical rigor that turns observations into conclusions.
- Linear Regression: Your first machine-learning model, built from scratch with numpy. Fit a straight line to data: define the linear model and its error, solve for the best-fit line with least squares, measure how well it fits with R-squared, and learn it iteratively with gradient descent, the optimization algorithm behind nearly all modern ML.
- Classification: Predict categories, not numbers. Build two classifiers from scratch: k-nearest neighbors (classify by the majority vote of the closest points) and logistic regression (a sigmoid model trained with gradient descent). Then evaluate them properly with the confusion matrix, accuracy, precision, recall, and F1.
- Model Evaluation: The central question of machine learning: will the model work on data it has never seen? Hold out a test set, validate robustly with k-fold cross-validation, diagnose overfitting and underfitting from the train-test gap, and understand the bias-variance tradeoff, the discipline that separates a model that generalizes from one that merely memorizes.
- Unsupervised Learning: Find structure in data that has no labels. Build k-means clustering from scratch (group points by similarity), learn to choose the number of clusters, and implement PCA (principal component analysis) to reduce dimensions while keeping the most variance, the two pillars of unsupervised learning.
- Capstone: An End-to-End ML Project: The grand finale. A raw, messy dataset arrives with a prediction goal. Run the complete machine-learning pipeline you have built across the whole track: load and clean the data, explore it, split and train a model, evaluate it honestly, and make predictions, every stage of real data science, end to end.
Key concepts
- Bias-variance tradeoff: Too simple a model underfits (high bias); too complex overfits (high variance). Good models balance the two.
- Confusion matrix: A table of true/false positives and negatives summarizing a classifier's mistakes, the basis of precision and recall.
- Feature and label: A feature is an input variable describing an example; the label is the target you predict. Supervised learning maps features to labels.
- Gradient descent: Minimizing a loss by repeatedly stepping the parameters downhill along the negative gradient, the workhorse of model fitting.
- Hypothesis test: A procedure that weighs evidence against a null hypothesis, deciding whether an effect is statistically significant.
- k-means clustering: An unsupervised method that partitions points into k groups by alternately assigning points to the nearest center and moving centers to the mean.
- Loss function: A number measuring how wrong a model's predictions are; training minimizes it.
- Overfitting: When a model memorizes training noise and fails on new data; the gap between train and test performance reveals it.
- p-value: The probability of seeing data at least this extreme if the null hypothesis were true; small p casts doubt on the null.
- Precision and recall: Precision is the fraction of positive predictions that are correct; recall is the fraction of actual positives found. They trade off; F1 combines them.
- R-squared: The fraction of variance in the target explained by a regression model; 1 is perfect, 0 is no better than the mean.
- Train/test split: Holding out part of the data to evaluate a model on examples it never trained on, the basic guard against fooling yourself.