road-to-cs336

Road to Stanford CS336: A Complete Free Self-Study Guide

Built for Sarang | Goal: Understand Stanford CS336 — Language Modeling from Scratch


Table of Contents


Overview: What You’re Working Toward

Stanford CS336 — Language Modeling from Scratch (cs336.stanford.edu) is a graduate-level course taught by Percy Liang and Tatsunori Hashimoto. The goal of the course is to build a large language model entirely from scratch — including the tokenizer, the transformer architecture, the training loop, distributed training across GPUs, and alignment techniques like RLHF.

The course officially requires:

You currently have MYP5 math, basic Python, and JavaScript experience. This guide maps out every step between where you are now and where CS336 starts, using only 100% free resources — no credit cards, no free trials, no paywalls.


Before You Start: Mindset and Setup

Code everything. Reading about neural networks without coding them is like reading about swimming without getting in the water. Every phase in this guide requires you to run code, not just watch videos. If you skip this, you will not be ready for CS336.

Use Google Colab (colab.research.google.com) as your coding environment throughout this entire journey. It gives you a free GPU in the browser, requires only a Google account (which you already have), and runs Jupyter notebooks — the same format used by CS336, Karpathy, fast.ai, and Hugging Face.

Use GitHub (github.com) to save your work. You already use VS Code, so install the GitHub extension and commit every notebook you write. This builds a portfolio that proves your work is real.

Do not rush phases. The most common mistake is moving forward before the previous phase is solid. If something in Phase 4 doesn’t make sense, the problem is almost always Phase 0 — a math gap you skipped.


Phase 0 — Close the Math Gaps

This is the most important phase. Everything in deep learning — every weight update, every attention score, every loss function — is applied linear algebra and calculus. Do not skip this phase or rush through it. You will regret it later.

Estimated time: 6–8 weeks


0.1 Linear Algebra (3Blue1Brown)

Link: youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab

This is the Essence of Linear Algebra playlist by Grant Sanderson (3Blue1Brown). It is 16 videos, approximately 3–5 minutes to 20 minutes each. The visual approach is uniquely good — you will come away with intuition that people who learned through pure symbol manipulation never build.

What to watch and what to pay attention to:

Video Key Concept What to Look Out For
Ch. 1 — Vectors Vectors as arrows in space Understand both the geometric view (arrow) and the numerical view (list of numbers) — AI uses both simultaneously
Ch. 2 — Linear combinations, span, basis What a basis is The idea of a “basis” is everywhere in NLP — word embeddings are vectors in a basis
Ch. 3 — Linear transformations and matrices Matrices as transformations This is the single most important video — a matrix multiplication is not just arithmetic, it is a transformation of space
Ch. 4 — Matrix multiplication as composition Chaining transformations Every layer in a neural network is a matrix multiplication; stacking layers = composing transformations
Ch. 6 — The determinant Scaling of space Used in understanding probability distributions
Ch. 9 — Dot products Inner products Attention in a transformer is literally dot products — understand this deeply
Ch. 14 — Eigenvectors and eigenvalues Directions that don’t change under transformation Used in PCA, optimization, understanding training dynamics

After watching: Go to MIT OpenCourseWare 18.06 (ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011) and do the first 5 problem sets. These are actual MIT assignments, free to access. They will show you whether you actually understood or just felt like you understood.

Red flag: If you can’t answer “what does matrix multiplication mean geometrically?” without looking it up, you are not ready for the next video yet.


0.2 Calculus (3Blue1Brown)

Link: youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr

This is the Essence of Calculus playlist. It is 12 videos. You likely know basic derivatives from MYP5, but this series builds the intuition underneath the symbol-pushing — which is what you need for understanding backpropagation.

What you must deeply understand from this series:

What you need beyond this series that it doesn’t fully cover:

Red flag: If you cannot explain why the chain rule works in your own words, stop and rewatch Chapter 4 before moving on.


0.3 Probability & Statistics (Khan Academy)

Link: khanacademy.org/math/statistics-probability

Link (probability section specifically): khanacademy.org/math/statistics-probability/probability-library

This is the most “boring” phase on the surface but is critically important. Language models are probabilistic systems — a transformer outputs a probability distribution over vocabulary tokens. If you don’t understand probability, you cannot understand what a model is actually doing.

Topics you must cover:

What to skip for now: The inferential statistics section (hypothesis testing, confidence intervals) is less relevant at this stage. Come back to it later if needed.

Red flag: You must be able to answer — without looking it up — “what is a conditional probability and why is P(next word all previous words) the right way to think about language modeling?”

Phase 1 — Python Depth (CS50P)

Link: cs50.harvard.edu/python

You already know Python from CS50x. CS50P goes deeper. It is 100% free with no account required to watch lectures, though you will want to create a free edX account to submit problem sets and get feedback.

Why you still need this: CS336 is explicit that its code volume is “an order of magnitude greater than a typical ML course.” The assignments involve writing optimized, production-style Python. CS50P will significantly strengthen your ability to write clean, modular, well-tested Python code.

Estimated time: 3–4 weeks

Topics in CS50P and what to pay extra attention to:

Week Topic Why It Matters for AI
Week 2 — Loops Generator functions and lazy evaluation PyTorch DataLoaders use generators; understanding lazy evaluation is essential for handling large datasets
Week 4 — Libraries NumPy and working with external packages NumPy is the backbone of all numerical computation in Python; everything in AI uses it
Week 6 — File I/O Reading and writing large files Training data is massive; you need to handle file I/O efficiently
Week 8 — Object-Oriented Programming Classes, inheritance, dunder methods PyTorch’s nn.Module is a class; you will subclass it for every neural network you build
Week 9 — Et Cetera *args, **kwargs, type hints, decorators Used extensively in PyTorch and modern AI codebases

After CS50P: Spend one week getting comfortable with NumPy specifically. Go to numpy.org/learn and work through the “NumPy fundamentals” section. Focus on array creation, broadcasting, and vectorized operations. The reason: before PyTorch exists, everything in ML is just NumPy. And even with PyTorch, understanding NumPy makes you understand what PyTorch tensors are actually doing.

What to look out for:


Phase 2 — First Look at ML (Google Machine Learning Crash Course)

Link: developers.google.com/machine-learning/crash-course

This is Google’s free, fully reimagined (2024 version) Machine Learning Crash Course. It is approximately 15 hours, completely free, and requires no login. It covers fundamental ML concepts with interactive visualizations and code exercises using Python.

Estimated time: 2–3 weeks

What this course covers that you need:

What to look out for:

Checkpoint: After this course, you should be able to answer: “What is a loss function, why do we compute gradients with respect to it, and what does gradient descent do with those gradients?” If you can explain this in plain language, you’re ready for Phase 3.


Phase 3 — Practical Deep Learning (fast.ai)

Link: course.fast.ai

This is fast.ai’s “Practical Deep Learning for Coders” — approximately 60 hours of material, completely free, no login required to access lectures. All exercises run in Kaggle notebooks (free, no GPU cost). This is a legendary course in the AI community and is the best top-down introduction to deep learning available anywhere.

Estimated time: 8–10 weeks

What makes this course unusual (and why it works): Most courses teach you theory first, then application. fast.ai does the opposite — you build a working image classifier in Lesson 1, and the theory is revealed gradually. This approach means you have working code from the start, which makes the theory much easier to understand when it appears.

Lesson-by-lesson guide:

What to look out for:


Phase 4 — The Core Pipeline (Karpathy Zero to Hero)

Link: karpathy.ai/zero-to-hero.html
YouTube playlist: youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
GitHub (code): github.com/karpathy/nn-zero-to-hero

This is the single most important resource in this entire guide for your goal. Andrej Karpathy (co-founder of OpenAI, former Director of AI at Tesla) built a series of videos that starts with calculus and ends with a working GPT. Every video involves building something from scratch in Python. This is as close as you can get to CS336’s philosophy without being in the course itself.

Estimated time: 12–16 weeks (do not rush this)


Lecture-by-lecture breakdown:

Lecture 1 — Micrograd: Backpropagation from Scratch

Video: youtube.com/watch?v=VMj-3S1tku0
Duration: ~2.5 hours

Karpathy builds micrograd — a tiny autograd engine in ~150 lines of Python. Autograd is the system that PyTorch (and all deep learning frameworks) use to automatically compute gradients. You will implement:

What to look out for:

Checkpoint: After this lecture, you should be able to implement a two-layer neural network that trains on a simple dataset using only your micrograd engine and plain Python. No PyTorch.


Lecture 2 — Makemore Part 1: Bigram Language Model

Video: youtube.com/watch?v=PaCmpygFfXo
Duration: ~1.5 hours

Karpathy introduces language modeling using character-level bigrams. A bigram model predicts the next character based only on the current character. This is the simplest possible language model — but the concepts introduced (token probabilities, loss functions for language models, the idea of “next token prediction”) are identical to what GPT-4 does at a much larger scale.

What to look out for:


Lecture 3 — Makemore Part 2: MLP

Video: youtube.com/watch?v=TCH_1BHY58I
Duration: ~1.5 hours

Karpathy implements the Bengio 2003 MLP language model — the paper that launched neural language modeling. You build a multi-layer perceptron that uses learned character embeddings.

What to look out for:


Lecture 4 — Makemore Part 3: Activations, Gradients, BatchNorm

Video: youtube.com/watch?v=P6sfmUTpUmc
Duration: ~2 hours

This lecture goes deep into the practical difficulties of training neural networks — dead neurons, vanishing/exploding gradients, and BatchNorm.

What to look out for:


Lecture 5 — Makemore Part 4: Backprop From Scratch (Manual)

Video: youtube.com/watch?v=q8SA3rM6ckI
Duration: ~2 hours

Karpathy manually implements the backward pass through BatchNorm without using PyTorch’s autograd. This is the hardest lecture in the series.

What to look out for:


Lecture 6 — Makemore Part 5: WaveNet

Video: youtube.com/watch?v=t3YJ5hKiMQ0
Duration: ~1 hour

Karpathy rebuilds the architecture using a tree-like CNN structure (WaveNet). He also dives into PyTorch internals — nn.Module, parameters, etc.

What to look out for:


Lecture 7 — Build GPT from Scratch

Video: youtube.com/watch?v=kCc8FmEb1nY
Duration: ~2 hours

This is the most important lecture in the series. Karpathy implements a character-level GPT (transformer) from scratch. Every component of the modern transformer is built and explained:

What to look out for:


Lecture 8 — GPT Tokenizer from Scratch

Video: youtube.com/watch?v=zduSFxRajkE
Duration: ~2 hours

Karpathy implements Byte Pair Encoding (BPE) — the tokenization algorithm used by GPT-2, GPT-4, and essentially all modern LLMs. CS336’s first lecture is specifically on tokenization.

What to look out for:


Phase 5 — PyTorch Fundamentals

Link (official tutorials): pytorch.org/tutorials
Specific starting point: pytorch.org/tutorials/beginner/basics/intro.html

After Karpathy, you know how PyTorch works conceptually. Now you need to be fluent in its API. CS336 uses raw PyTorch with no wrappers, and the assignments require using advanced PyTorch features.

Estimated time: 3–4 weeks (run parallel with late Phase 4)

What to cover:

What to look out for:


Phase 6 — Transformers and Tokenizers (Hugging Face NLP Course)

Link: huggingface.co/learn/nlp-course/chapter1/1

This course bridges from “I built a GPT from scratch” to “I understand the state-of-the-art transformer ecosystem.” It is completely free, no account required for reading, and runs in Google Colab.

Estimated time: 4–5 weeks

Chapter-by-chapter guide:

What to look out for:


Phase 7 — Stanford CS336 Lectures

Course website: cs336.stanford.edu
YouTube playlist (Spring 2025): youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_
Lecture code/slides (Spring 2025 GitHub): github.com/stanford-cs336/spring2025-lectures

If you have completed Phases 0–6, you are ready to follow these lectures. They are posted publicly on YouTube and are completely free to watch.

Estimated time: 12+ weeks to follow seriously

Lecture overview:

Lecture Topic What You Need from Prior Phases
Lecture 1 Overview + Tokenization Karpathy Lecture 8, HF Chapter 6
Lecture 2 PyTorch, Resource Accounting, FLOPs Phase 5 (PyTorch), Phase 4 (Karpathy)
Lecture 3 Transformer Architectures, Hyperparameters Karpathy Lecture 7
Lecture 4 Attention Variants, MoE Karpathy Lecture 7 + research paper reading
Lectures 5–6 GPU Hardware, Triton Kernels, CUDA Phase 5 + new material
Lectures 7–8 Distributed Training (DDP, FSDP, pipeline) Phase 5 + CS336 Lecture 2
Lecture 9 Scaling Laws (Chinchilla) Phase 2 + probability intuition
Lecture 10 Inference, KV-Cache, Flash Attention Karpathy Lecture 7 deep understanding
Lecture 11 Evaluation Phase 2 intuition
Lecture 12 Data — Sources, Filtering, Deduplication HF Chapter 5
Lecture 13 Supervised Fine-Tuning (SFT) All prior phases
Lectures 14–15 RLHF, RLVR (how ChatGPT is aligned) Hugging Face RL course bonus material
Lectures 16–17 Multimodality, Mixture of Experts Lecture 4 + CS336 lecture 3

What to look out for in CS336:


Realistic Timeline

Phase Content Duration
0 — Math (Linear Algebra, Calculus, Probability) 3Blue1Brown + Khan Academy + MIT problem sets 6–8 weeks
1 — Python Depth CS50P + NumPy fundamentals 3–4 weeks
2 — ML Foundations Google MLCC 2–3 weeks
3 — Practical Deep Learning fast.ai 8–10 weeks
4 — Karpathy Zero to Hero All 8 lectures, fully coded 12–16 weeks
5 — PyTorch API Official PyTorch tutorials 3–4 weeks (parallel with late Phase 4)
6 — Hugging Face NLP Course Chapters 1–7 4–5 weeks
7 — CS336 Lectures Full Spring 2025 YouTube playlist 12+ weeks

Total: approximately 14–18 months of consistent work.

This is not a pessimistic estimate. It is what it takes to go from MYP5 math to genuinely understanding a graduate-level LLM course. Starting now at 16 means you could realistically reach CS336 before your 18th birthday. That is an extraordinary achievement for someone your age — treat it as a long-term project, not a sprint.


Common Pitfalls to Avoid

  1. Watching without coding. Watching Karpathy without writing the code yourself is almost worthless. You will feel like you understand and then be unable to implement anything. Code every lecture.

  2. Moving forward with unresolved confusion. If you don’t understand the chain rule in Phase 0, Lecture 1 of Karpathy will be mysterious. If Lecture 1 is mysterious, Lecture 7 will be incomprehensible. Every phase builds on the last. Stop and resolve confusion before advancing.

  3. Treating fast.ai as a library course. fast.ai is teaching you concepts. The fastai library itself is not what you are learning — you are learning deep learning. The library is just the vehicle.

  4. Skipping the probability phase. This is the most commonly skipped phase and the one that causes the most confusion in CS336. The loss function, the softmax output, RLHF — everything is probability. Don’t skip it.

  5. Expecting to do CS336’s graded assignments for free. The compute required for the actual assignments costs real money. The lectures, lecture code, and assignment specifications are all free. You can learn everything CS336 teaches without completing the assignments — but you cannot shortcut the compute cost if you want to actually run the training jobs.

  6. Comparing your pace to others online. You are 16 and doing this on top of IB coursework. Someone who posts “I completed Karpathy in 2 weeks” is either experienced already, doing it poorly, or not in school full-time. Set your own pace.