Built for Sarang | Goal: Understand Stanford CS336 — Language Modeling from Scratch
Stanford CS336 — Language Modeling from Scratch (cs336.stanford.edu) is a graduate-level course taught by Percy Liang and Tatsunori Hashimoto. The goal of the course is to build a large language model entirely from scratch — including the tokenizer, the transformer architecture, the training loop, distributed training across GPUs, and alignment techniques like RLHF.
The course officially requires:
You currently have MYP5 math, basic Python, and JavaScript experience. This guide maps out every step between where you are now and where CS336 starts, using only 100% free resources — no credit cards, no free trials, no paywalls.
Code everything. Reading about neural networks without coding them is like reading about swimming without getting in the water. Every phase in this guide requires you to run code, not just watch videos. If you skip this, you will not be ready for CS336.
Use Google Colab (colab.research.google.com) as your coding environment throughout this entire journey. It gives you a free GPU in the browser, requires only a Google account (which you already have), and runs Jupyter notebooks — the same format used by CS336, Karpathy, fast.ai, and Hugging Face.
Use GitHub (github.com) to save your work. You already use VS Code, so install the GitHub extension and commit every notebook you write. This builds a portfolio that proves your work is real.
Do not rush phases. The most common mistake is moving forward before the previous phase is solid. If something in Phase 4 doesn’t make sense, the problem is almost always Phase 0 — a math gap you skipped.
This is the most important phase. Everything in deep learning — every weight update, every attention score, every loss function — is applied linear algebra and calculus. Do not skip this phase or rush through it. You will regret it later.
Estimated time: 6–8 weeks
Link: youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
This is the Essence of Linear Algebra playlist by Grant Sanderson (3Blue1Brown). It is 16 videos, approximately 3–5 minutes to 20 minutes each. The visual approach is uniquely good — you will come away with intuition that people who learned through pure symbol manipulation never build.
What to watch and what to pay attention to:
| Video | Key Concept | What to Look Out For |
|---|---|---|
| Ch. 1 — Vectors | Vectors as arrows in space | Understand both the geometric view (arrow) and the numerical view (list of numbers) — AI uses both simultaneously |
| Ch. 2 — Linear combinations, span, basis | What a basis is | The idea of a “basis” is everywhere in NLP — word embeddings are vectors in a basis |
| Ch. 3 — Linear transformations and matrices | Matrices as transformations | This is the single most important video — a matrix multiplication is not just arithmetic, it is a transformation of space |
| Ch. 4 — Matrix multiplication as composition | Chaining transformations | Every layer in a neural network is a matrix multiplication; stacking layers = composing transformations |
| Ch. 6 — The determinant | Scaling of space | Used in understanding probability distributions |
| Ch. 9 — Dot products | Inner products | Attention in a transformer is literally dot products — understand this deeply |
| Ch. 14 — Eigenvectors and eigenvalues | Directions that don’t change under transformation | Used in PCA, optimization, understanding training dynamics |
After watching: Go to MIT OpenCourseWare 18.06 (ocw.mit.edu/courses/18-06sc-linear-algebra-fall-2011) and do the first 5 problem sets. These are actual MIT assignments, free to access. They will show you whether you actually understood or just felt like you understood.
Red flag: If you can’t answer “what does matrix multiplication mean geometrically?” without looking it up, you are not ready for the next video yet.
Link: youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr
This is the Essence of Calculus playlist. It is 12 videos. You likely know basic derivatives from MYP5, but this series builds the intuition underneath the symbol-pushing — which is what you need for understanding backpropagation.
What you must deeply understand from this series:
What you need beyond this series that it doesn’t fully cover:
Red flag: If you cannot explain why the chain rule works in your own words, stop and rewatch Chapter 4 before moving on.
Link: khanacademy.org/math/statistics-probability
Link (probability section specifically): khanacademy.org/math/statistics-probability/probability-library
This is the most “boring” phase on the surface but is critically important. Language models are probabilistic systems — a transformer outputs a probability distribution over vocabulary tokens. If you don’t understand probability, you cannot understand what a model is actually doing.
Topics you must cover:
| Probability basics: Events, probability of an event, conditional probability (P(A | B) — probability of A given B). This is used constantly in language models where you compute the probability of the next word given all previous words. |
What to skip for now: The inferential statistics section (hypothesis testing, confidence intervals) is less relevant at this stage. Come back to it later if needed.
| Red flag: You must be able to answer — without looking it up — “what is a conditional probability and why is P(next word | all previous words) the right way to think about language modeling?” |
Link: cs50.harvard.edu/python
You already know Python from CS50x. CS50P goes deeper. It is 100% free with no account required to watch lectures, though you will want to create a free edX account to submit problem sets and get feedback.
Why you still need this: CS336 is explicit that its code volume is “an order of magnitude greater than a typical ML course.” The assignments involve writing optimized, production-style Python. CS50P will significantly strengthen your ability to write clean, modular, well-tested Python code.
Estimated time: 3–4 weeks
Topics in CS50P and what to pay extra attention to:
| Week | Topic | Why It Matters for AI |
|---|---|---|
| Week 2 — Loops | Generator functions and lazy evaluation | PyTorch DataLoaders use generators; understanding lazy evaluation is essential for handling large datasets |
| Week 4 — Libraries | NumPy and working with external packages | NumPy is the backbone of all numerical computation in Python; everything in AI uses it |
| Week 6 — File I/O | Reading and writing large files | Training data is massive; you need to handle file I/O efficiently |
| Week 8 — Object-Oriented Programming | Classes, inheritance, dunder methods | PyTorch’s nn.Module is a class; you will subclass it for every neural network you build |
| Week 9 — Et Cetera | *args, **kwargs, type hints, decorators |
Used extensively in PyTorch and modern AI codebases |
After CS50P: Spend one week getting comfortable with NumPy specifically. Go to numpy.org/learn and work through the “NumPy fundamentals” section. Focus on array creation, broadcasting, and vectorized operations. The reason: before PyTorch exists, everything in ML is just NumPy. And even with PyTorch, understanding NumPy makes you understand what PyTorch tensors are actually doing.
What to look out for:
(3,) array times a (3, 4) matrix doesn’t fail — it broadcasts. You need to understand why, because the same thing happens with tensors in PyTorch, and silent broadcasting bugs are a common source of incorrect model behavior.Link: developers.google.com/machine-learning/crash-course
This is Google’s free, fully reimagined (2024 version) Machine Learning Crash Course. It is approximately 15 hours, completely free, and requires no login. It covers fundamental ML concepts with interactive visualizations and code exercises using Python.
Estimated time: 2–3 weeks
What this course covers that you need:
What to look out for:
Checkpoint: After this course, you should be able to answer: “What is a loss function, why do we compute gradients with respect to it, and what does gradient descent do with those gradients?” If you can explain this in plain language, you’re ready for Phase 3.
Link: course.fast.ai
This is fast.ai’s “Practical Deep Learning for Coders” — approximately 60 hours of material, completely free, no login required to access lectures. All exercises run in Kaggle notebooks (free, no GPU cost). This is a legendary course in the AI community and is the best top-down introduction to deep learning available anywhere.
Estimated time: 8–10 weeks
What makes this course unusual (and why it works): Most courses teach you theory first, then application. fast.ai does the opposite — you build a working image classifier in Lesson 1, and the theory is revealed gradually. This approach means you have working code from the start, which makes the theory much easier to understand when it appears.
Lesson-by-lesson guide:
learn.fit_one_cycle() is doing. It is running gradient descent on a neural network. At this point you don’t need to know the implementation details; just get comfortable with the workflow.What to look out for:
fastai) is a high-level wrapper. In CS336, you will use raw PyTorch with no wrappers whatsoever. Use fast.ai to understand what these operations do, but get comfortable with the idea that later you will do the same things manually.Link: karpathy.ai/zero-to-hero.html
YouTube playlist: youtube.com/playlist?list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ
GitHub (code): github.com/karpathy/nn-zero-to-hero
This is the single most important resource in this entire guide for your goal. Andrej Karpathy (co-founder of OpenAI, former Director of AI at Tesla) built a series of videos that starts with calculus and ends with a working GPT. Every video involves building something from scratch in Python. This is as close as you can get to CS336’s philosophy without being in the course itself.
Estimated time: 12–16 weeks (do not rush this)
Lecture-by-lecture breakdown:
Video: youtube.com/watch?v=VMj-3S1tku0
Duration: ~2.5 hours
Karpathy builds micrograd — a tiny autograd engine in ~150 lines of Python. Autograd is the system that PyTorch (and all deep learning frameworks) use to automatically compute gradients. You will implement:
Value class that wraps a number and tracks its gradientWhat to look out for:
backward() function in each operation node is just the chain rule applied to that specific operation. Addition: d_loss/d_a = d_loss/d_output * 1. Multiplication: d_loss/d_a = d_loss/d_output * b. If you don’t understand why, go back to calculus Phase 0 and rewatch the chain rule video.Checkpoint: After this lecture, you should be able to implement a two-layer neural network that trains on a simple dataset using only your micrograd engine and plain Python. No PyTorch.
Video: youtube.com/watch?v=PaCmpygFfXo
Duration: ~1.5 hours
Karpathy introduces language modeling using character-level bigrams. A bigram model predicts the next character based only on the current character. This is the simplest possible language model — but the concepts introduced (token probabilities, loss functions for language models, the idea of “next token prediction”) are identical to what GPT-4 does at a much larger scale.
What to look out for:
Video: youtube.com/watch?v=TCH_1BHY58I
Duration: ~1.5 hours
Karpathy implements the Bengio 2003 MLP language model — the paper that launched neural language modeling. You build a multi-layer perceptron that uses learned character embeddings.
What to look out for:
Video: youtube.com/watch?v=P6sfmUTpUmc
Duration: ~2 hours
This lecture goes deep into the practical difficulties of training neural networks — dead neurons, vanishing/exploding gradients, and BatchNorm.
What to look out for:
tanh neuron is too large, the gradient is near zero — the neuron stops learning. This is called a “dead neuron.” Karpathy shows how to diagnose this by looking at activation histograms.Video: youtube.com/watch?v=q8SA3rM6ckI
Duration: ~2 hours
Karpathy manually implements the backward pass through BatchNorm without using PyTorch’s autograd. This is the hardest lecture in the series.
What to look out for:
Video: youtube.com/watch?v=t3YJ5hKiMQ0
Duration: ~1 hour
Karpathy rebuilds the architecture using a tree-like CNN structure (WaveNet). He also dives into PyTorch internals — nn.Module, parameters, etc.
What to look out for:
nn.Module. Understand how __init__, forward, and parameters() work — these are the three methods you will use in every model you build in CS336.Video: youtube.com/watch?v=kCc8FmEb1nY
Duration: ~2 hours
This is the most important lecture in the series. Karpathy implements a character-level GPT (transformer) from scratch. Every component of the modern transformer is built and explained:
What to look out for:
Video: youtube.com/watch?v=zduSFxRajkE
Duration: ~2 hours
Karpathy implements Byte Pair Encoding (BPE) — the tokenization algorithm used by GPT-2, GPT-4, and essentially all modern LLMs. CS336’s first lecture is specifically on tokenization.
What to look out for:
Link (official tutorials): pytorch.org/tutorials
Specific starting point: pytorch.org/tutorials/beginner/basics/intro.html
After Karpathy, you know how PyTorch works conceptually. Now you need to be fluent in its API. CS336 uses raw PyTorch with no wrappers, and the assignments require using advanced PyTorch features.
Estimated time: 3–4 weeks (run parallel with late Phase 4)
What to cover:
view, reshape, permute, squeeze, unsqueeze). Shape bugs are the most common bugs in ML code. You need to be able to reason about tensor shapes before writing a single line of a model.requires_grad, .backward(), .grad, torch.no_grad(). Understand when gradients are computed and when they aren’t (inference mode vs. training mode).nn.Module: Writing custom modules, the forward() method, parameters(), state_dict().torch.optim.Adam, torch.optim.SGD. Know the difference between them. Adam is used in virtually all LLM training..to(device). CS336 requires GPU awareness.What to look out for:
tensor += 1) on tensors that require gradients. Know when to avoid them..backward() twice without zeroing gradients with optimizer.zero_grad(), gradients accumulate. This is a common bug and also an intentional technique for large-batch training.Link: huggingface.co/learn/nlp-course/chapter1/1
This course bridges from “I built a GPT from scratch” to “I understand the state-of-the-art transformer ecosystem.” It is completely free, no account required for reading, and runs in Google Colab.
Estimated time: 4–5 weeks
Chapter-by-chapter guide:
What to look out for:
AutoTokenizer and writing your own: Hugging Face makes it easy to use a pre-trained tokenizer in two lines. CS336 makes you write one yourself. Use this course to understand what the pre-built one is doing under the hood.Course website: cs336.stanford.edu
YouTube playlist (Spring 2025): youtube.com/playlist?list=PLoROMvodv4rOY23Y0BoGoBGgQ1zmU_MT_
Lecture code/slides (Spring 2025 GitHub): github.com/stanford-cs336/spring2025-lectures
If you have completed Phases 0–6, you are ready to follow these lectures. They are posted publicly on YouTube and are completely free to watch.
Estimated time: 12+ weeks to follow seriously
Lecture overview:
| Lecture | Topic | What You Need from Prior Phases |
|---|---|---|
| Lecture 1 | Overview + Tokenization | Karpathy Lecture 8, HF Chapter 6 |
| Lecture 2 | PyTorch, Resource Accounting, FLOPs | Phase 5 (PyTorch), Phase 4 (Karpathy) |
| Lecture 3 | Transformer Architectures, Hyperparameters | Karpathy Lecture 7 |
| Lecture 4 | Attention Variants, MoE | Karpathy Lecture 7 + research paper reading |
| Lectures 5–6 | GPU Hardware, Triton Kernels, CUDA | Phase 5 + new material |
| Lectures 7–8 | Distributed Training (DDP, FSDP, pipeline) | Phase 5 + CS336 Lecture 2 |
| Lecture 9 | Scaling Laws (Chinchilla) | Phase 2 + probability intuition |
| Lecture 10 | Inference, KV-Cache, Flash Attention | Karpathy Lecture 7 deep understanding |
| Lecture 11 | Evaluation | Phase 2 intuition |
| Lecture 12 | Data — Sources, Filtering, Deduplication | HF Chapter 5 |
| Lecture 13 | Supervised Fine-Tuning (SFT) | All prior phases |
| Lectures 14–15 | RLHF, RLVR (how ChatGPT is aligned) | Hugging Face RL course bonus material |
| Lectures 16–17 | Multimodality, Mixture of Experts | Lecture 4 + CS336 lecture 3 |
What to look out for in CS336:
| Phase | Content | Duration |
|---|---|---|
| 0 — Math (Linear Algebra, Calculus, Probability) | 3Blue1Brown + Khan Academy + MIT problem sets | 6–8 weeks |
| 1 — Python Depth | CS50P + NumPy fundamentals | 3–4 weeks |
| 2 — ML Foundations | Google MLCC | 2–3 weeks |
| 3 — Practical Deep Learning | fast.ai | 8–10 weeks |
| 4 — Karpathy Zero to Hero | All 8 lectures, fully coded | 12–16 weeks |
| 5 — PyTorch API | Official PyTorch tutorials | 3–4 weeks (parallel with late Phase 4) |
| 6 — Hugging Face NLP Course | Chapters 1–7 | 4–5 weeks |
| 7 — CS336 Lectures | Full Spring 2025 YouTube playlist | 12+ weeks |
Total: approximately 14–18 months of consistent work.
This is not a pessimistic estimate. It is what it takes to go from MYP5 math to genuinely understanding a graduate-level LLM course. Starting now at 16 means you could realistically reach CS336 before your 18th birthday. That is an extraordinary achievement for someone your age — treat it as a long-term project, not a sprint.
Watching without coding. Watching Karpathy without writing the code yourself is almost worthless. You will feel like you understand and then be unable to implement anything. Code every lecture.
Moving forward with unresolved confusion. If you don’t understand the chain rule in Phase 0, Lecture 1 of Karpathy will be mysterious. If Lecture 1 is mysterious, Lecture 7 will be incomprehensible. Every phase builds on the last. Stop and resolve confusion before advancing.
Treating fast.ai as a library course. fast.ai is teaching you concepts. The fastai library itself is not what you are learning — you are learning deep learning. The library is just the vehicle.
Skipping the probability phase. This is the most commonly skipped phase and the one that causes the most confusion in CS336. The loss function, the softmax output, RLHF — everything is probability. Don’t skip it.
Expecting to do CS336’s graded assignments for free. The compute required for the actual assignments costs real money. The lectures, lecture code, and assignment specifications are all free. You can learn everything CS336 teaches without completing the assignments — but you cannot shortcut the compute cost if you want to actually run the training jobs.
Comparing your pace to others online. You are 16 and doing this on top of IB coursework. Someone who posts “I completed Karpathy in 2 weeks” is either experienced already, doing it poorly, or not in school full-time. Set your own pace.