Derivatives — Introduction
Derivatives — Introduction
1. What is this and why do we care?
Every time a neural network trains, it asks one question thousands of times per second:
“If I change this weight by a tiny amount, does my error go up or down?”
The derivative is the mathematical tool that answers this question. It tells you the rate of change — how fast something is changing at any given moment.
Without derivatives, backpropagation cannot work. Without backpropagation, neural networks cannot learn. Without learning, there is no deep learning, no transformers, no ChatGPT.
The derivative is the engine of all of modern AI.
2. Prerequisites
You need to know what a function is — a rule that takes one number in and gives one number out. For example: f(x) = x² takes the number 3 and gives back 9. If you are comfortable with that, you are ready.
3. The intuition — before any symbols
Imagine you are on a bus travelling from Delhi to Agra — a journey of about 200 km.
The bus does not travel at a constant speed. Sometimes it is stuck in traffic and moving at 10 km/h. Sometimes it is on an open highway doing 90 km/h.
At any given moment, your speedometer shows one number: your instantaneous speed right now. Not your average speed for the whole journey — your speed at this exact moment.
That number on the speedometer is the derivative of your position with respect to time.
More precisely:
- Your position is a function of time: where are you at time t?
- Your speed is how fast your position is changing: how much does your position change if time increases by a tiny amount?
- Speed = derivative of position with respect to time.
The derivative is always about: “if this input increases by a tiny amount, how much does the output change?”
In neural networks:
- The input is a weight
- The output is the error (how wrong the model is)
- The derivative tells us: if we nudge this weight, does the error go up or down, and by how much?
4. A tiny worked example with real numbers
Consider a simple function: f(x) = x²
| x | f(x) = x² |
|---|---|
| 1 | 1 |
| 2 | 4 |
| 3 | 9 |
| 4 | 16 |
How fast is f(x) growing at x = 3?
Rough approach — look at the change from x = 3 to x = 4:
Change in output = f(4) - f(3) = 16 - 9 = 7
Change in input = 4 - 3 = 1
Rate of change = 7 / 1 = 7
But that’s the average rate over the whole step from 3 to 4. Let’s use a smaller step — from x = 3 to x = 3.1:
f(3.1) = 3.1² = 9.61
Change in output = 9.61 - 9 = 0.61
Change in input = 3.1 - 3 = 0.1
Rate of change = 0.61 / 0.1 = 6.1
Even smaller step — from x = 3 to x = 3.01:
f(3.01) = 3.01² = 9.0601
Change in output = 9.0601 - 9 = 0.0601
Change in input = 3.01 - 3 = 0.01
Rate of change = 0.0601 / 0.01 = 6.01
Notice: as the step gets smaller, the rate of change approaches 6.
The derivative of f(x) = x² at x = 3 is 6. This is written:
f'(3) = 6 or df/dx at x=3 = 6
This means: at x = 3, if x increases by a tiny amount, f(x) increases by about 6 times that amount.
5. The general rule
For the function f(x) = xⁿ (x raised to any power n), the derivative is:
f'(x) = n × x^(n-1)
This is called the power rule. Let us verify with our example:
For f(x) = x² (n = 2):
f'(x) = 2 × x^(2-1) = 2 × x¹ = 2x
f'(3) = 2 × 3 = 6 ✓
More rules you will need:
| Function | Derivative | Plain meaning |
|---|---|---|
| f(x) = c (constant) | f’(x) = 0 | A flat line has zero slope |
| f(x) = cx | f’(x) = c | A straight line has constant slope |
| f(x) = x² | f’(x) = 2x | Grows faster as x grows |
| f(x) = x³ | f’(x) = 3x² | Grows even faster |
| f(x) = eˣ | f’(x) = eˣ | Special: the derivative of eˣ is itself |
| f(x) = ln(x) | f’(x) = 1/x | The log function |
The last two — eˣ and ln(x) — appear constantly in AI because loss functions are built from them.
6. A slightly bigger example — the loss function
In neural network training, we measure error using a loss function. A simple one is the squared error:
L(w) = (y - wx)²
Where:
- w = the weight (what we are trying to learn)
- x = the input
- y = the correct output
- L = the loss (how wrong we are)
Let us say x = 2, y = 4 (the correct answer), and our current weight is w = 1.
L(1) = (4 - 1×2)² = (4-2)² = 4
Our error is 4. How does the error change if we increase w slightly?
The derivative of L with respect to w tells us. Working it out (using chain rule — next tutorial):
dL/dw = -2x(y - wx) = -2 × 2 × (4 - 1×2) = -2 × 2 × 2 = -8
The derivative is -8. Negative means: if w increases, L decreases. We are moving in the right direction by increasing w. This is how gradient descent knows which way to move each weight.
7. Where does this appear in AI?
Paper 03 — Backpropagation: Backpropagation computes the derivative of the loss function with respect to every weight in the network. This is done using the chain rule, applied layer by layer from output back to input. Without derivatives, there is no backprop.
Paper 04 — LSTM: The vanishing gradient problem is a statement about derivatives: in a deep network, derivatives shrink exponentially as they travel backward through layers. LSTM solves this by designing a path through which gradients can flow without shrinking.
All training papers: Every neural network ever trained uses gradient descent — repeatedly computing derivatives and moving weights in the direction that reduces loss.
8. Common mistakes
-
Confusing the derivative with the value. f(3) = 9 is the value of the function at x=3. f’(3) = 6 is the derivative — the slope — at x=3. These are completely different numbers.
-
Thinking the derivative is always positive. The derivative can be negative (the function is decreasing), zero (the function has a flat point — a minimum, maximum, or saddle point), or positive (increasing). In gradient descent, we specifically look for negative derivatives — they tell us which direction decreases the loss.
-
Applying the power rule to the wrong functions. f(x) = 2ˣ is not the same as f(x) = x². The power rule
n × x^(n-1)only applies when x is the base and the power is a fixed number. For 2ˣ, the derivative is different and involves ln(2).
9. Try it yourself
Exercise 1: Find the derivative of f(x) = x³ using the power rule. What is f’(2)?
Show answer
Power rule: f’(x) = 3 × x^(3-1) = 3x²
f’(2) = 3 × 2² = 3 × 4 = 12
This means: at x=2, if x increases by 0.01, f(x) increases by approximately 12 × 0.01 = 0.12. Check: f(2.01) = 2.01³ = 8.120601. f(2) = 8. Difference = 0.120601 ≈ 0.12 ✓
Exercise 2: A simple loss function is L(w) = (3 - w)². What is dL/dw? At w = 1, is the derivative positive or negative? What does the sign tell you about which direction to move w to reduce L?
Show answer
Let u = (3 - w). Then L = u². dL/du = 2u = 2(3 - w) du/dw = -1 By chain rule: dL/dw = dL/du × du/dw = 2(3-w) × (-1) = -2(3-w)
At w = 1: dL/dw = -2(3-1) = -2 × 2 = -4
The derivative is negative. This means: increasing w decreases L. So to reduce loss, we should increase w (move in the direction opposite to the gradient). Gradient descent does exactly this: it subtracts the gradient, which means moving w upward when the gradient is negative.
The minimum loss is at w = 3, where dL/dw = 0. At that point, (3-w) = 0, so the loss is 0 — perfect prediction.
Exercise 3: Using finite differences (like we did in Section 4), estimate the derivative of f(x) = x³ at x = 2 by computing [f(2.001) - f(2)] / 0.001. Compare to the exact answer from Exercise 1.
Show answer
f(2.001) = 2.001³ = 8.012006001 f(2) = 8 [f(2.001) - f(2)] / 0.001 = 0.012006001 / 0.001 = 12.006
Exact answer: 12. The finite difference estimate is 12.006 — very close, with a tiny error because we used a small but non-zero step. As the step approaches zero, the estimate approaches the exact derivative.
10. Interactive widget
Coming soon: Derivative Explorer →
Draw any curve. Click any point. See the tangent line — the line whose slope equals the derivative at that point. Drag along the curve and watch the slope change.
Next tutorial: Chain Rule → Used in: Paper 03 — Backpropagation →