Gradient Descent Notes (Parameters)

Machine learning models depend on parameters. Training means finding the best parameter values.

In machine learning, most models depend on some parameters. These parameters control how the model fits the data.

  • In a linear model, we need the best values of m and c in
    $$y = mx + c$$
  • In a quadratic model, we need the best values of a, b, and c in
    $$y = ax^2 + bx + c$$
  • In logistic regression, we need the best parameter values in
    $$z = a + bx$$
    and then the value of \(z\) is passed through the sigmoid function to get the final probability.
    Learn Logistic

In all these cases, the model form is fixed, but the parameter values must be chosen correctly. If the parameters are not correct, the model gives poor predictions even if the formula is correct.

So the main task in training a machine learning model is to find the best parameter values that give the least error on the data. This is where gradient descent is used.

SJMathTube • Notes
Why not directly solve derivative equal to zero

In basic calculus, we often find the minimum or maximum by solving

$$ \frac{dJ}{d\theta} = 0 $$

This works when the function is simple and the number of unknowns is small. Machine learning is different because the cost function usually depends on many parameters.

Large number of parameters

A real model may have thousands or millions of parameters:

$$ J(w_1, w_2, w_3, \ldots, w_n) $$

To use the derivative method, we would need to set every partial derivative equal to zero and solve a huge system of equations. That is not practical for large models.

No neat formula solution

Many modern cost functions do not give a neat closed form solution. Even if the derivatives are known, solving them exactly can be impossible.

Why gradient descent is used

Gradient descent improves the parameters step by step. At each step, it checks the slope and moves in the direction that reduces the cost. This makes it suitable for large datasets and complex models.

Gradient descent update rule (quadratic example, alpha = 0.25)

We will use a quadratic cost function and run gradient descent by hand. This is a clean exam style example.

Step 1 choose a quadratic cost function

Let the cost function be

$$ J(x) = x^{2} - 6x + 13 $$

This is a quadratic function, so it has exactly one minimum point.

Step 2 find the derivative
$$ \frac{dJ}{dx} = 2x - 6 $$
Step 3 write the update rule
$$ x_{\text{new}} = x_{\text{old}} - \alpha \frac{dJ}{dx} $$

Here we choose

$$ \alpha = 0.25 $$
Step 4 substitute and simplify
$$ x_{\text{new}} = x_{\text{old}} - 0.25(2x_{\text{old}} - 6) $$ $$ x_{\text{new}} = x_{\text{old}} - (0.5x_{\text{old}} - 1.5) $$ $$ x_{\text{new}} = 0.5x_{\text{old}} + 1.5 $$
Step 5 start from a value and iterate

Let the starting value be

$$ x_{0} = 8 $$

Use \(x_{n+1} = 0.5x_n + 1.5\):

$$ x_{1} = 0.5(8) + 1.5 = 5.5 $$ $$ x_{2} = 0.5(5.5) + 1.5 = 4.25 $$ $$ x_{3} = 0.5(4.25) + 1.5 = 3.625 $$ $$ x_{4} = 0.5(3.625) + 1.5 = 3.3125 $$ $$ x_{5} = 0.5(3.3125) + 1.5 = 3.15625 $$ $$ x_{6} = 0.5(3.15625) + 1.5 = 3.078125 $$ $$ x_{7} = 0.5(3.078125) + 1.5 = 3.0390625 $$ $$ x_{8} = 0.5(3.0390625) + 1.5 = 3.01953125 $$ $$ x_{9} = 0.5(3.01953125) + 1.5 = 3.009765625 $$ $$ x_{10} = 0.5(3.009765625) + 1.5 = 3.0048828125 $$
Check the true minimum point

The minimum occurs where the derivative is zero:

$$ 2x - 6 = 0 \Rightarrow x = 3 $$

The iteration values move toward \(3\). That is the minimum point.

What to remember
  • Differentiate the cost function
  • Apply \(x_{\text{new}} = x_{\text{old}} - \alpha \frac{dJ}{dx}\)
  • Repeat until values settle near the minimum
Cubic example using gradient descent first

A cubic function can have two turning points. One is a local maximum and the other is a local minimum. Gradient descent moves toward a minimum, so it will not give the maximum point.

Step 1 choose a cubic function

Let the function be

$$ J(x) = x^{3} - 3x $$
Step 2 write the gradient descent update rule

We use the update rule

$$ x_{\text{new}} = x_{\text{old}} - \alpha \frac{dJ}{dx} $$

For this example, take \(\alpha = 0.10\) and start at \(x_{0} = 2\).

Step 3 use the derivative only for the update

To run the update, we need the slope expression:

$$ \frac{dJ}{dx} = 3x^{2} - 3 $$
Step 4 calculate 10 iterations

Substitute into the update:

$$ x_{n+1} = x_{n} - 0.10(3x_{n}^{2} - 3) $$
n xn 3xn2 − 3 xn+1 = xn − 0.10(3xn2 − 3)
02.0000009.0000001.100000
11.1000000.6300001.037000
21.0370000.2251071.014489
31.0144890.0876041.005729
41.0057290.0344821.002281
51.0022810.0137071.000910
61.0009100.0054641.000364
71.0003640.0021841.000145
81.0001450.0008741.000058
91.0000580.0003501.000023
101.0000230.0001400.999... (keeps moving to 1)
What happened

Starting from \(x_{0} = 2\), the values moved toward \(x = 1\). That point is the minimum for this cubic.

Now confirm using calculus (turning points)

Now we do the usual calculus method to see all turning points.

$$ \frac{dJ}{dx} = 3x^{2} - 3 $$ $$ 3x^{2} - 3 = 0 $$ $$ x^{2} = 1 $$ $$ x = -1,\ 1 $$

So the cubic has two stationary points: one at \(x = -1\) and one at \(x = 1\). Gradient descent gave us only the \(x = 1\) point because it is the minimum side from our starting position.

Important note

Gradient descent gives only the minimum value. That is why we did not get the other critical value \(x = -1\) from the iterations.

If you want the maximum

If you want the maximum of \(J(x)\), convert it into a minimum problem by defining

$$ g(x) = -J(x) $$

Minimizing \(g(x)\) is the same as maximizing \(J(x)\).

Formulae

Available Formula Sheets

Latest
LATEST CONTENT Auto
Home