In this post, we will explain how Gradient Descent (GD) works and why it can converge very slowly. The simplest first-order optimization algorithm is Gradient Descent. It is used to minimize a convex differentiable function over $latex {{\mathbb R}^d}&fg=000000$. The update rule is simply $latex {{\boldsymbol x}_{t+1} = {\boldsymbol x}_t - \eta_t \nabla f({\boldsymbol x}_t)}&fg=000000$, …

Continue reading "The Negative Gradient Does Not Point Towards the Minimum"