Gradient descent – why subtract gradient to update mm and bb

These are the gradient descent formulas: δδm=2n∑−xi(yi−mxi+b)δδm=2n∑−(yi−mxi+b) And my understanding is they come from first taking the positive gradient is the partial derivatives of the function (y−mx+b)2. This leads to δJδm(2x(y−mx+b))δJδb(2×(y−mx+b)×1) Then to get the descent, we just add negatives to each partial derivative. So we are already descending. But translating gradient descent into code, … Read more

Lipschitz Number in Gradient Descent

During gradient descent, if an objective function’s value is greater than the previous iteration, would use of an orthogonal vector to the update vector be advantageous? Regarding trust regions, the Lipschitz Theorem specifies use of 1/L, where L is the greatest eigenvalue of the Hessian matrix, ∇2f(x∗), but this approach follows the same approach used … Read more

Gradient descent with box constraints and possible non-convex function.

Hope you are well. I am working on an optimization problem, quadratic (see below). Of the 4 variables there are but 2 that have a negativity constraint. Am I correct to say that gradient descent is my best option (my suggestion to solve this is shown below the problem at hand). EDIT: Change in the … Read more

Gradient Descent: transforming error surface from elliptical to circular

I am watching the Neural Network videos by Prof. Geoff Hinton. In there he talks about the problem with elliptical error surfaces and how they can be transformed to circular surfaces. Slide: Link to the timestamped video: https://youtu.be/Xjtu1L7RwVM?list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9&t=125 Questions: 1) Why is the first case with 101, 99 so elongated? Is it because the inputs … Read more

Projected gradient descent on a semidefinite program with multiple constraints

I have the following semidefinite program argminXtrace(CX)subject todiag(X)=1X⪰0 which is a semidefinite relaxation of a MIMO communication detection problem with binary phased shift keying (PSK). My objective is to implement a first order algorithm and compare the results to an off-the-shelf solver such as CVX. I have tried to implement a crude projected gradient descent … Read more

Bound on steepest descent

I’m learning about the method of steepest descent for approximating the solution of Ax=b where A is an invertible matrix. Here is the part I understand: We do this by minimizing the function f(x)=12‖. The \nabla f(x)=A^TAx-A^Tb. Let M=A^TA and c=A^Tb. I came across this bound f(x^{(i)}) \leq f(x^{(0)})\left(1-\frac{1}{k(A)^2}\right)^i. Where x^{(i)} ist the i^{th} gradient … Read more

Unstable manifold of an unstable fixed point of a steepest descent iteration

Let $f:\mathbb R\to\mathbb R$ be differentiable $t:\mathbb R\to(0,\infty)$ be a “step size” function and $$\tau:\mathbb R\to\mathbb R\;,\;\;\;x\mapsto x-t(x)f'(x).$$ In the steepest descent method, we iterate $\tau$ to find a local minimum of $f$. Let’s consider$^1$ $f(x)=x^2(x-2)^2$ so that $f'(x)=4x(x-1)(x-2)$. Note that the fixed points of $\tau$ are precisely the critical points of $f$. In our … Read more

The stability of a gradient flow ( discrete scheme, JKO, proximal point, reference request)

Define a free energy functional on the space of probability densities ( on Rd, denoted P(Rn)) E(ρ):=∫Rdf(x)ρ(x)dx+∫Rdρ(x)logρ(x)dx, for some uniformly convex, Lipschitz, non-negative f:Rn→R. Consider the following discrete scheme of a Wasserstein gradient flow ( coined JKO scheme ) : Fix a time step h, given a density with finite second moment ρ0 such that … Read more

Gradient descent via polynomial approximation

It seems that most proofs of convergence for gradient descent algorithms rely on strong conditions on the first and second derivatives of the function, for instance that |f”(x)| \leq K over the whole domain of the function. My question is are there results for gradient descent type algorithms when we can only say something like … Read more

Decrease in the size of gradient in gradient descent

Gradient descent reduces the value of the objective function in each iteration. This is repeated until convergence happens. The question is if the norm of gradient has to decrease as well in every iteration of gradient descent? Edit: How about when the objective is a convex function? Answer Suppose we have the following objective function … Read more