On Tuesday, November 15, the “Optimization and neural networks” workshop of the DSAIDIS chair was held. Permanent members and PhD students presented their research work.

Olivier Fercoq

On the convergence of the ADAM algorithm

[EN] I will present the ADAM algorithm, which is a famous stochastic gradient method with adaptive learning rate. It is based on exponential moving averages of the stochastic gradients and their squares in order to estimate the first and second moments.
Then I will explain the main ideas of its convergence proof in the case of a convex objective function. The challenges are the following: 1) the estimation of the first moment is biased; 2) the learning rate is a random variable. They are solved by finding terms that telescope almost surely and by using the fact that learning rate is small when the gradient estimate is noisy.

See the slides

Maxime Lieber

Differentiable STFT with respect to the window length: optimizing STFT window length by gradient descent

[EN] In this talk, we revisit the tuning of the spectrogram window length, making the window length a continuous parameter optimizable by gradient descent instead of an empirically tuned integer-valued hyperparameter.

We first define two differentiable versions of the STFT w.r.t. the window length, in the case where local bins centers are fixed and independent of the window length parameter, and in the more difficult case where the window length affects the position and number of bins.
We then present the smooth optimization of the window length with any standard loss function. We show that this optimization can be of interest not only for any neural network-based inference system, but also for any STFT-based signal processing algorithm. We also show that the window length can not only be fixed and learned offline, but also be adaptive and optimized on the fly.
The contribution is mainly theoretical for the moment but the approach is very general and will have a large-scale application in several fields.

See the slides

Enzo Tartaglione

To the lottery ticket hypothesis and beyond: can we really make training efficient?

[EN] Recent advances in deep learning optimization showed that, with some a-posteriori information on fully-trained models, it is possible to match the same performance by simply training a subset of their parameters which, it is said, “had won at the lottery of initialization”.

Such a discovery has a potentially-high impact, from theory to applications in deep learning, especially from power consumption perspectives. However, all the “efficient” methods proposed do not match the state-of-the-art performance with high sparsity enforced, and rely on unstructured sparsely connected models, which notoriously introduce overheads when using the most common deep learning libraries.
In this talk, a background on the lottery ticket hypothesis will be provided with some of the approaches attempting to tackle the problem of efficiently identifying the parameters winning at the lottery of initialization, and two recent works will be presented.
The first, which has been presented at ICIP 2022 as oral, investigates the reasons for which an efficient algorithm in this context is hard to design, suggesting possible research directions towards efficient training. The second, which will be presented at NeurIPS 2022, implements an automatic method to assess, at training time, which sub-graph in the neural networks does not need further training (hence, no back-propagation and gradient computation is necessary for it, saving computation).

See the slides

Hicham Janati

Optimal alignments in machine learning: The case of spatio-temporal data

See the slides

The day ended with a discussion on big data and frugal AI.

Big data or fugal AI: where can optimization techniques help?

See the slides