Conditional Gradient Methods
The purpose of our survey is to serve both as a gentle introduction and a coherent overview of state-of-the-art Frank–Wolfe algorithms, also called conditional gradient algorithms, for function minimization. These are first-order algorithms accessing the feasible region only through linear minimization, and are especially useful in convex optimization, in particular when linear optimization is cheaper than projection.
The selection of the material has been guided by the principle of highlighting crucial ideas as well as presenting new approaches that we believe might become important in the future, with ample citations even of old works imperative in the development of newer methods. Yet, our selection is sometimes biased, for example all applications are from machine learning, and need not reflect consensus of the research community. We have certainly missed recent important contributions. After all the research area of Frank–Wolfe is very active, making it a moving target. We apologize sincerely in advance for any such distortions and we fully acknowledge: We stand on the shoulder of giants.
Authors
- Gábor Braun, Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany; Institute of Mathematics, Technische Universität Berlin, Germany
- Alejandro Carderera, School of Industrial and Systems Engineering, Georgia Institute of Technology, USA
- Cyrille W. Combettes, Toulouse School of Economics, Université Toulouse 1 Capitole, France
- Hamed Hassani, Department of Electrical and Systems Engineering, University of Pennsylvania, USA
- Amin Karbasi, Department of Electrical Engineering, Computer Science, Statistics & Data Science, Yale University, USA
- Aryan Mokhtari, Department of Electrical and Computer Engineering, University of Texas at Austin, USA
- Sebastian Pokutta, Department for AI in Society, Science, and Technology, Zuse Institute Berlin, Germany; Institute of Mathematics, Technische Universität Berlin, Germany
Citation
Cite as
Gábor Braun, Alejandro Carderera, Cyrille W Combettes, Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Sebastian Pokutta, Conditional Gradient Methods, arXiv:2211.14103 [math.OC]
or in BibTeX format
@Misc{CGFW2022,
author = {G{\'a}bor Braun and Alejandro Carderera and Cyrille
W Combettes and Hamed Hassani and Amin Karbasi and
Aryan Mokhtari and Sebastian Pokutta},
title = {Conditional Gradient Methods},
month = nov,
year = 2022,
archivePrefix ={arXiv},
primaryClass = {math.OC},
eprint = {2211.14103},
url = {https://conditional-gradients.org/}
}
Outline of Conditional Gradient Methods
Conditional gradient algorithms, also called Frank–Wolfe algorithms minimize a convex objective function via computing the function value and gradient at some points and minimizing linear functions over the feasible region. The algorithms successively compute solutions closer to the optimum.
The first chapter is introductory. It motivates the use of conditional gradient-based methods and provides a short summary of the history of this class of algorithms. It further very broadly discusses extensions and variants of the original Frank–Wolfe algorithm that have been studied in the literature. The chapter ends with notation and basic definitions used in this monograph.
The second chapter focuses on the base Frank–Wolfe algorithm. It summarizes its convergence properties with both lower and upper bounds on convergence, close to each other, for smooth (roughly locally nice) convex functions. The chapter continues with special cases with faster convergence, notably when the objective function is smooth and strongly convex (roughly the function rapidly increases away from the optimum), and either (i) the feasible set is uniformly or strongly convex (roughly round); or (ii) the optimum lies in the interior of the feasible region. Finally, the chapter ends by presenting Frank–Wolfe variants with fast convergence of smooth strongly convex functions when the feasible region is a polytope.
The third chapter presents several advanced variants of Frank–Wolfe algorithms, offering significant performance improvements in specific situations. The improvements target various elements of the base algorithm, like sharpness, a generalization of strongly convex functions, replacing or augmenting linear minimization in various ways, and eliminating or optimizing parameters used internally.
Adaptivity is a notable example of parameter optimization, replacing the global smoothness constant (a measure of smoothness) with local estimates. Lazification is a notable example of replacing linear minimization with a cheaper operation, e.g., an early termination of typical implementations of linear minimization upon a sufficiently better solution. This goes in the direction of solving subproblems only to the accuracy till progress on the original problem justifies the cost of accuracy. Boosting and blending aim to better utilize or eliminate linear minimization. Conditional Gradient Sliding is essentially Nesterov’s accelerated projected gradient descent algorithm using the Frank–Wolfe algorithm for approximate projections back into the feasible region, minimizing the number of gradient computations without increasing the number of linear optimizations.
The fourth chapter is conditional gradient methods in large-scale settings, where access to the exact value of the objective function or its gradient is computationally prohibitive. In particular, three scenarios are considered: Stochastic, online, and distributed.
In the stochastic setting, the function value and gradient are noisy (random), but unbiased. We study a variety of techniques that have been developed to adapt conditional gradient methods to the stochastic setting. The common aim of all these techniques is to provide more accurate estimates of function data. For all the resulting algorithms we provide tight convergence analysis and show in which situations they can be optimal. An extreme special case are zeroth order algorithms, using only (noisy) function values, but no gradients.
Next, we move to the online setting in which components of the objective function become available in a sequential manner and it is desirable to start optimizing before all information is disclosed. We study Frank–Wolfe algorithms for two common online settings, namely, the full-information setting, where gradually the whole objective function becomes available, and the bandit setting, where only very limited information on the objective function becomes available.
Finally, we study the distributed setting in which the objective function is decomposed as the sum of individual objectives each belonging to a “node” in a network. The nodes only know their own function and have no information about the functions belonging to the other nodes. Consequently, to optimize the global objective, the nodes have to communicate with each other. After recalling various distributed scenarios, we study in detail a Frank–Wolfe method for the decentralized setting, which is while quite general, mostly aimed at networks where all nodes are equal, and communicate only possibly with a few other nodes. The algorithm is one of the main conditional gradient algorithms for the distributed setting.
The fifth and last chapter consists of auxiliary material of two kinds. The first part presents variants of conditional gradient methods that differ from the core Frank–Wolfe algorithms to a larger extent than the ones in the preceding chapters, but still have similar ideas and structure worth highlighting. The first example is the Matching pursuit algorithm, minimizing over a linear subspace, an unbounded feasible region, while Frank–Wolfe algorithms need a bounded feasible region. Then we discuss second-order Conditional Gradient Sliding methods, a version of the previously mentioned Conditional Gradient Sliding algorithm which instead of projections optimizes second-order approximations of the objective function. The final variant is a direct acceleration for conditional gradient methods different from but using elements of Conditional Gradient Sliding.
The second part of the chapter focuses on the applications of Frank–Wolfe methods, like maximizing submodular functions, but mostly to machine learning problems, like classification, video colocalization (identifying objects in video), adversarial attacks (robustness of learning algorithms), optimal experimental design, but also other problems. The coreset problem asks for approximating a set with few points. The approximate Carathéodory problem aims to approximate a point with a sparsely presented one; such sparse presentation being practically desirable. The Frank–Wolfe algorithm is particularly well-suited by its design for efficiently finding sparse approximations.
Code
The python code used in the survey for generating all comparisons and graphics is publicly available on GitHub.
Many of the more general variants are available as reference implementations in the FrankWolfe.jl Julia package.
Feedback
Feedback, suggestions, and comments are very much welcome at frank.wolfe.book@gmail.com.