Funny how things I never expect to see on here pop up occasionally. I had read some of the early papers by Boumal on optimization over Riemannian manifolds in 2016 for a similar problem and even wrote some Julia code for it (before manifolds.jl existed).
In my case, I was trying to perform synchronization over a set of noisy point clouds to extract the "ground truth" point cloud. I.e., take the set of coordinates corresponding to a point cloud, randomly permute the order of the coordinates, randomly rotate/reflect the point cloud, and then apply a Gaussian perturbation to each coordinate. Repeat this process multiple times. The goal then is to recover the original point cloud from the noisy rotated/reflected/permuted copies of the original.
Boumal and Singer had done some work to solve this problem for just rotations/reflections by essentially "stacking" the unknown orthogonal matrices (i.e., elements of O(3)) into a Stiefel manifold and then performing Riemannian optimization over this manifold. It worked fantastically well and was much faster than prior techniques involving semidefinite programming, so I decided to try the same approach for synchronization over permuted copies of the point cloud, and it also worked quite well. However, one can achieve better convergence properties by performing a discrete Fourier transform over the symmetric group on the sets of point clouds so that instead of optimizing over a Stiefel manifold that is supposed to represent stacked permutation matrices (well, stacked orthogonal matrices with constraints to make them doubly stochastic), you optimize over a manifold of stacked unitary matrices intended to represent irreps of S_n.
Ultimately I didn't know enough group theory at the time to figure out how to generate irreps for the "combined" group of O(3) and S_n, so I couldn't solve the problem simultaneously for both rotations/reflections and permutations, but ultimately Singer and Boumal developed an approach that bypassed the synchronization problem altogether by utilizing invariant polynomials to extract the ground truth signal in a more direct way.
I believe Absil, Mahony, and Sepulchre also have a book on optimization over manifolds:
https://press.princeton.edu/absil
I was unaware of the Bournal work, so thanks for that. Do you have any idea how Bournal's approach differs from Absil's?
For others, it looks like Bournal also has a book on the topic from 2023:
Boumal was advised by Absil IIRC :) And in fact you can see this in his more modern presentation of the material.
That's actually interesting. Do you have some suggestions of where I could read more? Or others have suggestions?
(My group theory background is only from abstract algebra, quantum, and a bit of poking around)
Interestingly, the asymptotically fastest known algorithm for minimum weight bipartite matching [A] uses an interior point method, which means it's also doing Riemannian optimization in some sense.
[A] https://www.quantamagazine.org/researchers-achieve-absurdly-...
Interesting, I’ve always wondered if the assignment problem could be solved as continuous optimization. I’m familiar with manifold optimization but didn’t realize the permutation matrix could be expressed as a manifold.
Has this been applied to networks like DETR for object detection? I’ve never been clear on how they made the Hungarian algorithm differentiable (and it seems like their approach has issues).
Of course it can, but likely you will also need some search as well.
In math programming, we create a relaxation of the assignment problem where all of the integer decisions are converted to continuous. Then you solve this super simple Linear Program. This is a theoretical lower bound on what the optimal assignment is. You can now start branch and bound search and find the optimal integer assignment.
Now the trick is that we have identified over the decades cuts (inequalities) that you can apply to this initial relaxed problem, that they chop off irrelevant continuous space but let the integer solutions are untouched. If your cuts are good enough and the integer optimal solution is at the vertex of the chopped polytope, congratulations! You don't have to search anything!
No cuts are necessary for minimum-weight bipartite matching as the constraint matrix is totally unimodular. Total unimodularity guarantees that any optimal basic solution is an integer solution.
Moreover, well-known algorithms for linear programming like simplex and primal-dual methods have straightforward (and illuminating!) combinatorial interpretations when restricted to minimum-weight bipartite matching.
> Moreover, well-known algorithms for linear programming like simplex and primal-dual methods have straightforward (and illuminating!) combinatorial interpretations when restricted to minimum-weight bipartite matching.
I you have any resource illustrating this point, please share. I would be very interested.
I liked this video [0] about how the Hungarian Algorithm is a particular case of the primal-dual method and I am eager to dig further.
This was a popular topic in '80s linear programming books. Chvatal's "Linear programming," for instance, is carefully-written and devotes about 100 pages devoted to network simplex. Papadimitriou and Stieglitz's "Combinatorial optimization: algorithms and complexity" explicitly goes through primal-dual derivations of algorithms for shortest path, max flow, and min-cost flow (including bipartite matching). I haven't read it in any detail, but https://math.mit.edu/~goemans/PAPERS/book-ch4.pdf is online and might be to your liking.
Many thanks!
Spot on! I was just thinking about the case of having also complicating constraints. In the vanilla case is exactly as you say.
It's incredible how many algorithms can be made differentiable end-to-end.
E.g. in https://arxiv.org/abs/1905.11885 "Differentiable ranks and sorting using optimal transport" they show how a relaxation of an array sorting procedure leads to differentiable rank statistics (quantiles etc.). The underlying theory is quite close to what I show in the blog post.
In DETR they don't actually use a differentiable Hungarian algorithm, it's only used to score their neural predictions outside of the training loop IIUC.
I wonder if the author is aware of CVX / cvxpy / "disciplined convex programming"?
I think this would be easy to express in that, and I wonder how the performance would compare. Basically, there can be DSL-level facilities to track that functions and constraints are convex by construction, and remove the obligation to implement various bits of optimization logic (and allow one to switch between optimization details at a config level).
Good pointer! I did know about CVXPY but haven't used it yet. Though in requiring everything to be (log-)convex it constrains the user programs quite a bit, but I imagine it would have worked in this case because the objective is linear and the feasible set convex. I won't deny the role accident and force of habit had in picking this particular implementation path; I used Pytorch because it's my daily driver :D and because I had found 'mctorch' which provided the Riemann SGD logic.
Actually, you can solve the assignment problem as a linear program (LP). Just do
and you automagically get an integer solution in the x_{ij} due to the structure of the constraints.min sum_i sum_j c_{ij}*x_{ij} s.t. sum_i x_{ij} == 1, for all j sum_j x_{ij} == 1, for all i
CVXPY would be a good way to implement this these days.
- [deleted]
Author here! Thank you f1shy for sharing.
This is a fun technique that's surprisingly general-purpose (describing your error function differentiably and throwing your favorite optimizer at it).
E.g., graph layouts are notoriously hard to glean any good information from, especially for larger graphs. Tossing in a few constraints to enforce a topological sort, take up the whole page, keep nodes from overlapping, reduce crossings, have more or fewer clusters, ..., you can easily adjust a few sliders and get the perfect layout for a given problem.
I had to deal with minimum bipartite matching when implementing FedMA (a federated learning algorithm). I was lucky it could be solved with a scikit learn method because i wouldn't have been be able to write a solution in a short time, let alone an efficient one
When my supervisor read about fedma using matching he went "oh god, i have terrible memories about it". he was right
Yes, the problem can now be solved efficiently with a scikit one-liner, but the aim of this post was more speculative rather than beating the benchmark :)
Oh yeah that's sure, i wasn't trying to diminish the content, the opposite. as in, it was so complex and out of what i expected that i was lucky to have a library that could solve it for me
It's a little funny to think of a polyhedron (like the set of doubly stochastic matrices) as a manifold. The point here is that we're equipping the interior of this set with a certain "Riemannian metric" and using it to turn gradients into tangent vectors.
I thought the formulas for updates in this case were pretty opaque, so let's think of a simpler example. Say we want to optimize a function f(x_1, ..., x_n) over the simplex
\Delta = \{ (x_1, ..., x_n) : x_1 + ... + x_n = 1, x_i >= 0. \}
We can imagine that we're allocating a limited resource into n different bins. At each optimization step, say we have numbers g_i telling us partials of f with respect to the x_i. (Somewhat more generally, these numbers could tell us much we "regret" having allocated our resource into the ith bin.) How can we use this information to change our choices of x_i?
One option would be to take vanilla gradient descent steps. That is, we'd choose some small eta, set x_i := x_i - eta g_i, and somehow adjust these assignments to be a valid allocation. Another method, which tends to be more natural and do better in practice, is to set x_i := x_i exp(-eta g_i) and then divide each x_i by (x_1 + ... + x_n). (This is called the Hedge method, and "mysteriously" also looks just like a Bayesian update for a distribution of belief over n possibilities.) This is the flavor of method being used in the post to optimize over doubly stochastic matrices.
I've studied differential geometry but still have the impression that the proximal operator/mirror descent point of view is a little more natural in optimization/learning. Gesticulating wildly about Levi-Citiva connections is fun, but e.g. to optimize over the simplex it makes more sense to just ask: how should we regularize our updates to not be too "imprudent"? (Of course this depends on what problem you're solving!) One common story is that you want to avoid neglecting bins until you're very sure they're worthless. To do this, you can use a metric that makes your manifold look larger when you get near a vertex of the simplex (and then make sure you can do the necessary computations and find a good retraction map!), or you can think about what proximal operator to use/what Bregman divergence to use to regularize your updates. The paper used by this post uses a Fisher metric, which is the infinitesimal version of the KL divergence, and regularizing your updates by KL gives you things like the Hedge method/Bayesian updates.
Jesus, this would have been nice for my computability and optimisation class. A shame I had to upload it a few days ago. This would have come in handy