Online quantization algorithm for gudhi#536
Conversation
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
Co-authored-by: Vincent Rouvreau <10407034+VincentRouvreau@users.noreply.github.com>
…evel into quantization_v2
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
|
Corrected. |
mglisse
left a comment
There was a problem hiding this comment.
I guess keeping it in wasserstein/ is ok.
| different tori with some additional noise. | ||
| Starting from an initial codebook ``c0``, centroids are iteratively updated as new diagrams are provided. | ||
| As we use the standard metrics between persistence diagrams (denoted here by :math:`\mathrm{OT}_2`), points in the | ||
| diagrams that are close to the diagonal do not interfere in the codebook update process. |
There was a problem hiding this comment.
So it is the same as having an implicit point on the diagonal in the codebook?
There was a problem hiding this comment.
More precisely, having a point in the codebook that represents "all the points on the diagonal" (or, formally, looking at the quotient space where you identify the points on the diagonal).
Co-authored-by: Marc Glisse <marc.glisse@inria.fr>
Co-authored-by: Marc Glisse <marc.glisse@inria.fr>
|
I just realized that I never managed to do the last requested modifications (my local build was broken for some reason at that time). PS : and one day later I realize that I forgot to post this comment... 😴 |
mglisse
left a comment
There was a problem hiding this comment.
The algorithm is presented as on online algorithm. So it should be normal to give it some data, look at the codebook at that point, pass it more data, look at the updated codebook, etc. The init parameter could be used towards that goal, but the number of diagrams (or batches) already processed is forgotten, and indeed t (the learning rate) is reset to 0 at every call.
| (the two loops generating the tori). | ||
|
|
||
| .. figure:: | ||
| ./img/quantiz.gif |
There was a problem hiding this comment.
On the one hand, the GIF is cool. On the other hand, I have trouble reading the doc with that thing moving on my screen...
| if withdiag: | ||
| a = np.argmin(M[:-1, :], axis=1) | ||
| else: | ||
| a = np.argmin(M[:-1, :-1], axis=1) |
There was a problem hiding this comment.
It feels a bit strange to call _build_dist_matrix, whose main difference with cdist is that it adds the diagonal, just to drop the diagonal immediately... But I don't think it really matters.
| X_batch = np.concatenate(list_of_non_empty_diags) | ||
| return X_batch | ||
| else: | ||
| return np.array([]) |
There was a problem hiding this comment.
It is sometimes useful to force the shape of empty arrays, to (0,2) for instance. I don't know if that's the case here.
| :param internal_p: Ground metric to assess centroid affectation. Default is ``2.``. | ||
| :type internal_p: ``float`` | ||
|
|
||
| :returns: The final codebook obtained after going through the all pdiagset. |
|
|
||
| def _init_c(pdiagset, k, internal_p=2): | ||
| """ | ||
| A naive heuristic to initialize a codebook: we take the k points with largest distances to the diagonal |
There was a problem hiding this comment.
What if the first diagram has fewer than k points?
| :param batch_size: Size of batches used during the online exploration of the ``pdiagset``. | ||
| Default is ``1``. |
There was a problem hiding this comment.
As a user, should I stick to the default value of 1? If I already have all the diagrams, I may think that I don't need an online algorithm, which is for when data appears progressively, and consider using one huge batch under the impression that it disables the "online" stuff and gets the best result.
| # stochastic-gradient-descent like approach (decreasing learning rate). | ||
| c_current[j] = c_current[j] - grad / (t + 1) | ||
| else: | ||
| raise NotImplemented('Order = %s is not available yet. Only order=2. is valid' %order) |
There was a problem hiding this comment.
I think you could error out earlier (or not provide this option at all and just say that it is W2).
Provide a quantization algorithm to "summarize" a collection of persistence diagrams.
(At least) One thing that may be discussed :
python/gudhi/wasserstein/repo, because it is of a "Wasserstein metric" flavor (we minimize something in terms of Wasserstein distance between persistence diagrams). However, it does not rely on POT as other functions in this repo do ; we actually never need to explicitly compute a Wasserstein distance/matching explicitly. Perhaps would it belong directly to thegudhi/repo ?Also TODO :
quantization.py: is the copyright correct?