-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GitHub Request] Create new jVector KNN plugin repo #291
Comments
Hi @sam-herman , Could you help complete below questions. We are still in the transition to Linux Foundation, Thanks.
|
Hi @peterzhuamazon those should all be documented in the repo to move from (in MAINTAINERS.md, LICENSE.md, README.md files etc..). For convenience I will repeat those here:
KNN-jVector
ALV2
Please see details above in this ticket under the
Provide better alternative for customers who wish to perform KNN/ANN.
We are planning to move general functionality of KNN/ANN with Lucene as default to core, see opensearch-project/OpenSearch#17338
Separate plugins from existing KNN plugin would provide the benefits described earlier, we can't achieve that if those are in the same plugin together.
Source code is here: Reviewed and maintained by company maintainers: https://github.com/sam-herman/opensearch-jvector-plugin/blob/main/MAINTAINERS.md
jVector library - ALV2 license
NA
github issues, contributors and maintainers
The plugin will be directly supported by those maintainers: MAINTAINERS.md
to be performed by maintainers mentioned above.
Somewhere in 3.x will be the targeted first release after KNN changes went into core.
https://github.com/sam-herman/opensearch-jvector-plugin/blob/main/MAINTAINERS.md |
@sam-herman question please, the OpenSearch KNN plugin (https://github.com/Opensearch-project/k-nn) supports FAISS, NMSLIB as vector specialized backends, I would oversee JVector to become another option there? Why we are introducing the plugin that is intended (in principle) to do the same thing? Or I am missing something? Thank you! |
@reta see the response to the questions above:
and this one:
|
Thanks @sam-herman ,
Are there technical or organizational blocks? I have nothing against having as many plugins as we possibly could, but it looks to me that would disperse the precious maintainers, not unite them. I believe as we are open source community, organizational constraints should cease to exists, the goal to build the best plugin offering should be the motivating factor. Again, this is just my opinion, @navneet1v @VijayanB @vamshin @jmazanec15 do you folks want to chime in?
👍 |
@reta, @peterzhuamazon Sharing some thoughts here to provide a bit of background The initial integration of JVector which was suggested was in k-NN plugin as a separate module vending out as an Optional Engine. This was ensure that opensearch-project should not just become a ground of multiple engine and as a community we should vet out any new engine before we start to maintain it. All this discussion happened here: opensearch-project/k-NN#2386 The other idea which @sam-herman suggested was to move the core interfaces of Vector to OpenSearch core(Here is the GH issue: opensearch-project/OpenSearch#17338), which was discussed earlier too opensearch-project/k-NN#1467 (comment). We agreed that moving interfaces to core is a good idea, as mentioned and ack here: opensearch-project/OpenSearch#17338 (comment) The concern of moving the interface to core was to make sure that BWC is maintained, since vectors have become a big use case in OpenSearch. Apart from BWC plugins like neuralsearch, flow-framework, skills etc take dependency on k-NN plugin. So if we just flip the switch just like this the distribution will start to break. This is where I suggested an alternative path of first doing the refactoring in k-NN plugin: opensearch-project/k-NN#2386 (comment) just to ensure that interfaces and BWC is maintained and no distribution breaks. I understand this has some extra effort but this to me is a safest path. I am open to suggestion here. Once interfaces move to core, as a community we can see how we want to break the engines into different repos or plugins or in core. Option 1: Option 2: Option 3: |
The short answer is that it’s both things. opensearch-project/k-NN#1467 (comment) Also with the decision to move the KNN to core, the idea of one bloated plugin as the gateway is no longer applicable and makes up a lot of difficulties from maintainability perspective. More context on the links above and the summary provided earlier with this issue description. |
I don’t see other way than this one at the moment. This plugin is at least 100,000 lines of code less than KNN plugin without native dependencies we don’t need and with no build issues between platforms as we encountered while trying to merge with the existing plugin. More so adding our engine as optional means it will break without any build failures and will break our customers. This is out of the question for us and our customers. This also raises the question of a leveled playing field when maintainer of a plugin has the final say of which KNN extensions are allowed for the entire project which is supposed to welcome contributions with added value to different customers in the community. |
Ok, cool, thanks folks, it looks to me you are more or less aligned on the vision, definitely fine with me, thank you |
hi @sam-herman - thanks for initiating the request for the JVector plugin. I want to double check if JVector will be an optional plugin (not part of the standard bundle)? |
@Pallavi-AWS I don't see this as a viable option at the moment. I summarized it in the various threads earlier: I will re-iterate the core issues mentioned in above threads:
The attempt to fully integrate as another equal engine (not breakable optional option) into KNN plugin as was originally submitted (and rejected by KNN plugin maintainers) in this PR would have made more sense as it would have helped alleviated concern 1. It might have still left us with concerns 2,3 but maybe at a lesser level. I think once the decision was made to move KNN facade and fields mapping to core opensearch-project/OpenSearch#17338, the option of creating new jVector plugin seems to me as the most reasonable and less resistant option for the following reasons:
EDIT: |
Thanks @sam-herman for the clarification. We do need some discussion on the best model of extensibility for KNN (via the KNN plugin vs. a standalone plugin) as this will set the precedence for future integrations with other engines. KNN maintainers @navneet1v @luyuncheng will open an issue for the pros/cons discussion. Separately, we do need to close the process for creation of new repos after the move to LF to make sure we are covered with legal reviews, security etc. @peterzhuamazon @reta will open a separate issue on the new repo creation process. |
@Pallavi-AWS this is already discussed in the core issue and was also discussed in last TSC meeting with a clear vote to move KNN basic functionality to core (with your vote as well). |
Thanks @Pallavi-AWS , I think @peterzhuamazon already did that in #236 ? |
@sam-herman I think @Pallavi-AWS has a point here, we have a number of the plugins that took exactly the same route "hey, here is new plugin, please add it, we will watch after ........ abandoned and unmaintained", and she wants to have reassurance it is different this time (me too btw but my involvement is minuscule as of today) |
@reta we have plans to move KNN functionality to core, which means all new engines will eventually extend core as separate plugins. We mentioned that this will not be included in the release until the work to extend core is complete, which means all existing plugins in the release are not going to be impacted. |
I am 💯 fine with that, but it does not look like these plugins will disappear - they will be thinner, yes, but still around?
My concerns are addressed :-) |
@sam-herman we have consensus that we want to move KNN interfaces to core, the key question is whether separate engines extend core as separate plugins or get encapsulated in one plugin. Multiple plugins has become an overhead from release and maintenance perspective. Can we play out what will happen when you decide to make this new plugin as part of the standard release? This will help decide whether a separate plugin is viable. |
@sam-herman adding some thoughts on top of what @Pallavi-AWS has mentioned. Till the new plugin for jVector is not part of release will keep the disruption to minimum for release. But what path we should be taking for migration should be discussed before changes in core start to flow through. I added some options here opensearch-project/k-NN#2386 (comment) (few weeks back) But as when we want to make this part of main distribution thats when we should atleast think about.
As mentioned here I along with @luyuncheng and other maintainer @jmazanec15, @vamshin who are pretty seasoned with k-NN and vector search can start putting up a proposal on what should be the route for new engines to be part of default distribution. The reason for this is:
But if never want to make jVector plugin as part of main distribution ever then all the above points are moot. Hence, I think if you agree to work with maintainers of k-NN to help decide what should be the best path for adding new engines in OpenSearch it will be truly awesome. Below are some of the options I laid out to the best of my knowledge and ability. I can extend them in a GH issue and we can start the conversation from there and you add other options too. Along with this we can add a criteria for engines addition. I see Lucene also does something similar.
|
Hi @reta , The completion of #236 set up a dashboard and SOP, acting as an open communication channel between the community, GitHub admins, and the Linux Foundation. Some request, such as the maintainers addition, still needs to go through the nomination processes / voting. What we are referring here is similar, that we have not established a repository creation process yet. Not only for GitHub repos, but also for repos on platforms like DockerHub, npm, and PyPI. While the request issue can serve as the initial application for a new repo, there needs to be a structured review and feedback process before final creation. I will be sharing a draft proposal of the repo creation process soon. And we plan to get it reviewed with the community, so that the process is clear, transparent, and easy to follow by the community. Please let me know your thoughts. Thanks! |
@Pallavi-AWS The whole point of adding KNN interfaces and Lucene engine to core is to not need to depend on a plugin with native dependencies to extend KNN functionality with new engines. Otherwise there is no point in doing it.
@navneet1v I'll reiterate again that option 2 is the option I am targeting. As I mentioned, in my view other options are not going to work well.
There is a consensus to move KNN functionality to core, hence a separate plugin is submitted here.
@navneet1v I'm ok not to include in the default release for now, but after Lucene KNN is moved to core there should be a discussion which engines should be included in the default release and which one should not. I won't assume that FAISS and NMSLIB should be in the default release either. We can work together on establishing criteria and decide on that. |
A couple comments:
|
@jmazanec15 the difference is that when a change is pushed to jVector plugin the plugin itself won't break.
@jmazanec15 I think if you can create and publish lightweight library that can be shared with only the interfaces while we take time to think of transition to core that would be great (I removed a lot of code and will be very glad to remove even more from the jVector plugin ;) ). |
Hi All, I have proposed a standardized repo creation process here: Please take a look and let me know what you think. Thanks. |
I see. Im not sure what optional would mean in this case. My assumption would be it'd be similar to an optional plugin, where its compiled and just needs to be enabled via a command or an API. Regardless, just wanted to make sure with new plugin approach its understood (which I think it is) both wont be able to be installed at the same time, so users will need to use min distribution or uninstall knn plugin from distribution, in order to install jvector plugin, until interfaces are moved to core or some extendability model is created.
Library is interesting, but I think wed first need to make engine extendible before we can do this. For instance, KNNVectorFieldMapper class is tied directly to engines and it defines the user provided parameters via ParametrizedFieldMapper. For the query, it is a bit easier because parsing is already decoupled (see neural which does not depend on native code and takes dep on query parser). Anyway, I dont think the interfaces will change drastically, but I think until they are moved to core, the shared API interfaces (not the code, just the spec) need to be defined in one place (knn plugin) so that future migration to core does not have issues. If the interfaces branch, this will create a very confusing experience for users. |
I have absolutely no desire to branch interfaces. But at the same time we can't be left with no reasonable path for extendability. So only option I see is to include some of the interface code in jVector until it can be inherited from somewhere else (core, or library with minimal dependencies). |
Sounds good - in terms of extendability, might need a separate discussion - seems there are a couple different ways to go. But as long as interfaces dont branch, we should be able move code to core and extend in future with minimal issues (hopefully haha). |
What is the type of request?
Repository Management
Details of the request
Overview
I would like to move DataStax's jVector KNN plugin from this repo over to the OpenSearch project repo.
Please call the repo: KNN-jVector
Maintainers: repo maintainers (currently 4 maintainers) are listed here , jVector dependency has a about 28 contributors listed here
Security response: In the event of security response, the maintainers of the repo will be responsible for addressing the issue. If needed to handle the repo or one of its dependencies such as the jVector library.
See below the properties and features that make the jVector plugin appealing with added value to the OpenSearch community for ANN/KNN use cases in particular.
High Level
Unique Features
Additional information to support your request
Sample JMH Benchmarks for engine outputs
Important note: JMH numbers are qualitative and relative and should not be treated as "globally consistent".
Or in other words, the numbers below only illustrate the relative ratio of performance difference and while they may vary across systems, the ratios should remain constant.
RandomVectors:
sift-128-euclidean:
The numbers above were collected in an environment that all data was cached in either JVM or Operating System cache memory.
And we can already see a significant difference in performance!
When we are moving to a RAM constrained environment we are expecting to see a significant difference of a number of orders of magnitude.
For example if Lucene is doing 100x the number of disk reads compared to JVector, and the disk is 100x slower than RAM, then we can expect JVector to be 10,000x faster than Lucene in this scenario.
When does this request need to be completed?
ASAP
Notes
Track the progress of your request here: https://github.com/orgs/opensearch-project/projects/208/views/33.
Member of @opensearch-project/admin will take a look at the request soon.
Thanks!
The text was updated successfully, but these errors were encountered: