Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GitHub Request] Create new jVector KNN plugin repo #291

Open
sam-herman opened this issue Feb 25, 2025 · 27 comments
Open

[GitHub Request] Create new jVector KNN plugin repo #291

sam-herman opened this issue Feb 25, 2025 · 27 comments

Comments

@sam-herman
Copy link

sam-herman commented Feb 25, 2025

What is the type of request?

Repository Management

Details of the request

Overview

I would like to move DataStax's jVector KNN plugin from this repo over to the OpenSearch project repo.
Please call the repo: KNN-jVector

Maintainers: repo maintainers (currently 4 maintainers) are listed here , jVector dependency has a about 28 contributors listed here
Security response: In the event of security response, the maintainers of the repo will be responsible for addressing the issue. If needed to handle the repo or one of its dependencies such as the jVector library.

See below the properties and features that make the jVector plugin appealing with added value to the OpenSearch community for ANN/KNN use cases in particular.

High Level

  • Scalable: Run similarity search on billions of documents across thousands of dimensions without exceeding memory and without choking on the disk by using DiskANN
  • Fast: Blazing fast pure Java implementation with minimal overhead (see benchmarks)
  • Lightweight: Pure Java implementation. Self-contained, builds in seconds, no need to deal with native dependencies and complex flaky builds or additional 100,000s lines of code you didn't ask for.

Unique Features

  • DiskANN - JVector is a pure Java implementation capable to perform vector ANN search in a way that is optimized for RAM bound environments with minimal additional overhead. No need involving native dependencies (FAISS) and cumbersome JNI mechanism.
  • Thread Safety - JVector is a threadsafe index that supports concurrent modification and inserts with near perfect scalability as you add cores, Lucene is not threadsafe; This allows us to ingest much higher volume of vectors a lot faster without unnecessary merge operations to parallelize ingestion concurrency.
  • quantized index construction - JVector can perform index construction w/ quantized vectors, saving memory = larger segments = fewer segments = faster searches
  • Quantized Disk ANN - JVector supports DiskANN style quantization with rerank, it's quite easy (in principle) to demonstrate that this is a massive difference in performance for larger-than-memory indexes (in practice it takes days/weeks to insert enough vectors into Lucene to show this b/c of the single threaded problem, that's the only hard part)
  • PQ and BQ support - As part of (3) JVector supports PQ as well as the BQ that Lucene offers, it seems that this is fairly rare (pgvector doesn't do PQ either) because (1) the code required to get high performance ADC with SIMD is a bit involved and (2) it requires a separate codebook which Lucene isn't set up to easily accommodate. PQ at 64x compression gives you higher relevance than BQ at 32x
  • Fused ADC - Features that nobody else has like Fused ADC and NVQ and Anisotropic PQ
  • Compatibility - JVector is compatible with Cassandra. Which allows to more easily transfer vector encoded data from Cassandra to OpenSearch and vice versa.

Additional information to support your request

Sample JMH Benchmarks for engine outputs
Important note: JMH numbers are qualitative and relative and should not be treated as "globally consistent".
Or in other words, the numbers below only illustrate the relative ratio of performance difference and while they may vary across systems, the ratios should remain constant.

RandomVectors:

Benchmark                                               (codecType)  (dimension)  (numDocs)  Mode  Cnt  Score   Error  Units
FormatBenchmarkRandomVectors.benchmarkSearch  jvector_not_quantized          128       1000  avgt    5  0.146 ± 0.002  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch  jvector_not_quantized          128      10000  avgt    5  0.332 ± 0.003  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch  jvector_not_quantized          128     100000  avgt    5  0.451 ± 0.004  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch      jvector_quantized          128       1000  avgt    5  0.147 ± 0.001  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch      jvector_quantized          128      10000  avgt    5  0.181 ± 0.002  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch      jvector_quantized          128     100000  avgt    5  0.194 ± 0.002  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch              Lucene101          128       1000  avgt    5  0.707 ± 0.016  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch              Lucene101          128      10000  avgt    5  1.578 ± 0.022  ms/op
FormatBenchmarkRandomVectors.benchmarkSearch              Lucene101          128     100000  avgt    5  2.156 ± 0.080  ms/op

Visualization:
Benchmark: FormatBenchmarkRandomVectors.benchmarkSearch (dimension: 128)
Grouped by Number of Documents (numDocs)
Scaling: max bar width corresponds to 2.156 ms/op → 50 characters
--------------------------------------------------------------------------------
NumDocs: 1000
------------------------------------------------------------
Codec Type             Score (ms/op)    Visualization
------------------------------------------------------------
jvector_not_quantized  0.146 ms/op      |███
jvector_quantized      0.147 ms/op      |███
Lucene101              0.707 ms/op      |████████████████

NumDocs: 10000
------------------------------------------------------------
Codec Type             Score (ms/op)    Visualization
------------------------------------------------------------
jvector_not_quantized  0.332 ms/op      |████████
jvector_quantized      0.181 ms/op      |████
Lucene101              1.578 ms/op      |█████████████████████████████████████

NumDocs: 100000
------------------------------------------------------------
Codec Type             Score (ms/op)    Visualization
------------------------------------------------------------
jvector_not_quantized  0.451 ms/op      |██████████
jvector_quantized      0.194 ms/op      |█████
Lucene101              2.156 ms/op      |██████████████████████████████████████████████████
--------------------------------------------------------------------------------
---------------------------------------------------------------------------

sift-128-euclidean:

Benchmark                                                   (codecType)            (datasetName)  Mode  Cnt  Score   Error  Units
FormatBenchmarkWithKnownDatasets.benchmarkSearch  jvector_not_quantized  sift-128-euclidean.hdf5  avgt    4  0.292 ± 0.002  ms/op
FormatBenchmarkWithKnownDatasets.benchmarkSearch              Lucene101  sift-128-euclidean.hdf5  avgt    4  1.160 ± 0.015  ms/op

JMH Benchmark Results
Benchmark: FormatBenchmarkWithKnownDatasets.benchmarkSearch
Dataset: sift-128-euclidean.hdf5

Visualization:
Codec Type            Avg Time (ms/op)   Visualization
---------------------------------------------------------------
jvector_not_quantized  0.292 ms/op       |█████████████                        
Lucene101              1.160 ms/op       |██████████████████████████████████████████████████
---------------------------------------------------------------

The numbers above were collected in an environment that all data was cached in either JVM or Operating System cache memory.
And we can already see a significant difference in performance!

When we are moving to a RAM constrained environment we are expecting to see a significant difference of a number of orders of magnitude.
For example if Lucene is doing 100x the number of disk reads compared to JVector, and the disk is 100x slower than RAM, then we can expect JVector to be 10,000x faster than Lucene in this scenario.

When does this request need to be completed?

ASAP

Notes

Track the progress of your request here: https://github.com/orgs/opensearch-project/projects/208/views/33.
Member of @opensearch-project/admin will take a look at the request soon.
Thanks!

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Feb 26, 2025

Hi @sam-herman ,

Could you help complete below questions.

We are still in the transition to Linux Foundation,
and we are currently in the plan to bring up a new process to open new repos.

Thanks.

Repo Name/Project Name:  

 
Project license:  

 
Project description and business value:
 
What customer problem are we trying to solve with this repo? 

 
Should this project be in OpenSearch Core/OpenSearch Dashboard Core?  If no, why not?   

 
Why should we create a new repo at this time?  

 
Please also link to the feature review, the RFC/Feature brief in github and any PR/FAQs or other working backwards documents.


 
Source code and code review

 
Third-party code

 
Similar projects

 
Support Expectations
 
Who will be supporting this repo going forward?  

 
What is your plan (including staffing) to be responsive to the community (at a minimum, this should include reviewing PRs, responding to issues, answering forum questions?)

 
Targeted OpenSearch Release

 
Maintainer access

@sam-herman
Copy link
Author

sam-herman commented Feb 26, 2025

Hi @sam-herman ,

Could you help complete below questions.

Hi @peterzhuamazon those should all be documented in the repo to move from (in MAINTAINERS.md, LICENSE.md, README.md files etc..). For convenience I will repeat those here:

We are still in the transition to Linux Foundation, and we are currently in the plan to bring up a new process to open new repos.

Thanks.

Repo Name/Project Name:

KNN-jVector

Project license:

ALV2

Project description and business value:

Please see details above in this ticket under the Overview section primarily with the summary:

High Level
Scalable: Run similarity search on billions of documents across thousands of dimensions without exceeding memory and without choking on the disk by using DiskANN
Fast: Blazing fast pure Java implementation with minimal overhead (see [benchmarks](https://github.com/opensearch-project/.github/issues/291#benchmarks))
Lightweight: Pure Java implementation. Self-contained, builds in seconds, no need to deal with native dependencies and complex flaky builds or additional 100,000s lines of code you didn't ask for.

What customer problem are we trying to solve with this repo?

Provide better alternative for customers who wish to perform KNN/ANN.
See Above for details.

Should this project be in OpenSearch Core/OpenSearch Dashboard Core? If no, why not?

We are planning to move general functionality of KNN/ANN with Lucene as default to core, see opensearch-project/OpenSearch#17338
And have other libraries such as FAISS in the existing knn plugin or jVector as in this plugin remain in plugins outside of core.

Why should we create a new repo at this time?

Separate plugins from existing KNN plugin would provide the benefits described earlier, we can't achieve that if those are in the same plugin together.
Moreover, the maintainers of the existing KNN plugin are bound by organizational constraints to restrict support to only FAISS outside Lucene, which is not aligned with general community interest. For the reasons described earlier in here
Separate plugin would allow different organizations to focus on improving the extendability of OpenSearch KNN by either improving native support for FAISS or focusing on a lightweight leaner version that is self contained.

Please also link to the feature review, the RFC/Feature brief in github and any PR/FAQs or other working backwards documents.

Source code and code review

Source code is here:
https://github.com/sam-herman/opensearch-jvector-plugin

Reviewed and maintained by company maintainers: https://github.com/sam-herman/opensearch-jvector-plugin/blob/main/MAINTAINERS.md

Third-party code

jVector library - ALV2 license

Similar projects

NA

Support Expectations

github issues, contributors and maintainers

Who will be supporting this repo going forward?

The plugin will be directly supported by those maintainers: MAINTAINERS.md
The plugin dependency of jVector will be supported by the jVector team which has about 28 contributors listed here

What is your plan (including staffing) to be responsive to the community (at a minimum, this should include reviewing PRs, responding to issues, answering forum questions?)

to be performed by maintainers mentioned above.

Targeted OpenSearch Release

Somewhere in 3.x will be the targeted first release after KNN changes went into core.

Maintainer access

https://github.com/sam-herman/opensearch-jvector-plugin/blob/main/MAINTAINERS.md

@reta
Copy link

reta commented Feb 26, 2025

@sam-herman question please, the OpenSearch KNN plugin (https://github.com/Opensearch-project/k-nn) supports FAISS, NMSLIB as vector specialized backends, I would oversee JVector to become another option there? Why we are introducing the plugin that is intended (in principle) to do the same thing? Or I am missing something? Thank you!

@sam-herman
Copy link
Author

@reta see the response to the questions above:

Why should we create a new repo at this time?

Separate plugins from existing KNN plugin would provide the benefits described earlier, we can't achieve that if those are in the same plugin together.
Moreover, the maintainers of the existing KNN plugin are bound by organizational constraints to restrict support to only FAISS outside Lucene, which is not aligned with general community interest. For the reasons described earlier in here
Separate plugin would allow different organizations to focus on improving the extendability of OpenSearch KNN by either improving native support for FAISS or focusing on a lightweight leaner version that is self contained.

and this one:

Should this project be in OpenSearch Core/OpenSearch Dashboard Core? If no, why not?

We are planning to move general functionality of KNN/ANN with Lucene as default to core, see opensearch-project/OpenSearch#17338
And have other libraries such as FAISS in the existing knn plugin or jVector as in this plugin remain in plugins outside of core.

@reta
Copy link

reta commented Feb 27, 2025

Thanks @sam-herman ,

Separate plugins from existing KNN plugin would provide the benefits described earlier, we can't achieve that if those are in the same plugin together.

Are there technical or organizational blocks? I have nothing against having as many plugins as we possibly could, but it looks to me that would disperse the precious maintainers, not unite them. I believe as we are open source community, organizational constraints should cease to exists, the goal to build the best plugin offering should be the motivating factor. Again, this is just my opinion, @navneet1v @VijayanB @vamshin @jmazanec15 do you folks want to chime in?

We are planning to move general functionality of KNN/ANN with Lucene as default to core, see opensearch-project/OpenSearch#17338

👍

@navneet1v
Copy link

@reta, @peterzhuamazon Sharing some thoughts here to provide a bit of background

The initial integration of JVector which was suggested was in k-NN plugin as a separate module vending out as an Optional Engine. This was ensure that opensearch-project should not just become a ground of multiple engine and as a community we should vet out any new engine before we start to maintain it. All this discussion happened here: opensearch-project/k-NN#2386

The other idea which @sam-herman suggested was to move the core interfaces of Vector to OpenSearch core(Here is the GH issue: opensearch-project/OpenSearch#17338), which was discussed earlier too opensearch-project/k-NN#1467 (comment). We agreed that moving interfaces to core is a good idea, as mentioned and ack here: opensearch-project/OpenSearch#17338 (comment)

The concern of moving the interface to core was to make sure that BWC is maintained, since vectors have become a big use case in OpenSearch. Apart from BWC plugins like neuralsearch, flow-framework, skills etc take dependency on k-NN plugin. So if we just flip the switch just like this the distribution will start to break.

This is where I suggested an alternative path of first doing the refactoring in k-NN plugin: opensearch-project/k-NN#2386 (comment) just to ensure that interfaces and BWC is maintained and no distribution breaks. I understand this has some extra effort but this to me is a safest path. I am open to suggestion here.

Once interfaces move to core, as a community we can see how we want to break the engines into different repos or plugins or in core.

Option 1:
One split can be all java related engines stays in core and other non java engines in another repo like k-NN.

Option 2:
OpenSearch core keep only the Lucene engine and doesn't add any other engine. We keep separate plugins for different engines(k-NN keep on using Faiss and Jvector can be part of other plugin).

Option 3:
OpenSearch core keep only the Lucene engine and k-NN repo start hosting all other engines as modules in the k-NN plugin.

@sam-herman
Copy link
Author

Thanks @sam-herman ,

Separate plugins from existing KNN plugin would provide the benefits described earlier, we can't achieve that if those are in the same plugin together.

Are there technical or organizational blocks? I have nothing against having as many plugins as we possibly could, but it looks to me that would disperse the precious maintainers, not unite them. I believe as we are open source community, organizational constraints should cease to exists, the goal to build the best plugin offering should be the motivating factor. Again, this is just my opinion, @navneet1v @VijayanB @vamshin @jmazanec15 do you folks want to chime in?

We are planning to move general functionality of KNN/ANN with Lucene as default to core, see opensearch-project/OpenSearch#17338

👍

The short answer is that it’s both things.
The long answer with the full discussion threads happened here
opensearch-project/OpenSearch#17338

opensearch-project/k-NN#1467 (comment)

opensearch-project/k-NN#2386

Also with the decision to move the KNN to core, the idea of one bloated plugin as the gateway is no longer applicable and makes up a lot of difficulties from maintainability perspective. More context on the links above and the summary provided earlier with this issue description.

@sam-herman
Copy link
Author

sam-herman commented Feb 27, 2025

@reta, @peterzhuamazon Sharing some thoughts here to provide a bit of background

The initial integration of JVector which was suggested was in k-NN plugin as a separate module vending out as an Optional Engine. This was ensure that opensearch-project should not just become a ground of multiple engine and as a community we should vet out any new engine before we start to maintain it. All this discussion happened here: opensearch-project/k-NN#2386

The other idea which @sam-herman suggested was to move the core interfaces of Vector to OpenSearch core(Here is the GH issue: opensearch-project/OpenSearch#17338), which was discussed earlier too opensearch-project/k-NN#1467 (comment). We agreed that moving interfaces to core is a good idea, as mentioned and ack here: opensearch-project/OpenSearch#17338 (comment)

The concern of moving the interface to core was to make sure that BWC is maintained, since vectors have become a big use case in OpenSearch. Apart from BWC plugins like neuralsearch, flow-framework, skills etc take dependency on k-NN plugin. So if we just flip the switch just like this the distribution will start to break.

This is where I suggested an alternative path of first doing the refactoring in k-NN plugin: opensearch-project/k-NN#2386 (comment) just to ensure that interfaces and BWC is maintained and no distribution breaks. I understand this has some extra effort but this to me is a safest path. I am open to suggestion here.

Once interfaces move to core, as a community we can see how we want to break the engines into different repos or plugins or in core.

Option 1: One split can be all java related engines stays in core and other non java engines in another repo like k-NN.

Option 2: OpenSearch core keep only the Lucene engine and doesn't add any other engine. We keep separate plugins for different engines(k-NN keep on using Faiss and Jvector can be part of other plugin).

Option 3: OpenSearch core keep only the Lucene engine and k-NN repo start hosting all other engines as modules in the k-NN plugin.

I don’t see other way than this one at the moment.
We are not going to inherit technical debt of existing KNN plugin into our service. That path is not feasible and as a community we better focus on making core better rather than inflate new engines with unnecessary dependencies on the existing KNN plugin.

This plugin is at least 100,000 lines of code less than KNN plugin without native dependencies we don’t need and with no build issues between platforms as we encountered while trying to merge with the existing plugin.

More so adding our engine as optional means it will break without any build failures and will break our customers. This is out of the question for us and our customers.

This also raises the question of a leveled playing field when maintainer of a plugin has the final say of which KNN extensions are allowed for the entire project which is supposed to welcome contributions with added value to different customers in the community.

@reta
Copy link

reta commented Feb 27, 2025

Ok, cool, thanks folks, it looks to me you are more or less aligned on the vision, definitely fine with me, thank you

@Pallavi-AWS
Copy link
Member

hi @sam-herman - thanks for initiating the request for the JVector plugin. I want to double check if JVector will be an optional plugin (not part of the standard bundle)?

@sam-herman
Copy link
Author

sam-herman commented Feb 27, 2025

hi @sam-herman - thanks for initiating the request for the JVector plugin. I want to double check if JVector will be an optional plugin (not part of the standard bundle)?

@Pallavi-AWS I don't see this as a viable option at the moment. I summarized it in the various threads earlier:
opensearch-project/k-NN#2386 (comment)
opensearch-project/OpenSearch#17338 (comment)

I will re-iterate the core issues mentioned in above threads:

  1. jVector Optional - KNN plugin has many build failures already that trickle into main, but at least some guardrails exist before it is released. Adding an optional jVector plugin not part of the release means it will break silently without any guardrails in place when changes are made to the KNN plugin. This creates additional overhead on our team and risking our brand as things can easily break in each release without us ever making a single change.
  2. Maintainability - jVector plugin is super lightweight and self contained with minimal set of dependencies. This provides many operational and maintainability advantages. jVector plugin alone is roughly 100,000 less lines of code than the KNN plugin... This allows our team to focus on core innovation rather than committing resources to convoluted dependencies which we do not have.
  3. Build And Portablity - KNN plugin as a dependency is not a viable option either.
    • It will bloat our build to include many native dependencies which we do not want or need.
    • It is not easily portable and can easily break on various platforms and various architectures due to minor issues. Within 2 weeks we found at least 3 different build failures.
    • The build of native dependencies requires a lot of manual prep work and specially crafted environments. Besides of this being very different than the Lucene and OpenSearch (and jVector/C*) project philosophy of portability and pure Java implementation. It is another hurdle on maintainers of the jVector KNN for dependencies we don't have.

The attempt to fully integrate as another equal engine (not breakable optional option) into KNN plugin as was originally submitted (and rejected by KNN plugin maintainers) in this PR would have made more sense as it would have helped alleviated concern 1. It might have still left us with concerns 2,3 but maybe at a lesser level.

I think once the decision was made to move KNN facade and fields mapping to core opensearch-project/OpenSearch#17338, the option of creating new jVector plugin seems to me as the most reasonable and less resistant option for the following reasons:

  1. It solves all the major issues I mentioned above and allows DataStax/IBM to be a more active contributor to the innovation in OpenSearch as opposed to wasting valuable resources on unnecessary legacy and dependencies.
  2. It sets the stage for long term fix to the status of KNN in core and undo the historic weirdness of it being left out of core despite being first class Lucene primitive (it was done deliberately by the original project owner to monetize it as part of xpack).
  3. It relives KNN plugin maintainers from being gate keepers for new codecs and allow them to focus on core innovation for accelerated native libraries.

EDIT:
@Pallavi-AWS I am sorry, I just re-read your question and I think I missed to answer it!
Yes, I think jvector plugin should not go into the public release until core is fixed to include KNN functionality.
Hope that clarifies and sorry for providing a recap you didnt ask for :)

@Pallavi-AWS
Copy link
Member

Pallavi-AWS commented Feb 27, 2025

Thanks @sam-herman for the clarification. We do need some discussion on the best model of extensibility for KNN (via the KNN plugin vs. a standalone plugin) as this will set the precedence for future integrations with other engines. KNN maintainers @navneet1v @luyuncheng will open an issue for the pros/cons discussion.

Separately, we do need to close the process for creation of new repos after the move to LF to make sure we are covered with legal reviews, security etc. @peterzhuamazon @reta will open a separate issue on the new repo creation process.

@sam-herman
Copy link
Author

sam-herman commented Feb 27, 2025

Thanks @sam-herman for the clarification. We do need some discussion on the best model of extensibility for KNN (via the KNN plugin vs. a standalone plugin) as this will set the precedence for future integrations with other engines. KNN maintainers @navneet1v @luyuncheng will open an issue for the pros/cons discussion.

Separately, we do need to close the process for creation of new repos after the move to LF to make sure we are covered with legal reviews, security etc. @peterzhuamazon @reta will open a separate issue on the new repo creation process.

@Pallavi-AWS this is already discussed in the core issue and was also discussed in last TSC meeting with a clear vote to move KNN basic functionality to core (with your vote as well).
Given the fact that Datastax is going to be maintaining jvector plugin and is not part of the release what are the concerns remaining at the moment?
Lets wrap this up without exploding the number of issues already open, if there are any concerns with new plugin creation that doesn't go into the release, lets articulate those here.
If there are concerns about future integrations about timeline of work for changes in core, orthogonal separate issue with work plan would be reasonable.

@reta
Copy link

reta commented Feb 27, 2025

Separately, we do need to close the process for creation of new repos after the move to LF to make sure we are covered with legal reviews, security etc. @peterzhuamazon @reta will open a separate issue on the new repo creation proce

Thanks @Pallavi-AWS , I think @peterzhuamazon already did that in #236 ?

@reta
Copy link

reta commented Feb 27, 2025

Given the fact that Datastax is going to be maintaining jvector plugin and is not part of the release what are the concerns remaining at the moment?

@sam-herman I think @Pallavi-AWS has a point here, we have a number of the plugins that took exactly the same route "hey, here is new plugin, please add it, we will watch after ........ abandoned and unmaintained", and she wants to have reassurance it is different this time (me too btw but my involvement is minuscule as of today)

@sam-herman
Copy link
Author

Given the fact that Datastax is going to be maintaining jvector plugin and is not part of the release what are the concerns remaining at the moment?

@sam-herman I think @Pallavi-AWS has a point here, we have a number of the plugins that took exactly the same route "hey, here is new plugin, please add it, we will watch after ........ abandoned and unmaintained", and she wants to have reassurance it is different this time (me too btw but my involvement is minuscule as of today)

@reta we have plans to move KNN functionality to core, which means all new engines will eventually extend core as separate plugins. We mentioned that this will not be included in the release until the work to extend core is complete, which means all existing plugins in the release are not going to be impacted.
Are there any specific things you are looking for?

@reta
Copy link

reta commented Feb 28, 2025

@reta we have plans to move KNN functionality to core, which means all new engines will eventually extend core as separate plugins.

I am 💯 fine with that, but it does not look like these plugins will disappear - they will be thinner, yes, but still around?

Are there any specific things you are looking for?

My concerns are addressed :-)

@Pallavi-AWS
Copy link
Member

Pallavi-AWS commented Feb 28, 2025

@sam-herman we have consensus that we want to move KNN interfaces to core, the key question is whether separate engines extend core as separate plugins or get encapsulated in one plugin. Multiple plugins has become an overhead from release and maintenance perspective. Can we play out what will happen when you decide to make this new plugin as part of the standard release? This will help decide whether a separate plugin is viable.

@navneet1v
Copy link

navneet1v commented Feb 28, 2025

@sam-herman adding some thoughts on top of what @Pallavi-AWS has mentioned. Till the new plugin for jVector is not part of release will keep the disruption to minimum for release. But what path we should be taking for migration should be discussed before changes in core start to flow through. I added some options here opensearch-project/k-NN#2386 (comment) (few weeks back)

But as when we want to make this part of main distribution thats when we should atleast think about.

Thanks @sam-herman for the clarification. We do need some discussion on the best model of extensibility for KNN (via the KNN plugin vs. a standalone plugin) as this will set the precedence for future integrations with other engines. KNN maintainers @navneet1v @luyuncheng will open an issue for the pros/cons discussion.

As mentioned here I along with @luyuncheng and other maintainer @jmazanec15, @vamshin who are pretty seasoned with k-NN and vector search can start putting up a proposal on what should be the route for new engines to be part of default distribution. The reason for this is:

  1. Having multiple engines and multiple k-NN algorithm(as part of default distribution) creates confusion in community and make opensearch hard to use.
  2. Some of the core features(codecs compatibility with zstd etc) and interface like (query level hyper parameters, filtering, directory support, memory management[both heap and off heap], quantization techniques, vector datatypes like float, byte, binary, half float etc ) etc. becomes hard maintain and keep it consistent. Just for context we faced this issue with Nmslib and after like 3-4 long years its is getting marked as deprecated.
  3. The same problem arises for opensearch-clients too.

But if never want to make jVector plugin as part of main distribution ever then all the above points are moot.

Hence, I think if you agree to work with maintainers of k-NN to help decide what should be the best path for adding new engines in OpenSearch it will be truly awesome.

Below are some of the options I laid out to the best of my knowledge and ability. I can extend them in a GH issue and we can start the conversation from there and you add other options too. Along with this we can add a criteria for engines addition. I see Lucene also does something similar.

Once interfaces move to core, as a community we can see how we want to break the engines into different repos or plugins or in core.

Option 1: One split can be all java related engines stays in core and other non java engines in another repo like k-NN.

Option 2: OpenSearch core keep only the Lucene engine and doesn't add any other engine. We keep separate plugins for different engines(k-NN keep on using Faiss and Jvector can be part of other plugin).

Option 3: OpenSearch core keep only the Lucene engine and k-NN repo start hosting all other engines as modules in the k-NN plugin.

@peterzhuamazon
Copy link
Member

Separately, we do need to close the process for creation of new repos after the move to LF to make sure we are covered with legal reviews, security etc. @peterzhuamazon @reta will open a separate issue on the new repo creation proce

Thanks @Pallavi-AWS , I think @peterzhuamazon already did that in #236 ?

Hi @reta ,

The completion of #236 set up a dashboard and SOP, acting as an open communication channel between the community, GitHub admins, and the Linux Foundation. Some request, such as the maintainers addition, still needs to go through the nomination processes / voting.

What we are referring here is similar, that we have not established a repository creation process yet. Not only for GitHub repos, but also for repos on platforms like DockerHub, npm, and PyPI. While the request issue can serve as the initial application for a new repo, there needs to be a structured review and feedback process before final creation.

I will be sharing a draft proposal of the repo creation process soon. And we plan to get it reviewed with the community, so that the process is clear, transparent, and easy to follow by the community.

Please let me know your thoughts.

Thanks!

@sam-herman
Copy link
Author

sam-herman commented Feb 28, 2025

@sam-herman we have consensus that we want to move KNN interfaces to core, the key question is whether separate engines extend core as separate plugins or get encapsulated in one plugin.

@Pallavi-AWS The whole point of adding KNN interfaces and Lucene engine to core is to not need to depend on a plugin with native dependencies to extend KNN functionality with new engines. Otherwise there is no point in doing it.

Option 2: OpenSearch core keep only the Lucene engine and doesn't add any other engine. We keep separate plugins for different engines(k-NN keep on using Faiss and Jvector can be part of other plugin).

@navneet1v I'll reiterate again that option 2 is the option I am targeting. As I mentioned, in my view other options are not going to work well.

Hence, I think if you agree to work with maintainers of k-NN to help decide what should be the best path for adding new engines in OpenSearch it will be truly awesome.

There is a consensus to move KNN functionality to core, hence a separate plugin is submitted here.
I am very much willing and think it's critical to work together on the core changes and the migration plan so it won't break the release or other dependencies. But those in my opinion should be orthogonal efforts.

Having multiple engines and multiple k-NN algorithm(as part of default distribution) creates confusion in community and make opensearch hard to use.

@navneet1v I'm ok not to include in the default release for now, but after Lucene KNN is moved to core there should be a discussion which engines should be included in the default release and which one should not. I won't assume that FAISS and NMSLIB should be in the default release either. We can work together on establishing criteria and decide on that.

@jmazanec15
Copy link
Member

jmazanec15 commented Feb 28, 2025

A couple comments:

  1. Just want to confirm, both plugins cannot be installed at the same time in present form due to duplicate naming of resources. I think this is understood here, but just want to confirm. The reason Im bring this up is because @sam-herman says "More so adding our engine as optional means it will break without any build failures and will break our customers.", but in its present state and until the engine is made extendible, it will have to be optional and further cannot be installed with the distribution.
  2. Right now, the API interfaces (field types, query, apis, etc) are defined in the k-NN plugin. These are subject to change in the next couple of releases. For instance, in Make engine top level field mapping parameter k-NN#2534, we are thinking to move this outside of method and into top-level field. That being said, the interfaces cannot diverge or else moving interfaces to core will be impossible. So, until we move those interfaces to core, API interfaces need to be defined in one place, which should be the k-NN plugin. I think the plan for APIs should be agreed upon before releasing to ensure a compatible path forward.

@sam-herman
Copy link
Author

A couple comments:

  1. Just want to confirm, both plugins cannot be installed at the same time in present form due to duplicate naming of resources. I think this is understood here, but just want to confirm. The reason Im bring this up is because @sam-herman says "More so adding our engine as optional means it will break without any build failures and will break our customers.", but in its present state and until the engine is made extendible, it will have to be optional and further cannot be installed with the distribution.

@jmazanec15 the difference is that when a change is pushed to jVector plugin the plugin itself won't break.
If it's an "optional" part of KNN plugin it means it might not even compile when someone wants to try it out. It's a huge difference!
With the jvector as separate plugin customers can try it out if they opt to it or just try it on their standalone distribution. It's quite a big difference.

  1. Right now, the API interfaces (field types, query, apis, etc) are defined in the k-NN plugin. These are subject to change in the next couple of releases. For instance, in Make engine top level field mapping parameter k-NN#2534, we are thinking to move this outside of method and into top-level field. That being said, the interfaces cannot diverge or else moving interfaces to core will be impossible. So, until we move those interfaces to core, API interfaces need to be defined in one place, which should be the k-NN plugin. I think the plan for APIs should be agreed upon before releasing to ensure a compatible path forward.

@jmazanec15 I think if you can create and publish lightweight library that can be shared with only the interfaces while we take time to think of transition to core that would be great (I removed a lot of code and will be very glad to remove even more from the jVector plugin ;) ).
But (and important one) not any other dependencies can be there, especially not the native ones. If we can do that, I'm all-in to take those into jVector plugin right away.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Feb 28, 2025

Hi All,

I have proposed a standardized repo creation process here:

Please take a look and let me know what you think.
Would love to see your feedback and will propose here: opensearch-project/technical-steering#21 (comment)

Thanks.

@jmazanec15
Copy link
Member

@jmazanec15 the difference is that when a change is pushed to jVector plugin the plugin itself won't break.
If it's an "optional" part of KNN plugin it means it might not even compile when someone wants to try it out. It's a huge difference!
With the jvector as separate plugin customers can try it out if they opt to it or just try it on their standalone distribution. It's quite a big difference.

I see. Im not sure what optional would mean in this case. My assumption would be it'd be similar to an optional plugin, where its compiled and just needs to be enabled via a command or an API. Regardless, just wanted to make sure with new plugin approach its understood (which I think it is) both wont be able to be installed at the same time, so users will need to use min distribution or uninstall knn plugin from distribution, in order to install jvector plugin, until interfaces are moved to core or some extendability model is created.

@jmazanec15 I think if you can create and publish lightweight library that can be shared with only the interfaces while we take time to think of transition to core that would be great (I removed a lot of code and will be very glad to remove even more from the jVector plugin ;) ).
But (and important one) not any other dependencies can be there, especially not the native ones. If we can do that, I'm all-in to take those into jVector plugin right away.

Library is interesting, but I think wed first need to make engine extendible before we can do this. For instance, KNNVectorFieldMapper class is tied directly to engines and it defines the user provided parameters via ParametrizedFieldMapper. For the query, it is a bit easier because parsing is already decoupled (see neural which does not depend on native code and takes dep on query parser). Anyway, I dont think the interfaces will change drastically, but I think until they are moved to core, the shared API interfaces (not the code, just the spec) need to be defined in one place (knn plugin) so that future migration to core does not have issues. If the interfaces branch, this will create a very confusing experience for users.

@sam-herman
Copy link
Author

Library is interesting, but I think wed first need to make engine extendible before we can do this. For instance, KNNVectorFieldMapper class is tied directly to engines and it defines the user provided parameters via ParametrizedFieldMapper. For the query, it is a bit easier because parsing is already decoupled (see neural which does not depend on native code and takes dep on query parser). Anyway, I dont think the interfaces will change drastically, but I think until they are moved to core, the shared API interfaces (not the code, just the spec) need to be defined in one place (knn plugin) so that future migration to core does not have issues. If the interfaces branch, this will create a very confusing experience for users.

I have absolutely no desire to branch interfaces. But at the same time we can't be left with no reasonable path for extendability. So only option I see is to include some of the interface code in jVector until it can be inherited from somewhere else (core, or library with minimal dependencies).
With that being said, as I mentioned earlier, I am open to any solution that won't involve inheriting native dependencies into the extension.
If you prefer to just have a minimal build of KNN plugin that can be extended with neither of the native dependencies code that's perfectly fine with me until we are moving those to core.

@jmazanec15
Copy link
Member

Sounds good - in terms of extendability, might need a separate discussion - seems there are a couple different ways to go. But as long as interfaces dont branch, we should be able move code to core and extend in future with minimal issues (hopefully haha).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: 🆕 New
Development

No branches or pull requests

6 participants