Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [json-inverted] Incorrect filter results are returned for queries on indexed json data #38879

Open
1 task done
ThreadDao opened this issue Dec 31, 2024 · 2 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: JsDove-optimization_json-0b74598-20241230
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

test

  1. create a collection with 3 fields: a pk int64 field + a vector field + a json field
  2. create HNSW index for vector field
  3. insert 10m entities, The data generation rules of the batch json column are as follows:
values = [{"id": i, "values": {"float": float(i), "varchar": str(i)}} for i in pks]
# example:
row_0: {"id": 0, "values": {"float": 0.0, "varchar": '0'}
row_1: {"id": 1, "values": {"float": 1.0, "varchar": '1'}
  1. flush and create index again -> load
  2. query with Strong consistency level: -> wrong query results
c.query('json_1["values"]["float"] < 20.0', limit=5, output_fields=["json_1"], consistency_level="Strong")
data: ["{'json_1': {'id': 8192, 'values': {'float': 8192.0, 'varchar': '8192'}}, 'id': 8192}", "{'json_1': {'id': 8193, 'values': {'float': 8193.0, 'varchar': '8193'}}, 'id': 8193}", "{'json_1': {'id': 8194, 'values': {'float': 8194.0, 'varchar': '8194'}}, 'id': 8194}", "{'json_1': {'id': 8195, 'values': {'float': 8195.0, 'varchar': '8195'}}, 'id': 8195}", "{'json_1': {'id': 8196, 'values': {'float': 8196.0, 'varchar': '8196'}}, 'id': 8196}"] 

c.query('json_1["values"]["float"] < 20', output_fields=["count(*)"])
data: ["{'count(*)': 565653}"]  #expected 20
  1. You can refer to the instance master-20241225-c7313575-amd64 for the correct results of the same query with the same data.
c.query('json_1["values"]["float"] < 20', output_fields=["count(*)"])
data: ["{'count(*)': 20}"] , extra_info: {'cost': 0}
c.query('json_1["values"]["float"] < 20.0', limit=5, output_fields=["json_1"], consistency_level="Strong")
data: ["{'json_1': {'id': 0, 'values': {'float': 0.0, 'varchar': '0'}}, 'id': 0}", "{'json_1': {'id': 1, 'values': {'float': 1.0, 'varchar': '1'}}, 'id': 1}", "{'json_1': {'id': 2, 'values': {'float': 2.0, 'varchar': '2'}}, 'id': 2}", "{'json_1': {'id': 3, 'values': {'float': 3.0, 'varchar': '3'}}, 'id': 3}", "{'json_1': {'id': 4, 'values': {'float': 4.0, 'varchar': '4'}}, 'id': 4}"] , extra_info: {'cost': 0}

Expected Behavior

No response

Steps To Reproduce

argo workflow name: zong-json-index-6-10m
'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_concurrent_locust_custom_parameters',
            'test_case_params': {'dataset_params': {'metric_type': 'L2', 'dim': 128, 'dataset_name': 'sift', 'dataset_size': '10m', 'ni_per': 5000},
                                 'collection_params': {'other_fields': ['json_1'], 'shards_num': 1, 'collection_name': 'json_10m_coll'},
                                 'release_params': {'release_of_reload': True},
                                 'query_params': {},
                                 'search_params': {'output_fields': ['json_1'], 'timeout': 1200},
                                 'index_params': {'index_type': 'HNSW', 'index_param': {'M': 30, 'efConstruction': 200}},
                                 'concurrent_params': {'concurrent_number': 10, 'during_time': '10m', 'interval': 30, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'search',
                                                       'weight': 25,
                                                       'params': {'nq': 10,
                                                                  'top_k': 10,
                                                                  'output_fields': ['json_1'],
                                                                  'random_data': True,
                                                                  'search_param': {'ef': 96},
                                                                  'timeout': 60}}]},

Milvus Log

pods:

json-inverted-op-97-1369-milvus-datanode-55654cc544-wwf9d         1/1     Running                  0                20m     10.104.30.173   4am-node38   <none>           <none>
json-inverted-op-97-1369-milvus-indexnode-76d99f574b-b2np2        1/1     Running                  0                25m     10.104.23.12    4am-node27   <none>           <none>
json-inverted-op-97-1369-milvus-mixcoord-6677945b4d-9b66x         1/1     Running                  0                24m     10.104.30.169   4am-node38   <none>           <none>
json-inverted-op-97-1369-milvus-proxy-9d9558c86-r2t7t             1/1     Running                  0                20m     10.104.30.175   4am-node38   <none>           <none>
json-inverted-op-97-1369-milvus-querynode-0-6745868d58-5xthg      1/1     Running                  0                22m     10.104.21.149   4am-node24   <none>           <none>
json-inverted-op-97-1369-milvus-querynode-0-6745868d58-cwph7      1/1     Running                  0                23m     10.104.23.13    4am-node27   <none>           <none>

Anything else?

No response

@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 31, 2024
@ThreadDao ThreadDao added the severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. label Dec 31, 2024
@ThreadDao ThreadDao added this to the 2.5.2 milestone Dec 31, 2024
@ThreadDao
Copy link
Contributor Author

@JsDove There are another strange problem. too many small segments? Need to check why compaction is not performed?

show segment --collection 454964804311692572
--- Growing: 0, Sealed: 0, Flushed: 39, Dropped: 0
--- Small Segments: 27, row count: 3320000	 Other Segments: 12, row count: 6680000
--- Total Segments: 39, row count: 10000000

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 3, 2025
@yanliang567 yanliang567 removed their assignment Jan 3, 2025
@yanliang567 yanliang567 modified the milestones: 2.5.2, 2.5.3 Jan 6, 2025
@JsDove
Copy link
Contributor

JsDove commented Jan 6, 2025

It has been fixed already.
The problem is caused by the incorrect bitmap returned after filtering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants