You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
chunk: relax table segregation during chunking (#3812)
**Summary**
Relax table-segregation rule applied during chunking such that a `Table`
and `Text`-subtype elements can be combined into a single chunk when the
chunking window allows.
**Additional Context**
Until now, `Table` elements have always been segregated during chunking,
i.e. a chunk that contained a table would never contain any other
element. In certain scenarios, especially when a large chunking window
of say 2000 characters is used, this behavior can reduce retrieval
effectiveness by isolating the table from surrounding context.
---------
Co-authored-by: ryannikolaidis <[email protected]>
Co-authored-by: scanny <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+3-1Lines changed: 3 additions & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,8 +1,10 @@
1
-
## 0.16.11-dev0
1
+
## 0.16.11-dev1
2
2
3
3
### Enhancements
4
4
5
5
-**Enhance quote standardization tests** with additional Unicode scenarios
6
+
-**Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
7
+
-**Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
6
+
"metadata": {
7
+
"category_depth": 1,
8
+
"page_number": 1,
9
+
"parent_id": "747587de72444235a68c768d544ff5f3",
10
+
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
11
+
"languages": [
12
+
"eng"
13
+
],
14
+
"filetype": "text/html"
15
+
}
16
+
},
17
+
{
18
+
"type": "NarrativeText",
19
+
"element_id": "5bc93ad5828445f98cac824c750cacfd",
20
+
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
21
+
"metadata": {
22
+
"category_depth": 2,
23
+
"page_number": 1,
24
+
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
25
+
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
6
+
"metadata": {
7
+
"category_depth": 2,
8
+
"page_number": 1,
9
+
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
10
+
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
11
+
"languages": [
12
+
"eng"
13
+
],
14
+
"filetype": "text/html"
15
+
}
16
+
},
17
+
{
18
+
"type": "Table",
19
+
"element_id": "ca96108263324e9d865a98f19cf7c940",
20
+
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
21
+
"metadata": {
22
+
"category_depth": 1,
23
+
"page_number": 1,
24
+
"parent_id": "747587de72444235a68c768d544ff5f3",
25
+
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
0 commit comments