Skip to content

Commit 4379d88

Browse files
chunk: relax table segregation during chunking (#3812)
**Summary** Relax table-segregation rule applied during chunking such that a `Table` and `Text`-subtype elements can be combined into a single chunk when the chunking window allows. **Additional Context** Until now, `Table` elements have always been segregated during chunking, i.e. a chunk that contained a table would never contain any other element. In certain scenarios, especially when a large chunking window of say 2000 characters is used, this behavior can reduce retrieval effectiveness by isolating the table from surrounding context. --------- Co-authored-by: ryannikolaidis <[email protected]> Co-authored-by: scanny <[email protected]>
1 parent 18d6c81 commit 4379d88

File tree

15 files changed

+1049
-907
lines changed

15 files changed

+1049
-907
lines changed

CHANGELOG.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,10 @@
1-
## 0.16.11-dev0
1+
## 0.16.11-dev1
22

33
### Enhancements
44

55
- **Enhance quote standardization tests** with additional Unicode scenarios
6+
- **Relax table segregation rule in chunking.** Previously a `Table` element was always segregated into its own pre-chunk such that the `Table` appeared alone in a chunk or was split into multiple `TableChunk` elements, but never combined with `Text`-subtype elements. Allow table elements to be combined with other elements in the same chunk when space allows.
7+
- **Compute chunk length based solely on `element.text`.** Previously `.metadata.text_as_html` was also considered and since it is always longer that the text (due to HTML tag overhead) it was the effective length criterion. Remove text-as-html from the length calculation such that text-length is the sole criterion for sizing a chunk.
68

79
### Features
810

test_unstructured/chunking/test_base.py

Lines changed: 468 additions & 559 deletions
Large diffs are not rendered by default.

test_unstructured/chunking/test_basic.py

Lines changed: 20 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -25,69 +25,69 @@ def test_it_chunks_a_document_when_basic_chunking_strategy_is_specified_on_parti
2525
assert chunks == [
2626
CompositeElement(
2727
"US Trustee Handbook\n\nCHAPTER 1\n\nINTRODUCTION\n\nCHAPTER 1 – INTRODUCTION"
28-
"\n\nA.\tPURPOSE"
28+
"\n\nA. PURPOSE"
2929
),
3030
CompositeElement(
3131
"The United States Trustee appoints and supervises standing trustees and monitors and"
32-
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
33-
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
32+
" supervises cases under chapter 13 of title 11 of the United States Code. 28 U.S.C."
33+
" § 586(b). The Handbook, issued as part of our duties under 28 U.S.C. § 586,"
3434
" establishes or clarifies the position of the United States Trustee Program (Program)"
3535
" on the duties owed by a standing trustee to the debtors, creditors, other parties in"
36-
" interest, and the United States Trustee. The Handbook does not present a full and"
36+
" interest, and the United States Trustee. The Handbook does not present a full and"
3737
),
3838
CompositeElement(
3939
"complete statement of the law; it should not be used as a substitute for legal"
40-
" research and analysis. The standing trustee must be familiar with relevant"
40+
" research and analysis. The standing trustee must be familiar with relevant"
4141
" provisions of the Bankruptcy Code, Federal Rules of Bankruptcy Procedure (Rules),"
42-
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
43-
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
42+
" any local bankruptcy rules, and case law. 11 U.S.C. § 321, 28 U.S.C. § 586,"
43+
" 28 C.F.R. § 58.6(a)(3). Standing trustees are encouraged to follow Practice Tips"
4444
" identified in this Handbook but these are not considered mandatory."
4545
),
4646
CompositeElement(
4747
"Nothing in this Handbook should be construed to excuse the standing trustee from"
4848
" complying with all duties imposed by the Bankruptcy Code and Rules, local rules, and"
49-
" orders of the court. The standing trustee should notify the United States Trustee"
49+
" orders of the court. The standing trustee should notify the United States Trustee"
5050
" whenever the provision of the Handbook conflicts with the local rules or orders of"
51-
" the court. The standing trustee is accountable for all duties set forth in this"
52-
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
51+
" the court. The standing trustee is accountable for all duties set forth in this"
52+
" Handbook, but need not personally perform any duty unless otherwise indicated. All"
5353
),
5454
CompositeElement(
5555
"statutory references in this Handbook refer to the Bankruptcy Code, 11 U.S.C. § 101"
5656
" et seq., unless otherwise indicated."
5757
),
5858
CompositeElement(
5959
"This Handbook does not create additional rights against the standing trustee or"
60-
" United States Trustee in favor of other parties.\n\nB.\tROLE OF THE UNITED STATES"
60+
" United States Trustee in favor of other parties.\n\nB. ROLE OF THE UNITED STATES"
6161
" TRUSTEE"
6262
),
6363
CompositeElement(
6464
"The Bankruptcy Reform Act of 1978 removed the bankruptcy judge from the"
65-
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
65+
" responsibilities for daytoday administration of cases. Debtors, creditors, and"
6666
" third parties with adverse interests to the trustee were concerned that the court,"
6767
" which previously appointed and supervised the trustee, would not impartially"
6868
" adjudicate their rights as adversaries of that trustee. To address these concerns,"
6969
" judicial and administrative functions within the bankruptcy system were bifurcated."
7070
),
7171
CompositeElement(
7272
"Many administrative functions formerly performed by the court were placed within the"
73-
" Department of Justice through the creation of the Program. Among the administrative"
73+
" Department of Justice through the creation of the Program. Among the administrative"
7474
" functions assigned to the United States Trustee were the appointment and supervision"
75-
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
76-
" Program’s enabling statutes. \n\nC.\tSTATUTORY DUTIES OF A STANDING TRUSTEE\t"
75+
" of chapter 13 trustees./ This Handbook is issued under the authority of the"
76+
" Program’s enabling statutes.\n\nC. STATUTORY DUTIES OF A STANDING TRUSTEE"
7777
),
7878
CompositeElement(
79-
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
80-
" standing trustee is more than a mere disbursing agent. The standing trustee must"
81-
" be personally involved in the trustee operation. If the standing trustee is or"
79+
"The standing trustee has a fiduciary responsibility to the bankruptcy estate. The"
80+
" standing trustee is more than a mere disbursing agent. The standing trustee must"
81+
" be personally involved in the trustee operation. If the standing trustee is or"
8282
" becomes unable to perform the duties and responsibilities of a standing trustee,"
8383
" the standing trustee must immediately advise the United States Trustee."
84-
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
84+
" 28 U.S.C. § 586(b), 28 C.F.R. § 58.4(b) referencing 28 C.F.R. § 58.3(b)."
8585
),
8686
CompositeElement(
8787
"Although this Handbook is not intended to be a complete statutory reference, the"
8888
" standing trustee’s primary statutory duties are set forth in 11 U.S.C. § 1302, which"
8989
" incorporates by reference some of the duties of chapter 7 trustees found in"
90-
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
90+
" 11 U.S.C. § 704. These duties include, but are not limited to, the"
9191
" following:\n\nCopyright"
9292
),
9393
]

test_unstructured/chunking/test_title.py

Lines changed: 57 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88

99
import pytest
1010

11-
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock
11+
from test_unstructured.unit_utils import FixtureRequest, Mock, function_mock, input_path
1212
from unstructured.chunking.base import CHUNK_MULTI_PAGE_DEFAULT
1313
from unstructured.chunking.title import _ByTitleChunkingOptions, chunk_by_title
1414
from unstructured.documents.coordinates import CoordinateSystem
@@ -20,10 +20,12 @@
2020
ElementMetadata,
2121
ListItem,
2222
Table,
23+
TableChunk,
2324
Text,
2425
Title,
2526
)
2627
from unstructured.partition.html import partition_html
28+
from unstructured.staging.base import elements_from_json
2729

2830
# ================================================================================================
2931
# INTEGRATION-TESTS
@@ -33,7 +35,53 @@
3335
# ================================================================================================
3436

3537

36-
def test_it_splits_a_large_element_into_multiple_chunks():
38+
def test_it_chunks_text_followed_by_table_together_when_both_fit():
39+
elements = elements_from_json(input_path("chunking/title_table_200.json"))
40+
41+
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
42+
43+
assert len(chunks) == 1
44+
assert isinstance(chunks[0], CompositeElement)
45+
46+
47+
def test_it_chunks_table_followed_by_text_together_when_both_fit():
48+
elements = elements_from_json(input_path("chunking/table_text_200.json"))
49+
50+
# -- disable chunk combining so we test pre-chunking behavior, not chunk-combining --
51+
chunks = chunk_by_title(elements, combine_text_under_n_chars=0)
52+
53+
assert len(chunks) == 1
54+
assert isinstance(chunks[0], CompositeElement)
55+
56+
57+
def test_it_splits_oversized_table():
58+
elements = elements_from_json(input_path("chunking/table_2000.json"))
59+
60+
chunks = chunk_by_title(elements)
61+
62+
assert len(chunks) == 5
63+
assert all(isinstance(chunk, TableChunk) for chunk in chunks)
64+
65+
66+
def test_it_starts_new_chunk_for_table_after_full_text_chunk():
67+
elements = elements_from_json(input_path("chunking/long_text_table_200.json"))
68+
69+
chunks = chunk_by_title(elements, max_characters=250)
70+
71+
assert len(chunks) == 2
72+
assert [type(chunk) for chunk in chunks] == [CompositeElement, Table]
73+
74+
75+
def test_it_starts_new_chunk_for_text_after_full_table_chunk():
76+
elements = elements_from_json(input_path("chunking/full_table_long_text_250.json"))
77+
78+
chunks = chunk_by_title(elements, max_characters=250)
79+
80+
assert len(chunks) == 2
81+
assert [type(chunk) for chunk in chunks] == [Table, CompositeElement]
82+
83+
84+
def test_it_splits_a_large_text_element_into_multiple_chunks():
3785
elements: list[Element] = [
3886
Title("Introduction"),
3987
Text(
@@ -68,29 +116,26 @@ def test_it_splits_elements_by_title_and_table():
68116

69117
chunks = chunk_by_title(elements, combine_text_under_n_chars=0, include_orig_elements=True)
70118

71-
assert len(chunks) == 4
119+
assert len(chunks) == 3
72120
# --
73121
chunk = chunks[0]
74122
assert isinstance(chunk, CompositeElement)
75123
assert chunk.metadata.orig_elements == [
76124
Title("A Great Day"),
77125
Text("Today is a great day."),
78126
Text("It is sunny outside."),
127+
Table("Heading\nCell text"),
79128
]
80129
# --
81130
chunk = chunks[1]
82-
assert isinstance(chunk, Table)
83-
assert chunk.metadata.orig_elements == [Table("Heading\nCell text")]
84-
# ==
85-
chunk = chunks[2]
86131
assert isinstance(chunk, CompositeElement)
87132
assert chunk.metadata.orig_elements == [
88133
Title("An Okay Day"),
89134
Text("Today is an okay day."),
90135
Text("It is rainy outside."),
91136
]
92137
# --
93-
chunk = chunks[3]
138+
chunk = chunks[2]
94139
assert isinstance(chunk, CompositeElement)
95140
assert chunk.metadata.orig_elements == [
96141
Title("A Bad Day"),
@@ -119,9 +164,8 @@ def test_chunk_by_title():
119164

120165
assert chunks == [
121166
CompositeElement(
122-
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
167+
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
123168
),
124-
Table("Heading\nCell text"),
125169
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
126170
CompositeElement(
127171
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@@ -150,10 +194,7 @@ def test_chunk_by_title_separates_by_page_number():
150194
CompositeElement(
151195
"A Great Day",
152196
),
153-
CompositeElement(
154-
"Today is a great day.\n\nIt is sunny outside.",
155-
),
156-
Table("Heading\nCell text"),
197+
CompositeElement("Today is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"),
157198
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
158199
CompositeElement(
159200
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@@ -178,9 +219,8 @@ def test_chuck_by_title_respects_multipage():
178219
chunks = chunk_by_title(elements, multipage_sections=True, combine_text_under_n_chars=0)
179220
assert chunks == [
180221
CompositeElement(
181-
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
222+
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
182223
),
183-
Table("Heading\nCell text"),
184224
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
185225
CompositeElement(
186226
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",
@@ -206,9 +246,8 @@ def test_chunk_by_title_groups_across_pages():
206246

207247
assert chunks == [
208248
CompositeElement(
209-
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.",
249+
"A Great Day\n\nToday is a great day.\n\nIt is sunny outside.\n\nHeading Cell text"
210250
),
211-
Table("Heading\nCell text"),
212251
CompositeElement("An Okay Day\n\nToday is an okay day.\n\nIt is rainy outside."),
213252
CompositeElement(
214253
"A Bad Day\n\nToday is a bad day.\n\nIt is storming outside.",

test_unstructured/partition/test_json.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ def test_it_chunks_elements_when_a_chunking_strategy_is_specified():
3737
"example-docs/spring-weather.html.json", chunking_strategy="basic", max_characters=1500
3838
)
3939

40-
assert len(chunks) == 10
40+
assert len(chunks) == 9
4141
assert all(isinstance(ch, CompositeElement) for ch in chunks)
4242

4343

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[
2+
{
3+
"type": "Table",
4+
"element_id": "ca96108263324e9d865a98f19cf7c940",
5+
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
6+
"metadata": {
7+
"category_depth": 1,
8+
"page_number": 1,
9+
"parent_id": "747587de72444235a68c768d544ff5f3",
10+
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
11+
"languages": [
12+
"eng"
13+
],
14+
"filetype": "text/html"
15+
}
16+
},
17+
{
18+
"type": "NarrativeText",
19+
"element_id": "5bc93ad5828445f98cac824c750cacfd",
20+
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
21+
"metadata": {
22+
"category_depth": 2,
23+
"page_number": 1,
24+
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
25+
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
26+
"languages": [
27+
"eng"
28+
],
29+
"filetype": "text/html"
30+
}
31+
}
32+
]
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
[
2+
{
3+
"type": "NarrativeText",
4+
"element_id": "5bc93ad5828445f98cac824c750cacfd",
5+
"text": "Format: CSV file for Export and Download Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions [email protected] for other questions",
6+
"metadata": {
7+
"category_depth": 2,
8+
"page_number": 1,
9+
"parent_id": "d8fa364bbfdf42d7b37c7a1dcb90ecf5",
10+
"text_as_html": "<p class=\"NarrativeText\" id=\"5bc93ad5828445f98cac824c750cacfd\">Format: CSV file for Export and Download </p> <p class=\"NarrativeText\" id=\"875c1820b6cd4736a7e699571896b568\">Contact: Charles Stringham [email protected] to arrange secure data transfer OR with technical questions </p> <p class=\"NarrativeText\" id=\"ac41c15812e64e918cbb07c2bc68b5d2\">[email protected] for other questions </p>",
11+
"languages": [
12+
"eng"
13+
],
14+
"filetype": "text/html"
15+
}
16+
},
17+
{
18+
"type": "Table",
19+
"element_id": "ca96108263324e9d865a98f19cf7c940",
20+
"text": "RFP Number: 2024-PMO-01 RFP Title: PMO Services RFP RFP Due Date and Time: Number of Pages: #189 05/30/2024 by 5:00pm Central Time",
21+
"metadata": {
22+
"category_depth": 1,
23+
"page_number": 1,
24+
"parent_id": "747587de72444235a68c768d544ff5f3",
25+
"text_as_html": "<table class=\"Table\" id=\"ca96108263324e9d865a98f19cf7c940\"> <tbody> <tr> <td>RFP Number: 2024-PMO-01</td><td>RFP Title: PMO Services RFP</td></tr><tr> <td>RFP Due Date and Time:</td><td>Number of Pages: #189</td></tr><tr> <td>05/30/2024 by 5:00pm Central Time</td><td></td></tr></tbody></table>",
26+
"languages": [
27+
"eng"
28+
],
29+
"filetype": "text/html"
30+
}
31+
}
32+
]

0 commit comments

Comments
 (0)