Skip to content

Bnewm0609/layer slice#44

Open
bnewm0609 wants to merge 2 commits into
mainfrom
bnewm0609/layer_slice
Open

Bnewm0609/layer slice#44
bnewm0609 wants to merge 2 commits into
mainfrom
bnewm0609/layer_slice

Conversation

@bnewm0609

@bnewm0609 bnewm0609 commented Aug 16, 2023

Copy link
Copy Markdown
Contributor

Begins an implementation of Layer to wrap the List[Entities] and allow for more intuitive slicing (e.g. doc.sentences[:3].text instead of [sent.text for sent in doc.sentences]. This addresses #24 .

The new data structure is implemented in papermage/magelib/layer.py and inherits from python's UserList. The changes into integrate Layer were mainly made in the Document data structure.

One design decision to consider is what to do with chained access: e.g. doc.pages.paragraphs.sentences.tokens. Currently, each access creates a new layer, so doing the above would create a four-dimensional list. Two consequences of this decision:

  1. To get the first token, you would have to write doc.pages.paragraphs.sentences.tokens[0][0][0][0], which is a bit ugly.
  2. doc.pages.paragraphs.pages does not return the original Layer of pages., which is a bit uninutitive

The main question is: "Should chained accessing return the union of all of the entities in a single layer or should it return the entities in the shape of the chained accessing?"

As another example, if the doc is

Paragraph 1: "I am. I was."
Paragraph 2: "You are. You were."

Sentence 1: "I am."
Sentence 2: "I was."
Sentence 3: "You are."
Sentence 4: "You were."

Which should doc.paragraphs.sentences.text return?

# Option 1 - currently implemented
[[["I", "am", "."], ["I", "was", "."]], [["You", "are", "."], ["You", "were", "."]]]

# Option 2
["I", "am", ".", "I", "was", ".", "You", "are", ".", "You", "were", "."]

self.assertSequenceEqual(doc.chunks[0].tokens, tokens[0:3])
self.assertSequenceEqual(doc.chunks[1].tokens, tokens[3:5])
self.assertSequenceEqual(doc.chunks[2].tokens, [tokens[5]])

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertListEqual enforces that both arguments are list, while assertSequenceEqual does not. Because the first argument is a Layer, we use assertSequenceEqual.

@bnewm0609 bnewm0609 linked an issue Aug 16, 2023 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Layer definition with Slice compatibility

1 participant