Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a way to detect Numbering indicator/ Bullet point #45

Open
tungduonghgg123 opened this issue Dec 31, 2021 · 12 comments
Open

Add a way to detect Numbering indicator/ Bullet point #45

tungduonghgg123 opened this issue Dec 31, 2021 · 12 comments

Comments

@tungduonghgg123
Copy link

Thanks for making such a great lib, I just wonder is there a way we can know that some text is prefix with Numbering indicator/ Bullet point?
For example: I have a piece of word file like this:
image

The text extracted:
Câu 20: Để phát hiện một người có nhiễm HIV hay không người ta làm gì?
Xét nghiệm máu
Xét nghiệm đường hô hấp
Xét nghiệm đường tiêu hoá
Xét nghiệm da

As you can see, the A,B,C,D indicator prefixed on the last 4 lines is missing

@morungos
Copy link
Owner

morungos commented Jan 1, 2022

I don't have this immediately on my current plan, sorry. The way way these files store the text is, there is the text, and then there is a bunch of pointers into complex structures where the style is held. So working out the styling is not something that happens along the way to getting the text out. (That's for .doc, with .docx, something like this is likely to be easier.)

I won't close the issue, so at least it remains open for now, because it would be nice to have this.

@thegoatherder
Copy link

thegoatherder commented Apr 13, 2022

@morungos we have a use case for list numbering and bullet point extraction too - mostly in docx... even if it stubbed in an asterix that would be really helpful. Do you have any appetite to look into this?

We are seeing some mixed results in tests - numbered lists and bullet points seem to parse correctly in the majority of .doc files. In .docx files they don't seem to parse. Also, if we take a docx file and save it as .doc, the numbers don't get parsed out. We haven't been able to spot a pattern in the .doc files which indicates whether the list is likely to parse properly or not. We would be happy to share some test documents with you if it helps to understand the problem.

@morungos
Copy link
Owner

Please send through your test files, I'd be happy to take a look. It doesn't sound like it's too hard an issue if I can replicate it easily. (.doc files are much much worse, and I'd be crying if they were the ones you needed).

@yoy0lol
Copy link

yoy0lol commented Oct 5, 2022

Would love to have an update on this if possible.

@thegoatherder
Copy link

@yoy0lol this is my fault as I never sent across the sample files. I’ll try to sort something this week. Although any docx with bullets and numbering within it would probably make a basic test case. I imagine things could get more difficult with nested levels…

@morungos
Copy link
Owner

morungos commented Oct 5, 2022

Files would be great! I've been distracted by other things, but I will likely have time in the next couple of weeks. So if you can drop me some test files by early next week, that would be great!!

@Fdawgs
Copy link

Fdawgs commented Oct 6, 2023

@morungos Were you ever provided these documents? Happy to send over a few examples if not!

@morungos
Copy link
Owner

morungos commented Oct 6, 2023

@Fdawgs Pleae do. I have some available time at the moment.

@Fdawgs
Copy link

Fdawgs commented Oct 6, 2023

@Fdawgs Pleae do. I have some available time at the moment.

Brilliant, what's the best way to get them to you?

@morungos
Copy link
Owner

morungos commented Oct 6, 2023

Ideally, just drop them into this issue. Ideally they'd be small and public, so I can make them part of the test suite during development.

@Fdawgs
Copy link

Fdawgs commented Oct 9, 2023

Ah, didn't realise you could drop files into issue comments now!

Find below a handful:
test_file_1.docx
test_file_2.docx
test_file_3.docx

@Fdawgs
Copy link

Fdawgs commented Nov 30, 2023

Did you manage to find some time to look at this @morungos?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants