-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a way to detect Numbering indicator/ Bullet point #45
Comments
I don't have this immediately on my current plan, sorry. The way way these files store the text is, there is the text, and then there is a bunch of pointers into complex structures where the style is held. So working out the styling is not something that happens along the way to getting the text out. (That's for .doc, with .docx, something like this is likely to be easier.) I won't close the issue, so at least it remains open for now, because it would be nice to have this. |
@morungos we have a use case for list numbering and bullet point extraction too - mostly in docx... even if it stubbed in an asterix that would be really helpful. Do you have any appetite to look into this? We are seeing some mixed results in tests - numbered lists and bullet points seem to parse correctly in the majority of .doc files. In .docx files they don't seem to parse. Also, if we take a docx file and save it as .doc, the numbers don't get parsed out. We haven't been able to spot a pattern in the .doc files which indicates whether the list is likely to parse properly or not. We would be happy to share some test documents with you if it helps to understand the problem. |
Please send through your test files, I'd be happy to take a look. It doesn't sound like it's too hard an issue if I can replicate it easily. (.doc files are much much worse, and I'd be crying if they were the ones you needed). |
Would love to have an update on this if possible. |
@yoy0lol this is my fault as I never sent across the sample files. I’ll try to sort something this week. Although any docx with bullets and numbering within it would probably make a basic test case. I imagine things could get more difficult with nested levels… |
Files would be great! I've been distracted by other things, but I will likely have time in the next couple of weeks. So if you can drop me some test files by early next week, that would be great!! |
@morungos Were you ever provided these documents? Happy to send over a few examples if not! |
@Fdawgs Pleae do. I have some available time at the moment. |
Brilliant, what's the best way to get them to you? |
Ideally, just drop them into this issue. Ideally they'd be small and public, so I can make them part of the test suite during development. |
Ah, didn't realise you could drop files into issue comments now! Find below a handful: |
Did you manage to find some time to look at this @morungos? |
Thanks for making such a great lib, I just wonder is there a way we can know that some text is prefix with Numbering indicator/ Bullet point?
For example: I have a piece of word file like this:
The text extracted:
Câu 20: Để phát hiện một người có nhiễm HIV hay không người ta làm gì?
Xét nghiệm máu
Xét nghiệm đường hô hấp
Xét nghiệm đường tiêu hoá
Xét nghiệm da
As you can see, the A,B,C,D indicator prefixed on the last 4 lines is missing
The text was updated successfully, but these errors were encountered: