- FAQ
- Why set up parsers and converters instead of converting to Markdown directly?
- Why does the
PdfParserproduce poor Markdown text, and how to achieve optimal conversion? - Unable to connect to 'https://huggingface.co'
- Resource xxx not found. Please use the NLTK Downloader to obtain the resource:
- Resource 'wordnet' not found.
- Steps to resolve PyTorch
fbgemm.dllloading error
- The core purpose of a parser is to extract data such as text and images without too much processing. In some projects like knowledge bases, not all files need to be converted to Markdown. The extracted file or image content might already suffice for basic RAG (Retrieval-Augmented Generation) needs, eliminating the need for additional formatting overhead.
- Based on the extracted images and text, a converter can further refine and format the data, making it more suitable for training and fine-tuning models like RAG.
-
The core function of
PdfParseris parsing, not directly converting to Markdown. -
PdfParsersupports three engines:marker, inspired by the well-knownmarkerproject. It can directly convert to Markdown but performs poorly with complex texts, thus serving as part of the parser.unstructured, which outputs raw text with minimal formatting, recommended for PDFs with a clean layout.surya_layout, which outputs images marked with layout information. These need to be converted usingImageConverter. IfImageConverteruses multimodal models likegpt-4o, the Markdown conversion quality is optimal, matching some commercial conversion software.
-
Below is an example of the best conversion code:
import os from wisup_e2m import PdfParser, ImageConverter work_dir = os.getcwd() # Set the current path as the working directory image_dir = os.path.join(work_dir, "figure") pdf = "./test.pdf" # Load the parser pdf_parser = PdfParser(engine="surya_layout") # Load the converter image_converter = ImageConverter( engine="litellm", api_key="<your API key>", # Replace with your API key model="gpt-4o", base_url="<your base url>", # Fill in the base URL if using a model proxy caching=True, cache_type="disk-cache", ) # Parse the PDF into images pdf_data = pdf_parser.parse( pdf, start_page=0, # Starting page number end_page=20, # Ending page number work_dir=work_dir, image_dir=image_dir, # Location to save extracted images relative_path=True, # Whether the image path is relative to work_dir ) # Convert images to text using ImageConverter md_text = image_converter.convert( images=pdf_data.images, attached_images_map=pdf_data.attached_images_map, work_dir=work_dir, # Image addresses in Markdown will be relative to workdir; absolute path by default ) # Save the test markdown with open("test.md", "w") as f: f.write(md_text)
Unable to connect to 'https://huggingface.co'
- Method 1: Try accessing via a VPN or proxy.
- Method 2: Use a mirror in the code:
import os os.environ['CURL_CA_BUNDLE'] = '' os.environ['HF_ENDPOINT']= 'https://hf-mirror.com'
- Method 3: Set environment variables in the terminal:
export CURL_CA_BUNDLE='' export HF_ENDPOINT='https://hf-mirror.com'
import nltk
nltk.download('all') # Best to download all resources, around 3.57GB- Completely uninstall
nltk:pip uninstall nltk - Reinstall
nltkwith the command:pip install nltk - Manually download corpora/wordnet.zip and extract it to the directory specified in the error message. Alternatively, use the following commands to download:
- Windows:
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~\AppData\Roaming\nltk_data\corpora\wordnet.zipandunzip ~\AppData\Roaming\nltk_data\corpora\wordnet.zip -d ~\AppData\Roaming\nltk_data\corpora\ - Unix:
wget https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/wordnet.zip -O ~/nltk_data/corpora/wordnet.zipandunzip ~/nltk_data/corpora/wordnet.zip -d ~/nltk_data/corpora/
- Windows:
When running PyTorch code, you might encounter a fbgemm.dll loading error (OSError: [WinError 126] The specified module could not be found). Even if the fbgemm.dll file is in the correct path, the problem may still persist. Here are detailed troubleshooting and resolution steps.
-
Install PyTorch and use a Conda virtual environment
Sometimes, directly installing via
pipmay cause dependency issues. Creating a new environment usingcondaand installing PyTorch might help resolve this. Here is an example:conda create -n pytorch_env python=3.10 conda activate pytorch_env conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
This ensures PyTorch and all its dependencies are correctly installed, and path configuration won't be an issue.
-
Install or update the VS Redistributable
PyTorch depends on the Microsoft Visual C++ Redistributable runtime libraries. If these libraries are missing or the version is incorrect, it may lead to DLL loading failure.
- Download and install the latest version of the Microsoft Visual C++ Redistributable (including both x64 and x86 versions) from the Microsoft website.
-
Check system path and dependencies
Even if the DLL file is present, the system may fail to load it due to missing dependencies or path issues. Here are some steps to check:
-
Use
Dependency WalkerorDependencies- Download and run the Dependency Walker tool, or use the latest Dependencies tool.
- Load the
fbgemm.dllfile to check if all its dependencies are present.
-
Add the path to the system environment variables
- Add the path containing
fbgemm.dll(e.g.,D:\Python\lib\site-packages\torch\lib\) to the system'sPATHenvironment variable to ensure all related DLL files can be found.
- Add the path containing
-
-
Use SFC and DISM to repair system files
If system files or DLL links are corrupted, you can use Windows' System File Checker (SFC) and Deployment Imaging Service and Management Tool (DISM) to repair them:
- Open Command Prompt as an administrator.
- Run the following command to repair system files:
sfc /scannow
- Use DISM to repair the Windows image:
DISM /Online /Cleanup-Image /RestoreHealth
- Restart your computer and try running your program again.
-
Install missing DLL files
After checking with the
Dependenciestool, it was found that thelibomp140.x86_64.dllfile was missing. Follow these steps to resolve:-
Download the missing
libomp140.x86_64.dll- Visit a DLL file download site, search for, and download the
libomp140.x86_64.dllfile.
- Visit a DLL file download site, search for, and download the
-
Copy the DLL file to the
system32directory- Copy the downloaded
libomp140.x86_64.dllfile to theC:\Windows\System32\directory.
- Copy the downloaded
-
-
Final Resolution
After completing the above steps, restart your computer or rerun your Python code, and the issue should be resolved. If the problem persists, further inspection of other dependencies or alternative solutions may be necessary.
