TextLoader is restricted to UTF-8 file encoding format and doesn't support dynamic encoding similar to Python Version #7923
Labels
auto:bug
Related to a bug, vulnerability, unexpected error with an existing feature
Checked other resources
Example Code
Reproduce the bug use the below code:
Error Message and Stack Trace (if applicable)
[
'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n\x00g\x00=\x00"\x00U\x00T\x00F\x00-\x008\x00"\x00?\x00>\x00\n' +
'\x00<\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00x\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x002\x000\x000\x001\x00/\x00X\x00M\x00L\x00S\x00c\x00h\x00e\x00m\x00a\x00"\x00>\x00\n' +
'\x00\n' +
'\x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00E\x00m\x00p\x00l\x00o\x00y\x00e\x00e\x00"\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00N\x00a\x00m\x00e\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00D\x00e\x00p\x00a\x00r\x00t\x00m\x00e\x00n\x00t\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00S\x00a\x00l\x00a\x00r\x00y\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00d\x00e\x00c\x00i\x00m\x00a\x00l\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00a\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00i\x00d\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00i\x00n\x00t\x00e\x00g\x00e\x00r\x00"\x00 \x00u\x00s\x00e\x00=\x00"\x00r\x00e\x00q\x00u\x00i\x00r\x00e\x00d\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00>\x00\n' +
'\x00\n' +
'\x00<\x00/\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00>\x00\n' +
'\x00'
]
LangChainbugReport.zip
Description
When a file is written in the
utf16le
encoding, there is a problem withTextLoader
loading the document in the same format. The current implementation ofTextLoader
only supportsutf8
and doesn't provide a way to override or automatically detect the encoding, unlike the Python version.Python version of
TextLoader
: https://github.com/langchain-ai/langchain/blob/b075eab3e0af9a578af80c6e38f869419e770b5c/libs/community/langchain_community/document_loaders/text.py#L46Java Script version of
TextLoader
:langchainjs/langchain/src/document_loaders/fs/text.ts
Line 51 in e7a839c
System Info
Node Version: v20.18.2
Platform: Mac OS Sonoma 14.0
Language: Typescript
The text was updated successfully, but these errors were encountered: