Skip to content

TextLoader is restricted to UTF-8 file encoding format and doesn't support dynamic encoding similar to Python Version #7923

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
gangadharrr opened this issue Mar 29, 2025 · 1 comment · May be fixed by #7933
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@gangadharrr
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain.js documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain.js rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

Reproduce the bug use the below code:

import { TextLoader } from 'langchain/document_loaders/fs/text';
import { writeFileSync } from 'fs';

async function main() {
    const filePath = "src/sample.xsd"
    const sampleText = `<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:element name="Employee">
    <xs:complexType>
      <xs:sequence>
        <xs:element name="Name" type="xs:string"/>
        <xs:element name="Department" type="xs:string"/>
        <xs:element name="Salary" type="xs:decimal"/>
      </xs:sequence>
      <xs:attribute name="id" type="xs:integer" use="required"/>
    </xs:complexType>
  </xs:element>

</xs:schema>
`
    writeFileSync(filePath, sampleText, { encoding: "utf16le" })

    const textLoader = new TextLoader(filePath)
    const documents = await textLoader.load();
    console.log([documents[0].pageContent])
}
main()

Error Message and Stack Trace (if applicable)

[
'<\x00?\x00x\x00m\x00l\x00 \x00v\x00e\x00r\x00s\x00i\x00o\x00n\x00=\x00"\x001\x00.\x000\x00"\x00 \x00e\x00n\x00c\x00o\x00d\x00i\x00n\x00g\x00=\x00"\x00U\x00T\x00F\x00-\x008\x00"\x00?\x00>\x00\n' +
'\x00<\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00x\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x002\x000\x000\x001\x00/\x00X\x00M\x00L\x00S\x00c\x00h\x00e\x00m\x00a\x00"\x00>\x00\n' +
'\x00\n' +
'\x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00E\x00m\x00p\x00l\x00o\x00y\x00e\x00e\x00"\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00N\x00a\x00m\x00e\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00D\x00e\x00p\x00a\x00r\x00t\x00m\x00e\x00n\x00t\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00s\x00t\x00r\x00i\x00n\x00g\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00S\x00a\x00l\x00a\x00r\x00y\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00d\x00e\x00c\x00i\x00m\x00a\x00l\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00s\x00e\x00q\x00u\x00e\x00n\x00c\x00e\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00 \x00 \x00<\x00x\x00s\x00:\x00a\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00i\x00d\x00"\x00 \x00t\x00y\x00p\x00e\x00=\x00"\x00x\x00s\x00:\x00i\x00n\x00t\x00e\x00g\x00e\x00r\x00"\x00 \x00u\x00s\x00e\x00=\x00"\x00r\x00e\x00q\x00u\x00i\x00r\x00e\x00d\x00"\x00/\x00>\x00\n' +
'\x00 \x00 \x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00c\x00o\x00m\x00p\x00l\x00e\x00x\x00T\x00y\x00p\x00e\x00>\x00\n' +
'\x00 \x00 \x00<\x00/\x00x\x00s\x00:\x00e\x00l\x00e\x00m\x00e\x00n\x00t\x00>\x00\n' +
'\x00\n' +
'\x00<\x00/\x00x\x00s\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00>\x00\n' +
'\x00'
]

LangChainbugReport.zip

Description

When a file is written in the utf16le encoding, there is a problem with TextLoader loading the document in the same format. The current implementation of TextLoader only supports utf8 and doesn't provide a way to override or automatically detect the encoding, unlike the Python version.

Python version of TextLoader: https://github.com/langchain-ai/langchain/blob/b075eab3e0af9a578af80c6e38f869419e770b5c/libs/community/langchain_community/document_loaders/text.py#L46

Java Script version of TextLoader:

text = await readFile(this.filePathOrBlob, "utf8");

System Info

Node Version: v20.18.2
Platform: Mac OS Sonoma 14.0
Language: Typescript

Copy link

dosubot bot commented Mar 29, 2025

Hey @gangadharrr! I'm here to help you with any bugs, questions, or contributions you have for LangChain.js. Let's tackle this together!

The issue you're encountering is due to the TextLoader in LangChain.js being limited to UTF-8 encoding, which doesn't support files encoded in UTF-16LE. The current implementation reads files using UTF-8 encoding by default [1].

To handle different encodings, you can modify the TextLoader to accept an encoding parameter and use it when reading the file. Here's a potential workaround:

  1. Modify the TextLoader class to accept an encoding parameter.
  2. Use this parameter in the readFile function to specify the desired encoding.

Here's a basic example of how you might implement this:

import { readFile } from 'fs/promises';

class TextLoader {
  constructor(private filePath: string, private encoding: string = 'utf8') {}

  async load() {
    const text = await readFile(this.filePath, this.encoding);
    return [{ pageContent: text.toString() }];
  }
}

// Usage
async function main() {
  const filePath = "src/sample.xsd";
  const textLoader = new TextLoader(filePath, 'utf16le');
  const documents = await textLoader.load();
  console.log([documents[0].pageContent]);
}

main();

This modification allows you to specify the encoding when creating a TextLoader instance, enabling support for UTF-16LE or other encodings.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Bug Report | Other

@dosubot dosubot bot added the auto:bug Related to a bug, vulnerability, unexpected error with an existing feature label Mar 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant