Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translator: Improve translation evaluation in Co-op Translator using cosine similarity #37

Open
1 of 2 tasks
skytin1004 opened this issue Oct 10, 2024 · 0 comments
Open
1 of 2 tasks
Labels
enhancement New feature or request translator Related to any changes in the translation-related source files

Comments

@skytin1004
Copy link
Collaborator

Describe the feature you'd like

Currently, Co-op Translator identifies translation issues by comparing the number of line breaks between the original and translated content. While this helps flag significant differences, it is not always accurate, especially for longer documents where line breaks might be intentionally added by OpenAI to improve readability.

I would like to introduce a more sophisticated evaluation method using cosine similarity. By converting both the original and translated documents into vectors using embedding techniques like TF-IDF or Doc2Vec, we could measure the semantic similarity between the two. If the cosine similarity score is above a certain threshold (e.g., 0.7), we can assume the translation is accurate. If it falls below, the document could be flagged for further review.

Problem this feature solves

This feature would provide a more reliable way to assess the quality of translations by comparing the meaning rather than the formatting. It would help in cases where line breaks are not a definitive measure of translation accuracy, ensuring that meaningful translations are not mistakenly flagged as errors due to formatting differences.

Alternatives considered

We initially considered using Azure OpenAI to verify translation quality by sending both the original and translated documents for comparison. However, this approach was discarded because it would be too time-consuming and costly.

Additional context

Embedding techniques such as TF-IDF or Doc2Vec could be integrated into the translation process to generate vector representations of the documents. By calculating cosine similarity between the original and translated content, we can evaluate how closely the translated document retains the original meaning. A similarity score could be displayed along with the translation result, helping reviewers focus on documents with lower scores.

Are you willing to submit a pull request to implement this feature?

  • I am willing to submit a pull request

Code of Conduct

  • I agree to follow this project's Code of Conduct
@skytin1004 skytin1004 added the enhancement New feature or request label Oct 10, 2024
@skytin1004 skytin1004 changed the title Improving translation evaluation in Co-op Translator using cosine similarity Improve translation evaluation in Co-op Translator using cosine similarity Oct 10, 2024
@skytin1004 skytin1004 added the translator Related to any changes in the translation-related source files label Oct 11, 2024
@github-actions github-actions bot changed the title Improve translation evaluation in Co-op Translator using cosine similarity Translator: Improve translation evaluation in Co-op Translator using cosine similarity Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request translator Related to any changes in the translation-related source files
Projects
None yet
Development

No branches or pull requests

1 participant