Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Audio File Transcription #745

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

AryanK1511
Copy link

This PR fixes #722

This is done by introducing support for uploading audio files. Once an audio file (e.g., MP3) is uploaded, the chatbot transcribes it into text using OpenAI's Whisper API and displays the transcription in the chat. Users can then interact with the chatbot to discuss the content of the audio file, just like the existing feature for converting PDF files to Markdown.

Example Usage

For instance, if you upload an MP3 file of a podcast where someone says "Hello, tell me about the universe," the bot will transcribe the audio into text (e.g., "Hello, tell me about the universe") and add it to the chat. Users can then ask follow-up questions or have a conversation about the transcribed content.

Here's an example video demonstrating the feature. I uploaded an MP3 file where I recorded myself saying "Hello, tell me about the universe." The bot successfully transcribed the content, displayed it in the chat, and allowed further interaction.

Screen.Recording.2024-11-18.at.9.53.31.PM.mov

Implementation Details

As discussed earlier with @humphd in this issue comment, I modularized the implementation to ensure clarity and reusability:

  • A new audioToText() function was created in ai.ts to handle transcription using OpenAI's Whisper API (API Reference).
  • The code structure is now consistent with the way PDF files are processed, ensuring uniformity across the project.

Tasks Completed

  • Add return type to transcribe() function
    Ensured type safety and better maintainability (commit).

  • Add support for audio file uploads
    Updated input handling to accept MP3 and other audio file types (commit).

  • Update file import hooks
    Modified the file import hook to handle audio files in the same way as PDF files, as per discussions with @humphd (commit).

  • Add audioToText() function
    Implemented the transcription logic using OpenAI Whisper (commit).

  • Test multiple file uploads
    Verified that the system works correctly with both audio and PDF files.

  • Ensure backward compatibility
    Tested the application to confirm that existing features remain unaffected by the changes.

src/lib/ai.ts Outdated

const response = await openai.audio.transcriptions.create({
file,
model: "whisper-1",
Copy link
Owner

@tarasglek tarasglek Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know @humphd said get this working with openai. Don't do it this way, just follow same logic as other code that uses getSpeechToTextModel

We should not be hardcoding model names in code, instead using utility functions to determine model capabilities

Copy link
Owner

@tarasglek tarasglek Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way @Amnish04 restructured the code is that even if you transcribe an audio file for some openrouter model like claude, it would go and find the free groq or paid openai whisper model and use that. This code should reuse same logic

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarasglek So basically, the only change required is that we somehow select the model dynamically instead of hardcoding it right?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and try to reuse same codepath as does transcriptions for voice input already

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@humphd Im sorry I didnt get any time to work on this but ill get back to it this Monday. So you can expect the PR for this issue by this upcoming Tuesday

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @tarasglek , @humphd , and @Amnish04

I’ve started working on this issue again today, but I could really use some clarification on the overall goal here. It seems like there are some differing viewpoints on the approach.

@Amnish04 suggested putting everything inside a hook, which makes the model selection dynamic, but I’m also hearing from @humphd that we should refactor and remove logic from the hooks. This leaves me in a bit of a tough spot, as putting logic inside hooks would make it difficult to reuse them outside of hooks.

Additionally, @tarasglek mentioned avoiding hardcoding models and to follow @Amnish04's approach. However, the example @Amnish04 provided earlier in the thread does use hardcoded models.

type TextToSpeechModel = "tts-1" | "tts-1-hd";

I’m finding this a bit confusing, as it appears to contradict some of the guidance I’ve received.

Could you please help clarify the direction we should be taking? I’d appreciate any guidance so I can move forward with the implementation.

Thanks so much!

Copy link
Collaborator

@humphd humphd Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be done in stages. The difference in what you're hearing from people reflects the fact that we have code in one state, want to get to another ideal state, and in the mean time want to extend it with new features.

If you can see a path forward that builds on what we have, I'd start with that, and the evolution of this code to work outside of hooks (which is ideally what we want to get to), can happen in follow-ups, by you or someone else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this sounds good! Thank you so much for clarifying

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im going to look into this further and report back with some suggestions and code changes shortly

@AryanK1511
Copy link
Author

AryanK1511 commented Dec 6, 2024

Hi @tarasglek and @humphd,

I’ve just pushed some new changes - could you take a look and let me know your thoughts?

In essence, I’ve created a service that dynamically fetches the model, aligning with what @tarasglek initially suggested. The logic has been decoupled from hooks, making it reusable across the application.

As shown in the attached video, the functionality works flawlessly, and there’s no hardcoding of the Whisper model anymore.

Screen.20Recording.202024-12-05.20at.207.mp4

This is the start of the decoupling I think and we can slowly build on top of it. Looking forward to your feedback!

if (typeof contents === "string") {
if (!contents.trim().length) {
throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
}
} else {
} else if ("data" in contents && "content" in contents.data) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const content = contents?.data?.content

then use if (content) will take care of this and length check here and below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarasglek I have changed the file now according to what you suggested

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you do it, the code here is still showing the old way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I didn't understand the comment properly @humphd. What do you guys want me to do here?

Copy link
Owner

@tarasglek tarasglek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better. Thank you. Good by me. Would still be good to get @Amnish04 or @humphd approval

Copy link
Collaborator

@humphd humphd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice how small this change is, great job. I left a few questions and comments to consider.

@@ -218,7 +218,7 @@ function OptionsButton({
ref={fileInputRef}
hidden
onChange={handleFileChange}
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown"
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown, audio/*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support all audio types? Let's narrow this so we only accept the ones we can process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I have done that now @humphd

if (typeof contents === "string") {
if (!contents.trim().length) {
throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
}
} else {
} else if ("data" in contents && "content" in contents.data) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you do it, the code here is still showing the old way?

import { getSettings } from "./settings";
import { isSpeechToTextModel } from "./ai";

export class ModelService {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. We should add other methods later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

const settings = getSettings();
const provider = settings.currentProvider;

if (!provider.apiKey) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all providers require an API key

Copy link
Author

@AryanK1511 AryanK1511 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you review this once @humphd

@humphd
Copy link
Collaborator

humphd commented Dec 12, 2024

@AryanK1511 tested this locally, working well. Can you please rebase on main and push so I can land? Excited to ship this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support importing/pasting/dropping audio files
4 participants