-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Support for Audio File Transcription #745
base: main
Are you sure you want to change the base?
Conversation
src/lib/ai.ts
Outdated
|
||
const response = await openai.audio.transcriptions.create({ | ||
file, | ||
model: "whisper-1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know @humphd said get this working with openai. Don't do it this way, just follow same logic as other code that uses getSpeechToTextModel
We should not be hardcoding model names in code, instead using utility functions to determine model capabilities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way @Amnish04 restructured the code is that even if you transcribe an audio file for some openrouter model like claude, it would go and find the free groq or paid openai whisper model and use that. This code should reuse same logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarasglek So basically, the only change required is that we somehow select the model dynamically instead of hardcoding it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and try to reuse same codepath as does transcriptions for voice input already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@humphd Im sorry I didnt get any time to work on this but ill get back to it this Monday. So you can expect the PR for this issue by this upcoming Tuesday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tarasglek , @humphd , and @Amnish04
I’ve started working on this issue again today, but I could really use some clarification on the overall goal here. It seems like there are some differing viewpoints on the approach.
@Amnish04 suggested putting everything inside a hook, which makes the model selection dynamic, but I’m also hearing from @humphd that we should refactor and remove logic from the hooks. This leaves me in a bit of a tough spot, as putting logic inside hooks would make it difficult to reuse them outside of hooks.
Additionally, @tarasglek mentioned avoiding hardcoding models and to follow @Amnish04's approach. However, the example @Amnish04 provided earlier in the thread does use hardcoded models.
type TextToSpeechModel = "tts-1" | "tts-1-hd";
I’m finding this a bit confusing, as it appears to contradict some of the guidance I’ve received.
Could you please help clarify the direction we should be taking? I’d appreciate any guidance so I can move forward with the implementation.
Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be done in stages. The difference in what you're hearing from people reflects the fact that we have code in one state, want to get to another ideal state, and in the mean time want to extend it with new features.
If you can see a path forward that builds on what we have, I'd start with that, and the evolution of this code to work outside of hooks (which is ideally what we want to get to), can happen in follow-ups, by you or someone else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok this sounds good! Thank you so much for clarifying
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im going to look into this further and report back with some suggestions and code changes shortly
Hi @tarasglek and @humphd, I’ve just pushed some new changes - could you take a look and let me know your thoughts? In essence, I’ve created a service that dynamically fetches the model, aligning with what @tarasglek initially suggested. The logic has been decoupled from hooks, making it reusable across the application. As shown in the attached video, the functionality works flawlessly, and there’s no hardcoding of the Whisper model anymore. Screen.20Recording.202024-12-05.20at.207.mp4This is the start of the decoupling I think and we can slowly build on top of it. Looking forward to your feedback! |
if (typeof contents === "string") { | ||
if (!contents.trim().length) { | ||
throw new Error("Empty contents", { cause: { code: "EmptyFile" } }); | ||
} | ||
} else { | ||
} else if ("data" in contents && "content" in contents.data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const content = contents?.data?.content
then use if (content) will take care of this and length check here and below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tarasglek I have changed the file now according to what you suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you do it, the code here is still showing the old way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I didn't understand the comment properly @humphd. What do you guys want me to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice how small this change is, great job. I left a few questions and comments to consider.
src/components/OptionsButton.tsx
Outdated
@@ -218,7 +218,7 @@ function OptionsButton({ | |||
ref={fileInputRef} | |||
hidden | |||
onChange={handleFileChange} | |||
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown" | |||
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown, audio/*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support all audio types? Let's narrow this so we only accept the ones we can process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I have done that now @humphd
if (typeof contents === "string") { | ||
if (!contents.trim().length) { | ||
throw new Error("Empty contents", { cause: { code: "EmptyFile" } }); | ||
} | ||
} else { | ||
} else if ("data" in contents && "content" in contents.data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did you do it, the code here is still showing the old way?
import { getSettings } from "./settings"; | ||
import { isSpeechToTextModel } from "./ai"; | ||
|
||
export class ModelService { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting idea. We should add other methods later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
const settings = getSettings(); | ||
const provider = settings.currentProvider; | ||
|
||
if (!provider.apiKey) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all providers require an API key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you review this once @humphd
@AryanK1511 tested this locally, working well. Can you please rebase on |
This PR fixes #722
This is done by introducing support for uploading audio files. Once an audio file (e.g., MP3) is uploaded, the chatbot transcribes it into text using OpenAI's Whisper API and displays the transcription in the chat. Users can then interact with the chatbot to discuss the content of the audio file, just like the existing feature for converting PDF files to Markdown.
Example Usage
For instance, if you upload an MP3 file of a podcast where someone says "Hello, tell me about the universe," the bot will transcribe the audio into text (e.g., "Hello, tell me about the universe") and add it to the chat. Users can then ask follow-up questions or have a conversation about the transcribed content.
Here's an example video demonstrating the feature. I uploaded an MP3 file where I recorded myself saying "Hello, tell me about the universe." The bot successfully transcribed the content, displayed it in the chat, and allowed further interaction.
Screen.Recording.2024-11-18.at.9.53.31.PM.mov
Implementation Details
As discussed earlier with @humphd in this issue comment, I modularized the implementation to ensure clarity and reusability:
audioToText()
function was created inai.ts
to handle transcription using OpenAI's Whisper API (API Reference).Tasks Completed
Add return type to
transcribe()
functionEnsured type safety and better maintainability (commit).
Add support for audio file uploads
Updated input handling to accept MP3 and other audio file types (commit).
Update file import hooks
Modified the file import hook to handle audio files in the same way as PDF files, as per discussions with @humphd (commit).
Add
audioToText()
functionImplemented the transcription logic using OpenAI Whisper (commit).
Test multiple file uploads
Verified that the system works correctly with both audio and PDF files.
Ensure backward compatibility
Tested the application to confirm that existing features remain unaffected by the changes.