-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support importing/pasting/dropping audio files #722
Comments
I can pick this one up. Are you able to assign it to me so I can start working on it? |
@AryanK1511 it's all yours |
@humphd, just to confirm my understanding of this issue: Are we aiming to add the ability to attach, paste, or drop audio files into the chatbot, have their content transcribed into text, and then include that text in a new message? For example, if I attach a pre-recorded audio file saying, "Tell me about the universe," the chatbot would transcribe it and include the text as part of the conversation. Is that correct? |
Correct. It should work for the audio file types that we can natively send to the LLMs. Later we could add a step to transcode if necessary, but let's not start with that. So if I have a podcast MP3, I can attach this file and chat with it (the transcript will be injected into the chat for me). |
Adding the file will automatically process it into chat. Try adding a .js
or .pdf. Same as those
…On Mon, Nov 18, 2024 at 10:10 AM Aryan Khurana ***@***.***> wrote:
@humphd <https://github.com/humphd> I see! So for the user, they will
attach a file, chat with it and then when they hit send, they will see the
transcript and their prompt in the chat right?
—
Reply to this email directly, view it on GitHub
<#722 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADILBXFDKNXIJFFZURXQ6D2BH7O7AVCNFSM6AAAAABRQODY2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIOBTGI4DSMJUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi @humphd, After spending a few hours digging into this codebase, I managed to fix the issue. However, the solution I’ve come up with feels quite hacky. I’d like to explain my approach, why I believe the current structure makes a clean solution difficult, and seek your guidance on how to improve it. BackgroundThe issue involves handling audio files similarly to how PDFs are processed in the
My Approach for Audio FilesTo implement audio file handling, the ideal approach would be to mimic the PDF workflow by creating an if (file.type.startsWith("audio/")) {
const contents = await audioToText(file);
assertContents(contents);
return contents;
} However, since the project already includes functionality for transcribing audio in async transcribe(audio: File): Promise<string> {
const transcriptions = new OpenAI.Audio.Transcriptions(this._openai);
const transcription = await transcriptions.create({
file: audio,
model: this._sttModel,
});
return transcription.text;
} The ProblemUsing constructor(sttModel: string, openai: OpenAI) {
this._sttModel = sttModel;
this._openai = openai;
} This initialization is done in const { getSpeechToTextClient, isSpeechToTextSupported, allProvidersWithModels } = useModels();
const sttClient = await getSpeechToTextClient();
const sttProvider = allProvidersWithModels.find((p) => p.apiUrl === sttClient.baseURL);
const sttModel = sttProvider?.models.find((model) => isSpeechToTextModel(model.name))?.name;
speechRecognitionRef.current = new SpeechRecognition(sttModel, sttClient); The issue is that hooks like Current SolutionCurrently, I bypass the clean integration by doing the following:
This approach works, as shown in the attached video. Screen.Recording.2024-11-18.at.8.18.38.PM.movHowever, it introduces two significant problems:
Request for GuidanceI believe the current structure of the code makes it difficult to implement audio file handling in a clean way. How would you suggest proceeding? Would you like me to open a PR with my current changes so we can work collaboratively on refining the solution? Note that I’ve made a few other changes to ensure the input works as expected. Thank you for your guidance! |
Yeah, that's not ideal. We are going to need to refactor the logic for dealing with AI out of hooks, so background processes like this can do it as well. For now, let's make this work with OpenAI and whisper, per https://platform.openai.com/docs/guides/speech-to-text. You can use I'd make this work at all, then we can figure out how to make it work in general. To be honest, the TTS stuff is kind of separate from our other Chat model handling, so we should probably extract it out similar to what we're doing with Jina.ai. cc'ing @Amnish04, who might have other thoughts. |
Sounds good to me 🫡 Lemme make those changes and send in a PR |
@humphd Everything works perfectly now and the code is very consistent too. You can have a look at the PR that I just sent in and lemme know if everything looks good |
We just added support for more file types when you attach/paste/drop them. We also have support for turning audio into text, see (src/lib/speech-recognition.ts). Let's add support for importing audio files, which converts the audio to text and includes it in a new message.
The text was updated successfully, but these errors were encountered: