Add Support for Audio File Transcription #745

AryanK1511 · 2024-11-19T02:54:48Z

This PR fixes #722

This is done by introducing support for uploading audio files. Once an audio file (e.g., MP3) is uploaded, the chatbot transcribes it into text using OpenAI's Whisper API and displays the transcription in the chat. Users can then interact with the chatbot to discuss the content of the audio file, just like the existing feature for converting PDF files to Markdown.

Example Usage

For instance, if you upload an MP3 file of a podcast where someone says "Hello, tell me about the universe," the bot will transcribe the audio into text (e.g., "Hello, tell me about the universe") and add it to the chat. Users can then ask follow-up questions or have a conversation about the transcribed content.

Here's an example video demonstrating the feature. I uploaded an MP3 file where I recorded myself saying "Hello, tell me about the universe." The bot successfully transcribed the content, displayed it in the chat, and allowed further interaction.

Screen.Recording.2024-11-18.at.9.53.31.PM.mov

Implementation Details

As discussed earlier with @humphd in this issue comment, I modularized the implementation to ensure clarity and reusability:

A new audioToText() function was created in ai.ts to handle transcription using OpenAI's Whisper API (API Reference).
The code structure is now consistent with the way PDF files are processed, ensuring uniformity across the project.

Tasks Completed

Add return type to transcribe() function
Ensured type safety and better maintainability (commit).
Add support for audio file uploads
Updated input handling to accept MP3 and other audio file types (commit).
Update file import hooks
Modified the file import hook to handle audio files in the same way as PDF files, as per discussions with @humphd (commit).
Add audioToText() function
Implemented the transcription logic using OpenAI Whisper (commit).
Test multiple file uploads
Verified that the system works correctly with both audio and PDF files.
Ensure backward compatibility
Tested the application to confirm that existing features remain unaffected by the changes.

tarasglek · 2024-11-19T10:59:50Z

src/lib/ai.ts

+
+    const response = await openai.audio.transcriptions.create({
+      file,
+      model: "whisper-1",


I know @humphd said get this working with openai. Don't do it this way, just follow same logic as other code that uses getSpeechToTextModel

We should not be hardcoding model names in code, instead using utility functions to determine model capabilities

The way @Amnish04 restructured the code is that even if you transcribe an audio file for some openrouter model like claude, it would go and find the free groq or paid openai whisper model and use that. This code should reuse same logic

@tarasglek So basically, the only change required is that we somehow select the model dynamically instead of hardcoding it right?

and try to reuse same codepath as does transcriptions for voice input already

@humphd Im sorry I didnt get any time to work on this but ill get back to it this Monday. So you can expect the PR for this issue by this upcoming Tuesday

Hi @tarasglek , @humphd , and @Amnish04

I’ve started working on this issue again today, but I could really use some clarification on the overall goal here. It seems like there are some differing viewpoints on the approach.

@Amnish04 suggested putting everything inside a hook, which makes the model selection dynamic, but I’m also hearing from @humphd that we should refactor and remove logic from the hooks. This leaves me in a bit of a tough spot, as putting logic inside hooks would make it difficult to reuse them outside of hooks.

Additionally, @tarasglek mentioned avoiding hardcoding models and to follow @Amnish04's approach. However, the example @Amnish04 provided earlier in the thread does use hardcoded models.

type TextToSpeechModel = "tts-1" | "tts-1-hd";

I’m finding this a bit confusing, as it appears to contradict some of the guidance I’ve received.

Could you please help clarify the direction we should be taking? I’d appreciate any guidance so I can move forward with the implementation.

Thanks so much!

I think this can be done in stages. The difference in what you're hearing from people reflects the fact that we have code in one state, want to get to another ideal state, and in the mean time want to extend it with new features.

If you can see a path forward that builds on what we have, I'd start with that, and the evolution of this code to work outside of hooks (which is ideally what we want to get to), can happen in follow-ups, by you or someone else.

ok this sounds good! Thank you so much for clarifying

Im going to look into this further and report back with some suggestions and code changes shortly

…iles

…t functionality

AryanK1511 · 2024-12-06T00:19:32Z

Hi @tarasglek and @humphd,

I’ve just pushed some new changes - could you take a look and let me know your thoughts?

In essence, I’ve created a service that dynamically fetches the model, aligning with what @tarasglek initially suggested. The logic has been decoupled from hooks, making it reusable across the application.

As shown in the attached video, the functionality works flawlessly, and there’s no hardcoding of the Whisper model anymore.

Screen.20Recording.202024-12-05.20at.207.mp4

This is the start of the decoupling I think and we can slowly build on top of it. Looking forward to your feedback!

tarasglek · 2024-12-07T19:13:23Z

src/hooks/use-file-import.tsx

  if (typeof contents === "string") {
    if (!contents.trim().length) {
      throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
    }
-  } else {
+  } else if ("data" in contents && "content" in contents.data) {


const content = contents?.data?.content

then use if (content) will take care of this and length check here and below

@tarasglek I have changed the file now according to what you suggested

Where did you do it, the code here is still showing the old way?

Maybe I didn't understand the comment properly @humphd. What do you guys want me to do here?

tarasglek

This is much better. Thank you. Good by me. Would still be good to get @Amnish04 or @humphd approval

humphd

Nice how small this change is, great job. I left a few questions and comments to consider.

humphd · 2024-12-07T23:50:36Z

src/components/OptionsButton.tsx

@@ -218,7 +218,7 @@ function OptionsButton({
            ref={fileInputRef}
            hidden
            onChange={handleFileChange}
-            accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown"
+            accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown, audio/*"


Do we support all audio types? Let's narrow this so we only accept the ones we can process.

Okay I have done that now @humphd

humphd · 2024-12-07T23:51:11Z

src/hooks/use-file-import.tsx

  if (typeof contents === "string") {
    if (!contents.trim().length) {
      throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
    }
-  } else {
+  } else if ("data" in contents && "content" in contents.data) {


Where did you do it, the code here is still showing the old way?

humphd · 2024-12-07T23:54:04Z

src/lib/model-service.ts

+import { getSettings } from "./settings";
+import { isSpeechToTextModel } from "./ai";
+
+export class ModelService {


This is an interesting idea. We should add other methods later.

humphd · 2024-12-07T23:54:30Z

src/lib/model-service.ts

+    const settings = getSettings();
+    const provider = settings.currentProvider;
+
+    if (!provider.apiKey) {


Not all providers require an API key

AryanK1511

Can you review this once @humphd

humphd · 2024-12-12T17:57:40Z

@AryanK1511 tested this locally, working well. Can you please rebase on main and push so I can land? Excited to ship this.

tarasglek reviewed Nov 19, 2024

View reviewed changes

AryanK1511 added 6 commits December 5, 2024 16:59

Add return type to transcribe() function

81120c0

Add support to allow audio file uploads

8b51315

Update file import hook such that audio files behave similar to PDF f…

2d8d921

…iles

Add audioToText() function that uses openAI whisper for transcription

ee2b7ec

Add type keyword before importing OpenAISpeechToTextResponse interface

ddfdfbe

Create model service to allow dynamic model fetching for audio to tex…

91921f8

…t functionality

AryanK1511 force-pushed the issue-722 branch from c4d8eb4 to 91921f8 Compare December 6, 2024 00:15

tarasglek reviewed Dec 7, 2024

View reviewed changes

Refactor content validation in

8449a35

humphd requested changes Dec 7, 2024

View reviewed changes

fix: only allow audio formats that we can process

f93d190

AryanK1511 commented Dec 8, 2024

View reviewed changes

humphd approved these changes Dec 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Support for Audio File Transcription #745

Add Support for Audio File Transcription #745

AryanK1511 commented Nov 19, 2024

tarasglek Nov 19, 2024 •

edited

Loading

tarasglek Nov 19, 2024 •

edited

Loading

AryanK1511 Nov 19, 2024

tarasglek Nov 19, 2024

tarasglek Nov 19, 2024

AryanK1511 Nov 29, 2024

AryanK1511 Dec 2, 2024

humphd Dec 3, 2024 •

edited

Loading

AryanK1511 Dec 3, 2024

AryanK1511 Dec 3, 2024

AryanK1511 commented Dec 6, 2024 •

edited

Loading

tarasglek Dec 7, 2024

AryanK1511 Dec 7, 2024

humphd Dec 7, 2024

AryanK1511 Dec 8, 2024

tarasglek left a comment

humphd left a comment

humphd Dec 7, 2024

AryanK1511 Dec 8, 2024

humphd Dec 7, 2024

humphd Dec 7, 2024

AryanK1511 Dec 8, 2024

humphd Dec 7, 2024

AryanK1511 left a comment

humphd commented Dec 12, 2024

Add Support for Audio File Transcription #745

Are you sure you want to change the base?

Add Support for Audio File Transcription #745

Conversation

AryanK1511 commented Nov 19, 2024

Example Usage

Implementation Details

Tasks Completed

tarasglek Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

tarasglek Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humphd Dec 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AryanK1511 commented Dec 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarasglek left a comment

Choose a reason for hiding this comment

humphd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AryanK1511 left a comment

Choose a reason for hiding this comment

humphd commented Dec 12, 2024

tarasglek Nov 19, 2024 •

edited

Loading

tarasglek Nov 19, 2024 •

edited

Loading

humphd Dec 3, 2024 •

edited

Loading

AryanK1511 commented Dec 6, 2024 •

edited

Loading