Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Audio File Transcription #745

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
2 changes: 1 addition & 1 deletion src/components/OptionsButton.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ function OptionsButton({
ref={fileInputRef}
hidden
onChange={handleFileChange}
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown"
accept="image/*,text/*,.pdf,application/pdf,*.docx,application/vnd.openxmlformats-officedocument.wordprocessingml.document,.json,application/json,application/markdown, audio/*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we support all audio types? Let's narrow this so we only accept the ones we can process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I have done that now @humphd

/>
<MenuItem icon={<BsPaperclip />} onClick={handleAttachFiles}>
Attach Files...
Expand Down
45 changes: 34 additions & 11 deletions src/hooks/use-file-import.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,12 @@ import { useCallback } from "react";
import { useAlert } from "./use-alert";
import { ChatCraftChat } from "../lib/ChatCraftChat";
import { ChatCraftHumanMessage } from "../lib/ChatCraftMessage";
import { type JinaAiReaderResponse, pdfToMarkdown } from "../lib/ai";
import {
audioToText,
type JinaAiReaderResponse,
type OpenAISpeechToTextResponse,
pdfToMarkdown,
} from "../lib/ai";
import { compressImageToBase64, formatAsCodeBlock } from "../lib/utils";
import { getSettings } from "../lib/settings";

Expand Down Expand Up @@ -118,22 +123,31 @@ function formatTextContent(filename: string, type: string, content: string): str
}

// Makes sure that the contents are non-empty
function assertContents(contents: string | JinaAiReaderResponse) {
function assertContents(contents: string | JinaAiReaderResponse | OpenAISpeechToTextResponse) {
let content: string | undefined;

if (typeof contents === "string") {
if (!contents.trim().length) {
throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
}
} else {
if (!contents.data.content.trim().length) {
throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
}
content = contents;
} else if ("data" in contents && "content" in contents.data) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const content = contents?.data?.content

then use if (content) will take care of this and length check here and below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tarasglek I have changed the file now according to what you suggested

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did you do it, the code here is still showing the old way?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I didn't understand the comment properly @humphd. What do you guys want me to do here?

content = contents.data.content;
} else if ("text" in contents) {
content = contents.text;
}

if (!content?.trim().length) {
throw new Error("Empty contents", { cause: { code: "EmptyFile" } });
}

if (!content) {
throw new Error("Unknown content type", { cause: { code: "InvalidContentType" } });
}
}

async function processFile(
file: File,
settings: ReturnType<typeof getSettings>
): Promise<string | JinaAiReaderResponse> {
): Promise<string | JinaAiReaderResponse | OpenAISpeechToTextResponse> {
console.log(file.type);
if (file.type.startsWith("image/")) {
return await compressImageToBase64(file, {
compressionFactor: settings.compressionFactor,
Expand All @@ -142,6 +156,12 @@ async function processFile(
});
}

if (file.type.startsWith("audio/")) {
const contents = await audioToText(file);
assertContents(contents);
return contents;
}

if (file.type === "application/pdf") {
const contents = await pdfToMarkdown(file);
assertContents(contents);
Expand Down Expand Up @@ -180,13 +200,16 @@ export function useFileImport({ chat, onImageImport }: UseFileImportOptions) {
const settings = getSettings();

const importFile = useCallback(
(file: File, contents: string | JinaAiReaderResponse) => {
async (file: File, contents: string | JinaAiReaderResponse | OpenAISpeechToTextResponse) => {
if (file.type.startsWith("image/")) {
const base64 = contents as string;
onImageImport(base64);
} else if (file.type === "application/pdf") {
const document = (contents as JinaAiReaderResponse).data;
chat.addMessage(new ChatCraftHumanMessage({ text: `${document.content}\n` }));
} else if (file.type.startsWith("audio/")) {
const document = (contents as OpenAISpeechToTextResponse).text;
chat.addMessage(new ChatCraftHumanMessage({ text: `${document}\n` }));
} else if (
file.type === "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
) {
Expand Down
41 changes: 41 additions & 0 deletions src/lib/ai.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import {
import { ChatCraftModel } from "./ChatCraftModel";
import { getSettings } from "./settings";
import { usingOfficialOpenAI } from "./providers";
import { ModelService } from "./model-service";
import { SpeechRecognition } from "./speech-recognition";

export type ChatOptions = {
model?: ChatCraftModel;
Expand Down Expand Up @@ -464,6 +466,45 @@ export function isChatModel(model: string): boolean {
);
}

export type OpenAISpeechToTextResponse = {
text: string;
};

/**
* Convert an audio file to text
*/

export async function audioToText(file: File): Promise<OpenAISpeechToTextResponse> {
const settings = getSettings();
const currentProvider = settings.currentProvider;

if (!currentProvider.apiKey) {
throw new Error("Missing API Key");
}

const sttClient = await ModelService.getSpeechToTextClient();

if (!sttClient) {
throw new Error("No STT client available");
}

const sttModel = await ModelService.getSpeechToTextModel(currentProvider);

if (!sttModel) {
throw new Error(`No speech-to-text model found for provider ${currentProvider.name}`);
}

const recognition = new SpeechRecognition(sttModel, sttClient);

try {
const text = await recognition.transcribe(file);
return { text };
} catch (error) {
console.error("Error transcribing audio:", error);
throw error;
}
}

export type JinaAiReaderResponse = {
code: number;
status: number;
Expand Down
33 changes: 33 additions & 0 deletions src/lib/model-service.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
import { ChatCraftProvider } from "./ChatCraftProvider";
import { getSettings } from "./settings";
import { isSpeechToTextModel } from "./ai";

export class ModelService {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting idea. We should add other methods later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

static async getSpeechToTextClient() {
const settings = getSettings();
const provider = settings.currentProvider;

if (!provider.apiKey) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all providers require an API key

return null;
}

return provider.createClient(provider.apiKey).openai;
}

static async getSpeechToTextModel(provider: ChatCraftProvider): Promise<string | null> {
if (!provider.apiKey) {
return null;
}
const models: string[] = await provider.queryModels(provider.apiKey);
const sttModel = models.find((model) => isSpeechToTextModel(model));
return sttModel || null;
}

static async isSpeechToTextSupported(provider: ChatCraftProvider): Promise<boolean> {
if (!provider.apiKey) {
return false;
}
const models: string[] = await provider.queryModels(provider.apiKey);
return models.some((model) => isSpeechToTextModel(model));
}
}
2 changes: 1 addition & 1 deletion src/lib/speech-recognition.ts
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ export class SpeechRecognition {
}
}

async transcribe(audio: File) {
async transcribe(audio: File): Promise<string> {
const transcriptions = new OpenAI.Audio.Transcriptions(this._openai);
const transcription = await transcriptions.create({
file: audio,
Expand Down