feat: tts provider #107

devdairy699 · 2025-10-13T08:22:05Z

This PR is intended to resolve #23. Since we do not support WebSocket, HTTP/2, nor grpc reference, So made some changes to WIT and simplified the TTS interface.
All the test output is here https://github.com/devdairy699/golem-ai/tree/main/test/tts/test-audio-files

Feature Support Matrix

Feature	AWS Polly	Google TTS	ElevenLabs	Deepgram
Basic Synthesis
Synthesize	✅	✅	✅	✅
Synthesize Batch	✅	✅	✅	✅
SSML Support	✅	✅	✅	❌
Voice Management
List Voices	✅	✅	✅	✅
Get Voice	✅	✅	✅	✅
List Languages	✅	✅	✅	✅
Validation & Analysis
Validate Input	✅	✅	✅	✅
Get Timing Marks	❌	✅	❌	❌
Advanced Features
Voice Cloning	❌	❌	✅	❌
Voice Design	❌	❌	✅	❌
Voice Conversion	❌	❌	✅	❌
Sound Effects	❌	❌	✅	❌
Pronunciation Lexicons	✅	❌	✅	❌
Long-Form Synthesis	✅	❌	❌	❌

Demo Video

compressed_deepgram.-.Made.with.Clipchamp.mp4

compressed_Elevenlabs.-.Made.with.Clipchamp.mp4

compressed_Google.mp4

compressed_Polly.-.Made.with.Clipchamp.mp4

compressed_Polly-durability.mp4

Here is the current WIT

package golem:[email protected];

/// Core types and error handling for universal text-to-speech
interface types {
    /// Comprehensive error types covering all TTS operations
    variant tts-error {
        /// General errors
        request-error(string),

         /// Input validation errors
        invalid-text(string),
        text-too-long(u32),
        invalid-ssml(string),
        unsupported-language(string),
        
        /// Voice and model errors
        voice-not-found(string),
        model-not-found(string),
        voice-unavailable(string),
        
        /// Authentication and authorization
        unauthorized(string),
        access-denied(string),
        
        /// Resource and quota limits
        quota-exceeded(quota-info),
        rate-limited(u32),
        insufficient-credits,
        
        /// Operation errors
        synthesis-failed(string),
        unsupported-operation(string),
        invalid-configuration(string),
        
        /// Service errors
        service-unavailable(string),
        network-error(string),
        internal-error(string),
        
        /// Storage errors (for async operations)
        invalid-storage-location(string),
        storage-access-denied(string),
    }

    record quota-info {
        used: u32,
        limit: u32,
        reset-time: u64,
        unit: quota-unit,
    }

    enum quota-unit {
        characters,
        requests,
        seconds,
        credits,
    }

    /// Language identification using BCP 47 codes
    type language-code = string;

    /// Voice gender classification
    enum voice-gender {
        male,
        female,
        neutral,
    }


    /// Text input types
    enum text-type {
        plain,
        ssml,
    }

    
    /// Audio quality settings
    record audio-config {
        /// Provider specific audio encoding format 
        format: string, 
        sample-rate: option<u32>,
        bit-rate: option<u32>,
        channels: option<u8>,
    }

    /// Voice synthesis parameters
    record voice-settings {
        /// Speaking rate (0.25 to 4.0, default 1.0)
        speed: option<f32>,
        /// Pitch adjustment in semitones (-20.0 to 20.0, default 0.0)
        pitch: option<f32>,
        /// Volume gain in dB (-96.0 to 16.0, default 0.0)
        volume: option<f32>,
        /// Voice stability (0.0 to 1.0, provider-specific)
        stability: option<f32>,
        /// Similarity to original (0.0 to 1.0, provider-specific)
        similarity: option<f32>,
        /// Style exaggeration (0.0 to 1.0, provider-specific)
        style: option<f32>,
    }

    /// Audio effects and device optimization
    enum audio-effects {
        telephone-quality,
        headphone-optimized,
        speaker-optimized,
        car-audio-optimized,
        noise-reduction,
        bass-boost,
        treble-boost,
    }

    /// Input text with metadata
    record text-input {
        content: string,
        text-type: text-type,
        language: option<language-code>,
    }

    /// Complete synthesis result
    record synthesis-result {
        audio-data: list<u8>,
        metadata: synthesis-metadata,
    }

    /// Metadata about synthesized audio
    record synthesis-metadata {
        duration-seconds: f32,
        character-count: u32,
        word-count: u32,
        audio-size-bytes: u32,
        request-id: string,
        provider-info: option<string>,
    }

    /// Streaming audio chunk
    record audio-chunk {
        data: list<u8>,
        sequence-number: u32,
        is-final: bool,
        timing-info: option<timing-info>,
    }

    /// Timing and synchronization information
    record timing-info {
        start-time-seconds: f32,
        end-time-seconds: option<f32>,
        text-offset: option<u32>,
        mark-type: option<timing-mark-type>,
    }

    enum timing-mark-type {
        word,
        sentence,
        paragraph,
        ssml-mark,
        viseme,
    }


}

/// Voice discovery and management
interface voices {
    use types.{tts-error, language-code, voice-settings, voice-gender };

    /// Voice search and filtering
    record voice-filter {
        language: option<language-code>,
        gender: option<voice-gender>,
        quality: option<string>,
        supports-ssml: option<bool>,
        provider: option<string>,
        search-query: option<string>,
    }

    /// Detailed voice information
    record voice{
        id: string,
        name: string,
        language: language-code,
        additional-languages: list<language-code>,
        gender: voice-gender,
        quality: string,
        description: option<string>,
        provider: string,
        sample-rate: list<u32>,
        supports-ssml: bool,
        is-custom: bool,
        is-cloned: bool,
        preview-url: option<string>,
        use-cases: list<string>,
        supported-formats: list<string>,
    }

    record language-info {
       code: language-code,
       name: string,
       native-name: string,
       voice-count: u32,
    }

    /// List available voices with filtering and pagination
    list-voices: func(
        filter: option<voice-filter>
    ) -> result<list<voice>, tts-error>;

    /// Get specific voice by ID
    get-voice: func(voice-id: string) -> result<voice, tts-error>;

    /// Get supported languages
    list-languages: func() -> result<list<language-info>, tts-error>;

}

/// Core text-to-speech synthesis operations
interface synthesis {
    use types.{
        text-input, audio-config, voice-settings, audio-effects,
        synthesis-result, tts-error, timing-info
    };
    use voices.{voice};

    /// Synthesis configuration options
    record synthesis-options {
        audio-config: option<audio-config>,
        voice-settings: option<voice-settings>,
        audio-effects: option<list<audio-effects>>,
        enable-timing: option<bool>,
        enable-word-timing: option<bool>,
        seed: option<u32>,
        model-id: option<string>,
        context: option<synthesis-context>,
    }

    /// Context for better synthesis quality
    record synthesis-context {
        previous-text: option<string>,
        next-text: option<string>,
        topic: option<string>,
        emotion: option<string>,
        speaking-style: option<string>,
    }

    /// Convert text to speech (removed async)
    synthesize: func(
        input: text-input,
        voice: voice,
        options: option<synthesis-options>
    ) -> result<synthesis-result, tts-error>;

    /// Batch synthesis for multiple inputs (removed async)
    synthesize-batch: func(
        inputs: list<text-input>,
        voice: voice,
        options: option<synthesis-options>
    ) -> result<list<synthesis-result>, tts-error>;

    /// Get timing information without audio synthesis
    get-timing-marks: func(
        input: text-input,
        voice: voice
    ) -> result<list<timing-info>, tts-error>;

    /// Validate text before synthesis
    validate-input: func(
        input: text-input,
        voice: voice
    ) -> result<validation-result, tts-error>;

    record validation-result {
        is-valid: bool,
        character-count: u32,
        estimated-duration: option<f32>,
        warnings: list<string>,
        errors: list<string>,
    }
}


/// Advanced TTS features and voice manipulation
interface advanced {
    use types.{tts-error, audio-config, language-code, synthesis-metadata, voice-gender};
    use voices.{voice};

    /// Voice cloning and creation (removed async)
    create-voice-clone: func(
        name: string,
        audio-samples: list<audio-sample>,
        description: option<string>
    ) -> result<voice, tts-error>;

    record audio-sample {
        data: list<u8>,
        transcript: option<string>,
        quality-rating: option<u8>,
    }

    /// Design synthetic voice (removed async)
    design-voice: func(
        name: string,
        characteristics: voice-design-params
    ) -> result<voice, tts-error>;

    record voice-design-params {
        gender: voice-gender,
        age-category: age-category,
        accent: string,
        personality-traits: list<string>,
        reference-voice: option<string>,
    }

    enum age-category {
        child,
        young-adult,
        middle-aged,
        elderly,
    }

    /// Voice-to-voice conversion (removed async)
    convert-voice: func(
        input-audio: list<u8>,
        target-voice: voice,
        preserve-timing: option<bool>
    ) -> result<list<u8>, tts-error>;

    /// Generate sound effects from text description (removed async)
    generate-sound-effect: func(
        description: string,
        duration-seconds: option<f32>,
        style-influence: option<f32>
    ) -> result<list<u8>, tts-error>;

    /// Custom pronunciation management
    resource pronunciation-lexicon {
        get-name: func() -> string;
        get-language: func() -> language-code;
        get-entry-count: func() -> u32;
        
        /// Add pronunciation rule
        add-entry: func(word: string, pronunciation: string) -> result<_, tts-error>;
        
        /// Remove pronunciation rule
        remove-entry: func(word: string) -> result<_, tts-error>;
        
        /// Export lexicon content
        export-content: func() -> result<string, tts-error>;
    }

    /// Create custom pronunciation lexicon
    create-lexicon: func(
        name: string,
        language: language-code,
        entries: option<list<pronunciation-entry>>
    ) -> result<pronunciation-lexicon, tts-error>;

    record pronunciation-entry {
        word: string,
        pronunciation: string,
        part-of-speech: option<string>,
    }

    /// Long-form content synthesis with optimization (removed async)
    synthesize-long-form: func(
        content: string,
        voice: voice,
        chapter-breaks: option<list<u32>>
    ) -> result<long-form-operation, tts-error>;

    resource long-form-operation {
        get-task-id: func() -> result<string, tts-error>;
        get-status: func() -> result<operation-status, tts-error>;
        get-progress: func() -> result<f32, tts-error>;
        cancel: func() -> result<_, tts-error>;
        get-result: func() -> result<long-form-result, tts-error>;
    }

    enum operation-status {
        pending,
        processing,
        completed,
        failed,
        cancelled,
    }

    record long-form-result {
        output-location: string,
        total-duration: f32,
        chapter-durations: option<list<f32>>,
        metadata: synthesis-metadata,
    }
}

world tts-library {
    export types;
    export voices;
    export synthesis;
    export advanced;
}

/claim #23

devdairy699 · 2025-11-10T14:42:44Z

@vigoo PR’s ready for review! Tweaked wit and simplified things a bit. hope it looks good 🙌

rutikthakre · 2025-11-21T18:10:48Z

tts/deepgram/src/deepgram.rs

+        let base_url = get_env("TTS_PROVIDER_ENDPOINT")
+            .ok()
+            .unwrap_or("https://api.deepgram.com".to_string());
+        trace!("Using base URL: {base_url}");


It would be wonderful if we could have consistent logs across all providers!

devdairy699 force-pushed the main branch from 22dd85c to 427a83d Compare October 24, 2025 09:56

feat: tts components

e34ffaf

devdairy699 force-pushed the main branch from 515a0da to e34ffaf Compare November 5, 2025 11:48

devdairy699 added 5 commits November 7, 2025 07:37

google auth + test-output + minot fixes

fd8bf79

lint + test fixes

1b6598d

fix: durability long-form-synthesis

e7a8e72

google test output

7c633bf

lint

6f50cea

devdairy699 force-pushed the main branch from c2cd347 to 6f50cea Compare November 10, 2025 10:32

devdairy699 marked this pull request as ready for review November 10, 2025 13:43

algora-pbc bot added the 🙋 Bounty claim label Nov 10, 2025

devdairy699 changed the title ~~feat: tts provider (wip)~~ feat: tts provider Nov 21, 2025

rutikthakre suggested changes Nov 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: tts provider #107

feat: tts provider #107

Uh oh!

devdairy699 commented Oct 13, 2025 •

edited

Loading

Uh oh!

devdairy699 commented Nov 10, 2025

Uh oh!

rutikthakre Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: tts provider #107

Are you sure you want to change the base?

feat: tts provider #107

Uh oh!

Conversation

devdairy699 commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature Support Matrix

Demo Video

Uh oh!

devdairy699 commented Nov 10, 2025

Uh oh!

rutikthakre Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

devdairy699 commented Oct 13, 2025 •

edited

Loading