Skip to content

Conversation

@devdairy699
Copy link

@devdairy699 devdairy699 commented Oct 13, 2025

This PR is intended to resolve #23. Since we do not support WebSocket, HTTP/2, nor grpc reference, So made some changes to WIT and simplified the TTS interface.
All the test output is here https://github.com/devdairy699/golem-ai/tree/main/test/tts/test-audio-files

Feature Support Matrix

Feature AWS Polly Google TTS ElevenLabs Deepgram
Basic Synthesis
Synthesize
Synthesize Batch
SSML Support
Voice Management
List Voices
Get Voice
List Languages
Validation & Analysis
Validate Input
Get Timing Marks
Advanced Features
Voice Cloning
Voice Design
Voice Conversion
Sound Effects
Pronunciation Lexicons
Long-Form Synthesis

Demo Video

compressed_deepgram.-.Made.with.Clipchamp.mp4
compressed_Elevenlabs.-.Made.with.Clipchamp.mp4
compressed_Google.mp4
compressed_Polly.-.Made.with.Clipchamp.mp4
compressed_Polly-durability.mp4

Here is the current WIT

package golem:[email protected];

/// Core types and error handling for universal text-to-speech
interface types {
    /// Comprehensive error types covering all TTS operations
    variant tts-error {
        /// General errors
        request-error(string),

         /// Input validation errors
        invalid-text(string),
        text-too-long(u32),
        invalid-ssml(string),
        unsupported-language(string),
        
        /// Voice and model errors
        voice-not-found(string),
        model-not-found(string),
        voice-unavailable(string),
        
        /// Authentication and authorization
        unauthorized(string),
        access-denied(string),
        
        /// Resource and quota limits
        quota-exceeded(quota-info),
        rate-limited(u32),
        insufficient-credits,
        
        /// Operation errors
        synthesis-failed(string),
        unsupported-operation(string),
        invalid-configuration(string),
        
        /// Service errors
        service-unavailable(string),
        network-error(string),
        internal-error(string),
        
        /// Storage errors (for async operations)
        invalid-storage-location(string),
        storage-access-denied(string),
    }

    record quota-info {
        used: u32,
        limit: u32,
        reset-time: u64,
        unit: quota-unit,
    }

    enum quota-unit {
        characters,
        requests,
        seconds,
        credits,
    }

    /// Language identification using BCP 47 codes
    type language-code = string;

    /// Voice gender classification
    enum voice-gender {
        male,
        female,
        neutral,
    }


    /// Text input types
    enum text-type {
        plain,
        ssml,
    }

    
    /// Audio quality settings
    record audio-config {
        /// Provider specific audio encoding format 
        format: string, 
        sample-rate: option<u32>,
        bit-rate: option<u32>,
        channels: option<u8>,
    }

    /// Voice synthesis parameters
    record voice-settings {
        /// Speaking rate (0.25 to 4.0, default 1.0)
        speed: option<f32>,
        /// Pitch adjustment in semitones (-20.0 to 20.0, default 0.0)
        pitch: option<f32>,
        /// Volume gain in dB (-96.0 to 16.0, default 0.0)
        volume: option<f32>,
        /// Voice stability (0.0 to 1.0, provider-specific)
        stability: option<f32>,
        /// Similarity to original (0.0 to 1.0, provider-specific)
        similarity: option<f32>,
        /// Style exaggeration (0.0 to 1.0, provider-specific)
        style: option<f32>,
    }

    /// Audio effects and device optimization
    enum audio-effects {
        telephone-quality,
        headphone-optimized,
        speaker-optimized,
        car-audio-optimized,
        noise-reduction,
        bass-boost,
        treble-boost,
    }

    /// Input text with metadata
    record text-input {
        content: string,
        text-type: text-type,
        language: option<language-code>,
    }

    /// Complete synthesis result
    record synthesis-result {
        audio-data: list<u8>,
        metadata: synthesis-metadata,
    }

    /// Metadata about synthesized audio
    record synthesis-metadata {
        duration-seconds: f32,
        character-count: u32,
        word-count: u32,
        audio-size-bytes: u32,
        request-id: string,
        provider-info: option<string>,
    }

    /// Streaming audio chunk
    record audio-chunk {
        data: list<u8>,
        sequence-number: u32,
        is-final: bool,
        timing-info: option<timing-info>,
    }

    /// Timing and synchronization information
    record timing-info {
        start-time-seconds: f32,
        end-time-seconds: option<f32>,
        text-offset: option<u32>,
        mark-type: option<timing-mark-type>,
    }

    enum timing-mark-type {
        word,
        sentence,
        paragraph,
        ssml-mark,
        viseme,
    }


}

/// Voice discovery and management
interface voices {
    use types.{tts-error, language-code, voice-settings, voice-gender };

    /// Voice search and filtering
    record voice-filter {
        language: option<language-code>,
        gender: option<voice-gender>,
        quality: option<string>,
        supports-ssml: option<bool>,
        provider: option<string>,
        search-query: option<string>,
    }

    /// Detailed voice information
    record voice{
        id: string,
        name: string,
        language: language-code,
        additional-languages: list<language-code>,
        gender: voice-gender,
        quality: string,
        description: option<string>,
        provider: string,
        sample-rate: list<u32>,
        supports-ssml: bool,
        is-custom: bool,
        is-cloned: bool,
        preview-url: option<string>,
        use-cases: list<string>,
        supported-formats: list<string>,
    }

    record language-info {
       code: language-code,
       name: string,
       native-name: string,
       voice-count: u32,
    }

    /// List available voices with filtering and pagination
    list-voices: func(
        filter: option<voice-filter>
    ) -> result<list<voice>, tts-error>;

    /// Get specific voice by ID
    get-voice: func(voice-id: string) -> result<voice, tts-error>;

    /// Get supported languages
    list-languages: func() -> result<list<language-info>, tts-error>;

}

/// Core text-to-speech synthesis operations
interface synthesis {
    use types.{
        text-input, audio-config, voice-settings, audio-effects,
        synthesis-result, tts-error, timing-info
    };
    use voices.{voice};

    /// Synthesis configuration options
    record synthesis-options {
        audio-config: option<audio-config>,
        voice-settings: option<voice-settings>,
        audio-effects: option<list<audio-effects>>,
        enable-timing: option<bool>,
        enable-word-timing: option<bool>,
        seed: option<u32>,
        model-id: option<string>,
        context: option<synthesis-context>,
    }

    /// Context for better synthesis quality
    record synthesis-context {
        previous-text: option<string>,
        next-text: option<string>,
        topic: option<string>,
        emotion: option<string>,
        speaking-style: option<string>,
    }

    /// Convert text to speech (removed async)
    synthesize: func(
        input: text-input,
        voice: voice,
        options: option<synthesis-options>
    ) -> result<synthesis-result, tts-error>;

    /// Batch synthesis for multiple inputs (removed async)
    synthesize-batch: func(
        inputs: list<text-input>,
        voice: voice,
        options: option<synthesis-options>
    ) -> result<list<synthesis-result>, tts-error>;

    /// Get timing information without audio synthesis
    get-timing-marks: func(
        input: text-input,
        voice: voice
    ) -> result<list<timing-info>, tts-error>;

    /// Validate text before synthesis
    validate-input: func(
        input: text-input,
        voice: voice
    ) -> result<validation-result, tts-error>;

    record validation-result {
        is-valid: bool,
        character-count: u32,
        estimated-duration: option<f32>,
        warnings: list<string>,
        errors: list<string>,
    }
}


/// Advanced TTS features and voice manipulation
interface advanced {
    use types.{tts-error, audio-config, language-code, synthesis-metadata, voice-gender};
    use voices.{voice};

    /// Voice cloning and creation (removed async)
    create-voice-clone: func(
        name: string,
        audio-samples: list<audio-sample>,
        description: option<string>
    ) -> result<voice, tts-error>;

    record audio-sample {
        data: list<u8>,
        transcript: option<string>,
        quality-rating: option<u8>,
    }

    /// Design synthetic voice (removed async)
    design-voice: func(
        name: string,
        characteristics: voice-design-params
    ) -> result<voice, tts-error>;

    record voice-design-params {
        gender: voice-gender,
        age-category: age-category,
        accent: string,
        personality-traits: list<string>,
        reference-voice: option<string>,
    }

    enum age-category {
        child,
        young-adult,
        middle-aged,
        elderly,
    }

    /// Voice-to-voice conversion (removed async)
    convert-voice: func(
        input-audio: list<u8>,
        target-voice: voice,
        preserve-timing: option<bool>
    ) -> result<list<u8>, tts-error>;

    /// Generate sound effects from text description (removed async)
    generate-sound-effect: func(
        description: string,
        duration-seconds: option<f32>,
        style-influence: option<f32>
    ) -> result<list<u8>, tts-error>;

    /// Custom pronunciation management
    resource pronunciation-lexicon {
        get-name: func() -> string;
        get-language: func() -> language-code;
        get-entry-count: func() -> u32;
        
        /// Add pronunciation rule
        add-entry: func(word: string, pronunciation: string) -> result<_, tts-error>;
        
        /// Remove pronunciation rule
        remove-entry: func(word: string) -> result<_, tts-error>;
        
        /// Export lexicon content
        export-content: func() -> result<string, tts-error>;
    }

    /// Create custom pronunciation lexicon
    create-lexicon: func(
        name: string,
        language: language-code,
        entries: option<list<pronunciation-entry>>
    ) -> result<pronunciation-lexicon, tts-error>;

    record pronunciation-entry {
        word: string,
        pronunciation: string,
        part-of-speech: option<string>,
    }

    /// Long-form content synthesis with optimization (removed async)
    synthesize-long-form: func(
        content: string,
        voice: voice,
        chapter-breaks: option<list<u32>>
    ) -> result<long-form-operation, tts-error>;

    resource long-form-operation {
        get-task-id: func() -> result<string, tts-error>;
        get-status: func() -> result<operation-status, tts-error>;
        get-progress: func() -> result<f32, tts-error>;
        cancel: func() -> result<_, tts-error>;
        get-result: func() -> result<long-form-result, tts-error>;
    }

    enum operation-status {
        pending,
        processing,
        completed,
        failed,
        cancelled,
    }

    record long-form-result {
        output-location: string,
        total-duration: f32,
        chapter-durations: option<list<f32>>,
        metadata: synthesis-metadata,
    }
}

world tts-library {
    export types;
    export voices;
    export synthesis;
    export advanced;
}

/claim #23

@devdairy699 devdairy699 marked this pull request as ready for review November 10, 2025 13:43
@devdairy699
Copy link
Author

@vigoo PR’s ready for review! Tweaked wit and simplified things a bit. hope it looks good 🙌

@devdairy699 devdairy699 changed the title feat: tts provider (wip) feat: tts provider Nov 21, 2025
let base_url = get_env("TTS_PROVIDER_ENDPOINT")
.ok()
.unwrap_or("https://api.deepgram.com".to_string());
trace!("Using base URL: {base_url}");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be wonderful if we could have consistent logs across all providers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Durable Text-to-Speech Provider Components for golem:tts WIT Interface

2 participants