This implementation adds AI-powered video breakdown functionality to the VortexHub application using Google Cloud services. The system automatically extracts audio from uploaded videos, transcribes the content using Google Speech-to-Text, and analyzes the transcript using Google Gemini AI to generate structured step-by-step instructions.
- Automatic Audio Extraction: Uses FFmpeg to extract high-quality audio from video files
- Speech-to-Text Transcription: Leverages Google Speech-to-Text API for accurate transcription
- AI-Powered Analysis: Uses Google Gemini Pro to extract:
- Step-by-step instructions
- Required materials and quantities
- Tools and equipment needed
- Safety cautions and warnings
- Validation questions for learning
- Progress Tracking: Real-time progress updates during processing
- Cost Optimization: Uses the most cost-effective Google Cloud services
[User Uploads Video] → [Frontend] → [Backend API] → [Google Cloud Services]
↓
[Audio Extraction (FFmpeg)] → [Speech-to-Text] → [Gemini AI Analysis] → [Structured Output]
-
AI Video Analysis Section (
CreateProject.js)- New button: "Analyze with AI"
- Progress indicator during processing
- Results display showing detected steps, materials, and tools
- Integration with existing project creation flow
-
Google Cloud API Service (
src/services/googleCloudApi.js)- Centralized service for all Google Cloud API calls
- Error handling and retry logic
- Progress callback support
- Conditional Submission: Form automatically uses AI data when available
- Progress Feedback: Real-time updates during AI processing
- Error Handling: Comprehensive error messages and recovery
- Data Integration: AI results seamlessly integrated into project creation
-
POST /extract-audio
- Accepts video file upload
- Extracts audio using FFmpeg
- Uploads audio to Google Cloud Storage
- Returns GCS audio URL
-
POST /transcribe
- Accepts GCS audio URL
- Uses Google Speech-to-Text API
- Returns transcript with timestamps
- Supports multiple languages
-
POST /analyze-transcript
- Accepts transcript and context
- Uses Google Gemini Pro for analysis
- Returns structured step data
- Includes materials, tools, cautions, and questions
- Framework: FastAPI (Python)
- Audio Processing: FFmpeg
- Cloud Services: Google Cloud Platform
- AI Model: Google Gemini Pro (via Vertex AI)
- Storage: Google Cloud Storage
- Create a new Google Cloud Project or use existing
- Enable required APIs:
gcloud services enable speech.googleapis.com gcloud services enable aiplatform.googleapis.com gcloud services enable storage.googleapis.com
- Create a service account with appropriate permissions:
gcloud iam service-accounts create video-breakdown-sa \ --display-name="Video Breakdown Service Account" - Grant necessary roles:
gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \ --member="serviceAccount:video-breakdown-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/speech.client" gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \ --member="serviceAccount:video-breakdown-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/aiplatform.user" gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \ --member="serviceAccount:video-breakdown-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com" \ --role="roles/storage.objectViewer"
- Download the service account key:
gcloud iam service-accounts keys create service-account-key.json \ --iam-account=video-breakdown-sa@YOUR_PROJECT_ID.iam.gserviceaccount.com
Create a .env file in your backend directory:
# Google Cloud Configuration
GOOGLE_CLOUD_PROJECT_ID=your-project-id
GOOGLE_CLOUD_STORAGE_BUCKET=your-storage-bucket
GOOGLE_VERTEX_AI_LOCATION=us-central1
GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account-key.json
# Backend Configuration
API_HOST=0.0.0.0
API_PORT=8000
CORS_ORIGINS=http://localhost:3000,https://your-frontend-domain.com-
Install Python dependencies:
pip install -r backend_requirements.txt
-
Install FFmpeg (required for audio extraction):
# Ubuntu/Debian sudo apt update && sudo apt install ffmpeg # macOS brew install ffmpeg # Windows # Download from https://ffmpeg.org/download.html
-
Run the backend:
python backend_implementation_example.py
Update your frontend .env file:
# Existing Firebase config...
REACT_APP_FIREBASE_API_KEY=your_firebase_api_key
# ... other Firebase variables
# Backend API URL
REACT_APP_API_URL=http://localhost:8000- Navigate to the Create Project page
- Upload one or more video files
- Fill in project details (name, description, tags)
- Click "Analyze with AI" button
- Wait for processing to complete (progress shown in real-time)
- Review the generated results
- Click "Save with AI Data & Continue"
- Project is created with AI-generated steps
- Continue to annotation page for manual refinement
Speech-to-Text API:
- Standard Model: $0.006 per 15 seconds
- Enhanced Model: $0.009 per 15 seconds
- Video Model: $0.006 per 15 seconds (recommended)
Gemini Pro (Vertex AI):
- Input: $0.0005 per 1K characters
- Output: $0.0015 per 1K characters
Storage:
- Standard: $0.020 per GB per month
- Audio files are typically small (< 10MB per video)
For a 10-minute video:
- Audio extraction: ~$0.24 (Speech-to-Text)
- AI analysis: ~$0.05-0.15 (Gemini Pro)
- Storage: ~$0.0002 (audio file)
- Total per video: ~$0.30-0.40
- Use clear audio for better transcription
- Avoid background noise
- Speak clearly and at normal pace
- Use the "video" model for Speech-to-Text (optimized for video content)
- Implement caching for repeated transcriptions
- Clean up temporary files regularly
- Implement retry logic for API failures
- Provide clear error messages to users
- Log errors for debugging
- Validate uploaded files
- Implement rate limiting
- Use service accounts with minimal permissions
- Secure API keys and credentials
-
FFmpeg not found
- Ensure FFmpeg is installed and in PATH
- Check installation with
ffmpeg -version
-
Google Cloud authentication errors
- Verify service account key is correct
- Check API permissions
- Ensure APIs are enabled
-
CORS errors
- Update CORS configuration in backend
- Check frontend URL is in allowed origins
-
Audio extraction failures
- Verify video file format is supported
- Check file size limits
- Ensure sufficient disk space
- Check backend logs for detailed error messages
- Verify environment variables are loaded correctly
- Test API endpoints individually using tools like Postman
- Monitor Google Cloud Console for API usage and errors
- Batch Processing: Process multiple videos simultaneously
- Custom Models: Train custom models for specific domains
- Video Analysis: Add computer vision for visual step detection
- Multi-language Support: Support for multiple languages
- Real-time Processing: Stream processing for live videos
- Advanced Analytics: Detailed usage analytics and insights
For issues and questions:
- Check the troubleshooting section above
- Review Google Cloud documentation
- Check application logs for detailed error messages
- Monitor Google Cloud Console for API usage and quotas