Skip to content

Multi-channel audio is automatically cast to mono, num_channels is ignored #8005

@ZackHodari

Description

@ZackHodari

Describe the bug

The num_channels parameter in datasets.Audio() is documented to preserve stereo channels when set to None (preserve original) or 2 (explicit stereo), but it currently downmixes all audio to mono regardless of this setting.

Steps to reproduce the bug

import numpy as np
import soundfile as sf
import tempfile
from datasets import Dataset, Audio

# Create a stereo audio file
sample_rate = 16000
duration = 1.0
num_samples = int(sample_rate * duration)

left_channel = np.sin(2 * np.pi * 440 * np.linspace(0, duration, num_samples))
right_channel = np.sin(2 * np.pi * 880 * np.linspace(0, duration, num_samples))
stereo_audio = np.stack([left_channel, right_channel], axis=1).astype(np.float32)

temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
sf.write(temp_file.name, stereo_audio, sample_rate)

# Create HuggingFace dataset
dataset_dict = {"audio": [temp_file.name]}
ds = Dataset.from_dict(dataset_dict)

# Test with num_channels=2
ds_stereo = ds.cast_column("audio", Audio(num_channels=2))
audio_data = ds_stereo[0]["audio"]

print(f"Original file shape (via soundfile): {sf.read(temp_file.name)[0].shape}")
# Output: (16000, 2) ✓ Stereo

print(f"HF datasets shape with num_channels=2: {audio_data['array'].shape}")
# Output: (16000,) ✗ Mono (should be (2, 16000))

Result:

  • Original file: (16000, 2) - stereo ✓
  • Audio(num_channels=None): (16000,) - mono ✗
  • Audio(num_channels=2): (16000,) - mono ✗
  • Audio(num_channels=1): (16000,) - mono ✓

Expected behavior

According to the documentation, Audio decoding should return samples with shape (num_channels, num_samples):

  • num_channels=None should preserve the original number of channels from the source file
  • num_channels=2 should preserve/convert to stereo output with shape (2, num_samples)
  • num_channels=1 should downmix to mono with shape (num_samples,)

Actual Behavior
All num_channels settings produce mono output with shape (num_samples,), even when the source audio file is stereo.

Environment info

OS: macOS / Linux
Python 3.10.19

datasets             4.4.2
torchcodec           0.10.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions