Describe the bug
The num_channels parameter in datasets.Audio() is documented to preserve stereo channels when set to None (preserve original) or 2 (explicit stereo), but it currently downmixes all audio to mono regardless of this setting.
Steps to reproduce the bug
import numpy as np
import soundfile as sf
import tempfile
from datasets import Dataset, Audio
# Create a stereo audio file
sample_rate = 16000
duration = 1.0
num_samples = int(sample_rate * duration)
left_channel = np.sin(2 * np.pi * 440 * np.linspace(0, duration, num_samples))
right_channel = np.sin(2 * np.pi * 880 * np.linspace(0, duration, num_samples))
stereo_audio = np.stack([left_channel, right_channel], axis=1).astype(np.float32)
temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
sf.write(temp_file.name, stereo_audio, sample_rate)
# Create HuggingFace dataset
dataset_dict = {"audio": [temp_file.name]}
ds = Dataset.from_dict(dataset_dict)
# Test with num_channels=2
ds_stereo = ds.cast_column("audio", Audio(num_channels=2))
audio_data = ds_stereo[0]["audio"]
print(f"Original file shape (via soundfile): {sf.read(temp_file.name)[0].shape}")
# Output: (16000, 2) ✓ Stereo
print(f"HF datasets shape with num_channels=2: {audio_data['array'].shape}")
# Output: (16000,) ✗ Mono (should be (2, 16000))
Result:
- Original file:
(16000, 2) - stereo ✓
Audio(num_channels=None): (16000,) - mono ✗
Audio(num_channels=2): (16000,) - mono ✗
Audio(num_channels=1): (16000,) - mono ✓
Expected behavior
According to the documentation, Audio decoding should return samples with shape (num_channels, num_samples):
num_channels=None should preserve the original number of channels from the source file
num_channels=2 should preserve/convert to stereo output with shape (2, num_samples)
num_channels=1 should downmix to mono with shape (num_samples,)
Actual Behavior
All num_channels settings produce mono output with shape (num_samples,), even when the source audio file is stereo.
Environment info
OS: macOS / Linux
Python 3.10.19
datasets 4.4.2
torchcodec 0.10.0
Describe the bug
The
num_channelsparameter indatasets.Audio()is documented to preserve stereo channels when set toNone(preserve original) or2(explicit stereo), but it currently downmixes all audio to mono regardless of this setting.Steps to reproduce the bug
Result:
(16000, 2)- stereo ✓Audio(num_channels=None):(16000,)- mono ✗Audio(num_channels=2):(16000,)- mono ✗Audio(num_channels=1):(16000,)- mono ✓Expected behavior
According to the documentation,
Audiodecoding should return samples with shape(num_channels, num_samples):num_channels=Noneshould preserve the original number of channels from the source filenum_channels=2should preserve/convert to stereo output with shape(2, num_samples)num_channels=1should downmix to mono with shape(num_samples,)Actual Behavior
All
num_channelssettings produce mono output with shape(num_samples,), even when the source audio file is stereo.Environment info
OS: macOS / Linux
Python 3.10.19