Hello, I’m currently exploring the functionality of the audio-to-video script in this repository and would like to understand how the audio features are extracted as part of the process, specifically regarding the STFT features in the stft_pickle data which has a shape of (90, 45, 17) while the corresponding video has 90 frames; could you explain how the STFT (Short-Time Fourier Transform) features are computed?