This script identifies and manages duplicate audio files across two directories: a reference folder and a source folder.
- Scans specified folders for audio files with configurable extensions.
- Computes MD5 hashes of audio files for comparison (hashing only the first 10MB for performance).
- Identifies duplicates based on file content, not just names.
- Moves duplicate files to a designated output folder, ensuring unique filenames.
- Handles cross-filesystem moves and retries failed moves with exponential backoff.
- Supports dry run mode to simulate actions without making changes.
- Provides detailed logging of the process with Unicode character support.
- Tracks progress for scanning, hashing, and moving operations.
-
Configuration:
- Open the script and edit the
USER CONFIGURATIONsection at the top:REFERENCE_FOLDER: Path to the folder containing original files (protected).SOURCE_FOLDER: Path to the folder to scan for duplicates.OUTPUT_FOLDER: Path to the folder where duplicate files will be moved.LOG_FILE: Path to the log file where detailed logs will be saved.LOGGING_LEVEL: Logging verbosity (DEBUG,INFO,WARNING, orERROR).AUDIO_EXTENSIONS: List of audio file extensions to process (e.g.,.mp3,.flac).THREADS: Number of concurrent workers for processing (default: 8).BLOCK_SIZE: Size of chunks (in bytes) to read when hashing files (default: 64 KB).HASH_ALGORITHM: Hashing algorithm to use (default: MD5).MAX_HASH_BYTES: Maximum number of bytes to hash from each file (default: 10 MB).DRY_RUN: Set toTruefor a dry run mode that simulates actions without making changes.RETRIES: Number of retry attempts for failed file moves (default: 3).RETRY_DELAY: Initial delay (in seconds) between retries, with exponential backoff.
- Open the script and edit the
-
Run the Script: Execute the script using Python: python audio_deduplication.py
-
Check Logs: The log file specified in
LOG_FILEwill contain detailed information about the process, including errors, warnings, and actions performed.
-
Scanning Files: The script scans both the reference folder and source folder recursively for audio files matching the specified extensions.
-
Hashing Files: Files are hashed based on their content using MD5. For performance, only the first 10MB of each file is hashed by default. This can be adjusted using the
MAX_HASH_BYTESvariable. -
Identifying Duplicates: Duplicates are identified by comparing hashes. Files with matching hashes are considered duplicates regardless of their filenames.
-
Moving Duplicates: Duplicate files are moved from the source folder to the output folder. If a file with the same name already exists in the output folder, a unique name is generated by appending a counter (e.g.,
_1,_2). -
Dry Run Mode: When enabled (
DRY_RUN = True), no actual file operations are performed. Instead, actions are logged as if they were executed. -
Cross-Filesystem Moves: If source and destination folders are on different filesystems, the script falls back to using
shutil.move()instead ofos.rename(). -
Retry Logic: If a file move fails, the script retries up to a configurable number of times (
RETRIES) with exponential backoff (RETRY_DELAY). -
Progress Tracking: Progress bars are displayed for scanning, hashing, and moving operations using the
tqdmlibrary. -
Logging: Logs are written to both the console and a log file in UTF-8 encoding. The log level can be adjusted using the
LOGGING_LEVELvariable.
- Python 3.6 or higher
- tqdm library (install with:
pip install tqdm)
This script moves files from the source folder to the output folder. Always ensure you have backups before running it.