Skip to content

This script identifies and manages duplicate audio files across two directories: a reference folder and a source folder.

License

Notifications You must be signed in to change notification settings

sakej/Audio-File-Deduplication-Script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Audio-File-Deduplication-Script

This script identifies and manages duplicate audio files across two directories: a reference folder and a source folder.

Features:

  • Scans specified folders for audio files with configurable extensions.
  • Computes MD5 hashes of audio files for comparison (hashing only the first 10MB for performance).
  • Identifies duplicates based on file content, not just names.
  • Moves duplicate files to a designated output folder, ensuring unique filenames.
  • Handles cross-filesystem moves and retries failed moves with exponential backoff.
  • Supports dry run mode to simulate actions without making changes.
  • Provides detailed logging of the process with Unicode character support.
  • Tracks progress for scanning, hashing, and moving operations.

Usage:

  1. Configuration:

    • Open the script and edit the USER CONFIGURATION section at the top:
      • REFERENCE_FOLDER: Path to the folder containing original files (protected).
      • SOURCE_FOLDER: Path to the folder to scan for duplicates.
      • OUTPUT_FOLDER: Path to the folder where duplicate files will be moved.
      • LOG_FILE: Path to the log file where detailed logs will be saved.
      • LOGGING_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, or ERROR).
      • AUDIO_EXTENSIONS: List of audio file extensions to process (e.g., .mp3, .flac).
      • THREADS: Number of concurrent workers for processing (default: 8).
      • BLOCK_SIZE: Size of chunks (in bytes) to read when hashing files (default: 64 KB).
      • HASH_ALGORITHM: Hashing algorithm to use (default: MD5).
      • MAX_HASH_BYTES: Maximum number of bytes to hash from each file (default: 10 MB).
      • DRY_RUN: Set to True for a dry run mode that simulates actions without making changes.
      • RETRIES: Number of retry attempts for failed file moves (default: 3).
      • RETRY_DELAY: Initial delay (in seconds) between retries, with exponential backoff.
  2. Run the Script: Execute the script using Python: python audio_deduplication.py

  3. Check Logs: The log file specified in LOG_FILE will contain detailed information about the process, including errors, warnings, and actions performed.

Features in Detail:

  1. Scanning Files: The script scans both the reference folder and source folder recursively for audio files matching the specified extensions.

  2. Hashing Files: Files are hashed based on their content using MD5. For performance, only the first 10MB of each file is hashed by default. This can be adjusted using the MAX_HASH_BYTES variable.

  3. Identifying Duplicates: Duplicates are identified by comparing hashes. Files with matching hashes are considered duplicates regardless of their filenames.

  4. Moving Duplicates: Duplicate files are moved from the source folder to the output folder. If a file with the same name already exists in the output folder, a unique name is generated by appending a counter (e.g., _1, _2).

  5. Dry Run Mode: When enabled (DRY_RUN = True), no actual file operations are performed. Instead, actions are logged as if they were executed.

  6. Cross-Filesystem Moves: If source and destination folders are on different filesystems, the script falls back to using shutil.move() instead of os.rename().

  7. Retry Logic: If a file move fails, the script retries up to a configurable number of times (RETRIES) with exponential backoff (RETRY_DELAY).

  8. Progress Tracking: Progress bars are displayed for scanning, hashing, and moving operations using the tqdm library.

  9. Logging: Logs are written to both the console and a log file in UTF-8 encoding. The log level can be adjusted using the LOGGING_LEVEL variable.

Requirements:

  • Python 3.6 or higher
  • tqdm library (install with: pip install tqdm)

Caution:

This script moves files from the source folder to the output folder. Always ensure you have backups before running it.

About

This script identifies and manages duplicate audio files across two directories: a reference folder and a source folder.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages