Audio-File-Deduplication-Script

This script identifies and manages duplicate audio files across two directories: a reference folder and a source folder.

Features:

Scans specified folders for audio files with configurable extensions.
Computes MD5 hashes of audio files for comparison (hashing only the first 10MB for performance).
Identifies duplicates based on file content, not just names.
Moves duplicate files to a designated output folder, ensuring unique filenames.
Handles cross-filesystem moves and retries failed moves with exponential backoff.
Supports dry run mode to simulate actions without making changes.
Provides detailed logging of the process with Unicode character support.
Tracks progress for scanning, hashing, and moving operations.

Usage:

Configuration:
- Open the script and edit the USER CONFIGURATION section at the top:
  - REFERENCE_FOLDER: Path to the folder containing original files (protected).
  - SOURCE_FOLDER: Path to the folder to scan for duplicates.
  - OUTPUT_FOLDER: Path to the folder where duplicate files will be moved.
  - LOG_FILE: Path to the log file where detailed logs will be saved.
  - LOGGING_LEVEL: Logging verbosity (DEBUG, INFO, WARNING, or ERROR).
  - AUDIO_EXTENSIONS: List of audio file extensions to process (e.g., .mp3, .flac).
  - THREADS: Number of concurrent workers for processing (default: 8).
  - BLOCK_SIZE: Size of chunks (in bytes) to read when hashing files (default: 64 KB).
  - HASH_ALGORITHM: Hashing algorithm to use (default: MD5).
  - MAX_HASH_BYTES: Maximum number of bytes to hash from each file (default: 10 MB).
  - DRY_RUN: Set to True for a dry run mode that simulates actions without making changes.
  - RETRIES: Number of retry attempts for failed file moves (default: 3).
  - RETRY_DELAY: Initial delay (in seconds) between retries, with exponential backoff.
Run the Script: Execute the script using Python: python audio_deduplication.py
Check Logs: The log file specified in LOG_FILE will contain detailed information about the process, including errors, warnings, and actions performed.

Features in Detail:

Scanning Files: The script scans both the reference folder and source folder recursively for audio files matching the specified extensions.
Hashing Files: Files are hashed based on their content using MD5. For performance, only the first 10MB of each file is hashed by default. This can be adjusted using the MAX_HASH_BYTES variable.
Identifying Duplicates: Duplicates are identified by comparing hashes. Files with matching hashes are considered duplicates regardless of their filenames.
Moving Duplicates: Duplicate files are moved from the source folder to the output folder. If a file with the same name already exists in the output folder, a unique name is generated by appending a counter (e.g., _1, _2).
Dry Run Mode: When enabled (DRY_RUN = True), no actual file operations are performed. Instead, actions are logged as if they were executed.
Cross-Filesystem Moves: If source and destination folders are on different filesystems, the script falls back to using shutil.move() instead of os.rename().
Retry Logic: If a file move fails, the script retries up to a configurable number of times (RETRIES) with exponential backoff (RETRY_DELAY).
Progress Tracking: Progress bars are displayed for scanning, hashing, and moving operations using the tqdm library.
Logging: Logs are written to both the console and a log file in UTF-8 encoding. The log level can be adjusted using the LOGGING_LEVEL variable.

Requirements:

Python 3.6 or higher
tqdm library (install with: pip install tqdm)

Caution:

This script moves files from the source folder to the output folder. Always ensure you have backups before running it.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
README.md		README.md
audio_deduplication.py		audio_deduplication.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Audio-File-Deduplication-Script

Features:

Usage:

Features in Detail:

Requirements:

Caution:

About

Uh oh!

Releases

Packages

Languages

License

sakej/Audio-File-Deduplication-Script

Folders and files

Latest commit

History

Repository files navigation

Audio-File-Deduplication-Script

Features:

Usage:

Features in Detail:

Requirements:

Caution:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages