-
-
Notifications
You must be signed in to change notification settings - Fork 896
Add flag to enable collisionless media downloads #1461
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
fwiw, I think a simple solution like this would work quite well. I'd a bit worried about navigating the different filename restrictions on different platforms (or URLs containing wild crap like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a new opt-in flag --new-media-paths to prevent media filename collisions when downloading assets. Instead of using hash-based filenames that can collide, the new approach uses Discord's URL structure to create unique file paths organized by media type (emojis, stickers, avatars, etc.).
Changes:
- Adds
--new-media-pathsCLI flag and corresponding GUI toggle to enable collision-free media organization - Implements new path generation logic that preserves Discord's URL structure for downloaded media
- Organizes media into subdirectories by type (emojis/, attachments/, external/, etc.) with size-specific subfolders
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| DiscordChatExporter.Gui/Views/Dialogs/ExportSetupView.axaml | Adds UI toggle for the new media paths feature |
| DiscordChatExporter.Gui/ViewModels/Dialogs/ExportSetupViewModel.cs | Adds view model property for the new flag |
| DiscordChatExporter.Gui/ViewModels/Components/DashboardViewModel.cs | Passes the new flag to export requests |
| DiscordChatExporter.Gui/Services/SettingsService.cs | Persists the new flag setting |
| DiscordChatExporter.Core/Exporting/ExportRequest.cs | Adds the new flag as a property and constructor parameter |
| DiscordChatExporter.Core/Exporting/ExportContext.cs | Passes the new flag to the asset downloader |
| DiscordChatExporter.Core/Exporting/ExportAssetDownloader.cs | Implements new path generation logic based on URL structure |
| DiscordChatExporter.Cli/Commands/Base/ExportCommandBase.cs | Adds CLI option and validation for the new flag |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if (!string.Equals(uri.Host, "cdn.discordapp.com", StringComparison.OrdinalIgnoreCase)) | ||
| { | ||
| File.AppendAllTextAsync("external_urls.txt", $"{url}\n"); | ||
| return $"external/{uri.Host}{uri.AbsolutePath}"; |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For external URLs, the path is constructed using uri.Host and uri.AbsolutePath directly without proper sanitization. This creates two issues:
- Path traversal vulnerability: The AbsolutePath could contain '..' sequences that could allow writing files outside the intended media directory
- Invalid file paths: The path components are not escaped using Path.EscapeFileName, so they may contain invalid filesystem characters (e.g., ':', '*', '?', etc.)
The path components should be sanitized before use. Consider escaping both the host and each segment of the absolute path using Path.EscapeFileName.
| return $"external/{uri.Host}{uri.AbsolutePath}"; | |
| // Sanitize host and path segments to avoid traversal and invalid characters | |
| var safeHost = Path.EscapeFileName(uri.Host); | |
| var segments = uri.AbsolutePath.Split('/', StringSplitOptions.RemoveEmptyEntries); | |
| var safeSegments = new List<string>(); | |
| foreach (var segment in segments) | |
| { | |
| // Skip traversal segments | |
| if (segment == "." || segment == "..") | |
| continue; | |
| safeSegments.Add(Path.EscapeFileName(segment)); | |
| } | |
| if (safeSegments.Count == 0) | |
| return $"external/{safeHost}"; | |
| var safePath = string.Join('/', safeSegments); | |
| return $"external/{safeHost}/{safePath}"; |
|
|
||
| // If it is a Discord URL, we're guaranteed to have matches for these groups. <see cref="ImageCdn"/> | ||
| if (string.IsNullOrWhiteSpace(path) || string.IsNullOrWhiteSpace(fileName)) | ||
| throw new Exception("Invalid Discord URL shape"); |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the generic Exception type is not recommended. Consider using a more specific exception type like InvalidOperationException or creating a custom exception type for this scenario. This makes it easier for calling code to handle specific error cases appropriately.
| throw new Exception("Invalid Discord URL shape"); | |
| throw new InvalidOperationException("Invalid Discord URL shape"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fail more gracefully here instead of an exception. I.e. treat this URL as non-Discord CDN
| private static string GetFilePathFromUrl(string url) | ||
| { | ||
| var uri = new Uri(url); | ||
|
|
||
| // Try to extract the file name from URL | ||
| var pathAndFileName = Regex.Match(uri.AbsolutePath, @"/(.+)/([^?]*)"); | ||
| var path = pathAndFileName.Groups[1].Value; | ||
| var fileName = pathAndFileName.Groups[2].Value; | ||
|
|
||
| // If this isn't a Discord CDN URL, save the file to the `media/external` folder. | ||
| if (!string.Equals(uri.Host, "cdn.discordapp.com", StringComparison.OrdinalIgnoreCase)) | ||
| { | ||
| File.AppendAllTextAsync("external_urls.txt", $"{url}\n"); | ||
| return $"external/{uri.Host}{uri.AbsolutePath}"; | ||
| } | ||
|
|
||
| // If it is a Discord URL, we're guaranteed to have matches for these groups. <see cref="ImageCdn"/> | ||
| if (string.IsNullOrWhiteSpace(path) || string.IsNullOrWhiteSpace(fileName)) | ||
| throw new Exception("Invalid Discord URL shape"); | ||
|
|
||
| // If there is a size parameter, add it as the final folder in the path. | ||
| // This prevents multiple sizes of an avatar, for example, from overwriting each other. | ||
| var sizeParam = HttpUtility.ParseQueryString(uri.Query)["size"]; | ||
| var sizePathSegment = sizeParam != null ? $"{sizeParam}px/" : ""; | ||
|
|
||
| return $"{path}/{sizePathSegment}{Path.EscapeFileName(fileName)}"; | ||
| } |
Copilot
AI
Jan 17, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new file path generation logic introduced by the ShouldUseNewMediaFilePaths flag lacks test coverage. Given that the repository has comprehensive tests for asset downloading (e.g., SelfContainedSpecs.cs), this new functionality should be similarly tested to ensure:
- Files are saved to the correct paths for Discord CDN URLs
- External URLs are handled correctly
- Size parameters create appropriate subdirectories
- Path sanitization works correctly
- The new paths don't collide
Consider adding tests that verify the correct behavior with the new flag enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's enough to duplicate existing tests and make them leverage the new option
| // If there is a size parameter, add it as the final folder in the path. | ||
| // This prevents multiple sizes of an avatar, for example, from overwriting each other. | ||
| var sizeParam = HttpUtility.ParseQueryString(uri.Query)["size"]; | ||
| var sizePathSegment = sizeParam != null ? $"{sizeParam}px/" : ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this more generic and instead include all query parameters into the filename (instead of making new directories). Something like:
avatar_size=256px.png
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the tests revealed that when outputting HTML, the = will get URL-encoded to %3D, which could cause problems. So maybe instead of equals, we could use a hyphen instead?
Ex:
avatar_size-256_spoiler-false_animated-true.png
| public bool ShouldReuseAssets { get; init; } = false; | ||
|
|
||
| [CommandOption( | ||
| "new-media-paths", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use media-nested:
- Avoids the "new" as that doesn't really convey anything useful to the user.
- Explains what it actually does.
Please update related symbol names as well.
| Description = "Saves media with a new file naming convention to prevent collisions. " | ||
| + "Duplicate media may be downloaded if consecutive runs use different values for this flag." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the second part is a bit confusing in this case. We can instead update the description of --reuse-media to mention that the --media-nested should be consistent.
| Description = "Saves media with a new file naming convention to prevent collisions. " | |
| + "Duplicate media may be downloaded if consecutive runs use different values for this flag." | |
| Description = "Saves assets with a nested file naming convention, creating subdirectories for each distinct asset type." |
353eb05 to
431526b
Compare
431526b to
076d038
Compare
|
Regarding how best to handle external URLs, I'm thinking of the following:
I'm less sure that the second point needs to be upstreamed, just something to consider if wanted! |
|
Thanks for the thorough review! Much appreciated |
|
If folder count is a concern, we can also special-case the Discord CDN types, like changing attachment save location from |
Closes #1231
Note
My understanding of the original issue is that when downloading media with the
--mediaflag (with or without--reuse-media) there is a significant chance that multiple pieces of media could attempt to use the same filename + hash, causing all but one media download to be silently lost. If this is wrong, let me know!Change proposed
This PR introduces a new system for saving media downloads gated behind an opt-in flag
--new-media-pathsthat eliminates the chance for media filenames to collide by using existing unique IDs from Discord.Rationale
This implementation takes inspiration from this PR #1232 and its comment #1231 (comment):
This seemed reasonable to me, so I decided to look in to implementing a version of it myself for personal use.
The interesting thing I noticed is that out of the 5 Discord media types that DCE downloads via
ExportAssetDownloader(emojis, stickers, avatars, attachments, server icons), only three of them use snowflakes while the other two use hashes. The two that use hashes, avatars and server icons, are uniquely identified by a combination of the member/server ID and the file hash:emojis/{emoji_id}.png{emoji_id}stickers/{sticker_id}.png{sticker_id}attachments/{channel_id}/{attachment_id}/{attachment}.png{attachment_id}icons/{guild_id}/{guild_icon}.png{guild_id}+{guild_icon}guilds/{guild_id}/users/{user_id}/avatars/{member_avatar}.png{user_id}+{member_avatar}So strictly speaking, if we wanted to save something using just its unique ID (snowflake or composite ID+hash), we'd have to introspect the URL, determine the type of asset we're downloading, then branch on that to properly extract the right components from the URL.
But we could take this one step further: why extract parts from the URL to construct a unique filename when the URL is already a unique identifier itself? Using the URL itself has the added benefit of naturally separating the media downloads by their type:
emojis go in the
emojis/folder, stickers go in thestickers/folder, etc.Single-file examples
When
ExportAssetDownloaderreceives a media URL, it checks if the URL is from the Discord CDN. If so, it parses the path from the rest of the URL and uses that as the path to the file that will be saved to disk.If the
sizequery param is present, that is added as the last folder in the file path (a la Adobe Illustrator's technique for organizing exports of assets across multiple size). This prevents a possible issue whereExportAssetDownloaderreceives two requests for the same asset at different sizes causing one size to be overwritten.Discord CDN file
For example, if we get this URL passed:
Then we will save that media to this path on disk
External media file
If the URL is not from the discord CDN, then it's sequestered into an
external/subfolder and then uses the path in the URL as the filepath beyond that. For example, an embed might have a URL like the following:Which would get saved to this location:
Wholistic directory structure
In effect, running a command that looks something like this:
Would result in an output structure along these lines:
Comparison to the old behavior
The old behavior puts all media files flat into the same folder. Some benefits of this is that it reduces the number of folders on your disk, reduces the number of characters needed to point to each file, and in a way feels clean when you see everything packaged together. A commonly cited drawback of this is that filename collisions are highly likely.
The new behavior's nesting improves accessibility of casually inspecting the results of a backup via the native file browser. Additionally, the structure allows for external tooling to understand the type of each file, making it easier for other programs to display the downloaded assets (e.g. a simple sticker browser). Finally, filename collisions can never happen with this system while still being idempotent.
A drawback of the new system is that some of the resulting file paths are pretty ugly. If this is a major concern (which would be fairly reasonable as the file paths are essentially the implied public API of
--media), we could bring back some amount of URL introspection to make them more pretty (e.g. moving server-specific user avatars fromguilds/{guild_id}/users/{user_id}/avatars/{avatar_hash}.pngtoavatars/{user_id}/guilds/{guild_id}/{avatar_hash}.png.Further follow-ups
If this style of change is wanted as a way to fix #1231 before v3, then I'm happy to iterate on this PR to make it production-ready. I'd imagine the following changes could be discussed before merging:
https://cdn.jsdelivr.net/gh/twitter/twemoji@latest/assets/svg/1f44d.svg) to shorten the output file path (for example, to put them in atwemojis/folder)images-ext-1.discordapp.netlinks to shorten the output file path to just the external URLIf this isn't the direction that DCE wants to go in, then I'm happy to keep it in my personal fork. Thanks!