Skip to content

Conversation

@benrucker
Copy link

@benrucker benrucker commented Jan 3, 2026

Closes #1231

Note

My understanding of the original issue is that when downloading media with the --media flag (with or without --reuse-media) there is a significant chance that multiple pieces of media could attempt to use the same filename + hash, causing all but one media download to be silently lost. If this is wrong, let me know!

Change proposed

This PR introduces a new system for saving media downloads gated behind an opt-in flag --new-media-paths that eliminates the chance for media filenames to collide by using existing unique IDs from Discord.

Rationale

This implementation takes inspiration from this PR #1232 and its comment #1231 (comment):

An alternative suggestion to lengthening the hash would be to just put the attachment ID in the filename. I feel like at a certain point, a bunch of random base-32 characters doesn't really provide much utility over 19 digits (which would be guaranteed to never collide).

This seemed reasonable to me, so I decided to look in to implementing a version of it myself for personal use.

The interesting thing I noticed is that out of the 5 Discord media types that DCE downloads via ExportAssetDownloader (emojis, stickers, avatars, attachments, server icons), only three of them use snowflakes while the other two use hashes. The two that use hashes, avatars and server icons, are uniquely identified by a combination of the member/server ID and the file hash:

Asset URL Unique identifier
Custom Emoji emojis/{emoji_id}.png {emoji_id}
Sticker stickers/{sticker_id}.png {sticker_id}
Attachment attachments/{channel_id}/{attachment_id}/{attachment}.png {attachment_id}
Guild Icon icons/{guild_id}/{guild_icon}.png {guild_id}+{guild_icon}
Guild Member Avatar guilds/{guild_id}/users/{user_id}/avatars/{member_avatar}.png {user_id}+{member_avatar}

So strictly speaking, if we wanted to save something using just its unique ID (snowflake or composite ID+hash), we'd have to introspect the URL, determine the type of asset we're downloading, then branch on that to properly extract the right components from the URL.

But we could take this one step further: why extract parts from the URL to construct a unique filename when the URL is already a unique identifier itself? Using the URL itself has the added benefit of naturally separating the media downloads by their type:
emojis go in the emojis/ folder, stickers go in the stickers/ folder, etc.

Single-file examples

When ExportAssetDownloader receives a media URL, it checks if the URL is from the Discord CDN. If so, it parses the path from the rest of the URL and uses that as the path to the file that will be saved to disk.

If the size query param is present, that is added as the last folder in the file path (a la Adobe Illustrator's technique for organizing exports of assets across multiple size). This prevents a possible issue where ExportAssetDownloader receives two requests for the same asset at different sizes causing one size to be overwritten.

Discord CDN file

For example, if we get this URL passed:

https://cdn.discordapp.com/emojis/754441880066064584.webp?size=128

Then we will save that media to this path on disk

{media_path}/emojis/128px/754441880066064584.webp

External media file

If the URL is not from the discord CDN, then it's sequestered into an external/ subfolder and then uses the path in the URL as the filepath beyond that. For example, an embed might have a URL like the following:

https://images-ext-1.discordapp.net/external/kJEb3-XApHsU0ov41y9Nk8BuseDcmeUSHQEtN0swR2M/https/i.ytimg.com/vi/6RghYTEwbpM/maxresdefault.jpg

Which would get saved to this location:

{media_path}/external/images-ext-1.discordapp.net/external/kJEb3-XApHsU0ov41y9Nk8BuseDcmeUSHQEtN0swR2M/https/i.ytimg.com/vi/6RghYTEwbpM/maxresdefault.jpg

Wholistic directory structure

In effect, running a command that looks something like this:

./DiscordChatExporter.Cli export
  -t "wHid.spr.bNa" 
  --media 
  --reuse-media 
  --new-media-paths 
  -o "output/data/"
  --media-dir "output/media/"

Would result in an output structure along these lines:

output/
  data/
    guild name - category - channel1 [id].json
    guild name - category - channel2 [id].json
    guild name - category - channel3 [id].json
    ...
  media/
    attachments/
      173123.../ (server_id)
        145699.../ (attachment_id)
          IMG_1550.png
        ...
      ...
    avatars/
      17294.../ (user id)
        cadcc293.png (avatar hash)
      ...
    emojis/
      1234219.png (emoji id)
      ...
    external/
      cdn.jsdelivr.net/
      images-ext-1.discordapp.net/
      www.youtube.com/
      ...
    guilds/
      170123.../ (guild_id)
        users/
          17290.../ (user id)
            avatars/
              512px/
                50e477.png (avatar hash)
              ...
          ...
      ...
    icons/
      170123.../ (guild_id)
        512px/
          abde23.gif (file hash)
        ...

Comparison to the old behavior

The old behavior puts all media files flat into the same folder. Some benefits of this is that it reduces the number of folders on your disk, reduces the number of characters needed to point to each file, and in a way feels clean when you see everything packaged together. A commonly cited drawback of this is that filename collisions are highly likely.

The new behavior's nesting improves accessibility of casually inspecting the results of a backup via the native file browser. Additionally, the structure allows for external tooling to understand the type of each file, making it easier for other programs to display the downloaded assets (e.g. a simple sticker browser). Finally, filename collisions can never happen with this system while still being idempotent.

A drawback of the new system is that some of the resulting file paths are pretty ugly. If this is a major concern (which would be fairly reasonable as the file paths are essentially the implied public API of --media), we could bring back some amount of URL introspection to make them more pretty (e.g. moving server-specific user avatars from guilds/{guild_id}/users/{user_id}/avatars/{avatar_hash}.png to avatars/{user_id}/guilds/{guild_id}/{avatar_hash}.png.

Further follow-ups

If this style of change is wanted as a way to fix #1231 before v3, then I'm happy to iterate on this PR to make it production-ready. I'd imagine the following changes could be discussed before merging:

  • Custom handling of twemoji links (e.g. https://cdn.jsdelivr.net/gh/twitter/twemoji@latest/assets/svg/1f44d.svg) to shorten the output file path (for example, to put them in a twemojis/ folder)
  • Custom handling of images-ext-1.discordapp.net links to shorten the output file path to just the external URL
  • Better sanitation of external media URLs
    • I didn't see any invalid file paths in my testing, but I wonder if it could happen

If this isn't the direction that DCE wants to go in, then I'm happy to keep it in my personal fork. Thanks!

@Tyrrrz Tyrrrz added the bug label Jan 3, 2026
@Lymia
Copy link
Contributor

Lymia commented Jan 5, 2026

fwiw, I think a simple solution like this would work quite well. I'd a bit worried about navigating the different filename restrictions on different platforms (or URLs containing wild crap like /../), but those are solvable problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new opt-in flag --new-media-paths to prevent media filename collisions when downloading assets. Instead of using hash-based filenames that can collide, the new approach uses Discord's URL structure to create unique file paths organized by media type (emojis, stickers, avatars, etc.).

Changes:

  • Adds --new-media-paths CLI flag and corresponding GUI toggle to enable collision-free media organization
  • Implements new path generation logic that preserves Discord's URL structure for downloaded media
  • Organizes media into subdirectories by type (emojis/, attachments/, external/, etc.) with size-specific subfolders

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
DiscordChatExporter.Gui/Views/Dialogs/ExportSetupView.axaml Adds UI toggle for the new media paths feature
DiscordChatExporter.Gui/ViewModels/Dialogs/ExportSetupViewModel.cs Adds view model property for the new flag
DiscordChatExporter.Gui/ViewModels/Components/DashboardViewModel.cs Passes the new flag to export requests
DiscordChatExporter.Gui/Services/SettingsService.cs Persists the new flag setting
DiscordChatExporter.Core/Exporting/ExportRequest.cs Adds the new flag as a property and constructor parameter
DiscordChatExporter.Core/Exporting/ExportContext.cs Passes the new flag to the asset downloader
DiscordChatExporter.Core/Exporting/ExportAssetDownloader.cs Implements new path generation logic based on URL structure
DiscordChatExporter.Cli/Commands/Base/ExportCommandBase.cs Adds CLI option and validation for the new flag

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

if (!string.Equals(uri.Host, "cdn.discordapp.com", StringComparison.OrdinalIgnoreCase))
{
File.AppendAllTextAsync("external_urls.txt", $"{url}\n");
return $"external/{uri.Host}{uri.AbsolutePath}";
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For external URLs, the path is constructed using uri.Host and uri.AbsolutePath directly without proper sanitization. This creates two issues:

  1. Path traversal vulnerability: The AbsolutePath could contain '..' sequences that could allow writing files outside the intended media directory
  2. Invalid file paths: The path components are not escaped using Path.EscapeFileName, so they may contain invalid filesystem characters (e.g., ':', '*', '?', etc.)

The path components should be sanitized before use. Consider escaping both the host and each segment of the absolute path using Path.EscapeFileName.

Suggested change
return $"external/{uri.Host}{uri.AbsolutePath}";
// Sanitize host and path segments to avoid traversal and invalid characters
var safeHost = Path.EscapeFileName(uri.Host);
var segments = uri.AbsolutePath.Split('/', StringSplitOptions.RemoveEmptyEntries);
var safeSegments = new List<string>();
foreach (var segment in segments)
{
// Skip traversal segments
if (segment == "." || segment == "..")
continue;
safeSegments.Add(Path.EscapeFileName(segment));
}
if (safeSegments.Count == 0)
return $"external/{safeHost}";
var safePath = string.Join('/', safeSegments);
return $"external/{safeHost}/{safePath}";

Copilot uses AI. Check for mistakes.

// If it is a Discord URL, we're guaranteed to have matches for these groups. <see cref="ImageCdn"/>
if (string.IsNullOrWhiteSpace(path) || string.IsNullOrWhiteSpace(fileName))
throw new Exception("Invalid Discord URL shape");
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the generic Exception type is not recommended. Consider using a more specific exception type like InvalidOperationException or creating a custom exception type for this scenario. This makes it easier for calling code to handle specific error cases appropriately.

Suggested change
throw new Exception("Invalid Discord URL shape");
throw new InvalidOperationException("Invalid Discord URL shape");

Copilot uses AI. Check for mistakes.
Copy link
Owner

@Tyrrrz Tyrrrz Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should fail more gracefully here instead of an exception. I.e. treat this URL as non-Discord CDN

Comment on lines 68 to 94
private static string GetFilePathFromUrl(string url)
{
var uri = new Uri(url);

// Try to extract the file name from URL
var pathAndFileName = Regex.Match(uri.AbsolutePath, @"/(.+)/([^?]*)");
var path = pathAndFileName.Groups[1].Value;
var fileName = pathAndFileName.Groups[2].Value;

// If this isn't a Discord CDN URL, save the file to the `media/external` folder.
if (!string.Equals(uri.Host, "cdn.discordapp.com", StringComparison.OrdinalIgnoreCase))
{
File.AppendAllTextAsync("external_urls.txt", $"{url}\n");
return $"external/{uri.Host}{uri.AbsolutePath}";
}

// If it is a Discord URL, we're guaranteed to have matches for these groups. <see cref="ImageCdn"/>
if (string.IsNullOrWhiteSpace(path) || string.IsNullOrWhiteSpace(fileName))
throw new Exception("Invalid Discord URL shape");

// If there is a size parameter, add it as the final folder in the path.
// This prevents multiple sizes of an avatar, for example, from overwriting each other.
var sizeParam = HttpUtility.ParseQueryString(uri.Query)["size"];
var sizePathSegment = sizeParam != null ? $"{sizeParam}px/" : "";

return $"{path}/{sizePathSegment}{Path.EscapeFileName(fileName)}";
}
Copy link

Copilot AI Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new file path generation logic introduced by the ShouldUseNewMediaFilePaths flag lacks test coverage. Given that the repository has comprehensive tests for asset downloading (e.g., SelfContainedSpecs.cs), this new functionality should be similarly tested to ensure:

  1. Files are saved to the correct paths for Discord CDN URLs
  2. External URLs are handled correctly
  3. Size parameters create appropriate subdirectories
  4. Path sanitization works correctly
  5. The new paths don't collide

Consider adding tests that verify the correct behavior with the new flag enabled.

Copilot uses AI. Check for mistakes.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's enough to duplicate existing tests and make them leverage the new option

Comment on lines 88 to 91
// If there is a size parameter, add it as the final folder in the path.
// This prevents multiple sizes of an avatar, for example, from overwriting each other.
var sizeParam = HttpUtility.ParseQueryString(uri.Query)["size"];
var sizePathSegment = sizeParam != null ? $"{sizeParam}px/" : "";
Copy link
Owner

@Tyrrrz Tyrrrz Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this more generic and instead include all query parameters into the filename (instead of making new directories). Something like:

avatar_size=256px.png

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the tests revealed that when outputting HTML, the = will get URL-encoded to %3D, which could cause problems. So maybe instead of equals, we could use a hyphen instead?

Ex:
avatar_size-256_spoiler-false_animated-true.png

public bool ShouldReuseAssets { get; init; } = false;

[CommandOption(
"new-media-paths",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use media-nested:

  1. Avoids the "new" as that doesn't really convey anything useful to the user.
  2. Explains what it actually does.

Please update related symbol names as well.

Comment on lines 106 to 107
Description = "Saves media with a new file naming convention to prevent collisions. "
+ "Duplicate media may be downloaded if consecutive runs use different values for this flag."
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the second part is a bit confusing in this case. We can instead update the description of --reuse-media to mention that the --media-nested should be consistent.

Suggested change
Description = "Saves media with a new file naming convention to prevent collisions. "
+ "Duplicate media may be downloaded if consecutive runs use different values for this flag."
Description = "Saves assets with a nested file naming convention, creating subdirectories for each distinct asset type."

@benrucker benrucker force-pushed the br/newMediaFilePaths branch from 353eb05 to 431526b Compare January 17, 2026 04:09
@benrucker benrucker force-pushed the br/newMediaFilePaths branch from 431526b to 076d038 Compare January 17, 2026 04:26
@benrucker
Copy link
Author

benrucker commented Jan 17, 2026

Regarding how best to handle external URLs, I'm thinking of the following:

I'm less sure that the second point needs to be upstreamed, just something to consider if wanted!

@benrucker
Copy link
Author

Thanks for the thorough review! Much appreciated

@benrucker
Copy link
Author

If folder count is a concern, we can also special-case the Discord CDN types, like changing attachment save location from attachments/{channel_id}/{attachment_id}/{filename}.{ext} to just attachments/{attachment_id}.{ext} or similar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--reuse-media creates an unacceptably high chance of a collision

3 participants