Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'check if file already in db by URLs' does not work #1667

Open
444man opened this issue Jan 24, 2025 · 4 comments
Open

'check if file already in db by URLs' does not work #1667

444man opened this issue Jan 24, 2025 · 4 comments
Labels
feature-request system:user-interface Looks and actions of the user interface

Comments

@444man
Copy link

444man commented Jan 24, 2025

Hydrus version

v606

Qt major version

Qt 6

Operating system

Windows 11

Install method

Extract

Install and OS comments

No response

Bug description and reproduction

In my use case,I need to use URLs to check duplication,
but image downloaded again althrough another image has same url.

A simple way to reproduce bug:

  1. download "https://danbooru.donmai.us/posts/1" with url download
  2. modify file (hash changed) in external program and import to hydrus
  3. copy all urls of old file to modified file
  4. choose old file then [delete] then [delete physically now] then [clear delection record]
  5. download "https://danbooru.donmai.us/posts/1" again with url download

it downloaded again althrough modified file has "https://danbooru.donmai.us/posts/1" in urls
step 4 is important,without it then without bug.

Log output


@444man 444man added the bug label Jan 24, 2025
@hydrusnetwork
Copy link
Owner

Thank you for this report. I am not sure if this is a bug, but I think I can say there is bad user feedback on what is going on here.

The 'have we seen this file before' logic in the downloader can get pretty tricky. There's a bunch of situations where we cannot be confident in the result, and here the system generally falls back to 'I do not know for sure, so we'll let the download go ahead'. An example of this is when a URL that has an ostensible match to a file also has matches to other files. These duplicate mappings can be added by various means, either merging in the duplication system, or a booru that suggests an incorrect 'source URL', or as in your case a manual copy. In this case, hydrus has good evidence that the URL-mapping it matched is not definite, and in your case it was indeed correct--the URL would result in downloading a new file for which it has no delete record, and so it imports it. Had the 'clear deletion record' not been set, I think the download would have either fetched a hash and matched it to 'previously deleted', or it would have downloaded the file again, calculated hash locally, and then come to the same result.

There is a similar logical exception for a file that has two refer-capable URLs within the same domain. somebooru.net/123 and somebooru.net/567--hydrus is not certain about which URL or file mapping is correct here, so it discards that URL as a potential to rely on.

I will remove the bug label, but this is a good place to say that I should have some UI, let's say somewhere in the file log, that can record or otherwise better explain the logic behind the downloader engine's decisions here. Maybe I can set a 'note' for odd situations like this, despite them coming up as 'success' in the end.

@hydrusnetwork hydrusnetwork added feature-request system:user-interface Looks and actions of the user interface and removed bug labels Jan 25, 2025
@hydrusnetwork
Copy link
Owner

Oh--I should say, if you want to build a workflow out of this sort of conversion, I recommend you move your URLs rather than copying. I think that'll preserve the 1-to-1 nature of your file-url mapping store, even if it doesn't actually reflect the reality of which files are actually at those URL endpoints.

I did look at some advanced logic that would navigate this situation of n-mapped-urls when the n-files are actually duplicates, but it wasn't a trivial problem to solve, and this stuff can get complicated, so I stepped back for KISS reasons. I will likely revisit seriously when we get to more automated duplicate merging and en masse file conversion tech.

@444man
Copy link
Author

444man commented Jan 26, 2025

Oh--I should say, if you want to build a workflow out of this sort of conversion, I recommend you move your URLs rather than copying. I think that'll preserve the 1-to-1 nature of your file-url mapping store, even if it doesn't actually reflect the reality of which files are actually at those URL endpoints.

I did look at some advanced logic that would navigate this situation of n-mapped-urls when the n-files are actually duplicates, but it wasn't a trivial problem to solve, and this stuff can get complicated, so I stepped back for KISS reasons. I will likely revisit seriously when we get to more automated duplicate merging and en masse file conversion tech.

I have read the API document in detail again and do some tests,
GET /add_urls/get_url_files will return status of old file and new file,so I think that is the reason.
POST /add_urls/associate_url save me.After POST /add_files/clear_file_deletion_record,send all urls of old file to delete by POST /add_urls/associate_url ,and then the workflow runs as I wish.

Anyway,thanks for your work.I have already managed over 1.2TB of files using this,the detailed API document makes my work easier,and this is definitely the best file manager I have ever used.

@hydrusnetwork
Copy link
Owner

Great--that sounds good. I am really glad you like my program! Let me know if you run into any more trouble.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request system:user-interface Looks and actions of the user interface
Projects
None yet
Development

No branches or pull requests

2 participants