Add Hashing & Release Info Providers [WIP] #1228

revam · 2025-02-24T00:39:08Z

PoC adding hashing & release providers. Still things left to test, but it's overall in a working state, minus the still missing MySQL/MariaDB & MS SQL Server support.

Changes internally:

Added a new IVideoHashingService, IHashProvider, IHashDigest, HashDigest to the plugin abstraction. The new IVideoHashingService operates on raw System.IO.FileInfos and is responsible for providing hashes to the HashFileJob before a IVideo & IVideoFile is necessarily assigned to a file location. So far there are runtime checks in place to make sure at least 1 "ED2K" hasher is enabled at all times, since we still rely on it as our absolute ID (in combination with the video file size) internally, but the hasher doesn't necessarily need to be provided by the new "Core" hasher. It contains events for when a IVideo & IVideoFile has been hashed (and added to the system), and when providers have been updated. The service can be used to switch between sequential mode and parallel mode — which controls how plugin providers are called, view all available and/or enabled hash types, enable or disable hash types per provider, and re-order the run priority of providers in sequential mode.

Note: The priority doesn't affect the parallel mode because every provider is… ran in parallel.
Added a new "Core" hasher (CoreHashProvider) implementing the "ED2K", "MD5", "CRC32", "SHA1", "SHA256", & "SHA512" hash types, with the "ED2K" enabled by default.
Added a new IVideoReleaseService, IReleaseInfoProvider, IReleaseInfo, IReleaseVideoCrossReference, IReleaseMediaInfo, VideoReleaseEventArgs to the plugin abstraction. The new IVideoReleaseService is responsible for everything release to release info, be it managing providers, doing the auto-search across multiple providers, showing provider info, saving release info to the database, and clearing saved release info from the database. It also contains events for when a release has been saved or cleared, when a auto-search has been started/completed, and when providers have been updated. The service can be used to switch between sequential mode — running each provider in a loop in priority order until we find a match or exhaust the list — or parallel mode — running all providers in parallel and selecting the highest valid priority release, view all providers, enable or disable providers, re-order the priority of the providers.
Added a new "AniDB" release info provider (AnidbReleaseProvider), hooking into the existing AniDB UDP lookup logic. As a side-effect of the change in the lookup process have the MyList support in the existing UDP lookup logic has been stripped out, and we now rely entirely on the IVideoReleaseService and MyList sync job to add new files and/or or pull watched state from the MyList.
Added IReadOnlyList<IHashDigest> Hashes, string? SHA256, string? SHA512 to IHashes, to list all hashes stored for a IVideo that may not necessarily by strongly typed and to expose all hash types supported by the CoreHashProvider ("Core" in the UI) as strongly typed hashes. The existing strongly typed hash types have been converted to helpers; retrieving the first stored hash from the list of the given type.
AniDB_File, AniDB_FileUpdate, AniDB_ReleaseGroup, CrossRef_Languages_AniDB_File, & CrossRef_Subtitles_AniDB_File models/tables/repos are gone, and their functionality replaced by the new StoredReleaseInfo & StoredReleaseInfo_MatchAttempt. The AniDB file has also been removed from the abstraction.
Video file hashes — except the "ED2K" hash — has been moved to only being stored in the new VideoLocal_HashDigest table, but the "ED2K" is still stored on the video itself in addition to the new table.
Currently I've assigned every existing link as a "manual link", because the user is now able to edit every link we store if they so desire, and this was the simplest way to show all the links in the current Web UI.
Added a new plugin to simply export/import release info (Shoko.Plugin.ReleaseExporter). This is both my test case for the plugin system in addition to a small handy provider if you ever need to re-index your collection from scratch and don't want to do the AniDB UDP dance, or if you want to transcode your collection to a newer format at some point and want to preserve the release info in the process.

Changes in APIv1:

All file linking/unlinking in APIv1 has been soft deprecated. Use APIv3 instead. By soft deprecated I mean the client can still make the requests, but will only get an error message back from the server.
Release info has been migrated to use the new format, but only for releases provided by the "AniDB" provider.
Release groups have been migrated to use the new format, but only for release groups with "AniDB" as a source.

Changes in APIv2:

Release info has been migrated to use the new format, but only for releases provided by the "AniDB" provider.
Release groups have been migrated to use the new format, but only for release groups with "AniDB" as a source.

Changes in APIv3:

Release info has been migrated to a new API model.
File.Hashes has been changed from a dict of well known, nullable hashes to a list of hash digests, where only the ED2K hash is guaranteed to be included in the list.
File.AniDB has been replaced with File.Release, which now uses the new release info model. The includeDataFrom=AniDB query parameter for file endpoints
Added a new hashing controller (mounted at /api/v3/Hashing for now), to view and edit hashing provider settings, enable disable hash types per provider, and re-order the run order of providers in sequential mode (note: the order doesn't affect the parallel mode because every provider is… ran in parallel).
Added a new ReleaseInfoController (mounted at /api/v3/ReleaseInfo), allowing RESTful clients to also interact with the newly added IVideoReleaseService. You can do anything you can do
File linking in APIv3 have been converted to use the new service, and as a result the artificial limit of not allowing the user to remove AniDB releases is gone. A release is simply a release now.

To-do;

Add missing MySQL/MariaDB & MS SQL Server database migrations.
Test that the anidb provider somehow works as it should.
Test out that MyList is still working as it should.
Test out adding a workflow to edit the providers in the web ui.
Fix breakage in the web ui as a result of the removal of the anidb property on the file model.
Fix breakage in Shokofin as a result of the removal of the anidb property on the file model.

da3dsoul · 2025-02-24T04:22:38Z

This is...a lot, so I'll need to look at it more later. One thing I see first off is the complication and kind of hacky handling of the scheduling and ProcessFile. I would split the jobs if possible, and make it so that it has a flow like so:

Discover
Hash
ProcessFile
    Check the state of the data and what providers can/need to update, prolly via an interface/abstraction
    Schedule the relevant jobs for each provider. AniDB will have one, which can be handled in scheduler via the exclusion types. Ashen might have one. Maybe an NFO one? It's extensible, after all.
Get Provider File Info
    The job mentioned before. It can do the job that Process File did and orchestrate other things. We can make helpers or a base "Get Provider File Info" in the abstractions for providers to extend.
...

The plugin abstractions might need to provide a hook to add Acquisition Filters, jobs, etc

da3dsoul · 2025-02-24T04:28:36Z

@Cazzar can you comment on some of the design? We aren't nitpicking code quality yet.

though if we were, stop making constructors for models. It'll mess up Entity Framework, and I'm going to get rid of them anyway. Use object initializers.
i.e.

public StoredReleaseInfo(IVideo video, IReleaseInfo releaseInfo)
...

Models should be models. If processing needs to be done, it should be in a service/factory. I don't know if your "embedded" models will work. We will see. I'm not sure how Entity Framework will handle loading of relationships through them.

Stuff like this is fine imo, though:

public IReadOnlyList<IReleaseVideoCrossReference> CrossReferences
    {
        get => EmbeddedCrossReferences
            .Split(',')
            .Select(EmbeddedCrossReference.FromString)
            .WhereNotNull()
            .ToList();
        set => EmbeddedCrossReferences = value
            .Select(x => x.ToEmbeddedString())
            .Join(',');
    }

da3dsoul · 2025-02-24T04:34:29Z

Shoko.Server/Models/Release/StoredReleaseInfo.cs

+    {
+        get
+        {
+            var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? [];


I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link

Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.

I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link

I can add some validation logic on the provided hashes in the video release service, e.g. making sure the hashes confront to the type they're supposed to be, making sure they're all upper cased, etc.

We can also split the hashes into 4 fields instead of a single shared field. This was just a "hacky" way of storing them all as a single field since they were initially just meant to be extra information a release provider can add, and not meant to be index-able fields.

Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.

You mean in the stored release info model, or in the release info interface? In the former, then sure. In the latter, then we can add it, but I strongly suggest that anything we add to the interface is strongly typed and not a Dictionary<string, object> or similar for extra data.

Additionally why don't we provide a FileHash table which is a combination of:

FileID | int FK | PK HashType | string | PK Hash | String |

For the videos or the release infos? Or both?

Presumably, the point of abstracting the hashes is so that new hash types could be added, such as perceptual hashes. In those cases, the extra data would be timestamps.
The extra data would be on the hash object. The hash object would replace the hashes everywhere a hash is relevant

so if I'm understanding it right, then we add a new hash object (hash type, hash value, extra data) and convert all occurrences of all existing properties exposed as IHashes interface to instead be a list of these new hash objects? (so on the IVideo and on the IReleaseInfo)

or would it be adding a new list property on IHashes to contain those extra hashes?

Cazzar · 2025-02-24T09:56:06Z

Overall design I like the idea, I haven't looked through it extensively as the large amount of changes does make things complex.

Cazzar · 2025-02-24T09:38:10Z

Shoko.Server/API/v3/Models/Shoko/Episode.cs

@@ -119,7 +119,7 @@ public class Episode : BaseModel
    [JsonProperty(NullValueHandling = NullValueHandling.Ignore)]
    public IEnumerable<FileCrossReference.EpisodeCrossReferenceIDs>? CrossReferences { get; set; }

-    public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false)
+    public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false, bool includeReleaseInfo = false)


I think we need to rethink this API model a bit better.
Something like

new Episode(_httpContext, _episode) .IncludeDataFrom(...) .WithFiles() .WithMediaInfo() .AbsolutePaths() .XRefs() .WithReleaseInfo()

might reduce some of the constructor bloat.

It's not a bad idea to clean up the constructor bloat in the API layer, but can we work on this after this PR? I can look into it unless you or somebody else wants to after the PR, but I don't want to deal with converting it just now.

Cazzar · 2025-02-24T09:39:26Z

Shoko.Server/API/v3/Models/Shoko/File.cs

    {
        var mediaInfo = file.MediaInfo as IMediaInfo;
        ID = file.VideoLocalID;
        Size = file.FileSize;
        IsVariation = file.IsVariation;
-        Hashes = new() { ED2K = file.Hash, MD5 = file.MD5, CRC32 = file.CRC32, SHA1 = file.SHA1 };
+        Hashes = new(file);


I would just leave this as an instance of the interface, not wrapping it with a POCO.

The interface would need to be moved or duplicated to the API layer to have proper model names in swagger I think. Currently it's wrapping it in the hash dict model used in the APIv3, which I also reused in the release info APIv3 model in this PR.

Cazzar · 2025-02-24T09:41:51Z

Shoko.Server/Databases/MySQL.cs

@@ -723,7 +723,7 @@ public class MySQL : BaseDatabase<MySqlConnection>
        new(109, 1, "UPDATE VideoLocal v INNER JOIN CrossRef_File_Episode CRFE on v.Hash = CRFE.Hash SET DateTimeImported = DateTimeCreated;"),
        new(110, 1, "CREATE TABLE `AniDB_FileUpdate` ( `AniDB_FileUpdateID` INT NOT NULL AUTO_INCREMENT, `FileSize` BIGINT NOT NULL, `Hash` varchar(50) NOT NULL, `HasResponse` BIT NOT NULL, `UpdatedAt` datetime NOT NULL, PRIMARY KEY (`AniDB_FileUpdateID`) );"),
        new(110, 2, "ALTER TABLE `AniDB_FileUpdate` ADD INDEX `IX_AniDB_FileUpdate` (`FileSize` ASC, `Hash` ASC) ;"),
-        new(110, 3, DatabaseFixes.MigrateAniDB_FileUpdates),
+        new(110, 3, DatabaseFixes.NoOperation),


What's the reason for killing an existing migration?

It is no longer relevant to run since they're later removed, and the models the migration used is no longer available.

Cazzar · 2025-02-24T09:45:58Z

Shoko.Server/Models/Release/StoredReleaseInfo.cs

+    {
+        get
+        {
+            var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? [];


revam · 2025-02-24T20:30:04Z

This is...a lot, so I'll need to look at it more later. One thing I see first off is the complication and kind of hacky handling of the scheduling and ProcessFile. I would split the jobs if possible, and make it so that it has a flow like so:
Discover
Hash
ProcessFile
    Check the state of the data and what providers can/need to update, prolly via an interface/abstraction
    Schedule the relevant jobs for each provider. AniDB will have one, which can be handled in scheduler via the exclusion types. Ashen might have one. Maybe an NFO one? It's extensible, after all.
Get Provider File Info
    The job mentioned before. It can do the job that Process File did and orchestrate other things. We can make helpers or a base "Get Provider File Info" in the abstractions for providers to extend.
...
The plugin abstractions might need to provide a hook to add Acquisition Filters, jobs, etc

The current service can be ran inside or outside the queue/job system, and in the current PoC then the release providers are processed in a user-configurable order until a release is found. Only a single release can be assigned to the same video at any given time, so scheduling "relevant jobs for each provider" won't ever happen in parallel. In short, the logic to select a release happens strictly inside the service and the process-file job is now just asking the service to do it's thing while running in the queue/job system in the background. There are also other ways to interact with the service, be it from other plugins through the abstraction, or from RESTful clients through the new endpoints.

I do admit that the way I modified the AniDB banned acquisition filter is kind of hacky, but also correct, as it was modified to only block the process-file jobs if AniDB banned ONLY IF the AniDB release provider is enabled, as it needs to be able to use the AniDB UDP API. But flipping that would mean that as long as the AniDB provider isn't enabled then we don't need to block the process-file jobs at all, since it's not using the AniDB UDP API to find releases.

da3dsoul · 2025-02-24T20:42:31Z

That is a reasonable argument, but I would allow multiple so that you can cross reference them. AniDB is more likely correct than perceptual hashing, even though perceptual hashing is pretty accurate. We could even have a filename plugin with very low accuracy. Maybe add an enum for how trustworthy we expect a provider to be in those cases.

revam · 2025-02-24T20:53:08Z

We kinda already have a user configurable "priority" to use. Can we add two modes,

one mode to run in a sync. loop in the priority order until a release is found (AKA the current way), and
one mode to run multiple providers in parallel, await all the responses, and then pick the highest priority release of the available candidates?

I can add the new setting to the service and modify the description of the auto-finding method and endpoints to reflect the new behavior. The reason I would opt to have both modes is to let the user choose how they want to do it. By default we will only have one provider included (…unless…), so the default mode can be whatever.

My particular flow would require the finding to happen in sync. order; it would first checks the "nfo" (quotes intentional) file before asking any remote services (or a potential fallback local/offline provider). But I know some would maybe want to do it in a parallel fashion as you described with the p-hash and AniDB provider, so I'll say "let them pick their own poison to swallow."

- Added the initial plugin abstraction definitions for the release providers. Implementation of the service and the anidb release provider will come in a following commit.

…e release info, in case they mapped the initial file to some other file

- Renamed the `IImportFolder` to `IManagedFolder` in the plugin abstraction. - Renamed all fields in the plugin abstraction referring to an import folder to instead refer to a managed folder (e.g. `IVideoFile.ImportFolderID` → `IVideoFile.ManagedFolderID`, etc.). Since this changed _everything_ in the abstraction then renamers will have to be rebuilt against the new abstraction using the new fields. - Renamed `ImportFolder` to `ManagedFolder` in the RESTful APIv3, including the model itself, all endpoints referring to import folders, and all fields referring to import folders, to now refer to managed folders instead. - Updated the `IVideoService` to have added/updated/removed events for managed folders, and added a new method to get a single managed folder by ID if needed. - Updated the `IVideoService` to fix up the methods to get one or more video files by hash. Removed the now defunct `HashAlgorithmName`, replaced by a `string`, since we now support more than the previously well-known hash algorithms. - And more.

If you really need to know when the server settings was changed then use the configuration service to check.

…ueue in service

- Added new functionality to the `IAniDBService`. You can now subscribe to the banned and AVDump events in the service, schedule or run refresh on anidb (to update existing or add new), and check the AVDump installed status and run AVDump in the foreground or background on videos/files. - The get anime through the HTTP API flow has changed slightly as a result of the above; We can now command a refresh to ignore the banned status or the time check, among other things like more cleanly disabling/enabling remote or cache lookups.

…an 24h since last fetch

…d scheduling methods to service

…efinitions to load

Set the static service provider/container _before_ loading plugins, in case we end up loading some core service that depend on it being available.

…to the service

…t's name

…name

…older

da3dsoul reviewed Feb 24, 2025

View reviewed changes

Cazzar reviewed Feb 24, 2025

View reviewed changes

revam force-pushed the add-release-info-providers branch from 78f8f2d to c698ee9 Compare February 24, 2025 22:01

revam changed the title ~~Add Release Info Providers [WIP]~~ Add Hashing & Release Info Providers [WIP] Feb 27, 2025

revam force-pushed the add-release-info-providers branch 2 times, most recently from fd77c1d to 06b0dc6 Compare March 8, 2025 21:47

revam added 18 commits March 11, 2025 00:24

feat: add plugin based release providers [WIP]

6374325

- Added the initial plugin abstraction definitions for the release providers. Implementation of the service and the anidb release provider will come in a following commit.

feat: allow plugins to provide which hashes were actually used for th…

d7da1e7

…e release info, in case they mapped the initial file to some other file

today's progress

fabb40a

today's progress

8cd4804

today's progress

726d0b3

refactor: add to mylist when release info have been added

b3fddcb

feat: add video relocated event to video service

d401f92

today's progress

0f43611

feat: add generic import reason for user data save events

737f047

refactor: fix up removing releases

f140431

add database migration

48cfad1

fix: don't save an attempt if no providers are enabled

aa40216

revert: unteach the providers/jobs about the abstraction interfaces

28b7071

fix up anidb release provider that I've procrastinated on so far

18c4ad4

don't limit process file job on anidb if anidb is disabled as a provider

d37376c

move init of video release service to a more fitting place

d6dae94

fix adding type exports in loader

689861d

fix sqlite attempts table

78933d1

revam added 6 commits March 11, 2025 00:24

misc: update shokofin config with more fake connect validation logic

60158fa

misc: remove unnecessary attribute for hidden field

cdafd5b

misc: add more groups/fields to the shokofin example config

88311d5

fix: remove videos with no files attached during startup validation

2881936

fix: allow unknown properties to exist in json

db02e42

revam force-pushed the add-release-info-providers branch from 4d3d616 to db02e42 Compare March 11, 2025 03:13

revam mentioned this pull request Mar 11, 2025

[WIP] Accomodate server side changes ShokoAnime/Shoko-WebUI#1190

Draft

revam added 2 commits March 11, 2025 09:30

refactor: remove settings saved event from shoko event handler

83bf7bc

If you really need to know when the server settings was changed then use the configuration service to check.

feat: expose release match attempts & schedule release finding onto q…

06183ee

…ueue in service

revam force-pushed the add-release-info-providers branch from 3621d08 to 2f09814 Compare March 11, 2025 08:30

revam added 2 commits March 11, 2025 09:32

fix: attempt to re-fetch anidb anime title cache if it's been more th…

b0b1338

…an 24h since last fetch

revam force-pushed the add-release-info-providers branch from 2f09814 to b0b1338 Compare March 11, 2025 08:32

revam added 16 commits March 12, 2025 03:09

refactor: move hashing responsibility from the job to the service, ad…

70a79a2

…d scheduling methods to service

fix: only schedule auto-match job if auto matching is enabled

24fd5cf

fix: change id of core config before enumerating configs and config d…

6849c4f

…efinitions to load

misc: update obsolete message

ae92d55

fix: run managed folder events in a new thread

3660885

fix: set static service provider/container _before_ loading plugins

a18355b

Set the static service provider/container _before_ loading plugins, in case we end up loading some core service that depend on it being available.

chore: remove empty line

1de2db7

chore: fix typo in exception message

935d5cb

refactor: move responsibility of refreshing anidb anime from the job …

c6b08e4

…to the service

refactor: remove defunct job parameter

3924254

fix: fix circular dependency loop i created

cabda0b

chore: rename video hashing service impl. to not have 'abstract' in i…

fe0ec6c

…t's name

chore: rename video release service impl. to not have 'abstract' in i…

15f4678

…t's name

chore: rename AniDB service impl. to not have 'abstract' in it's name

3087b93

chore: rename user data service impl. to not have 'abstract' in it's …

d6f24a3

…name

chore: more the remaining abstraction services into a new namespace/f…

4315e30

…older

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Hashing & Release Info Providers [WIP] #1228

Add Hashing & Release Info Providers [WIP] #1228

revam commented Feb 24, 2025 •

edited

Loading

da3dsoul commented Feb 24, 2025

da3dsoul commented Feb 24, 2025 •

edited

Loading

da3dsoul Feb 24, 2025

da3dsoul Feb 24, 2025

Cazzar Feb 24, 2025

revam Feb 24, 2025

da3dsoul Feb 24, 2025

revam Feb 24, 2025 •

edited

Loading

Cazzar commented Feb 24, 2025

Cazzar Feb 24, 2025

revam Feb 24, 2025

Cazzar Feb 24, 2025

revam Feb 24, 2025 •

edited

Loading

Cazzar Feb 24, 2025

revam Feb 24, 2025

Cazzar Feb 24, 2025

revam commented Feb 24, 2025 •

edited

Loading

da3dsoul commented Feb 24, 2025

revam commented Feb 24, 2025 •

edited

Loading

Add Hashing & Release Info Providers [WIP] #1228

Are you sure you want to change the base?

Add Hashing & Release Info Providers [WIP] #1228

Conversation

revam commented Feb 24, 2025 • edited Loading

Changes internally:

Changes in APIv1:

Changes in APIv2:

Changes in APIv3:

To-do;

da3dsoul commented Feb 24, 2025

da3dsoul commented Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revam Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Cazzar commented Feb 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revam Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

revam commented Feb 24, 2025 • edited Loading

da3dsoul commented Feb 24, 2025

revam commented Feb 24, 2025 • edited Loading

revam commented Feb 24, 2025 •

edited

Loading

da3dsoul commented Feb 24, 2025 •

edited

Loading

revam Feb 24, 2025 •

edited

Loading

revam Feb 24, 2025 •

edited

Loading

revam commented Feb 24, 2025 •

edited

Loading

revam commented Feb 24, 2025 •

edited

Loading