Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Hashing & Release Info Providers [WIP] #1228

Draft
wants to merge 126 commits into
base: master
Choose a base branch
from

Conversation

revam
Copy link
Member

@revam revam commented Feb 24, 2025

PoC adding hashing & release providers. Still things left to test, but it's overall in a working state, minus the still missing MySQL/MariaDB & MS SQL Server support.

Changes internally:

  • Added a new IVideoHashingService, IHashProvider, IHashDigest, HashDigest to the plugin abstraction. The new IVideoHashingService operates on raw System.IO.FileInfos and is responsible for providing hashes to the HashFileJob before a IVideo & IVideoFile is necessarily assigned to a file location. So far there are runtime checks in place to make sure at least 1 "ED2K" hasher is enabled at all times, since we still rely on it as our absolute ID (in combination with the video file size) internally, but the hasher doesn't necessarily need to be provided by the new "Core" hasher. It contains events for when a IVideo & IVideoFile has been hashed (and added to the system), and when providers have been updated. The service can be used to switch between sequential mode and parallel mode — which controls how plugin providers are called, view all available and/or enabled hash types, enable or disable hash types per provider, and re-order the run priority of providers in sequential mode.

    Note: The priority doesn't affect the parallel mode because every provider is… ran in parallel.

  • Added a new "Core" hasher (CoreHashProvider) implementing the "ED2K", "MD5", "CRC32", "SHA1", "SHA256", & "SHA512" hash types, with the "ED2K" enabled by default.

  • Added a new IVideoReleaseService, IReleaseInfoProvider, IReleaseInfo, IReleaseVideoCrossReference, IReleaseMediaInfo, VideoReleaseEventArgs to the plugin abstraction. The new IVideoReleaseService is responsible for everything release to release info, be it managing providers, doing the auto-search across multiple providers, showing provider info, saving release info to the database, and clearing saved release info from the database. It also contains events for when a release has been saved or cleared, when a auto-search has been started/completed, and when providers have been updated. The service can be used to switch between sequential mode — running each provider in a loop in priority order until we find a match or exhaust the list — or parallel mode — running all providers in parallel and selecting the highest valid priority release, view all providers, enable or disable providers, re-order the priority of the providers.

  • Added a new "AniDB" release info provider (AnidbReleaseProvider), hooking into the existing AniDB UDP lookup logic. As a side-effect of the change in the lookup process have the MyList support in the existing UDP lookup logic has been stripped out, and we now rely entirely on the IVideoReleaseService and MyList sync job to add new files and/or or pull watched state from the MyList.

  • Added IReadOnlyList<IHashDigest> Hashes, string? SHA256, string? SHA512 to IHashes, to list all hashes stored for a IVideo that may not necessarily by strongly typed and to expose all hash types supported by the CoreHashProvider ("Core" in the UI) as strongly typed hashes. The existing strongly typed hash types have been converted to helpers; retrieving the first stored hash from the list of the given type.

  • AniDB_File, AniDB_FileUpdate, AniDB_ReleaseGroup, CrossRef_Languages_AniDB_File, & CrossRef_Subtitles_AniDB_File models/tables/repos are gone, and their functionality replaced by the new StoredReleaseInfo & StoredReleaseInfo_MatchAttempt. The AniDB file has also been removed from the abstraction.

  • Video file hashes — except the "ED2K" hash — has been moved to only being stored in the new VideoLocal_HashDigest table, but the "ED2K" is still stored on the video itself in addition to the new table.

  • Currently I've assigned every existing link as a "manual link", because the user is now able to edit every link we store if they so desire, and this was the simplest way to show all the links in the current Web UI.

  • Added a new plugin to simply export/import release info (Shoko.Plugin.ReleaseExporter). This is both my test case for the plugin system in addition to a small handy provider if you ever need to re-index your collection from scratch and don't want to do the AniDB UDP dance, or if you want to transcode your collection to a newer format at some point and want to preserve the release info in the process.

Changes in APIv1:

  • All file linking/unlinking in APIv1 has been soft deprecated. Use APIv3 instead. By soft deprecated I mean the client can still make the requests, but will only get an error message back from the server.

  • Release info has been migrated to use the new format, but only for releases provided by the "AniDB" provider.

  • Release groups have been migrated to use the new format, but only for release groups with "AniDB" as a source.

Changes in APIv2:

  • Release info has been migrated to use the new format, but only for releases provided by the "AniDB" provider.

  • Release groups have been migrated to use the new format, but only for release groups with "AniDB" as a source.

Changes in APIv3:

  • Release info has been migrated to a new API model.

  • File.Hashes has been changed from a dict of well known, nullable hashes to a list of hash digests, where only the ED2K hash is guaranteed to be included in the list.

  • File.AniDB has been replaced with File.Release, which now uses the new release info model. The includeDataFrom=AniDB query parameter for file endpoints

  • Added a new hashing controller (mounted at /api/v3/Hashing for now), to view and edit hashing provider settings, enable disable hash types per provider, and re-order the run order of providers in sequential mode (note: the order doesn't affect the parallel mode because every provider is… ran in parallel).

  • Added a new ReleaseInfoController (mounted at /api/v3/ReleaseInfo), allowing RESTful clients to also interact with the newly added IVideoReleaseService. You can do anything you can do

  • File linking in APIv3 have been converted to use the new service, and as a result the artificial limit of not allowing the user to remove AniDB releases is gone. A release is simply a release now.

To-do;

  • Add missing MySQL/MariaDB & MS SQL Server database migrations.
  • Test that the anidb provider somehow works as it should.
  • Test out that MyList is still working as it should.
  • Test out adding a workflow to edit the providers in the web ui.
  • Fix breakage in the web ui as a result of the removal of the anidb property on the file model.
  • Fix breakage in Shokofin as a result of the removal of the anidb property on the file model.

@da3dsoul
Copy link
Member

This is...a lot, so I'll need to look at it more later. One thing I see first off is the complication and kind of hacky handling of the scheduling and ProcessFile. I would split the jobs if possible, and make it so that it has a flow like so:

Discover
Hash
ProcessFile
    Check the state of the data and what providers can/need to update, prolly via an interface/abstraction
    Schedule the relevant jobs for each provider. AniDB will have one, which can be handled in scheduler via the exclusion types. Ashen might have one. Maybe an NFO one? It's extensible, after all.
Get Provider File Info
    The job mentioned before. It can do the job that Process File did and orchestrate other things. We can make helpers or a base "Get Provider File Info" in the abstractions for providers to extend.
...

The plugin abstractions might need to provide a hook to add Acquisition Filters, jobs, etc

@da3dsoul
Copy link
Member

da3dsoul commented Feb 24, 2025

@Cazzar can you comment on some of the design? We aren't nitpicking code quality yet.

though if we were, stop making constructors for models. It'll mess up Entity Framework, and I'm going to get rid of them anyway. Use object initializers.
i.e.

public StoredReleaseInfo(IVideo video, IReleaseInfo releaseInfo)
...

Models should be models. If processing needs to be done, it should be in a service/factory. I don't know if your "embedded" models will work. We will see. I'm not sure how Entity Framework will handle loading of relationships through them.

Stuff like this is fine imo, though:

public IReadOnlyList<IReleaseVideoCrossReference> CrossReferences
    {
        get => EmbeddedCrossReferences
            .Split(',')
            .Select(EmbeddedCrossReference.FromString)
            .WhereNotNull()
            .ToList();
        set => EmbeddedCrossReferences = value
            .Select(x => x.ToEmbeddedString())
            .Join(',');
    }

{
get
{
var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally why don't we provide a FileHash table which is a combination of:

FileID   | int FK | PK
HashType | string | PK
Hash     | String |

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link

I can add some validation logic on the provided hashes in the video release service, e.g. making sure the hashes confront to the type they're supposed to be, making sure they're all upper cased, etc.

We can also split the hashes into 4 fields instead of a single shared field. This was just a "hacky" way of storing them all as a single field since they were initially just meant to be extra information a release provider can add, and not meant to be index-able fields.

Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.

You mean in the stored release info model, or in the release info interface? In the former, then sure. In the latter, then we can add it, but I strongly suggest that anything we add to the interface is strongly typed and not a Dictionary<string, object> or similar for extra data.

Additionally why don't we provide a FileHash table which is a combination of:

FileID   | int FK | PK
HashType | string | PK
Hash     | String |

For the videos or the release infos? Or both?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably, the point of abstracting the hashes is so that new hash types could be added, such as perceptual hashes. In those cases, the extra data would be timestamps.
The extra data would be on the hash object. The hash object would replace the hashes everywhere a hash is relevant

Copy link
Member Author

@revam revam Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if I'm understanding it right, then we add a new hash object (hash type, hash value, extra data) and convert all occurrences of all existing properties exposed as IHashes interface to instead be a list of these new hash objects? (so on the IVideo and on the IReleaseInfo)

or would it be adding a new list property on IHashes to contain those extra hashes?

@Cazzar
Copy link
Member

Cazzar commented Feb 24, 2025

Overall design I like the idea, I haven't looked through it extensively as the large amount of changes does make things complex.

@@ -119,7 +119,7 @@ public class Episode : BaseModel
[JsonProperty(NullValueHandling = NullValueHandling.Ignore)]
public IEnumerable<FileCrossReference.EpisodeCrossReferenceIDs>? CrossReferences { get; set; }

public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false)
public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false, bool includeReleaseInfo = false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rethink this API model a bit better.
Something like

new Episode(_httpContext, _episode)
  .IncludeDataFrom(...)
  .WithFiles()
  .WithMediaInfo()
  .AbsolutePaths()
  .XRefs()
  .WithReleaseInfo()

might reduce some of the constructor bloat.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a bad idea to clean up the constructor bloat in the API layer, but can we work on this after this PR? I can look into it unless you or somebody else wants to after the PR, but I don't want to deal with converting it just now.

{
var mediaInfo = file.MediaInfo as IMediaInfo;
ID = file.VideoLocalID;
Size = file.FileSize;
IsVariation = file.IsVariation;
Hashes = new() { ED2K = file.Hash, MD5 = file.MD5, CRC32 = file.CRC32, SHA1 = file.SHA1 };
Hashes = new(file);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just leave this as an instance of the interface, not wrapping it with a POCO.

Copy link
Member Author

@revam revam Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface would need to be moved or duplicated to the API layer to have proper model names in swagger I think. Currently it's wrapping it in the hash dict model used in the APIv3, which I also reused in the release info APIv3 model in this PR.

@@ -723,7 +723,7 @@ public class MySQL : BaseDatabase<MySqlConnection>
new(109, 1, "UPDATE VideoLocal v INNER JOIN CrossRef_File_Episode CRFE on v.Hash = CRFE.Hash SET DateTimeImported = DateTimeCreated;"),
new(110, 1, "CREATE TABLE `AniDB_FileUpdate` ( `AniDB_FileUpdateID` INT NOT NULL AUTO_INCREMENT, `FileSize` BIGINT NOT NULL, `Hash` varchar(50) NOT NULL, `HasResponse` BIT NOT NULL, `UpdatedAt` datetime NOT NULL, PRIMARY KEY (`AniDB_FileUpdateID`) );"),
new(110, 2, "ALTER TABLE `AniDB_FileUpdate` ADD INDEX `IX_AniDB_FileUpdate` (`FileSize` ASC, `Hash` ASC) ;"),
new(110, 3, DatabaseFixes.MigrateAniDB_FileUpdates),
new(110, 3, DatabaseFixes.NoOperation),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for killing an existing migration?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is no longer relevant to run since they're later removed, and the models the migration used is no longer available.

{
get
{
var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally why don't we provide a FileHash table which is a combination of:

FileID   | int FK | PK
HashType | string | PK
Hash     | String |

@revam
Copy link
Member Author

revam commented Feb 24, 2025

This is...a lot, so I'll need to look at it more later. One thing I see first off is the complication and kind of hacky handling of the scheduling and ProcessFile. I would split the jobs if possible, and make it so that it has a flow like so:

Discover
Hash
ProcessFile
    Check the state of the data and what providers can/need to update, prolly via an interface/abstraction
    Schedule the relevant jobs for each provider. AniDB will have one, which can be handled in scheduler via the exclusion types. Ashen might have one. Maybe an NFO one? It's extensible, after all.
Get Provider File Info
    The job mentioned before. It can do the job that Process File did and orchestrate other things. We can make helpers or a base "Get Provider File Info" in the abstractions for providers to extend.
...

The plugin abstractions might need to provide a hook to add Acquisition Filters, jobs, etc

The current service can be ran inside or outside the queue/job system, and in the current PoC then the release providers are processed in a user-configurable order until a release is found. Only a single release can be assigned to the same video at any given time, so scheduling "relevant jobs for each provider" won't ever happen in parallel. In short, the logic to select a release happens strictly inside the service and the process-file job is now just asking the service to do it's thing while running in the queue/job system in the background. There are also other ways to interact with the service, be it from other plugins through the abstraction, or from RESTful clients through the new endpoints.

I do admit that the way I modified the AniDB banned acquisition filter is kind of hacky, but also correct, as it was modified to only block the process-file jobs if AniDB banned ONLY IF the AniDB release provider is enabled, as it needs to be able to use the AniDB UDP API. But flipping that would mean that as long as the AniDB provider isn't enabled then we don't need to block the process-file jobs at all, since it's not using the AniDB UDP API to find releases.

@da3dsoul
Copy link
Member

That is a reasonable argument, but I would allow multiple so that you can cross reference them. AniDB is more likely correct than perceptual hashing, even though perceptual hashing is pretty accurate. We could even have a filename plugin with very low accuracy. Maybe add an enum for how trustworthy we expect a provider to be in those cases.

@revam
Copy link
Member Author

revam commented Feb 24, 2025

We kinda already have a user configurable "priority" to use. Can we add two modes,

  • one mode to run in a sync. loop in the priority order until a release is found (AKA the current way), and
  • one mode to run multiple providers in parallel, await all the responses, and then pick the highest priority release of the available candidates?

I can add the new setting to the service and modify the description of the auto-finding method and endpoints to reflect the new behavior. The reason I would opt to have both modes is to let the user choose how they want to do it. By default we will only have one provider included (…unless…), so the default mode can be whatever.

My particular flow would require the finding to happen in sync. order; it would first checks the "nfo" (quotes intentional) file before asking any remote services (or a potential fallback local/offline provider). But I know some would maybe want to do it in a parallel fashion as you described with the p-hash and AniDB provider, so I'll say "let them pick their own poison to swallow."

@revam revam force-pushed the add-release-info-providers branch from 78f8f2d to c698ee9 Compare February 24, 2025 22:01
@revam revam changed the title Add Release Info Providers [WIP] Add Hashing & Release Info Providers [WIP] Feb 27, 2025
@revam revam force-pushed the add-release-info-providers branch 2 times, most recently from fd77c1d to 06b0dc6 Compare March 8, 2025 21:47
revam added 6 commits March 11, 2025 00:24
- Renamed the `IImportFolder` to `IManagedFolder` in the plugin abstraction.

- Renamed all fields in the plugin abstraction referring to an import folder to instead refer to a managed folder (e.g. `IVideoFile.ImportFolderID` → `IVideoFile.ManagedFolderID`, etc.). Since this changed _everything_ in the abstraction then renamers will have to be rebuilt against the new abstraction using the new fields.

- Renamed `ImportFolder` to `ManagedFolder` in the RESTful APIv3, including the model itself, all endpoints referring to import folders, and all fields referring to import folders, to now refer to managed folders instead.

- Updated the `IVideoService` to have added/updated/removed events for managed folders, and added a new method to get a single managed folder by ID if needed.

- Updated the `IVideoService` to fix up the methods to get one or more video files by hash. Removed the now defunct `HashAlgorithmName`, replaced by a `string`, since we now support more than the previously well-known hash algorithms.

- And more.
revam added 2 commits March 11, 2025 09:30
If you really need to know when the server settings was changed then use the configuration service to check.
@revam revam force-pushed the add-release-info-providers branch from 3621d08 to 2f09814 Compare March 11, 2025 08:30
revam added 2 commits March 11, 2025 09:32
- Added new functionality to the `IAniDBService`. You can now subscribe to the banned and AVDump events in the service, schedule or run refresh on anidb (to update existing or add new), and check the AVDump installed status and run AVDump in the foreground or background on videos/files.

- The get anime through the HTTP API flow has changed slightly as a result of the above; We can now command a refresh to ignore the banned status or the time check, among other things like more cleanly disabling/enabling remote or cache lookups.
@revam revam force-pushed the add-release-info-providers branch from 2f09814 to b0b1338 Compare March 11, 2025 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants