-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Hashing & Release Info Providers [WIP] #1228
base: master
Are you sure you want to change the base?
Conversation
This is...a lot, so I'll need to look at it more later. One thing I see first off is the complication and kind of hacky handling of the scheduling and ProcessFile. I would split the jobs if possible, and make it so that it has a flow like so:
The plugin abstractions might need to provide a hook to add Acquisition Filters, jobs, etc |
@Cazzar can you comment on some of the design? We aren't nitpicking code quality yet. though if we were, stop making constructors for models. It'll mess up Entity Framework, and I'm going to get rid of them anyway. Use object initializers. public StoredReleaseInfo(IVideo video, IReleaseInfo releaseInfo)
... Models should be models. If processing needs to be done, it should be in a service/factory. I don't know if your "embedded" models will work. We will see. I'm not sure how Entity Framework will handle loading of relationships through them. Stuff like this is fine imo, though: public IReadOnlyList<IReleaseVideoCrossReference> CrossReferences
{
get => EmbeddedCrossReferences
.Split(',')
.Select(EmbeddedCrossReference.FromString)
.WhereNotNull()
.ToList();
set => EmbeddedCrossReferences = value
.Select(x => x.ToEmbeddedString())
.Join(',');
} |
{ | ||
get | ||
{ | ||
var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally why don't we provide a FileHash table which is a combination of:
FileID | int FK | PK
HashType | string | PK
Hash | String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will work. We can't guarantee that a plugin provided hash won't contain a pipe. I would store these in a separate table with a string Type, string or blob Hash, and any IDs necessary to link
I can add some validation logic on the provided hashes in the video release service, e.g. making sure the hashes confront to the type they're supposed to be, making sure they're all upper cased, etc.
We can also split the hashes into 4 fields instead of a single shared field. This was just a "hacky" way of storing them all as a single field since they were initially just meant to be extra information a release provider can add, and not meant to be index-able fields.
Maybe a few columns for "extra data". Kind of like how some hashes require a salt or whatever to reproduce.
You mean in the stored release info model, or in the release info interface? In the former, then sure. In the latter, then we can add it, but I strongly suggest that anything we add to the interface is strongly typed and not a Dictionary<string, object>
or similar for extra data.
Additionally why don't we provide a FileHash table which is a combination of:
FileID | int FK | PK HashType | string | PK Hash | String |
For the videos or the release infos? Or both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably, the point of abstracting the hashes is so that new hash types could be added, such as perceptual hashes. In those cases, the extra data would be timestamps.
The extra data would be on the hash object. The hash object would replace the hashes everywhere a hash is relevant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so if I'm understanding it right, then we add a new hash object (hash type, hash value, extra data) and convert all occurrences of all existing properties exposed as IHashes
interface to instead be a list of these new hash objects? (so on the IVideo
and on the IReleaseInfo
)
or would it be adding a new list property on IHashes
to contain those extra hashes?
Overall design I like the idea, I haven't looked through it extensively as the large amount of changes does make things complex. |
@@ -119,7 +119,7 @@ public class Episode : BaseModel | |||
[JsonProperty(NullValueHandling = NullValueHandling.Ignore)] | |||
public IEnumerable<FileCrossReference.EpisodeCrossReferenceIDs>? CrossReferences { get; set; } | |||
|
|||
public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false) | |||
public Episode(HttpContext context, SVR_AnimeEpisode episode, HashSet<DataSource>? includeDataFrom = null, bool includeFiles = false, bool includeMediaInfo = false, bool includeAbsolutePaths = false, bool withXRefs = false, bool includeReleaseInfo = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to rethink this API model a bit better.
Something like
new Episode(_httpContext, _episode)
.IncludeDataFrom(...)
.WithFiles()
.WithMediaInfo()
.AbsolutePaths()
.XRefs()
.WithReleaseInfo()
might reduce some of the constructor bloat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a bad idea to clean up the constructor bloat in the API layer, but can we work on this after this PR? I can look into it unless you or somebody else wants to after the PR, but I don't want to deal with converting it just now.
{ | ||
var mediaInfo = file.MediaInfo as IMediaInfo; | ||
ID = file.VideoLocalID; | ||
Size = file.FileSize; | ||
IsVariation = file.IsVariation; | ||
Hashes = new() { ED2K = file.Hash, MD5 = file.MD5, CRC32 = file.CRC32, SHA1 = file.SHA1 }; | ||
Hashes = new(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just leave this as an instance of the interface, not wrapping it with a POCO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface would need to be moved or duplicated to the API layer to have proper model names in swagger I think. Currently it's wrapping it in the hash dict model used in the APIv3, which I also reused in the release info APIv3 model in this PR.
@@ -723,7 +723,7 @@ public class MySQL : BaseDatabase<MySqlConnection> | |||
new(109, 1, "UPDATE VideoLocal v INNER JOIN CrossRef_File_Episode CRFE on v.Hash = CRFE.Hash SET DateTimeImported = DateTimeCreated;"), | |||
new(110, 1, "CREATE TABLE `AniDB_FileUpdate` ( `AniDB_FileUpdateID` INT NOT NULL AUTO_INCREMENT, `FileSize` BIGINT NOT NULL, `Hash` varchar(50) NOT NULL, `HasResponse` BIT NOT NULL, `UpdatedAt` datetime NOT NULL, PRIMARY KEY (`AniDB_FileUpdateID`) );"), | |||
new(110, 2, "ALTER TABLE `AniDB_FileUpdate` ADD INDEX `IX_AniDB_FileUpdate` (`FileSize` ASC, `Hash` ASC) ;"), | |||
new(110, 3, DatabaseFixes.MigrateAniDB_FileUpdates), | |||
new(110, 3, DatabaseFixes.NoOperation), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason for killing an existing migration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is no longer relevant to run since they're later removed, and the models the migration used is no longer available.
{ | ||
get | ||
{ | ||
var (ed2k, md5, sha1, crc32) = Hashes?.Split('|', StringSplitOptions.TrimEntries) ?? []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally why don't we provide a FileHash table which is a combination of:
FileID | int FK | PK
HashType | string | PK
Hash | String |
The current service can be ran inside or outside the queue/job system, and in the current PoC then the release providers are processed in a user-configurable order until a release is found. Only a single release can be assigned to the same video at any given time, so scheduling "relevant jobs for each provider" won't ever happen in parallel. In short, the logic to select a release happens strictly inside the service and the process-file job is now just asking the service to do it's thing while running in the queue/job system in the background. There are also other ways to interact with the service, be it from other plugins through the abstraction, or from RESTful clients through the new endpoints. I do admit that the way I modified the AniDB banned acquisition filter is kind of hacky, but also correct, as it was modified to only block the process-file jobs if AniDB banned ONLY IF the AniDB release provider is enabled, as it needs to be able to use the AniDB UDP API. But flipping that would mean that as long as the AniDB provider isn't enabled then we don't need to block the process-file jobs at all, since it's not using the AniDB UDP API to find releases. |
That is a reasonable argument, but I would allow multiple so that you can cross reference them. AniDB is more likely correct than perceptual hashing, even though perceptual hashing is pretty accurate. We could even have a filename plugin with very low accuracy. Maybe add an enum for how trustworthy we expect a provider to be in those cases. |
We kinda already have a user configurable "priority" to use. Can we add two modes,
I can add the new setting to the service and modify the description of the auto-finding method and endpoints to reflect the new behavior. The reason I would opt to have both modes is to let the user choose how they want to do it. By default we will only have one provider included (…unless…), so the default mode can be whatever. My particular flow would require the finding to happen in sync. order; it would first checks the "nfo" (quotes intentional) file before asking any remote services (or a potential fallback local/offline provider). But I know some would maybe want to do it in a parallel fashion as you described with the p-hash and AniDB provider, so I'll say "let them pick their own poison to swallow." |
78f8f2d
to
c698ee9
Compare
fd77c1d
to
06b0dc6
Compare
- Added the initial plugin abstraction definitions for the release providers. Implementation of the service and the anidb release provider will come in a following commit.
…e release info, in case they mapped the initial file to some other file
- Renamed the `IImportFolder` to `IManagedFolder` in the plugin abstraction. - Renamed all fields in the plugin abstraction referring to an import folder to instead refer to a managed folder (e.g. `IVideoFile.ImportFolderID` → `IVideoFile.ManagedFolderID`, etc.). Since this changed _everything_ in the abstraction then renamers will have to be rebuilt against the new abstraction using the new fields. - Renamed `ImportFolder` to `ManagedFolder` in the RESTful APIv3, including the model itself, all endpoints referring to import folders, and all fields referring to import folders, to now refer to managed folders instead. - Updated the `IVideoService` to have added/updated/removed events for managed folders, and added a new method to get a single managed folder by ID if needed. - Updated the `IVideoService` to fix up the methods to get one or more video files by hash. Removed the now defunct `HashAlgorithmName`, replaced by a `string`, since we now support more than the previously well-known hash algorithms. - And more.
4d3d616
to
db02e42
Compare
If you really need to know when the server settings was changed then use the configuration service to check.
3621d08
to
2f09814
Compare
- Added new functionality to the `IAniDBService`. You can now subscribe to the banned and AVDump events in the service, schedule or run refresh on anidb (to update existing or add new), and check the AVDump installed status and run AVDump in the foreground or background on videos/files. - The get anime through the HTTP API flow has changed slightly as a result of the above; We can now command a refresh to ignore the banned status or the time check, among other things like more cleanly disabling/enabling remote or cache lookups.
…an 24h since last fetch
2f09814
to
b0b1338
Compare
…d scheduling methods to service
…efinitions to load
Set the static service provider/container _before_ loading plugins, in case we end up loading some core service that depend on it being available.
PoC adding hashing & release providers. Still things left to test, but it's overall in a working state, minus the still missing MySQL/MariaDB & MS SQL Server support.
Changes internally:
Added a new
IVideoHashingService
,IHashProvider
,IHashDigest
,HashDigest
to the plugin abstraction. The newIVideoHashingService
operates on rawSystem.IO.FileInfo
s and is responsible for providing hashes to theHashFileJob
before aIVideo
&IVideoFile
is necessarily assigned to a file location. So far there are runtime checks in place to make sure at least 1"ED2K"
hasher is enabled at all times, since we still rely on it as our absolute ID (in combination with the video file size) internally, but the hasher doesn't necessarily need to be provided by the new"Core"
hasher. It contains events for when aIVideo
&IVideoFile
has been hashed (and added to the system), and when providers have been updated. The service can be used to switch between sequential mode and parallel mode — which controls how plugin providers are called, view all available and/or enabled hash types, enable or disable hash types per provider, and re-order the run priority of providers in sequential mode.Note: The priority doesn't affect the parallel mode because every provider is… ran in parallel.
Added a new
"Core"
hasher (CoreHashProvider
) implementing the"ED2K"
,"MD5"
,"CRC32"
,"SHA1"
,"SHA256"
, &"SHA512"
hash types, with the"ED2K"
enabled by default.Added a new
IVideoReleaseService
,IReleaseInfoProvider
,IReleaseInfo
,IReleaseVideoCrossReference
,IReleaseMediaInfo
,VideoReleaseEventArgs
to the plugin abstraction. The newIVideoReleaseService
is responsible for everything release to release info, be it managing providers, doing the auto-search across multiple providers, showing provider info, saving release info to the database, and clearing saved release info from the database. It also contains events for when a release has been saved or cleared, when a auto-search has been started/completed, and when providers have been updated. The service can be used to switch between sequential mode — running each provider in a loop in priority order until we find a match or exhaust the list — or parallel mode — running all providers in parallel and selecting the highest valid priority release, view all providers, enable or disable providers, re-order the priority of the providers.Added a new
"AniDB"
release info provider (AnidbReleaseProvider
), hooking into the existing AniDB UDP lookup logic. As a side-effect of the change in the lookup process have the MyList support in the existing UDP lookup logic has been stripped out, and we now rely entirely on theIVideoReleaseService
and MyList sync job to add new files and/or or pull watched state from the MyList.Added
IReadOnlyList<IHashDigest> Hashes
,string? SHA256
,string? SHA512
toIHashes
, to list all hashes stored for aIVideo
that may not necessarily by strongly typed and to expose all hash types supported by theCoreHashProvider
("Core"
in the UI) as strongly typed hashes. The existing strongly typed hash types have been converted to helpers; retrieving the first stored hash from the list of the given type.AniDB_File
,AniDB_FileUpdate
,AniDB_ReleaseGroup
,CrossRef_Languages_AniDB_File
, &CrossRef_Subtitles_AniDB_File
models/tables/repos are gone, and their functionality replaced by the newStoredReleaseInfo
&StoredReleaseInfo_MatchAttempt
. The AniDB file has also been removed from the abstraction.Video file hashes — except the
"ED2K"
hash — has been moved to only being stored in the newVideoLocal_HashDigest
table, but the"ED2K"
is still stored on the video itself in addition to the new table.Currently I've assigned every existing link as a "manual link", because the user is now able to edit every link we store if they so desire, and this was the simplest way to show all the links in the current Web UI.
Added a new plugin to simply export/import release info (
Shoko.Plugin.ReleaseExporter
). This is both my test case for the plugin system in addition to a small handy provider if you ever need to re-index your collection from scratch and don't want to do the AniDB UDP dance, or if you want to transcode your collection to a newer format at some point and want to preserve the release info in the process.Changes in APIv1:
All file linking/unlinking in APIv1 has been soft deprecated. Use APIv3 instead. By soft deprecated I mean the client can still make the requests, but will only get an error message back from the server.
Release info has been migrated to use the new format, but only for releases provided by the
"AniDB"
provider.Release groups have been migrated to use the new format, but only for release groups with
"AniDB"
as a source.Changes in APIv2:
Release info has been migrated to use the new format, but only for releases provided by the
"AniDB"
provider.Release groups have been migrated to use the new format, but only for release groups with
"AniDB"
as a source.Changes in APIv3:
Release info has been migrated to a new API model.
File.Hashes
has been changed from a dict of well known, nullable hashes to a list of hash digests, where only the ED2K hash is guaranteed to be included in the list.File.AniDB
has been replaced withFile.Release
, which now uses the new release info model. TheincludeDataFrom=AniDB
query parameter for file endpointsAdded a new hashing controller (mounted at
/api/v3/Hashing
for now), to view and edit hashing provider settings, enable disable hash types per provider, and re-order the run order of providers in sequential mode (note: the order doesn't affect the parallel mode because every provider is… ran in parallel).Added a new
ReleaseInfoController
(mounted at/api/v3/ReleaseInfo
), allowing RESTful clients to also interact with the newly addedIVideoReleaseService
. You can do anything you can doFile linking in APIv3 have been converted to use the new service, and as a result the artificial limit of not allowing the user to remove AniDB releases is gone. A release is simply a release now.
To-do;