Karton Filetype Engine

A Different Approach to File Classification for MWDB Karton

The Karton Filetype Engine is a powerful tool designed for the MWDB Karton system. It's inspired by the karton-classifier, however it follows an entirely different approach. While the classifier tries to put all the possible labels on it hoping that at least one of the will be correct and consumed by the correct consumer, this repository tries its best to assign it to a SINGLE, but as correct file type as possible.

Utilized third party tools

In order to achieve the best accuracy Filetype engine uses all of the following tools:

Also it utizises some external database/lists too to improve its mimetype knowledge:

Input/Output

Consumes

{
    "type": "sample",
    "kind": "raw"
    "payload": {
        "magic":  "output from 'file' command",
        "sample": <Resource>
    }
}

Produces

It produces a similar structure to classifier, however in no way it's compatible with that.

{
'type': 'sample',
'stage': 'recognized',
'extension': '',    # Literally an extension used by the file format
                    # In some cases it's not the actual extension, but a placeholder, for example
                    # for PEs it's "pe", which is nonexistent
                    # By default "bin" is used.
'mime': '',         # The actual MIME type it identifies. Most of the cases it's provided by Magika and Tika,
                    # hence they should be stable to use.
                    # In case of no match "application/octet-stream" is used as default
'kind': '',         # A mixed hybrid of the TOP level items from:
                    # https://www.digipres.org/formats/mime-types/
                    # And one extra-custom introduced element for archives.
                    # So, every mimetype will have either the TOP mimetype element or "archive"
... (other fields are derived from incoming task)
}

I know, Filetypeis more complicated to check. TODO

Getting Started

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
filetype.py		filetype.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Karton Filetype Engine

A Different Approach to File Classification for MWDB Karton

Utilized third party tools

Input/Output

Consumes

Produces

Getting Started

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

NtWriteCode/karton-filetype

Folders and files

Latest commit

History

Repository files navigation

Karton Filetype Engine

A Different Approach to File Classification for MWDB Karton

Utilized third party tools

Input/Output

Consumes

Produces

Getting Started

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages