Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full review #3

Open
wants to merge 59 commits into
base: empty
Choose a base branch
from
Open

Full review #3

wants to merge 59 commits into from

Conversation

peternowee
Copy link
Collaborator

No description provided.

And describe in README how to let `clamd` run as `root`.
To ensure a stable setup for now, configure clamscan JS library to only
use Unix socket to connect to clamd daemon directly, not falling back
to TCP, /usr/bin/clamdscan or /usr/bin/clamscan.

Test page:
    http://localhost/virus-scanner/test
The test page tests with one of clamav's own test files. Successfully
tested with EICAR test file as well, but not including that here for
now to prevent false positive detection on this service itself.
## Getting started

Prerequisites:
- [file-service](https://github.com/mu-semtech/file-service)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: not per se. Other files may ingest and store files too.

> VIRUS_SCANNER_CLAMD_USER: "root"


Add rules to `dispatcher.ex` to dispatch requests to this service. E.g.
Copy link
Member

@cecemel cecemel Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe mention if you need to trigger a scan through the API directly

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the current scope, I am very much inclined to keep the API in the documentation for 'debugging' purposes only. Although it could be potentially useful, it needs thorough consideration before making it available to production clients (mainly the frontend). I do not consider this as part of the scope for this iteration."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial problems I see:

  • what about long running jobs
  • inconsistency how the virus scanner needs to be triggered vs the malware analysis data model
  • json:api compatible spec
  • ...
    It's very close; but i would mention in the documentation that this is not meant for public exposure YET

| sample-ref | `stix:sample_ref` | `nfo:FileDataObject` | The file that was scanned |
|**TODO**result-name| `stix:result_name` | `xsd:string` | Details of the result, e.g. names of detected malware |

**TODO**: We could add `result-name`, but clamscan returns an array of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not now

|-------------------|-------------------------|----------------------|-------------------------------------------------------------------|
| analysis-started | `stix:analysis_started` | `xsd:dateTime` | Datetime of scan start |
| analysis-ended | `stix:analysis_ended` | `xsd:dateTime` | Datetime of scan end |
| result | `stix:result` | `xsd:string` | The result: `malicious`, `suspicious`, `benign` or `unknown` |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we'd go for skos:Concepts

"analysis-started": "2023-11-30T14:10:33.855Z",
"analysis-ended": "2023-11-30T14:10:33.930Z",
"result": "malicious",
"sample-ref": "http://mu.semte.ch/services/file-service/files/65684a368d76fe0010000000"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be compatible with json:api, i think it should be releationships. (https://jsonapi.org//)

const fileResults = [];

for (const file of filesToScan) {
const scanFileResult = await scanFile(file);
Copy link
Member

@cecemel cecemel Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it would be better to have a try-catch block for each iteration of the loop. Currently, if something fails in the first file, it could potentially cause the remaining files to not be scanned properly

console.log('\nDetailed results per file:');
console.dir(fileResults, { depth: null });

console.log('\nFiles per STIX Malware Analysis result:');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok; this is too elaborate for me. But anyway; if you want to keep such a report; maybe abstract it away in a 'printReport' function that you call here, so the 'controller' logic is less cluttered and easier to follow.

Comment on lines +113 to +132
// TODO: Let storeMalwareAnalysis() run a SELECT query after the insert
// to check in which (if any) graphs the resource was inserted?
// Would be more reliable and easier to parse.
const filesNoDatabaseUpdate = fileResults.filter(
(fileResult) =>
!(
fileResult.databaseResponse?.results?.bindings[0]?.['callret-0']
?.value &&
fileResult.databaseResponse?.results?.bindings[0]?.[
'callret-0'
]?.value.match(' -- done')
),
);
if (filesNoDatabaseUpdate.length) {
console.log(
'\nFiles for which the database response indicates that the ' +
'malware analysis resource object was not added to any graph:',
);
console.dir(filesNoDatabaseUpdate, { depth: null });
}
Copy link
Member

@cecemel cecemel Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really too much. At some point we should trust the system and the corresponding error codes if stuff fails.

} catch (e) {
console.log(e);
res.status(500).send('Uncaught error in /delta: ' + e);
// TODO: Re-throw error? Because not sure if the response 500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think that's the good response, for now.
Currently this is a silent fail; meaning; no one will get notified if we have a general exception. And here is a bit a problem of convention; there is no general way of dealing with these kind of situations.
In loket apps for instance; it's gonna write an error to the database and that will be picked up by another service who will send a mail. But unsure what others do ATM.
Anyway; i would keep for this iteration the handling as is. Maybe use console.error instead

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*
* @param {Object} body Request body should be in JSON-format with
* `file` containing a logical file IRI as a single String.
* E.g. { "file": "http://mu.semte.ch/services/file-service/files/6543bc046ea4f3000e00000c" }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before making this API public we should see how json:api-compatible it can be made. Again not for this iteration. Very good it's there

* @param {Object} body Request body should be in JSON-format with
* `file` containing a logical file IRI as a single String.
* E.g. { "file": "http://mu.semte.ch/services/file-service/files/6543bc046ea4f3000e00000c" }
* @return [201] if file was found in database, a malware analysis ran and the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here again what with long runnig requests. (not for now)

Comment on lines +179 to +187
const logicalFileIRI = body.file;
if (
!(
typeof logicalFileIRI === 'string' || logicalFileIRI instanceof String
) ||
!logicalFileIRI.length
) {
return res.status(400).send('`file` not a non-empty String');
}
Copy link
Member

@cecemel cecemel Dec 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for me a bit overkill. but ok

}

if (logicalFileIRI.slice(0, 8) === 'share://') {
// TODO: Be flexible and lookup the logical file IRI? Can we assume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no that's too much :-)

Comment on lines +193 to +195
return res
.status(422)
.send('`file` is a physical file IRI, should be a logical file IRI');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're extremely kind to your users :-)

Comment on lines +258 to +259
fileIRI.slice(0, 8) === 'share://'
? fileIRI
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I didn't see it being used on physical files, i would scrap this to keep code a bit simpler

analysisStarted: new Date(),
analysisEnded: undefined,
result: 'unknown',
resultName: undefined,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick; maybe consider it calling 'viruses' or something. cause was not immediately clear to me what it was.

Copy link
Member

@cecemel cecemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Very good start. Very well documented too. That's very nice.

If testing proves success (still have to do this); then the only changes i would make for release v0.1.0

All the rest can be picked up in next iterations. Some stuff might be breaking (like e.g. the concept scheme status) but nothing we cannot explain in a migration guid

SELECT ?physicalFile
WHERE {
GRAPH ?g {
?physicalFile nie:dataSource ${sparqlEscapeUri(logicalFileIRI)} .
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make the query a tad more precise. Saying you expect your physical file to be a nfo:Filedataobject

async function getPhysicalFileIRI(logicalFileIRI) {
const result = await query(`
PREFIX nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#>
SELECT ?physicalFile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use a distinct here

}
`);
if (result.results.bindings.length)
// `[0]` is based on the assumption that, even if there are triples
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using distinct won't need this comment. I can also image a lot of other egde cases (e.g. derived files) but let's not dwell upon that.

GRAPH ?graph {
${sparqlEscapeUri(fileIRI)} a nfo:FileDataObject .
}
BIND(?graph AS ?g)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just use ?graph directly


let databaseResponse;
try {
databaseResponse = await update(`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: extra reason why i am reluctant to expose as API: it uses a sudo query. so everyone with access could write to the db.

type: 'malware-analyses',
id: malwareAnalysisId,
attributes: {
// TODO: Ok to include uri? Not a property in database, but
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the question

} catch (e) {
console.log(e);
res.status(500).send('Uncaught error in /delta: ' + e);
// TODO: Re-throw error? Because not sure if the response 500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK: i don't think this is picked up during build.

// TODO: Not http://data.gift/id/virus-scanner/analysis/1 ?
// or: http://data.gift/services/id/virus-scanner/analysis/1 ?
const malwareAnalysisIri =
'http://data.gift/virus-scanner/analysis/id/'.concat(malwareAnalysisId);
Copy link
Collaborator Author

@peternowee peternowee Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cecemel Could you still answer this question about the IRI format http://data.gift/virus-scanner/analysis/id/{id} ?

  • It seems uncommon to let the service name be the first path segment (/virus-scanner/). Often, the first path segment seems to be more generic, like /id/ (or /concepts/, /concept-schemes/, /services/, /vocabularies/, etc.)
  • The hostname often has a prefix, like redpencil.data.gift, lblod.data.gift, kanselarij.vo.data.gift, (or own domain, like data.lblod.info).

Some examples from other services:

  • http://lblod.data.gift/id/subsidie/bicycle-infrastructure/{id}
  • http://redpencil.data.gift/id/dataContainers/{id}
  • http://data.lblod.info/id/dataContainers/${id}
  • http://kanselarij.vo.data.gift/id/indieningsactiviteiten/${id}
  • http://kanselarij.vo.data.gift/id/personen/${id}

But there are many different variations, perhaps there is no hard rule. I could look into this myself or ask around more later.

VIRUS_SCANNER_CLAMD_USER: # "root"
volumes:
- ./data/files:/share
- type: volume
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions here:

  1. do we need to persist these signatures?
  2. if we do, can't this not be in a simple mounted volume. (It's one of these implicit assumptions of the stack we DON'T work with docker-volumes)

Copy link
Collaborator Author

@peternowee peternowee Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. do we need to persist these signatures?

If you don't, you risk getting rate-limited for downloading the signature database too often:

WARNING: Tue Oct 24 18:08:37 2023 -> Can't download daily.cvd from https://database.clamav.net/daily.cvd
WARNING: Tue Oct 24 18:08:37 2023 -> FreshClam received error code 429 from the ClamAV Content Delivery Network (CDN).
WARNING: Tue Oct 24 18:08:37 2023 -> You are on cool-down until after: 2023-10-25 14:48:57
WARNING: Tue Oct 24 18:08:38 2023 -> Can't download main.cvd from https://database.clamav.net/main.cvd
WARNING: Tue Oct 24 18:08:38 2023 -> FreshClam received error code 429 from the ClamAV Content Delivery Network (CDN).
WARNING: Tue Oct 24 18:08:38 2023 -> You are on cool-down until after: 2023-10-25 14:48:57
Tue Oct 24 18:08:38 2023 -> This means that you have been rate limited by the CDN.
Tue Oct 24 18:08:38 2023 ->  1. Run FreshClam no more than once an hour to check for updates.
Tue Oct 24 18:08:38 2023 ->     FreshClam should check DNS first to see if an update is needed.
Tue Oct 24 18:08:38 2023 ->  2. If you have more than 10 hosts on your network attempting to download,
Tue Oct 24 18:08:38 2023 ->     it is recommended that you set up a private mirror on your network using
Tue Oct 24 18:08:38 2023 ->     cvdupdate (https://pypi.org/project/cvdupdate/) to save bandwidth on the
Tue Oct 24 18:08:38 2023 ->     CDN and your own network.
Tue Oct 24 18:08:38 2023 ->  3. Please do not open a ticket asking for an exemption from the rate limit,
Tue Oct 24 18:08:38 2023 ->     it will not be granted.

Copy link
Collaborator Author

@peternowee peternowee Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. if we do, can't this not be in a simple mounted volume. (It's one of these implicit assumptions of the stack we DON'T work with docker-volumes)

From my notes of that time:

Problem: When using a bind mount of a host directory to /var/lib/clamav, it is owned by user peter on the host and thus owned by root in the container (unless chowned, which is difficult because of uid and gid mapping between host and container). As a result, freshclam cannot download the database to /var/lib/clamav.

Finally got mounting volume with non-root owner to work.
Seems the following is required:

  • In Dockerfile, in this order:
    1. Make sure the directory has desired owner and file mode.
      E.g. RUN mkdir && chown && chmod.
      In our case, the RUN apt-get install of clamav already created
      /var/lib/clamav owned by clamav:clamav.
    2. After that: VOLUME /var/lib/clamav
      See https://devops.stackexchange.com/questions/4540/how-to-change-the-owner-of-volume-directory-in-dockerfile
  • Then in docker-compose.yml:
    • Seems that bind mount does not work, probably because
      "Copy modes are not supported for bind-mounted volumes.".
      -- https://stackoverflow.com/questions/68317294/docker-compose-volumes-mode-options
      • Actually, that could probably be solved by adding chown in
        boot-virus-scanner.sh, but given the discussion of 20231023 and
        20231024, I am trying to prevent that. Also, it would result in
        strange ownership uid/gid on host. See similar remark on ClamAV's
        own Docker image:

        Disclaimer: When using a Bind Mount, the container's
        entrypoint script will change ownership of this directory to
        its "clamav" user. This enables FreshClam and ClamD with the
        required permissions to read and write to the directory,
        though these changes will also affect those files on the
        host.
        -- https://github.com/Cisco-Talos/clamav-documentation/blob/main/src/manual/Installing/Docker.md

        Added todo-later item:

        • 20231024 15:30 Add chown in boot-virus-scanner.sh to have
          bind mount on /var/lib/clamav work as well?
    • Not specifying a volume for /var/lib/clamav at all: Works. It will
      create an anonymous volume.
      peter@potlood:~/rpio/mini-app/app-hardware
      $ docker-compose exec -- virus-scanner runuser -u clamav touch /var/lib/clamav/test
      $ docker-compose exec -- virus-scanner ls -al /var/lib/clamav
      drwxr-xr-x 2 clamav clamav 4096 Oct 24 15:32 .
      drwxr-xr-x 1 root   root   4096 Oct 24 14:53 ..
      -rw-r--r-- 1 clamav clamav    0 Oct 24 15:32 test
      root@potlood:~
      # ls -al /var/lib/docker/1000.1000/volumes/c41e4f3299bd29aa894241e18217783a3e12ecced4ea86ed77b95bc36caf316b/_data/
      drwxr-xr-x 2 1100 1103 4096 Oct 24 17:32 .
      drwx-----x 3 root root 4096 Oct 24 17:28 ..
      -rw-r--r-- 1 1100 1103    0 Oct 24 17:32 test
      
      However, anonymous volume is not very useful for persisting data,
      because a new anonymous volume is created on every down+up (not on
      restart).
    • Specify a named volume, e.g.:
          services:
            virus-scanner:
              volumes:
                - type: volume
                  source: signature-database
                  target: /var/lib/clamav
                  volume:
                    nocopy: false
          volumes:
            signature-database:
      
      Note about nocopy:

service clamav-daemon restart

echo "(Re)starting freshclam daemon"
service clamav-freshclam restart
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed for large files the scan fails because it can't allocate sufficient memory. Would it be possible to add this through a parameter? (So we decide per app how large our files can be for scanning)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants