-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Full review #3
base: empty
Are you sure you want to change the base?
Full review #3
Conversation
h/t cecemel
Based on examples from: https://github.com/lblod/prepare-submissions-for-export-service/blob/v0.6.0-rc.1/app.js https://github.com/lblod/lpdc-publish-service/blob/v0.9.1/deltaPostProcess.js https://github.com/lblod/toezicht-flattened-form-data-generator/blob/v1.3.0/app.js https://github.com/lblod/harvesting-execute-diff-deletes-service/blob/v1.0.1/app.js
And describe in README how to let `clamd` run as `root`.
To ensure a stable setup for now, configure clamscan JS library to only use Unix socket to connect to clamd daemon directly, not falling back to TCP, /usr/bin/clamdscan or /usr/bin/clamscan. Test page: http://localhost/virus-scanner/test The test page tests with one of clamav's own test files. Successfully tested with EICAR test file as well, but not including that here for now to prevent false positive detection on this service itself.
- Let /post 201 response body contain only the resource object. This is more JSON:API-like. - Log the rest.
## Getting started | ||
|
||
Prerequisites: | ||
- [file-service](https://github.com/mu-semtech/file-service) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: not per se. Other files may ingest and store files too.
> VIRUS_SCANNER_CLAMD_USER: "root" | ||
|
||
|
||
Add rules to `dispatcher.ex` to dispatch requests to this service. E.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe mention if you need to trigger a scan through the API directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the current scope, I am very much inclined to keep the API in the documentation for 'debugging' purposes only. Although it could be potentially useful, it needs thorough consideration before making it available to production clients (mainly the frontend). I do not consider this as part of the scope for this iteration."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial problems I see:
- what about long running jobs
- inconsistency how the virus scanner needs to be triggered vs the malware analysis data model
- json:api compatible spec
- ...
It's very close; but i would mention in the documentation that this is not meant for public exposure YET
| sample-ref | `stix:sample_ref` | `nfo:FileDataObject` | The file that was scanned | | ||
|**TODO**result-name| `stix:result_name` | `xsd:string` | Details of the result, e.g. names of detected malware | | ||
|
||
**TODO**: We could add `result-name`, but clamscan returns an array of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not now
|-------------------|-------------------------|----------------------|-------------------------------------------------------------------| | ||
| analysis-started | `stix:analysis_started` | `xsd:dateTime` | Datetime of scan start | | ||
| analysis-ended | `stix:analysis_ended` | `xsd:dateTime` | Datetime of scan end | | ||
| result | `stix:result` | `xsd:string` | The result: `malicious`, `suspicious`, `benign` or `unknown` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we'd go for skos:Concepts
"analysis-started": "2023-11-30T14:10:33.855Z", | ||
"analysis-ended": "2023-11-30T14:10:33.930Z", | ||
"result": "malicious", | ||
"sample-ref": "http://mu.semte.ch/services/file-service/files/65684a368d76fe0010000000" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be compatible with json:api, i think it should be releationships. (https://jsonapi.org//)
const fileResults = []; | ||
|
||
for (const file of filesToScan) { | ||
const scanFileResult = await scanFile(file); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, it would be better to have a try-catch block for each iteration of the loop. Currently, if something fails in the first file, it could potentially cause the remaining files to not be scanned properly
console.log('\nDetailed results per file:'); | ||
console.dir(fileResults, { depth: null }); | ||
|
||
console.log('\nFiles per STIX Malware Analysis result:'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok; this is too elaborate for me. But anyway; if you want to keep such a report; maybe abstract it away in a 'printReport' function that you call here, so the 'controller' logic is less cluttered and easier to follow.
// TODO: Let storeMalwareAnalysis() run a SELECT query after the insert | ||
// to check in which (if any) graphs the resource was inserted? | ||
// Would be more reliable and easier to parse. | ||
const filesNoDatabaseUpdate = fileResults.filter( | ||
(fileResult) => | ||
!( | ||
fileResult.databaseResponse?.results?.bindings[0]?.['callret-0'] | ||
?.value && | ||
fileResult.databaseResponse?.results?.bindings[0]?.[ | ||
'callret-0' | ||
]?.value.match(' -- done') | ||
), | ||
); | ||
if (filesNoDatabaseUpdate.length) { | ||
console.log( | ||
'\nFiles for which the database response indicates that the ' + | ||
'malware analysis resource object was not added to any graph:', | ||
); | ||
console.dir(filesNoDatabaseUpdate, { depth: null }); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is really too much. At some point we should trust the system and the corresponding error codes if stuff fails.
} catch (e) { | ||
console.log(e); | ||
res.status(500).send('Uncaught error in /delta: ' + e); | ||
// TODO: Re-throw error? Because not sure if the response 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I think that's the good response, for now.
Currently this is a silent fail; meaning; no one will get notified if we have a general exception. And here is a bit a problem of convention; there is no general way of dealing with these kind of situations.
In loket apps for instance; it's gonna write an error to the database and that will be picked up by another service who will send a mail. But unsure what others do ATM.
Anyway; i would keep for this iteration the handling as is. Maybe use console.error
instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* | ||
* @param {Object} body Request body should be in JSON-format with | ||
* `file` containing a logical file IRI as a single String. | ||
* E.g. { "file": "http://mu.semte.ch/services/file-service/files/6543bc046ea4f3000e00000c" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before making this API public we should see how json:api-compatible it can be made. Again not for this iteration. Very good it's there
* @param {Object} body Request body should be in JSON-format with | ||
* `file` containing a logical file IRI as a single String. | ||
* E.g. { "file": "http://mu.semte.ch/services/file-service/files/6543bc046ea4f3000e00000c" } | ||
* @return [201] if file was found in database, a malware analysis ran and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here again what with long runnig requests. (not for now)
const logicalFileIRI = body.file; | ||
if ( | ||
!( | ||
typeof logicalFileIRI === 'string' || logicalFileIRI instanceof String | ||
) || | ||
!logicalFileIRI.length | ||
) { | ||
return res.status(400).send('`file` not a non-empty String'); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for me a bit overkill. but ok
} | ||
|
||
if (logicalFileIRI.slice(0, 8) === 'share://') { | ||
// TODO: Be flexible and lookup the logical file IRI? Can we assume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no that's too much :-)
return res | ||
.status(422) | ||
.send('`file` is a physical file IRI, should be a logical file IRI'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're extremely kind to your users :-)
fileIRI.slice(0, 8) === 'share://' | ||
? fileIRI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I didn't see it being used on physical files, i would scrap this to keep code a bit simpler
analysisStarted: new Date(), | ||
analysisEnded: undefined, | ||
result: 'unknown', | ||
resultName: undefined, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick; maybe consider it calling 'viruses' or something. cause was not immediately clear to me what it was.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Very good start. Very well documented too. That's very nice.
If testing proves success (still have to do this); then the only changes i would make for release v0.1.0
- Update Readme: and mention the API is not ready for public consumption yet.
- https://github.com/peternowee/virus-scanner-service/pull/3/files#r1417514505
All the rest can be picked up in next iterations. Some stuff might be breaking (like e.g. the concept scheme status) but nothing we cannot explain in a migration guid
SELECT ?physicalFile | ||
WHERE { | ||
GRAPH ?g { | ||
?physicalFile nie:dataSource ${sparqlEscapeUri(logicalFileIRI)} . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make the query a tad more precise. Saying you expect your physical file to be a nfo:Filedataobject
async function getPhysicalFileIRI(logicalFileIRI) { | ||
const result = await query(` | ||
PREFIX nie: <http://www.semanticdesktop.org/ontologies/2007/01/19/nie#> | ||
SELECT ?physicalFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use a distinct
here
} | ||
`); | ||
if (result.results.bindings.length) | ||
// `[0]` is based on the assumption that, even if there are triples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using distinct
won't need this comment. I can also image a lot of other egde cases (e.g. derived files) but let's not dwell upon that.
GRAPH ?graph { | ||
${sparqlEscapeUri(fileIRI)} a nfo:FileDataObject . | ||
} | ||
BIND(?graph AS ?g) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use ?graph
directly
|
||
let databaseResponse; | ||
try { | ||
databaseResponse = await update(` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Side note: extra reason why i am reluctant to expose as API: it uses a sudo query. so everyone with access could write to the db.
type: 'malware-analyses', | ||
id: malwareAnalysisId, | ||
attributes: { | ||
// TODO: Ok to include uri? Not a property in database, but |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the question
} catch (e) { | ||
console.log(e); | ||
res.status(500).send('Uncaught error in /delta: ' + e); | ||
// TODO: Re-throw error? Because not sure if the response 500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK: i don't think this is picked up during build.
// TODO: Not http://data.gift/id/virus-scanner/analysis/1 ? | ||
// or: http://data.gift/services/id/virus-scanner/analysis/1 ? | ||
const malwareAnalysisIri = | ||
'http://data.gift/virus-scanner/analysis/id/'.concat(malwareAnalysisId); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cecemel Could you still answer this question about the IRI format http://data.gift/virus-scanner/analysis/id/{id}
?
- It seems uncommon to let the service name be the first path segment (
/virus-scanner/
). Often, the first path segment seems to be more generic, like/id/
(or/concepts/
,/concept-schemes/
,/services/
,/vocabularies/
, etc.) - The hostname often has a prefix, like
redpencil.data.gift
,lblod.data.gift
,kanselarij.vo.data.gift
, (or own domain, likedata.lblod.info
).
Some examples from other services:
http://lblod.data.gift/id/subsidie/bicycle-infrastructure/{id}
http://redpencil.data.gift/id/dataContainers/{id}
http://data.lblod.info/id/dataContainers/${id}
http://kanselarij.vo.data.gift/id/indieningsactiviteiten/${id}
http://kanselarij.vo.data.gift/id/personen/${id}
But there are many different variations, perhaps there is no hard rule. I could look into this myself or ask around more later.
VIRUS_SCANNER_CLAMD_USER: # "root" | ||
volumes: | ||
- ./data/files:/share | ||
- type: volume |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions here:
- do we need to persist these signatures?
- if we do, can't this not be in a simple mounted volume. (It's one of these implicit assumptions of the stack we DON'T work with docker-volumes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- do we need to persist these signatures?
If you don't, you risk getting rate-limited for downloading the signature database too often:
WARNING: Tue Oct 24 18:08:37 2023 -> Can't download daily.cvd from https://database.clamav.net/daily.cvd
WARNING: Tue Oct 24 18:08:37 2023 -> FreshClam received error code 429 from the ClamAV Content Delivery Network (CDN).
WARNING: Tue Oct 24 18:08:37 2023 -> You are on cool-down until after: 2023-10-25 14:48:57
WARNING: Tue Oct 24 18:08:38 2023 -> Can't download main.cvd from https://database.clamav.net/main.cvd
WARNING: Tue Oct 24 18:08:38 2023 -> FreshClam received error code 429 from the ClamAV Content Delivery Network (CDN).
WARNING: Tue Oct 24 18:08:38 2023 -> You are on cool-down until after: 2023-10-25 14:48:57
Tue Oct 24 18:08:38 2023 -> This means that you have been rate limited by the CDN.
Tue Oct 24 18:08:38 2023 -> 1. Run FreshClam no more than once an hour to check for updates.
Tue Oct 24 18:08:38 2023 -> FreshClam should check DNS first to see if an update is needed.
Tue Oct 24 18:08:38 2023 -> 2. If you have more than 10 hosts on your network attempting to download,
Tue Oct 24 18:08:38 2023 -> it is recommended that you set up a private mirror on your network using
Tue Oct 24 18:08:38 2023 -> cvdupdate (https://pypi.org/project/cvdupdate/) to save bandwidth on the
Tue Oct 24 18:08:38 2023 -> CDN and your own network.
Tue Oct 24 18:08:38 2023 -> 3. Please do not open a ticket asking for an exemption from the rate limit,
Tue Oct 24 18:08:38 2023 -> it will not be granted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- if we do, can't this not be in a simple mounted volume. (It's one of these implicit assumptions of the stack we DON'T work with docker-volumes)
From my notes of that time:
Problem: When using a bind mount of a host directory to /var/lib/clamav, it is owned by user peter
on the host and thus owned by root
in the container (unless chowned, which is difficult because of uid and gid mapping between host and container). As a result, freshclam cannot download the database to /var/lib/clamav.
Finally got mounting volume with non-root owner to work.
Seems the following is required:
- In Dockerfile, in this order:
- Make sure the directory has desired owner and file mode.
E.g. RUN mkdir && chown && chmod.
In our case, theRUN apt-get install
of clamav already created
/var/lib/clamav owned by clamav:clamav. - After that:
VOLUME /var/lib/clamav
See https://devops.stackexchange.com/questions/4540/how-to-change-the-owner-of-volume-directory-in-dockerfile
- Make sure the directory has desired owner and file mode.
- Then in docker-compose.yml:
- Seems that bind mount does not work, probably because
"Copy modes are not supported for bind-mounted volumes.".
-- https://stackoverflow.com/questions/68317294/docker-compose-volumes-mode-options-
Actually, that could probably be solved by adding
chown
in
boot-virus-scanner.sh, but given the discussion of 20231023 and
20231024, I am trying to prevent that. Also, it would result in
strange ownership uid/gid on host. See similar remark on ClamAV's
own Docker image:Disclaimer: When using a Bind Mount, the container's
entrypoint script will change ownership of this directory to
its "clamav" user. This enables FreshClam and ClamD with the
required permissions to read and write to the directory,
though these changes will also affect those files on the
host.
-- https://github.com/Cisco-Talos/clamav-documentation/blob/main/src/manual/Installing/Docker.mdAdded todo-later item:
- 20231024 15:30 Add chown in boot-virus-scanner.sh to have
bind mount on /var/lib/clamav work as well?
- 20231024 15:30 Add chown in boot-virus-scanner.sh to have
-
- Not specifying a volume for /var/lib/clamav at all: Works. It will
create an anonymous volume.However, anonymous volume is not very useful for persisting data,peter@potlood:~/rpio/mini-app/app-hardware $ docker-compose exec -- virus-scanner runuser -u clamav touch /var/lib/clamav/test $ docker-compose exec -- virus-scanner ls -al /var/lib/clamav drwxr-xr-x 2 clamav clamav 4096 Oct 24 15:32 . drwxr-xr-x 1 root root 4096 Oct 24 14:53 .. -rw-r--r-- 1 clamav clamav 0 Oct 24 15:32 test root@potlood:~ # ls -al /var/lib/docker/1000.1000/volumes/c41e4f3299bd29aa894241e18217783a3e12ecced4ea86ed77b95bc36caf316b/_data/ drwxr-xr-x 2 1100 1103 4096 Oct 24 17:32 . drwx-----x 3 root root 4096 Oct 24 17:28 .. -rw-r--r-- 1 1100 1103 0 Oct 24 17:32 test
because a new anonymous volume is created on every down+up (not on
restart). - Specify a named volume, e.g.:
Note about
services: virus-scanner: volumes: - type: volume source: signature-database target: /var/lib/clamav volume: nocopy: false volumes: signature-database:
nocopy
:nocopy: true
does NOT work! (I tested it again.)nocopy: false
is not required, because
"For named volumes,copy
is the default mode."
-- https://stackoverflow.com/questions/68317294/docker-compose-volumes-mode-options
- Seems that bind mount does not work, probably because
service clamav-daemon restart | ||
|
||
echo "(Re)starting freshclam daemon" | ||
service clamav-freshclam restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed for large files the scan fails because it can't allocate sufficient memory. Would it be possible to add this through a parameter? (So we decide per app how large our files can be for scanning)
No description provided.