Migrate JobFilesAPIController to FastAPI (excluding TUS uploads)#20235
Migrate JobFilesAPIController to FastAPI (excluding TUS uploads)#20235kysrpex wants to merge 11 commits intogalaxyproject:devfrom
JobFilesAPIController to FastAPI (excluding TUS uploads)#20235Conversation
|
Pulsar uses these endpoints, so before merging this, it is critical that it is passes all tests from test/integration/test_job_files.py and test_job_files_tus.py ( |
|
Locally, @jmchilton I could use some help from you (I see that you wrote test/integration/test_job_files_tus.py). Do you think the failures are related to the changes from this PR? Do you have any clue of what's failing before I look further into it? I would be quite grateful if you could have a look after the CI finishes running the tests. |
|
|
||
| @router.post( | ||
| "/api/jobs/{job_id}/files", | ||
| summary="Populate an output file.", |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| responses={ | ||
| 200: {"description": "An okay message.", "content": {"application/json": {"example": {"message": "ok"}}}}, | ||
| }, | ||
| ) |
Check failure
Code scanning / CodeQL
Polynomial regular expression used on uncontrolled data High
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| if input_file_path: | ||
| with open(input_file_path, "rb") as input_file_handle: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
| # tempfile has moved and Python wants to delete it. | ||
| pass | ||
| return {"message": "ok"} | ||
| with open(path, "wb") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression High
|
About the CodeQL issues, although there is One solution is to add an exception. Another to delay the merge to see if there is some time remaining to fix them after the ARC integration has been worked through. |
1a8b10b to
d17002d
Compare
|
|
||
| job = self.__authorize_job_access(trans, job_id, path=path, job_key=job_key) | ||
|
|
||
| if not os.path.exists(path): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| with open(path, "ab") as destination: | ||
| shutil.copyfileobj(open(input_file.name, "rb"), destination) | ||
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| if os.path.exists(path) and (path.endswith("tool_stdout") or path.endswith("tool_stderr")): | ||
| with open(path, "ab") as destination: | ||
| if input_file_path: | ||
| with open(input_file_path, "rb") as input_file_handle: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
| # (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even | ||
| # a path where uploaded files can be accessed on disk | ||
| if input_file_path: | ||
| shutil.move(input_file_path, path) |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
|
|
||
| See more discussion of checking upload access, but we shouldn't need the | ||
| API key and session stuff the user upload tusd server should be configured with. | ||
| with open(path, "wb") as destination: |
Check failure
Code scanning / CodeQL
Uncontrolled data used in path expression
It was because the endpoint |
|
@maikenp FYI |
`FastAPIJobFiles` is the new, FastAPI version of `JobFilesAPIController`. The endpoints that have been migrated should exhibit exactly the same behavior as the old ones from `FastAPIJobFiles`. Something to keep in mind is that while FastAPI has some extra built-in features that the legacy WSGI system did not have, such as answering HEAD requests, those do not work because of the way legacy WSGI endpoints are injected into the FastAPI app (using `app.mount("/", wsgi_handler)`), meaning that for example, HEAD requests are passed to the `wsgi_handler` sub-application.
Endpoints dedicated to TUS uploads work in tandem with the WSGI middleware `TusMiddleware` from the `tuswsgi` package. As explained above, WSGI middlewares and endpoints are injected into the FastAPI app after FastAPI routes as a single sub-application `wsgi_handler` using `app.mount("/", wsgi_handler)`, meaning that requests are passed to the `wsgi_handler` sub-application (and thus to `TusMiddleware`) only if there was no FastAPI endpoint defined to handle them. Therefore, they cannot be migrated to FastAPI unless `TusMiddleware` is also migrated to ASGI.
Work around a bug in FastAPI (fastapi/fastapi#13175) that assigns the same operation id to both request methods GET and HEAD of the endpoint `/api/jobs/{job_id}/files` when using the `@router.api_route()` decorator with `methods=["GET", "HEAD"]` as keyword argument.
…T requests to `/api/jobs/{job_id}/files`
Pulsar formats the `path` and `job_key` parameters as query parameters when submitting POST requests to `/api/jobs/{job_id}/files`. However, many Galaxy tests format them as form parameters. The only way to keep the endpoint working as it should (as it worked before the migration to FastAPI) is to accept both query and form parameters.
… requests to `/api/jobs/{job_id}/files`
FastAPI will not use the parameter aliases of form parameters in the OpenAPI docs, but the name of their Python variables. Therefore, the API docs show `path_form` and `job_key_form`. Rename them so that the API docs show the correct parameter names.
071d2d5 to
7a11519
Compare
Change docstrings of `job_key_query_or_form()` and `path_query_or_form()` to reference the correct HTTP error code (400).
There was a small merge conflict with 7f89e0e, make sure changes are compliant with said commit.
mvdbeek
left a comment
There was a problem hiding this comment.
here's the (incomplete) review i never submitted, maybe you can take a look at this and I'll come back to this next week
|
|
||
| :rtype: binary | ||
| :returns: contents of file | ||
| # FastAPI answers HEAD requests automatically for GET endpoints. However, because of the way legacy WSGI endpoints |
There was a problem hiding this comment.
Where's that info coming from ? This has been discussed but was never implemented (last attempt fastapi/fastapi#13509).
There was a problem hiding this comment.
If you return FileResponse, then it's true fastapi/fastapi#1773 (comment). I incorrectly assumed it was generally true, because I just tried it with this endpoint.
I changed the comments so that they are precise 8def3a7. Thanks for letting me know about this.
| @router.get( | ||
| # simplify me (remove `_args` and `_kwargs` defined using the walrus operator) when ALL endpoints have been | ||
| # migrated to FastAPI, this is a workaround for FastAPI bug https://github.com/fastapi/fastapi/issues/13175 | ||
| *(_args := ["/api/jobs/{job_id}/files"]), | ||
| **( | ||
| _kwargs := dict( | ||
| summary="Get a file required to staging a job.", | ||
| responses={ | ||
| 200: { | ||
| "description": "Contents of file.", | ||
| "content": {"application/json": None, "application/octet-stream": {"example": None}}, | ||
| }, | ||
| 400: { | ||
| "description": ( | ||
| "File not found, path does not refer to a file, or input dataset(s) for job have been purged." | ||
| ) | ||
| }, | ||
| }, | ||
| ) | ||
| ), | ||
| ) | ||
| @router.head(*_args, **_kwargs) # type: ignore[name-defined] | ||
| # remove `@router.head(...)` when ALL endpoints have been migrated to FastAPI |
There was a problem hiding this comment.
Can you follow the ordinary way of setting up fastapi endpoints, i.e. no argument unpacking and I don't think the explicit responses and summary keyword are needed either.
There was a problem hiding this comment.
This is the result of adding the keywords.
...
I don't think there harm in neatly documenting the endpoint, especially because the cause of HTTP 400 errors is hard to guess.
Regarding argument unpacking, refer to this answer.
| router = Router( | ||
| # keep the endpoint in the undocumented section of the API docs `/api/docs`, as all endpoints from `FastAPIJobFiles` | ||
| # are certainly not meant to represent part of Galaxy's stable, user facing API | ||
| tags=["undocumented"] |
There was a problem hiding this comment.
That's meant for the stuff that hasn't been ported. There's no harm documenting these and making them available for client libraries as long as these are marked as internal.
| # | ||
| @router.get( | ||
| # simplify me (remove `_args` and `_kwargs` defined using the walrus operator) when ALL endpoints have been | ||
| # migrated to FastAPI, this is a workaround for FastAPI bug https://github.com/fastapi/fastapi/issues/13175 |
There was a problem hiding this comment.
That's maybe a little excessive, you can also pass in an explicit operation_id if you did this to avoid the warning.
There was a problem hiding this comment.
The point of having both decorators @router.get() and @router.head() is indeed avoiding the warning.
If I have only one decorator @router.api_route(), then passing an explicit operation_id does not solve anything (both GET and HEAD get the same operation id).
If I use both decorators, then I never have to pass an explicit operation_id, because FastAPI will generate two different ones for me.
Regarding argument unpacking, the idea is that the same method index() has to be reused both for GET and HEAD. I am forced to use the two-decorator approach (to avoid the warning) and by definition, both the GET and HEAD endpoints must be defined in exactly the same way. I use argument unpacking to avoid repeating the arguments of the first decorator in the second decorator.
I realize however, that including both GET and HEAD in the API schema is redundant. Thus, I added include_in_schema=False to the @router.head() decorator (bed78be).
There was a problem hiding this comment.
Sorry, I don't think my comment is addressed. Please prefer a bit of repetitive code over disabling type checks.
There was a problem hiding this comment.
Since I set include_in_schema=False, @router.head() can now skip the kwargs and there would then be little repetition (only the path). Seems ok if it's just that what's repeated.
|
|
||
| This method does not force the client to provide the parameter, it could simply not submit the parameter in either | ||
| format. To force the client to provide the parameter, coerce the output of the method to a string, i.e. | ||
| `job_key: str = Depends(job_key_query_or_form)` so that FastAPI responds with status code 500 when the parameter is |
There was a problem hiding this comment.
That should be a validation error, not an internal server error.
Thank you, complete or not, that's some steam to get the engine running. |
Including both GET and HEAD in the API schema for the same endpoint is redundant.
…ally It is only true for endpoints returning `FileResponse`, see fastapi/fastapi#1773 (comment).
49ac9c3 to
8def3a7
Compare
Group job files API endpoints together with jobs API endpoints.
I addressed the comments, let's continue next week. Thanks again for moving things forward. |
Thanks for coming back to this again.
Regarding test coverage I think we are safe. I added a couple tests to cover the code paths that were originally not covered. Luckily, besides the tests from test_job_files.py, there also exist tests that caught a nasty problem that would've been overlooked (see this comment).
You're right this is important. It can be the the next step to get this merged 👍. There's also this potential performance issue that's not simple to solve at all. |
| with open(path, "wb") as destination: | ||
| shutil.copyfileobj(cast(IO[bytes], input_file), destination) |
There was a problem hiding this comment.
This is the potential performance mentioned in this comment. For readability I just copied the new code below (the comment above the code explains the problem).
# prior to migrating to FastAPI, this operation was done more efficiently for all cases using
# `shutil.move(input_file_path, path)`, but FastAPI stores the uploaded file as
# `tempfile.SpooledTemporaryFile`
# (https://docs.python.org/3/library/tempfile.html#tempfile.SpooledTemporaryFile), so now there is not even
# a path where uploaded files can be accessed on disk
if input_file_path:
shutil.move(input_file_path, path)
else:
with open(path, "wb") as destination:
shutil.copyfileobj(cast(IO[bytes], input_file), destination)It's tricky because the limitation is in FastAPI's design.
There was a problem hiding this comment.
you should be able to call rollover on the file and move it then ?
There was a problem hiding this comment.
Thanksss, that's just what I was looking for!
There was a problem hiding this comment.
Hmm there is a problem with this. Besides storing the data in memory until the file is rolled over to disk, tempfile.SpooledTemporaryFile operates exactly as tempfile.TemporaryFile. So even if rollover() is called, there is no path for that file (and there is nothing to call shutil.move() on).
Under Unix, the directory entry for the file is either not created at all or is removed immediately after the file is created. Other platforms do not support this; your code should not rely on a temporary file created using this function having or not having a visible name in the file system.
Definitely writing the file to disk is twice suboptiomal... 🤔
The point of having both decorators `@router.get()` and `@router.head()` is working around FastAPI issue fastapi/fastapi#13175. If there is only one decorator `@router.api_route()`, then both GET and HEAD get the same operation id. If both decorators are used, then FastAPI generates two different operation ids. The idea behind argument unpacking is that the same method `index()` has to be reused both for GET and HEAD. The two-decorator approach is needed to avoid the duplicate operation id warning and by definition, both the GET and HEAD endpoints must be defined in exactly the same way. Argument unpacking allows to avoid repeating the arguments of the first decorator in the second decorator. However, including both GET and HEAD in the API schema is redundant. Commit bed78be adds `include_in_schema=False` to the `@router.head()`. That implies that there is now not so much benefit in the argument unpacking trick, because the kwargs are no longer needed and there would then be little repetition (only the path). Thus, declare `@router.head()` explicitly.
As part of the development of an integration of Galaxy with ARC (Advanced Resource Connector) as a Pulsar job runner, tweaks to the
JobFilesAPIControllerare needed. None of such tweaks are included in this PR, but it makes sense to implement them building upon a FastAPI endpoint rather than a legacy WSGI one; and the first step to do that is migrating the controller to FastAPI.FastAPIJobFilesis the new, FastAPI version ofJobFilesAPIController. The endpoints that have been migrated should exhibit exactly the same behavior as the old ones fromFastAPIJobFiles.Endpoints dedicated to TUS uploads work in tandem with the WSGI middleware
TusMiddlewarefrom thetuswsgipackage. WSGI middlewares and endpoints are injected into the FastAPI app after FastAPI routes as a single sub-applicationwsgi_handlerusingapp.mount("/", wsgi_handler), meaning that requests are passed to thewsgi_handlersub-application (and thus toTusMiddleware) only if there was no FastAPI endpoint defined to handle them. Therefore, they cannot be migrated to FastAPI unlessTusMiddlewareis also migrated to ASGI. I am postponing that migration, because the ARC integration needs to be delivered soonish.I also included three new tests for existing functionality: writing from uploads done with the nginx_upload_module, from TUS uploads and using the parameter
__file.How to test the changes?
(Select all options that apply)
License