-
Notifications
You must be signed in to change notification settings - Fork 303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Painfully slow responses with large dataset and relationships (?) #558
Comments
Thank you for your thorough issue report. The problem is entirely with Flask-Restless, not your database configuration.
The performance is atrocious because I had no idea what I was doing when I started this project, and before long people it had become used by way more people than I had ever anticipated :).
This is a long-standing issue with Flask-Restless. The includes/excludes are applied at serialization time, not at query time. If you see an easy way to fix this, it would be a welcome pull request.
This is sort of the same as the problem in the first bullet point above: you want to specify includes/excludes at the database level. Is that right?
This is so that we meet the requirements of the JSON API specification.
This is certainly a tradeoff that comes with the structure provided by the JSON API specification. Maybe what you're looking for is Flask-Restless to treat the many-to-many relationship for I would like to be able to work on applying includes/excludes at the database query level instead of at the serialization level, but it would take a fair bit of work and I don't have the time at the moment. But this is definitely a high priority issue. Let me know how difficult you think it would be to solve this. |
@jfinkels, thank you for the in-depth response. I would have responded sooner but I wanted my response to be equally as useful.
Kudos to you for starting something that has become a favorite amongst people like me! :)
Scalar is invoking .one() and raising an exception if more than one result are found so maybe the name of the test is wrong? I see what is trying to be done there and it will fail from what i can see in helpers.py and it seems that any relationship is treated as a relationship regardless of relationship type. I think there should be some configurable logic added here to:
And that shouldn't be too difficult to do, with or without using SQLAlchemy's inspector (I'm trying to think of any overhead reduction).
I think performance would be a higher priority, but that's just my humble opinion. Using includes and excludes to modify a query should be relatively simple; for example, I wrote this POC that takes a list of strings, as column names, to exclude form my query:
(NOTE: I was planning to incorporate "includes" as well but I haven't gotten around to that yet) The query for test_normal() was:
And the query for test_excludes() was:
While the results, with excludes, won't necessarily be the objects themselves, I did try to preserve the primary keys so they could be referenced by key instead of trying to do any other kind of table scan (or an index scan on a non-pkey index). As far as the overall performance, this is really the first time I've looked under the skirt of flask-restless but a weekend and a copious supply of Red Bull (blueberry++, mmm) could yield some benefit. :) Feel free to reach out to me via email or google chat/hangouts at amFtaWUuaXZhbm92QGdtYWlsLmNvbQo= (base64) or amFtaWUuaXZhbm92QGljbG91ZC5jb20K (base64) (both work with iMessage) and my GitHub handle is where you can find me on Freenode as well. |
What I'm trying to figure out is why I'm getting 683,672 calls to views/base.py->get_all_inclusions() because line 1465 is where my problem is:
Which happens to be the number of rows in the table:
When there is a default pagination, why is flask-restless querying ALL objects in my table when, by default, it returns only 10 results? What should be happening is querying the first 10 results, then when the next page is queried then both an offset and limit are used. I'm still rooting through the code to map everything out so I figured I would bring that to @jfinkels attention to see if someone who knows the code layout would have a better idea of why that's going on until I finish figuring out the code layout. |
👍 Yes kudos indeed, I've looked at a few flask-api packages and this one looked like the easiest to configure and use. I do agree with @AlmightyOatmeal I think the performance is a high priority. I have a few tables of about 27000 records and it takes around 5-8 seconds for the 10 item paginated result to be retreived to a web browser, meanwhile the server cpu is consumed at 100% processing (single core) :( I really don't know that much about python to contribute but I am learning so perhaps I will dive in one of these weekends. :) |
@roemhildtg, welcome to Python :) If you're interested, you can do what I did with profiling my Flask application take a look at: http://werkzeug.pocoo.org/docs/0.11/contrib/profiler/ There are limitations to that and it's not as detailed as something like line_profiler. I'm working on a module that can hook up to requests in a similar fashion as the aforementioned so that one can use any profiler they want (pprofile looks promising). Feel free to use my contact information from above if you want to reach me out-of-band (seeing a bug is not the best place for casual chat). |
Sorry to derail this issue, but have you seen flask-debugtoolbar? It's quite nice, profile in your browser and you can see all SQL queries. On a slightly more on-topic note, +1 @jfinkels for starting this project, it's great. I may have some time to help out with this issue. |
@uSpike, I've heard of it but the information on using it is sparse and doesn't seem to apply if I'm not using a jinja template at any endpoint (it's purely a restful API). |
@jfinkels @roemhildtg I created an add-on for werkzeug that uses pprofile to profile requests line-by-line. I'm also making a large number of changes to pprofile as well to better support integrating and mimc cProfile methods for added support of the stats module. I attached the full output (~9MB) of my pprofile add-on and I should have the pull requests for the other module's updates done this weekend. |
@jfinkels, I'm thinking the problem could lay base.py/within _get_collection_helper(). I noticed that when I hit this line:
That 'instances' is a base query to the model so when
(by the way, it's kind of defeating to chain() a map() in a set()) Against a query that is no .limit() applied to it, ergo all items in the database are queried even though there are only supposed to be 10 (by default). There are a number of times where objects are renamed so it's a little difficult to follow but I think a lot of those methods could/should be broken down to separate functions or methods, if applicable, to help clean-up the code, make it easier to test, and by far easier to read. Due to the complexity of base.py/within _get_collection_helper(), I'm not sure when I'll hit the nail on the head. Oh, and the driver has no default limit for the get() function. |
in base.py/get_collection_helper(), I commented out the following lines:
I get a response that takes approximately 536ms which is certainly better than 1-5 minutes. My question is: what is the purpose of this and why is it trying to map my entire table? |
See http://flask-restless.readthedocs.io/en/latest/includes.html. Basically, if a user requests |
@jfinkels but with no includes specified, there should be no reason that the logic is being used. Also, it should not be scanning the entire table especially when there is nothing more to include than what was already in the model. This is where the performance hit is coming from. I would rather not have that functionality then cause unnecessary delays. I think there is a problem in how the logic is being applied. |
I tested out the code change, and while it may not be the correct solution to the problem, since it removes some of the json api specification and functionality, it does appear that therein lies the issue. My queries on big tables went from 5 seconds to miliseconds when I commented out the section that @AlmightyOatmeal did in the pull-request. |
I'm having an issue similar to @AlmightyOatmeal's. I'm querying a large table (call it "posts") for a very tiny subset of records based on a many-to-many relationship with another table (call it "tags"). To be exact, I'm using this filter: {
name: "tags",
op: "any",
val: {
name: "name",
op: "eq",
val: "tag_name"
}
} From the database perspective, this restless query results in the following db query to count the results: SELECT count(*) AS count_1
FROM posts
WHERE EXISTS (SELECT 1
FROM posts_tag_mapping, posts_tags
WHERE posts.id = posts_tag_mapping.post_id AND posts_tags.id = posts_tag_map.tag_id AND posts_tags.name = 'tag_name') I haven't dug into the code to figure out what is generating this query, but changing the subquery to a join speeds things up significantly. |
Hi @spight! Thanks for the report, but that seems to be a separate issue, unrelated to the one being discussed here. Please open a new bug with the info above along with the SQL query you expect to see when filtering on a relation. As for the original issue, (i.e. scanning the whole database for includes), I believe there should be a relatively straightforward way fix for getting the included resources more efficiently, and I'd like to include a fix for it before releasing version 1.0.0b2. Is it as simple as disabling a check for included resources when none are requested? |
Anything going on here? We are hitting the same issue… |
same here |
We migrated all codebases to Flask-REST-JSONAPI by now. It's worth the effort.
|
@Natureshadow, what about the performance? Can you describe the issue with Flask-Restless and how it's going now with Flask-REST-JSONAPI? I had a look to Flask-REST-JSONAPI, it seems very nice. Although flexible, I think JSON API introduces a lot of overhead for programming the UI. |
I'm running into a serious performance problem with flask-restless that is making me look at alternatives and that breaks my heart. The posts table contains almost 700k "posts" for a blog with one-to-one mappings for raw JSON blobs and a many-to-many for tags; with getting only 10 results for each page, the request is taking 1-2 minutes!
I've already been through the database thinking that was an issue but it most certainly isn't. MySQL returns results in a second or two. Then I thought maybe it was SQLAlchemy so I went through that with a fine tooth comb but that wasn't the problem either; I can use flask-restful, flask-potion, or sqlalchemy by itself to generate the exact same result set in roughly 5 seconds or less.
The blog posts model:
The endpoint (all model endpoints are essentially the same for the time being due to testing):
The request:
And the logs:
I've tried dozens of permutations of model relationship adjustments which have had a little impact on the issue at hand. I'm really curious to where "{map}" is being called from; map() is used in a number of locations within flask-restless so I may need to do a little more performance profiling there.
But:
If someone has some ideas then I am most certainly all ears! I'm hoping to get this knocked out relatively soon! Cheers!
The text was updated successfully, but these errors were encountered: