-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2
Comments
Yup, @matthewwardrop this describes the problem, but in addition is the larger question of whether Mensor wants to support use cases where there are no primary keys or subjects. e.g. you could imagine having only 2 data sources, an impression log and a click log. Neither of these have a primary key but the query below is still a valid query.
|
@danfrankj I personally think the null unit type case is not really in the purview of mensor. It would be straightforward to associate an impression log or click log with an artificial id (i.e. I'm willing to be convinced on this point, but I'd need a super clear example of its utility that is not better and more intuitively served by some other pathway. |
Forgive me if I am missing the point of this issue, but I am wondering... if the above example metric evaluation is allowed w/o joining on the pk/f key then why do we need to look at the solution to be still a single query? Why not compute these two measures as separate queries and then merge the results as next step. In other words, I am trying to say that when a join on the key would be necessary then any query optimization should be taken care of by the underlying provider/platform and if the join on the key is not needed then why not just make them as separate queries. |
@agorajek Hey Olek! Thanks for chiming in! The main reason why this might be desirable, instead of running the evaluations separately, is when you want to push higher level calculations down into (e.g.) SQL. If a Metric depends on these two features, if you split it up, you would not be able to push the result into SQL, and would have to compute it locally in Python. That's not always prohibitive (since it would be quite cheap), but if you wanted the results cached in a database table, not being able to squash it down would be annoying. Perhaps I'm missing something? |
@matthewwardrop that makes sense. But then my question would be... why are you performing a join between users and signups at all (as shown in the first query)? Is the answer simply "because that part of the code wasn't optimized yet?" IMO both queries may be valid depending on what question you are asking of the data. Just to clarify, I don't like the first query either, especially the |
@agorajek The provided sql was supposed to demonstrate the form of the query, rather than represent the query verbatim. Regarding the necessity of the join, the main objective of the original join strategy was to prevent statistical unit counts being under-estimated during joins due to not involving a primary table. Currently, the evaluation strategy implemented bundles each set of query statements for a given statistical unit type into a single root query (from a primary provider of that statistical unit type), and joins all other provisions into it. Likewise all nested statistical units follow the same pattern, and are then joined into the parent query on their statistical unit type key. This is nice because it is relatively simple recursive algorithm, and is guaranteed to work in all circumstances. For the more interesting use-cases, this will be (close to) optimal, since you will be using a broader selection of measures and dimensions, and joins will be essential. It is only suboptimal in simpler cases, such as that noted above. It was always my plan to improve the optimality of the algorithm, I just believe gunning for completeness and correctness takes priority :). I cannot remember where I read the mantra, but I think it is a good one: Make it work, make it correct, and then make it fast. |
@matthewwardrop thanks for the last explanation. It clarified this issue for me. And btw, I never heard this mantra... but I fully adhere. |
Currently the evaluation strategy logic always joins measures in by joining MeasureProviders in by the nominal statistical unit type. In cases where segmentation can be done entirely local to all MeasureProviders, this is inefficient.
For example, if you have two measure providers:
And you run:
`m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])
The current SQL generated will be something like:
Especially if the cardinality of users is large, this can be needlessly computationally taxing.
Instead, we should add an optimisation when all requested segmentation can be performed locally by all measure providers such that the generated SQL looks something like:
Which potentially reduces the computational complexity of the join substantially in these simple evaluation scenarios.
Thanks for suggesting/flagging this @danfrankj
The text was updated successfully, but these errors were encountered: