Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

matthewwardrop · 2018-03-09T02:41:55Z

Currently the evaluation strategy logic always joins measures in by joining MeasureProviders in by the nominal statistical unit type. In cases where segmentation can be done entirely local to all MeasureProviders, this is inefficient.

For example, if you have two measure providers:

users
- user [primary key]
- age [measure]
- ds [partition, see issue Add support for partition dimensions #1]
signups
- user [foreign key]
- lead_time [measure]
- ds [partition, see issue Add support for partition dimensions #1]

And you run:
`m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])

The current SQL generated will be something like:

SELECT users.ds, SUM(age), SUM(lead_time)
FROM (
    SELECT * FROM users
) users
LEFT JOIN (
    SELECT * FROM signups
) ON users.user = signups.user AND users.ds = signups.ds
GROUP BY users.ds

Especially if the cardinality of users is large, this can be needlessly computationally taxing.

Instead, we should add an optimisation when all requested segmentation can be performed locally by all measure providers such that the generated SQL looks something like:

SELECT a.ds, sum_age, sum_lead_time
FROM (
    SELECT ds, SUM(age) AS sum_age FROM users GROUP BY ds
) a
FULL OUTER JOIN (
    SELECT ds, SUM(lead_time) AS sum_lead_time FROM signups GROUP BY ds
) b ON a.ds = b.ds

Which potentially reduces the computational complexity of the join substantially in these simple evaluation scenarios.

Thanks for suggesting/flagging this @danfrankj

The text was updated successfully, but these errors were encountered:

danfrankj · 2018-03-09T18:06:27Z

Yup, @matthewwardrop this describes the problem, but in addition is the larger question of whether Mensor wants to support use cases where there are no primary keys or subjects. e.g. you could imagine having only 2 data sources, an impression log and a click log. Neither of these have a primary key but the query below is still a valid query.

m.evaluate(None, measures=['impressions', 'clicks'], segment_by=['ds'])

matthewwardrop · 2018-03-12T01:51:14Z

@danfrankj I personally think the null unit type case is not really in the purview of mensor. It would be straightforward to associate an impression log or click log with an artificial id (i.e. <name of event>+<ts>+<...>) that could be considered primary, and then use them as unitted evaluations. As soon as you remove unit types, there is no way to determine what you can segment by (i.e. you'd be disabling any non-local segmentations), and it becomes less intuitive as to what is being calculated (i.e. I think m.evaluate('impression', measures=['count'], segment_by=['ds'] is clearer and more intuitive than m.evaluate(None, measures='impressions/count', segment_by=['ds'])).

I'm willing to be convinced on this point, but I'd need a super clear example of its utility that is not better and more intuitively served by some other pathway.

agorajek · 2018-05-14T16:28:50Z

Forgive me if I am missing the point of this issue, but I am wondering... if the above example metric evaluation is allowed w/o joining on the pk/f key then why do we need to look at the solution to be still a single query? Why not compute these two measures as separate queries and then merge the results as next step.

In other words, I am trying to say that when a join on the key would be necessary then any query optimization should be taken care of by the underlying provider/platform and if the join on the key is not needed then why not just make them as separate queries.

matthewwardrop · 2018-05-14T16:47:34Z

@agorajek Hey Olek! Thanks for chiming in!

The main reason why this might be desirable, instead of running the evaluations separately, is when you want to push higher level calculations down into (e.g.) SQL. If a Metric depends on these two features, if you split it up, you would not be able to push the result into SQL, and would have to compute it locally in Python. That's not always prohibitive (since it would be quite cheap), but if you wanted the results cached in a database table, not being able to squash it down would be annoying.

Perhaps I'm missing something?

agorajek · 2018-05-14T17:10:30Z

@matthewwardrop that makes sense. But then my question would be... why are you performing a join between users and signups at all (as shown in the first query)? Is the answer simply "because that part of the code wasn't optimized yet?"

IMO both queries may be valid depending on what question you are asking of the data. Just to clarify, I don't like the first query either, especially the SELECT * ;) But I am trying to say that the choice between the 1st and the 2nd option may simply be determined by the need of the join on the key between these datasets. And the question of the necessity of such a join should also be inferred from the measures/metrics definitions. And I agree that for the question you asked (m.evaluate('user', measures=['age', 'lead_time'], segment_by=['ds'])) a join on the key should not be performed but a single query may still be the desired choice.

matthewwardrop · 2018-05-15T15:50:30Z

@agorajek The provided sql was supposed to demonstrate the form of the query, rather than represent the query verbatim. SELECT * would never feature except in (badly) user crafted data extraction queries.

Regarding the necessity of the join, the main objective of the original join strategy was to prevent statistical unit counts being under-estimated during joins due to not involving a primary table. Currently, the evaluation strategy implemented bundles each set of query statements for a given statistical unit type into a single root query (from a primary provider of that statistical unit type), and joins all other provisions into it. Likewise all nested statistical units follow the same pattern, and are then joined into the parent query on their statistical unit type key. This is nice because it is relatively simple recursive algorithm, and is guaranteed to work in all circumstances. For the more interesting use-cases, this will be (close to) optimal, since you will be using a broader selection of measures and dimensions, and joins will be essential. It is only suboptimal in simpler cases, such as that noted above. It was always my plan to improve the optimality of the algorithm, I just believe gunning for completeness and correctness takes priority :). I cannot remember where I read the mantra, but I think it is a good one: Make it work, make it correct, and then make it fast.

agorajek · 2018-05-15T18:14:38Z

@matthewwardrop thanks for the last explanation. It clarified this issue for me. And btw, I never heard this mantra... but I fully adhere.

matthewwardrop self-assigned this Mar 9, 2018

matthewwardrop added the enhancement label Apr 12, 2019

matthewwardrop added this to the v0.4: Mensor API functionally complete milestone Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

matthewwardrop commented Mar 9, 2018 •

edited

Loading

danfrankj commented Mar 9, 2018

matthewwardrop commented Mar 12, 2018 •

edited

Loading

agorajek commented May 14, 2018 •

edited

Loading

matthewwardrop commented May 14, 2018

agorajek commented May 14, 2018 •

edited

Loading

matthewwardrop commented May 15, 2018 •

edited

Loading

agorajek commented May 15, 2018

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

Optimise evaluation strategy in circumstances where segmentation is local to each MeasureProvider #2

Comments

matthewwardrop commented Mar 9, 2018 • edited Loading

danfrankj commented Mar 9, 2018

matthewwardrop commented Mar 12, 2018 • edited Loading

agorajek commented May 14, 2018 • edited Loading

matthewwardrop commented May 14, 2018

agorajek commented May 14, 2018 • edited Loading

matthewwardrop commented May 15, 2018 • edited Loading

agorajek commented May 15, 2018

matthewwardrop commented Mar 9, 2018 •

edited

Loading

matthewwardrop commented Mar 12, 2018 •

edited

Loading

agorajek commented May 14, 2018 •

edited

Loading

agorajek commented May 14, 2018 •

edited

Loading

matthewwardrop commented May 15, 2018 •

edited

Loading