Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt to ensure a better query execution plan for test run stats by forbidding hash joins (and some others) #809

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Pascal-Delange
Copy link
Contributor

@Pascal-Delange Pascal-Delange commented Jan 29, 2025

TL:DR

The query optimizer was in some cases (at least: in the case where there are few phantom decisions) making a really bad decision on the execution plan for getting rule execution statistics on a test run, namely by planning a full table scan of decision rules.

In detail, I forbid hash joins, merge joins, and the reordering of joins for the query.

Details

Disclaimer

Sorry for the big empty space below the text, it seems this is an unexpected side effect of using the details tag in github markdown

Now to the real details

Considering the query for getting rule execution stats on (phantom_)decisions, as below

select scir.stable_rule_id, scir.name, dr.outcome, scit.version, COUNT(*) as total
from  phantom_decisions as d (respectively decisions as d)
   join scenario_iterations as scit ON scit.id = d.scenario_iteration_id
   join scenario_iteration_rules as scir ON scir.scenario_iteration_id = scit.id
   join decision_rules as dr ON dr.rule_id = scir.id and dr.decision_id = d.id
where d.created_at>'2025-01-20 14:02:05.0185+00'
   and d.created_at<'2025-01-20 23:00:00+00'
   and d.org_id='515974e4-5b4d-4091-8fb0-0473d3b0da25'
   and scir.scenario_iteration_id='515974e4-f41f-4b3d-ba26-f540885b17c0'
group by scir.stable_rule_id, scir.name, dr.outcome, scit.version;
Click to toggle the query plan in the case of starting with the decisions table
GroupAggregate  (cost=40104.52..40104.55 rows=1 width=67) (actual time=32.587..35.042 rows=9 loops=1)
  Group Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
  Buffers: shared hit=18718
  ->  Sort  (cost=40104.52..40104.53 rows=1 width=59) (actual time=32.415..33.066 rows=10000 loops=1)
        Sort Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
        Sort Method: quicksort  Memory: 1228kB
        Buffers: shared hit=18718
        ->  Nested Loop  (cost=39.67..40104.51 rows=1 width=59) (actual time=1.898..24.923 rows=10000 loops=1)
              Join Filter: (dr.rule_id = scir.id)
              Rows Removed by Join Filter: 20000
              Buffers: shared hit=18718
              ->  Nested Loop  (cost=39.67..39874.57 rows=41 width=40) (actual time=1.763..18.080 rows=10000 loops=1)
                    Buffers: shared hit=18653
                    ->  Nested Loop  (cost=39.09..51.14 rows=1 width=34) (actual time=1.745..8.362 rows=2000 loops=1)
                          Buffers: shared hit=8058
                          ->  Bitmap Heap Scan on decisions d  (cost=38.82..42.84 rows=1 width=32) (actual time=1.730..5.406 rows=2000 loops=1)
                                Recheck Cond: ((org_id = '515974e4-5b4d-4091-8fb0-0473d3b0da25'::uuid) AND (created_at > '2025-01-20 14:02:05.0185+00'::timestamp with time zone) AND (created_at < '2025-01-20 23:00:00+00'::timestamp with time zone) AND (scenario_iteration_id = '515974e4-c25f-4536-a8fa-b1aac0c1e676'::uuid))
                                Heap Blocks: exact=1999
                                Buffers: shared hit=2058
                                ->  BitmapAnd  (cost=38.82..38.82 rows=1 width=0) (actual time=1.437..1.439 rows=0 loops=1)
                                      Buffers: shared hit=59
                                      ->  Bitmap Index Scan on decisions_by_org_id_index  (cost=0.00..4.83 rows=22 width=0) (actual time=0.522..0.523 rows=2000 loops=1)
                                            Index Cond: ((org_id = '515974e4-5b4d-4091-8fb0-0473d3b0da25'::uuid) AND (created_at > '2025-01-20 14:02:05.0185+00'::timestamp with time zone) AND (created_at < '2025-01-20 23:00:00+00'::timestamp with time zone))
                                            Buffers: shared hit=52
                                      ->  Bitmap Index Scan on decisions_scenario_iteration_id_idx  (cost=0.00..33.74 rows=2307 width=0) (actual time=0.835..0.835 rows=4000 loops=1)
                                            Index Cond: (scenario_iteration_id = '515974e4-c25f-4536-a8fa-b1aac0c1e676'::uuid)
                                            Buffers: shared hit=7
                          ->  Index Scan using scenario_iterations_pkey on scenario_iterations scit  (cost=0.27..8.29 rows=1 width=18) (actual time=0.001..0.001 rows=1 loops=2000)
                                Index Cond: (id = '515974e4-c25f-4536-a8fa-b1aac0c1e676'::uuid)
                                Buffers: shared hit=6000
                    ->  Index Scan using decision_rules_decisionid_idx on decision_rules dr  (cost=0.57..39671.40 rows=15203 width=38) (actual time=0.002..0.004 rows=5 loops=2000)
                          Index Cond: (decision_id = d.id)
                          Buffers: shared hit=10595
              ->  Materialize  (cost=0.00..228.10 rows=3 width=83) (actual time=0.000..0.000 rows=3 loops=10000)
                    Buffers: shared hit=65
                    ->  Seq Scan on scenario_iteration_rules scir  (cost=0.00..228.09 rows=3 width=83) (actual time=0.129..0.133 rows=5 loops=1)
                          Filter: (scenario_iteration_id = '515974e4-c25f-4536-a8fa-b1aac0c1e676'::uuid)
                          Rows Removed by Filter: 454
                          Buffers: shared hit=65
Planning:
  Buffers: shared hit=63
Planning Time: 0.756 ms
Execution Time: 35.168 ms
=> nothing super fancy, but notice the loop join with decision_rules
Click to toggle the query plan in the case of starting with the phantom_decisions table
Finalize GroupAggregate  (cost=25704916.21..25705246.52 rows=12 width=67)
  Group Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
  ->  Gather Merge  (cost=25704916.21..25705246.10 rows=24 width=67)
        Workers Planned: 2
        ->  Partial GroupAggregate  (cost=25703916.18..25704243.30 rows=12 width=67)
              Group Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
              ->  Sort  (cost=25703916.18..25703970.68 rows=21800 width=59)
                    Sort Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
                    ->  Hash Join  (cost=369.72..25702345.27 rows=21800 width=59)
                          Hash Cond: (dr.decision_id = d.id)
                          ->  Hash Join  (cost=228.12..25700330.42 rows=441403 width=89)
                                Hash Cond: (dr.rule_id = scir.id)
                                ->  Parallel Seq Scan on decision_rules dr  (cost=0.00..25139594.42 rows=212903242 width=38)
                                ->  Hash  (cost=228.09..228.09 rows=3 width=83)
                                      ->  Seq Scan on scenario_iteration_rules scir  (cost=0.00..228.09 rows=3 width=83)
                                            Filter: (scenario_iteration_id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid)
                          ->  Hash  (cost=119.37..119.37 rows=1778 width=34)
                                ->  Nested Loop  (cost=0.27..119.37 rows=1778 width=34)
                                      ->  Index Scan using scenario_iterations_pkey on scenario_iterations scit  (cost=0.27..8.29 rows=1 width=18)
                                            Index Cond: (id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid)
                                      ->  Seq Scan on phantom_decisions d  (cost=0.00..93.30 rows=1778 width=32)
                                            Filter: ((created_at > '2025-01-20 14:02:05.0185+00'::timestamp with time zone) AND (created_at < '2025-01-20 23:00:00+00'::timestamp with time zone) AND (scenario_iteration_id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid) AND (org_id = '515974e4-5b4d-4091-8fb0-0473d3b0da25'::uuid))

=> notice the hash join with full table scan on decision_rules, which is never going to complete

Click to toggle the query plan in the case of starting with the phantom_decisions table, after the PR
GroupAggregate  (cost=195623148.56..195623933.47 rows=12 width=67) (actual time=60.289..62.203 rows=6 loops=1)
  Group Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
  Buffers: shared hit=41249
  ->  Sort  (cost=195623148.56..195623279.36 rows=52319 width=59) (actual time=60.065..60.605 rows=8000 loops=1)
        Sort Key: scir.stable_rule_id, scir.name, dr.outcome, scit.version
        Sort Method: quicksort  Memory: 880kB
        Buffers: shared hit=41249
        ->  Nested Loop  (cost=0.84..195617077.55 rows=52319 width=59) (actual time=0.257..53.464 rows=8000 loops=1)
              Join Filter: (scir.id = dr.rule_id)
              Rows Removed by Join Filter: 24000
              Buffers: shared hit=41240
              ->  Nested Loop  (cost=0.27..414.14 rows=5334 width=85) (actual time=0.124..4.770 rows=8000 loops=1)
                    Buffers: shared hit=264
                    ->  Nested Loop  (cost=0.27..119.37 rows=1778 width=34) (actual time=0.041..1.150 rows=2000 loops=1)
                          Buffers: shared hit=54
                          ->  Index Scan using scenario_iterations_pkey on scenario_iterations scit  (cost=0.27..8.29 rows=1 width=18) (actual time=0.021..0.023 rows=1 loops=1)
                                Index Cond: (id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid)
                                Buffers: shared hit=3
                          ->  Seq Scan on phantom_decisions d  (cost=0.00..93.30 rows=1778 width=32) (actual time=0.019..0.826 rows=2000 loops=1)
                                Filter: ((created_at > '2025-01-20 14:02:05.0185+00'::timestamp with time zone) AND (created_at < '2025-01-20 23:00:00+00'::timestamp with time zone) AND (scenario_iteration_id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid) AND (org_id = '515974e4-5b4d-4091-8fb0-0473d3b0da25'::uuid))
                                Rows Removed by Filter: 115
                                Buffers: shared hit=51
                    ->  Materialize  (cost=0.00..228.10 rows=3 width=83) (actual time=0.000..0.001 rows=4 loops=2000)
                          Buffers: shared hit=210
                          ->  Seq Scan on scenario_iteration_rules scir  (cost=0.00..228.09 rows=3 width=83) (actual time=0.071..0.400 rows=4 loops=1)
                                Filter: (scenario_iteration_id = '515974e4-f41f-4b3d-ba26-f540885b17c0'::uuid)
                                Rows Removed by Filter: 1453
                                Buffers: shared hit=210
              ->  Index Scan using decision_rules_decisionid_idx on decision_rules dr  (cost=0.57..36496.13 rows=14193 width=38) (actual time=0.004..0.005 rows=4 loops=8000)
                    Index Cond: (decision_id = d.id)
                    Buffers: shared hit=40976
Planning:
  Buffers: shared hit=265
Planning Time: 2.733 ms
Execution Time: 62.560 ms

=> the hash join is gone, the query can execute kind of quickly (it may still be slowish if there are a lot of decisions, but now we have a chance, and we can reuse this query to precompute the stats as we go)

@Pascal-Delange Pascal-Delange force-pushed the pascal/tentative-fix-testrun-stats-timeout branch from cbb2374 to 1b6376a Compare January 29, 2025 20:32
@Pascal-Delange Pascal-Delange changed the title attempt to ensure a better query execution plan for test run stats by… attempt to ensure a better query execution plan for test run stats by forbidding hash joins (and some others) Jan 29, 2025
@Pascal-Delange Pascal-Delange requested a review from a team January 29, 2025 20:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants