Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impact of number of partitions on cstore tables #249

Open
rameshkaranth opened this issue Apr 13, 2021 · 0 comments
Open

Impact of number of partitions on cstore tables #249

rameshkaranth opened this issue Apr 13, 2021 · 0 comments

Comments

@rameshkaranth
Copy link

rameshkaranth commented Apr 13, 2021

Hey team,

I have a metrics table which we are using in our application to aggregate on items like impressions, clicks by day, month etc.

The table DDL is as below

CREATE TABLE public.metrics (
	region varchar NOT NULL,
	account_id bigserial NOT NULL,
	report_id varchar NOT NULL,
	report_date date NOT NULL,
	entity_id int8 NOT NULL,
	impressions int8 NULL,
	clicks int8 NULL,
       "sales14d" numeric(12,2) NULL,
       "conversions14d" int8 NULL,
       "cost" numeric(12,2) NULL
)

This table has grown really large and has become tough to manage. As part of an optimization exercise we want to try partitioning and use cstore_fdw to help with faster calculation of aggregated metrics. For partitioning, I am using using RANGE partition by report_date column. I am creating the foreign tables as below

CREATE EXTENSION cstore_fdw;
CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

CREATE FOREIGN TABLE tbl_metrics_fdw_2019_q1 PARTITION OF metrics FOR VALUES FROM ('2019-01-01') TO ('2019-04-01') SERVER cstore_server OPTIONS(compression 'pglz');

--Other partitions for other quarters are done in similar way

For data in recent months, the size by quarter was too big ( ~15 - 17 GB), hence I decided to partition data by month. After the partitioning was complete, I ran a simple select on this partitioned table without any joins for the date range where I had partitioned by month, the cost is coming out to be expensive.

select query

EXPLAIN ANALYSE SELECT date_trunc('day',report_date), sum(cost) FROM tbl_amzn_seller_report_partition_master tasrpm WHERE profile_id = 495164040385661 AND report_date 
BETWEEN '2021-01-01' AND '2021-03-30' GROUP BY 1 ORDER BY 1 desc

Below is the explain plan, it took ~4.3s. When I ran the same query to compare it another period where partition was by quarter, the cost was (cost=311568.50..311745.74 rows=1266 width=40) (actual time=2388.600..2388.669 rows=3 loops=1); which is significantly less

explain analyse plan

Finalize GroupAggregate  (cost=844043.72..844380.29 rows=1266 width=40) (actual time=4368.766..4368.852 rows=3 loops=1)
  Group Key: (date_trunc('month'::text, (tasrpm_2.report_date)::timestamp with time zone))
  ->  Gather Merge  (cost=844043.72..844339.14 rows=2532 width=40) (actual time=4368.760..4368.844 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Sort  (cost=843043.70..843046.86 rows=1266 width=40) (actual time=4284.954..4284.956 rows=1 loops=3)
              Sort Key: (date_trunc('month'::text, (tasrpm_2.report_date)::timestamp with time zone)) DESC
              Sort Method: quicksort  Memory: 25kB
              Worker 0:  Sort Method: quicksort  Memory: 25kB
              Worker 1:  Sort Method: quicksort  Memory: 25kB
              ->  Partial HashAggregate  (cost=842956.31..842978.46 rows=1266 width=40) (actual time=4284.923..4284.932 rows=1 loops=3)
                    Group Key: (date_trunc('month'::text, (tasrpm_2.report_date)::timestamp with time zone))
                    ->  Parallel Append  (cost=0.00..840944.78 rows=402306 width=11) (actual time=2.466..4150.730 rows=323148 loops=3)
                          ->  Foreign Scan on tbl_metrics_partition_fdw_2021_m3 tasrpm_2  (cost=0.00..461043.79 rows=359574 width=11) (actual time=2.571..4060.444 rows=363021 loops=1)
                                Filter: ((report_date >= '2021-01-01'::date) AND (report_date <= '2021-03-30'::date) AND (profile_id = '495164040385661'::bigint))
                                Rows Removed by Filter: 14146979
                                CStore File: /postgresql_data/data/cstore_fdw/16393/17708
                                CStore File Size: 475126005
                          ->  Foreign Scan on tbl_metrics_partition_fdw_2021_m1 tasrpm  (cost=0.00..437256.57 rows=275958 width=11) (actual time=2.453..4036.797 rows=287911 loops=1)
                                Filter: ((report_date >= '2021-01-01'::date) AND (report_date <= '2021-03-30'::date) AND (profile_id = '495164040385661'::bigint))
                                Rows Removed by Filter: 14882089
                                CStore File: /postgresql_data/data/cstore_fdw/16393/17697
                                CStore File Size: 453330868
                          ->  Foreign Scan on tbl_metrics_partition_fdw_2021_m2 tasrpm_1  (cost=0.00..401676.68 rows=330005 width=11) (actual time=2.373..4223.470 rows=318513 loops=1)
                                Filter: ((report_date >= '2021-01-01'::date) AND (report_date <= '2021-03-30'::date) AND (profile_id = '495164040385661'::bigint))
                                Rows Removed by Filter: 13891487
                                CStore File: /postgresql_data/data/cstore_fdw/16393/17703
                                CStore File Size: 414883929
Planning Time: 1.633 ms
Execution Time: 4370.064 ms

I have the below questions based on the above context

  • Is creation of more partitions the cause of increased cost? Our use case allows looking at this metrics by days, weeks, month(s) even till year. Are there any guidelines around partitioning tables in cstore_fdw, i.e. is it based on size or other metrics?
  • How many number of partitions is considered bad for performance?
  • Considering large rows for metrics table are there suggested values for stripe_row_count & block_row_count?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant