Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to create_distributed_table() #7798

Open
34code opened this issue Dec 20, 2024 · 8 comments
Open

unable to create_distributed_table() #7798

34code opened this issue Dec 20, 2024 · 8 comments

Comments

@34code
Copy link

34code commented Dec 20, 2024

Here is the error I see (over 5 times after running for 2 hrs each):

NOTICE:  Copying data from local table...
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

And here are the logs from the server at the same time:

2024-12-20 17:43:07.787 UTC [129221] LOG:  invalid length of startup packet
2024-12-20 17:43:09.224 UTC [129222] LOG:  invalid length of startup packet
2024-12-20 17:43:10.845 UTC [129223] LOG:  invalid length of startup packet
2024-12-20 17:43:12.078 UTC [129224] LOG:  invalid length of startup packet
2024-12-20 18:09:24.342 UTC [130501] FATAL:  unsupported frontend protocol 65363.19778: server supports 3.0 to 3.0
2024-12-20 18:09:25.015 UTC [130504] LOG:  invalid length of startup packet
2024-12-20 18:09:25.196 UTC [130505] LOG:  invalid length of startup packet
2024-12-20 18:09:25.440 UTC [130514] LOG:  invalid length of startup packet
2024-12-20 18:09:25.522 UTC [130515] LOG:  invalid length of startup packet
2024-12-20 18:09:25.627 UTC [130516] LOG:  invalid length of startup packet
2024-12-20 18:09:25.753 UTC [130517] LOG:  invalid length of startup packet
2024-12-20 18:09:25.831 UTC [130518] LOG:  invalid length of startup packet
2024-12-20 18:09:25.907 UTC [130519] LOG:  invalid length of startup packet
2024-12-20 18:09:26.031 UTC [130520] LOG:  invalid length of startup packet
2024-12-20 18:09:26.133 UTC [130521] LOG:  invalid length of startup packet
2024-12-20 18:09:26.422 UTC [130522] LOG:  invalid length of startup packet
2024-12-20 18:09:26.677 UTC [130523] LOG:  invalid length of startup packet
2024-12-20 18:09:28.188 UTC [130503] LOG:  could not receive data from client: Connection reset by peer

It's about 70GB of data transfer to the worker nodes (saw size increase by that much with "df -h" on worker nodes).. but the operation never successfully completes.

@34code
Copy link
Author

34code commented Dec 20, 2024

is there a possibility there is a default timeout of 2 hrs for queries by default?

@onurctirtir
Copy link
Member

I'm not aware if Citus specifies such a default statement timeout for distributed queries / operations, so I'll check the code to see if we're doing this somewhere for internal COPY connections.

Also wondering are you seeing "canceling statement due to statement timeout", or "canceling statement due to lock timeout", or such messages in your PG logs during COPY, before distributed table creation breaks?

@onurctirtir
Copy link
Member

But all in all, such error messages are not really good;

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

@onurctirtir
Copy link
Member

Would you mind sharing the CREATE TABLE command as well as the exact SELECT .. create_distributed_table(..) .. call that you're making?

If sharing the exact commands is not possible, some obfuscated version that reveals the column types, column default expressions and indexes etc. would also help a lot.

@34code
Copy link
Author

34code commented Dec 24, 2024

I think i've isolated it to a node/os related timeout.. I was able to tmux and docker exec into coordinator node and it ran fine without the timeout when using psql. However, when i connect from my laptop or remote server, it always times out.

Here is my create table:

CREATE TABLE com_walmart_prices (
    id uuid NOT NULL DEFAULT uuid_generate_v4(),
    source character varying DEFAULT 'Walmart.com'::character varying,
    condition character varying,
    amount character varying,
    buy_link text,
    barcode character varying NOT NULL,
    created_at timestamp(6) without time zone NOT NULL,
    amount_override character varying
);

And this is the create distributed table command:

SELECT create_distributed_table('com_walmart_prices', 'barcode');

Re: the logs do you mean just copying the docker logs from the coordinator node while the command is running to assess why its timing out?

@34code
Copy link
Author

34code commented Dec 24, 2024

I had all the data on coordinator node which wasn't ideal but it all fit somehow.. in future I would prefer to not have to write the data to coordinator node first as the tables will grow to be larger than coordinator node alone can take. Maybe its an issue with my pg_backup script which is coming from vanilla postgres..

@34code
Copy link
Author

34code commented Dec 24, 2024

These are the relevant logs I found on the coordinator node:

2024-12-20 06:53:06.739 UTC [95332] ERROR:  canceling statement due to user request
2024-12-20 06:53:06.739 UTC [95332] STATEMENT:  SELECT create_distributed_table('com_walmart_prices', 'barcode');

2024-12-20 06:53:06.739 UTC [95332] LOG:  could not send data to client: Connection reset by peer
2024-12-20 06:53:06.739 UTC [95332] FATAL:  connection to client lost
2024-12-20 06:53:06.871 UTC [97222] LOG:  PID 95332 in cancel request did not match any process
2024-12-20 06:53:28.364 UTC [95330] ERROR:  canceling statement due to user request
2024-12-20 06:53:28.364 UTC [95330] STATEMENT:  SELECT create_distributed_table('com_walmart_prices', 'barcode');

2024-12-20 06:53:38.307 UTC [97232] ERROR:  canceling statement due to user request
2024-12-20 06:53:38.307 UTC [97232] STATEMENT:  SELECT create_distributed_table('com_walmart_prices', 'barcode');

@onurctirtir
Copy link
Member

I think i've isolated it to a node/os related timeout.. I was able to tmux and docker exec into coordinator node and it ran fine without the timeout when using psql. However, when i connect from my laptop or remote server, it always times out.

Yes, such logs made me think of that one of the issues here is related to statement timeout.

However, these still don't really look good and look like a separate problem.

server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

I'll look into this a bit more and will try to reproduce the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants