Skip to content
This repository was archived by the owner on Nov 6, 2020. It is now read-only.

Openethereum node stopped importing blocks #11737

Closed
trailtacos opened this issue May 30, 2020 · 20 comments
Closed

Openethereum node stopped importing blocks #11737

trailtacos opened this issue May 30, 2020 · 20 comments
Labels
F2-bug 🐞 The client fails to follow expected behavior.

Comments

@trailtacos
Copy link

  • OpenEthereum version: 3.0.0
  • Operating system: Linux, Ubuntu 18.04
  • Installation: Built from source
  • Fully synchronized: Yes
  • Network: ethereum
  • Restarted: N/A (issue is that it stopped importing blocks) Restart brought it back to life

Last night, our openethereum node stopped importing blocks. A restart brought it back to life, but this is an issue for us if it can't keep up to date.

Some observations:

  1. It was using ~60% CPU (according to TOP)
  2. JSON RPC was still operating
  3. Logs showed nothing at all (they just stopped)
  4. I captured an strace of the openethereum process if that is helpful.

We have been running the same configuration on parity (I believe we were running 2.5.13 prior to upgrading). Our openethereum node has been alive for about 10 days - it was created from a snapshot of the parity DB.

2020-05-30 05:53:37 UTC Verifier #0 INFO import  Imported #10165361 0xe39e…1949 (276 txs, 9.99 Mgas, 1286 ms, 39.56 KiB)
2020-05-30 05:53:39 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:53:40 UTC IO Worker #1 INFO import    30/50 peers    128 MiB chain  306 MiB db  0 bytes queue   25 MiB sync  RPC:  0 conn,    1 req/s,  237 µs
2020-05-30 05:53:49 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:53:59 UTC IO Worker #0 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:54:09 UTC IO Worker #1 INFO snapshot::service  Snapshot: 6978582 accounts, 5000 blocks, 1756802596 bytes
2020-05-30 05:54:10 UTC IO Worker #1 INFO import    30/50 peers    128 MiB chain  306 MiB db  0 bytes queue   25 MiB sync  RPC:  0 conn,   42 req/s,  206 µs
2020-05-30 14:47:30 UTC main INFO openethereum::run  Starting OpenEthereum/v3.0.0-stable-b079d1746-20200511/x86_64-unknown-linux-gnu/rustc1.43.1
2020-05-30 14:47:30 UTC main INFO openethereum::run  Keys path /root/.local/share/io.parity.ethereum/keys/ethereum
2020-05-30 14:47:30 UTC main INFO openethereum::run  DB path /root/.local/share/io.parity.ethereum/chains/ethereum/db/906a34e69aec8c0d

Its configuration is:

[parity]
mode = "last"
auto_update = "none"

[footprint]
tracing = "on"
pruning = "archive"
cache_size = 2048
db_compaction = "ssd"
scale_verifiers = true
num_verifiers = 8

[misc]
logging = "own_tx=trace"
log_file = "/var/log/openethereum.log"
color = true

[rpc]
disable = false
port = 8545
interface = "all"
server_threads = 8

[network]
min_peers = 25
max_peers = 50
@trailtacos
Copy link
Author

Node has hung again:

2020-06-01 17:32:25 UTC Verifier #0 INFO import  Imported #10181340 0x17b7…bfa3 (129 txs, 9.98 Mgas, 656 ms, 29.47 KiB)
2020-06-01 17:32:25 UTC IO Worker #0 INFO import    29/50 peers    131 MiB chain  306 MiB db  0 bytes queue   11 MiB sync  RPC:  0 conn,   18 req/s,  248 µs
2020-06-01 17:32:29 UTC Verifier #0 INFO import  Imported #10181341 0x4a4f…3a68 (110 txs, 9.97 Mgas, 1064 ms, 29.13 KiB)
2020-06-01 17:32:33 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:37 UTC Verifier #0 INFO import  Imported #10181342 0x7efd…b3eb (149 txs, 9.98 Mgas, 1909 ms, 30.75 KiB)
2020-06-01 17:32:43 UTC IO Worker #0 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes

@trailtacos
Copy link
Author

I now have an strace of it hanging and a gdb thread dump when it has hung...

@vogelito
Copy link

vogelito commented Jun 1, 2020

To complement @CrispinFlowerday:

$ date && tail -1 /var/log/openethereum.log && curl --data '{"method":"eth_blockNumber","params":[],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8545
Mon Jun  1 17:52:01 UTC 2020
2020-06-01 17:32:53 UTC IO Worker #3 INFO snapshot::service  Snapshot: 25701656 accounts, 5000 blocks, 5446623304 bytes
{"jsonrpc":"2.0","result":"0x9b5ade","id":1}

@dvdplm
Copy link
Collaborator

dvdplm commented Jun 1, 2020

I now have an strace of it hanging and a gdb thread dump when it has hung...

Oh, that sounds like something that'd be plenty useful. Can you share them?

@trailtacos
Copy link
Author

Here they are.

openethereum_threads.2020-06-01.txt.gz
openethereum-2020-05-30.strace.gz

@pstroinski
Copy link

pstroinski commented Jun 2, 2020

This issue is here since 2.7.2 https://github.com/openethereum/openethereum/issues/11539 and https://github.com/openethereum/openethereum/issues/11494
I managed to get it working after downgrading back to 2.5.
Now after upgrading nodes to 3.0 it's back again. I tasted it for about 5 days, looked stable so upgraded the whole network. Few days later it started to get stuck.

Is there a way to downgrade again?

@vkconsult
Copy link

I'm experiencing similar issues on the classic chain after upgrading from 2.7.2 to 3.0.1 to tackle #11744 . Now its stuck on a block and won't sync past it.
2020-06-02 09:29:02 UTC IO Worker #3 INFO import Syncing #10500973 0xa98f…4491 0.00 blk/s 0.0 tx/s 0.0 Mgas/s 0+ 0 Qed #10501210 40/50 peers 3 MiB chain 514 MiB db 0 bytes queue 7 MiB sync RPC: 0 conn, 0 req/s, 15 µs

@phiresky
Copy link

phiresky commented Jun 5, 2020

(Copied from #11494 since this seems more relevant)

For me, 2.7.2 completely stops syncing every few days. The server still runs and RPC calls are still answered, but it just doesn't get any new blocks. Even though it shows 10+ connections. I have full debug logs of a few instances, but they don't seem to say anything interesting.

I think it got better when I enabled --no-periodic-snapshot (now only once per week), but it still happens. I wrote a script that just kills and restarts the parity node when it's out of date by more than 10 minutes, that works.. but it's obviously pretty dumb.

@kcanty
Copy link

kcanty commented Jun 5, 2020

Agreed we have this same issue - full historical archive node with 8TB SSD etc etc.

Hangs and has to be killed and restarted. we are running alternate nodes on 2.5.13 that do not have this issue.

@phiresky
Copy link

phiresky commented Jun 5, 2020

Here's my kill script to work around this by request from @kcanty:

https://gist.github.com/phiresky/153411abb794ad61eb4ba949236bc6ce

it's written in JS since it has to connect to the node and does some rudimentary checks.

@kcanty
Copy link

kcanty commented Jun 5, 2020

Thanks @phiresky - appreciate it - looks good.

@nysxah
Copy link

nysxah commented Jun 8, 2020

There is definitely something going on since 2.7.2 (database format changes?).

2.7.2 and 3.0.0 either sync too slow to ever catch up to tip, or freeze (stop importing blocks) periodically.

Happens on full/fast and archive nodes, both with tracing.

2.5.13 has no issues.

@tjayrush
Copy link

tjayrush commented Jun 9, 2020

I've been watching this issue and other issues like it on OpenEthereum, and I'm just wondering if this has any effect on the pending Berlin hard fork. I guess if OpenEthereum nodes can't sync, it doesn't really have a deleterious effect on the chain, but if everyone is forced to upgrade because of the Berlin fork, and a good portion of those newly upgraded nodes can't sync, doesn't that lower the overall security of the system as a whole. Is there any sense of how many OpenEthereum nodes are experiencing this problem? I've not yet upgraded to 2.7.2 because of this issue, but with Berlin approaching, I will have to.

@trailtacos
Copy link
Author

So... My openethereum node is currently hung - I'm happy to do any debugging on it over the next 48 hours that would be helpful to allow devs to track this issue down:

2020-06-10 20:40:10 UTC IO Worker #1 INFO import    30/50 peers    147 MiB chain  306 MiB db  0 bytes queue   13 MiB sync  RPC:  0 conn,    0 req/s,    0 µs
2020-06-10 20:40:12 UTC IO Worker #1 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import  Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import  Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
<no more logs>

@mr-older
Copy link

After first and single restart (since synced and consumed 21 gb of virtual memory / 5 res) OpenEthereum/v3.0.1-stable-8ca8089-20200601/ works fine for me for 5 days on ETC (currently 12.5 gb of virtual memory).

Parity-Ethereum/v2.6.8-beta-9bf6ed8-20191231/ works fine on ETH, but it can stop syncing, if resident memory ends up. It is very old issue, came from eairlier versions.

May be last versions have a memory leak, that leads to consuming all available RAM and thus, stopping syncing.

@vogelito
Copy link

This is currently a blocker for us to support the upcoming Berlin fork.

@trailtacos
Copy link
Author

One data point - our (I work with Vogelito) stuck openethereum node did, eventually emit more logs:

2020-06-10 20:40:22 UTC IO Worker #3 INFO snapshot::service  Snapshot: 59522653 accounts, 5000 blocks, 11935564982 bytes
2020-06-10 20:40:29 UTC Verifier #0 INFO import  Imported #10240161 0x4314…b2db (88 txs, 9.93 Mgas, 2141 ms, 13.82 KiB)
2020-06-10 20:40:31 UTC Verifier #0 INFO import  Imported #10240162 0x82ae…6abd (149 txs, 9.95 Mgas, 1859 ms, 39.22 KiB)
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot  Took a snapshot at #10205000 of 88364089 accounts
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot  produced 4878 state chunks and 112 block chunks.
2020-06-13 09:41:40 UTC Periodic Snapshot INFO snapshot::service  Finished taking snapshot at #10205000, in 689772s

I've now restarted it with logging set to debug to see if that emits anything useful

@trailtacos
Copy link
Author

Our node hung again today, and I have more detailed logging - for reference the block it is reporting is 0x9db235 aka 10334773 - which has a block time of Jun-25-2020 11:55:17 AM +UTC. Attached is the last 100k of rows from our node that has debug logging enabled across the board.

My observations is just that logging apparently stops from the IO Workers - there is a large gap in logging before the logs from my queries for its block number start. No idea if this is helpful at all.

openethereum.log.gz

@kael-shipman
Copy link

@vorot93
Copy link

vorot93 commented Jun 29, 2020

Looks like #11758.

@vorot93 vorot93 closed this as completed Jun 29, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
F2-bug 🐞 The client fails to follow expected behavior.
Projects
None yet
Development

No branches or pull requests