Database Not Being Checkpointed? #31

cwarden · 2023-02-02T19:00:32Z

It looks like the SQLite database isn't getting checkpointed during a run of gb backup.

The WAL is growing to many times the ultimate size of the database, and the database isn't getting checkpointed until the backup completes. During a long backup, the WAL grew for my database grew to 45GB; the database ended up being about 5GB.

The text was updated successfully, but these errors were encountered:

leijurv · 2023-02-03T22:03:58Z

might help: 2c8df3e

I ran a simple test where I repeatedly insert an unlimited number of rows into sizes, and I saw the .gb.db growing while the -wal stayed constant size.

leijurv · 2023-02-04T18:42:36Z

Per 2c8df3e#commitcomment-99243585

https://stackoverflow.com/questions/16547263/clarification-regarding-journal-size-limit-in-sqlite

I now believe that the issue is that a transaction is very long-lived. I think it's almost certainly scannerThread in backup/scanner.go, specifically in how utils.WalkFiles only uses one large transaction. I think starting a new transaction for every scanned file would be too slow (this is a vague recollection from 2019), so I guess I need to recreate the transaction periodically? (even though it's read-only)

leijurv · 2023-02-04T18:52:42Z

@cwarden please try with 7e72ab5

leijurv · 2023-02-04T18:53:33Z

Now that the scanner transaction is closed every second, there should be no more SQL transactions that remain open for the entire duration of a backup. This should hopefully allow the WAL to be truncated, as the PRAGMA directs it to be.

cwarden · 2023-02-05T16:20:20Z

I'm not sure if it's related, but I got the "somehow something isn't being synchronized" panic on the first run with 305f560.

leijurv · 2023-02-05T18:48:11Z

oof, sorry again, try with 490a805

bug seems to have been introduced accidentally here: ca6b0b0#diff-7b8b15093cd777455c713b59054b2fa528b1879a817e1e3a927a89f241dba45aL264-R207

this should fix the race condition that I vaguely remember trying to fix in that commit, by computing late[0] outside of the goroutine (which could supposedly race). and it should also fix the actual correctness issue by enqueueing late[0] rather than planned.File (which just failed)

cwarden · 2023-02-06T12:39:33Z

Looking good now

cwarden · 2023-02-15T23:27:35Z

This might have introduced a race condition. Just got the following panic with 67825d5.

2023/02/15 17:20:09 Committing and recreating scanner transaction to prevent WAL from growing too large
2023/02/15 17:20:10 Bytes written: 0
panic: sql: database is closed

goroutine 28 [running]:
github.com/leijurv/gb/backup.(*ScannerTransactionContext).Tx(0xac8e070)
        /home/cwarden/git/gb/backup/scanner.go:169 +0x169
github.com/leijurv/gb/backup.(*ScannerTransactionContext).Tx(0xac8e070)
        /home/cwarden/git/gb/backup/scanner.go:180 +0x13c
github.com/leijurv/gb/backup.scannerThread.func1({0xe08b440, 0x8e}, {0x8dbcdec, 0xe08d4d0})
        /home/cwarden/git/gb/backup/scanner.go:46 +0x7a
github.com/leijurv/gb/utils.WalkFiles.func1()
        /home/cwarden/git/gb/utils/utils.go:51 +0x97
created by github.com/leijurv/gb/utils.WalkFiles
        /home/cwarden/git/gb/utils/utils.go:49 +0xc0

leijurv · 2023-02-16T01:03:47Z

Hm, strange! I've just taken a quick look and I have no idea how that could happen, fascinating. Do you have the rest of the log perhaps? For example did it print Closing database before that? I think that's the only place where the database is closed. But in main.go, backup.BackupDB is only called after backup.Backup completes? Did it print Scanner committing beforehand?

I guess this could be caused if utils.WalkFiles returned async - meaning it returns before it was done scanning all the files. But I don't see how that can happen since it explicitly calls <-done before returning? Did you get Scan processor signaling done in your log?

cwarden · 2023-02-16T13:06:58Z

Sorry, I don't have the rest of the log. I'll keep an eye out for it again.

While looking into this, I think I found a bug in ScannerTransactionContext.Tx. It calls itself after calling ScannerTransactionContext.Close. Under load, select could keep reading from the ticker, such that it keeps committing and beginning a transaction, and never actually return a sql.Tx.

leijurv · 2023-02-16T19:14:04Z

I think that is unlikely because the ticker only sends a tick on the channel once a second. Committing then reopening an empty transaction that hasn't written anything will certainly take less than that I think?

leijurv · 2023-02-16T19:19:14Z

nevertheless here you go 65c4006

cwarden · 2023-02-16T19:22:46Z

Ah, so maybe just one extra time through the cycle. First commit could be slow so that next time through ScannerTransactionContext.Tx, it gets a tick, but the next commit/begin should be fast. I think you're right.

leijurv · 2023-02-16T19:29:24Z

Yeah. And on top of which, the scanner that uses this tx is doing so read-only, so even the first commit should be fast since there isn't any data to commit. Although maybe the idea might be that it is slow because it is the last tx standing that's pinning the WAL journal, and when it commits SQLite will now take the time to integrate the WAL into the main db file. Better safe than sorry though I'll leave that last commit in.

leijurv added a commit that referenced this issue Feb 4, 2023

recreate scanner tx every second to try and fix issue #31

7e72ab5

leijurv closed this as completed Feb 6, 2023

leijurv reopened this Feb 16, 2023

leijurv added a commit that referenced this issue Feb 16, 2023

fix potential issue raises by @cwarden in issue #31

8388771

leijurv added a commit that referenced this issue Feb 16, 2023

fix potential issue raised by @cwarden in issue #31

65c4006

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Not Being Checkpointed? #31

Database Not Being Checkpointed? #31

cwarden commented Feb 2, 2023

leijurv commented Feb 3, 2023

leijurv commented Feb 4, 2023

leijurv commented Feb 4, 2023

leijurv commented Feb 4, 2023

cwarden commented Feb 5, 2023

leijurv commented Feb 5, 2023 •

edited

Loading

cwarden commented Feb 6, 2023

cwarden commented Feb 15, 2023

leijurv commented Feb 16, 2023

cwarden commented Feb 16, 2023

leijurv commented Feb 16, 2023

leijurv commented Feb 16, 2023

cwarden commented Feb 16, 2023

leijurv commented Feb 16, 2023 •

edited

Loading

Database Not Being Checkpointed? #31

Database Not Being Checkpointed? #31

Comments

cwarden commented Feb 2, 2023

leijurv commented Feb 3, 2023

leijurv commented Feb 4, 2023

leijurv commented Feb 4, 2023

leijurv commented Feb 4, 2023

cwarden commented Feb 5, 2023

leijurv commented Feb 5, 2023 • edited Loading

cwarden commented Feb 6, 2023

cwarden commented Feb 15, 2023

leijurv commented Feb 16, 2023

cwarden commented Feb 16, 2023

leijurv commented Feb 16, 2023

leijurv commented Feb 16, 2023

cwarden commented Feb 16, 2023

leijurv commented Feb 16, 2023 • edited Loading

leijurv commented Feb 5, 2023 •

edited

Loading

leijurv commented Feb 16, 2023 •

edited

Loading