Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database Not Being Checkpointed? #31

Open
cwarden opened this issue Feb 2, 2023 · 14 comments
Open

Database Not Being Checkpointed? #31

cwarden opened this issue Feb 2, 2023 · 14 comments

Comments

@cwarden
Copy link
Contributor

cwarden commented Feb 2, 2023

It looks like the SQLite database isn't getting checkpointed during a run of gb backup.

The WAL is growing to many times the ultimate size of the database, and the database isn't getting checkpointed until the backup completes. During a long backup, the WAL grew for my database grew to 45GB; the database ended up being about 5GB.

@leijurv
Copy link
Owner

leijurv commented Feb 3, 2023

might help: 2c8df3e

I ran a simple test where I repeatedly insert an unlimited number of rows into sizes, and I saw the .gb.db growing while the -wal stayed constant size.

@leijurv
Copy link
Owner

leijurv commented Feb 4, 2023

Per 2c8df3e#commitcomment-99243585

https://stackoverflow.com/questions/16547263/clarification-regarding-journal-size-limit-in-sqlite

I now believe that the issue is that a transaction is very long-lived. I think it's almost certainly scannerThread in backup/scanner.go, specifically in how utils.WalkFiles only uses one large transaction. I think starting a new transaction for every scanned file would be too slow (this is a vague recollection from 2019), so I guess I need to recreate the transaction periodically? (even though it's read-only)

@leijurv
Copy link
Owner

leijurv commented Feb 4, 2023

@cwarden please try with 7e72ab5

@leijurv
Copy link
Owner

leijurv commented Feb 4, 2023

Now that the scanner transaction is closed every second, there should be no more SQL transactions that remain open for the entire duration of a backup. This should hopefully allow the WAL to be truncated, as the PRAGMA directs it to be.

@cwarden
Copy link
Contributor Author

cwarden commented Feb 5, 2023

I'm not sure if it's related, but I got the "somehow something isn't being synchronized" panic on the first run with 305f560.

@leijurv
Copy link
Owner

leijurv commented Feb 5, 2023

oof, sorry again, try with 490a805

bug seems to have been introduced accidentally here: ca6b0b0#diff-7b8b15093cd777455c713b59054b2fa528b1879a817e1e3a927a89f241dba45aL264-R207

this should fix the race condition that I vaguely remember trying to fix in that commit, by computing late[0] outside of the goroutine (which could supposedly race). and it should also fix the actual correctness issue by enqueueing late[0] rather than planned.File (which just failed)

@cwarden
Copy link
Contributor Author

cwarden commented Feb 6, 2023

Looking good now

@leijurv leijurv closed this as completed Feb 6, 2023
@cwarden
Copy link
Contributor Author

cwarden commented Feb 15, 2023

This might have introduced a race condition. Just got the following panic with 67825d5.

2023/02/15 17:20:09 Committing and recreating scanner transaction to prevent WAL from growing too large
2023/02/15 17:20:10 Bytes written: 0
panic: sql: database is closed

goroutine 28 [running]:
github.com/leijurv/gb/backup.(*ScannerTransactionContext).Tx(0xac8e070)
        /home/cwarden/git/gb/backup/scanner.go:169 +0x169
github.com/leijurv/gb/backup.(*ScannerTransactionContext).Tx(0xac8e070)
        /home/cwarden/git/gb/backup/scanner.go:180 +0x13c
github.com/leijurv/gb/backup.scannerThread.func1({0xe08b440, 0x8e}, {0x8dbcdec, 0xe08d4d0})
        /home/cwarden/git/gb/backup/scanner.go:46 +0x7a
github.com/leijurv/gb/utils.WalkFiles.func1()
        /home/cwarden/git/gb/utils/utils.go:51 +0x97
created by github.com/leijurv/gb/utils.WalkFiles
        /home/cwarden/git/gb/utils/utils.go:49 +0xc0

@leijurv
Copy link
Owner

leijurv commented Feb 16, 2023

Hm, strange! I've just taken a quick look and I have no idea how that could happen, fascinating. Do you have the rest of the log perhaps? For example did it print Closing database before that? I think that's the only place where the database is closed. But in main.go, backup.BackupDB is only called after backup.Backup completes? Did it print Scanner committing beforehand?

I guess this could be caused if utils.WalkFiles returned async - meaning it returns before it was done scanning all the files. But I don't see how that can happen since it explicitly calls <-done before returning? Did you get Scan processor signaling done in your log?

@leijurv leijurv reopened this Feb 16, 2023
@cwarden
Copy link
Contributor Author

cwarden commented Feb 16, 2023

Sorry, I don't have the rest of the log. I'll keep an eye out for it again.

While looking into this, I think I found a bug in ScannerTransactionContext.Tx. It calls itself after calling ScannerTransactionContext.Close. Under load, select could keep reading from the ticker, such that it keeps committing and beginning a transaction, and never actually return a sql.Tx.

@leijurv
Copy link
Owner

leijurv commented Feb 16, 2023

I think that is unlikely because the ticker only sends a tick on the channel once a second. Committing then reopening an empty transaction that hasn't written anything will certainly take less than that I think?

@leijurv
Copy link
Owner

leijurv commented Feb 16, 2023

nevertheless here you go 65c4006

@cwarden
Copy link
Contributor Author

cwarden commented Feb 16, 2023

Ah, so maybe just one extra time through the cycle. First commit could be slow so that next time through ScannerTransactionContext.Tx, it gets a tick, but the next commit/begin should be fast. I think you're right.

@leijurv
Copy link
Owner

leijurv commented Feb 16, 2023

Yeah. And on top of which, the scanner that uses this tx is doing so read-only, so even the first commit should be fast since there isn't any data to commit. Although maybe the idea might be that it is slow because it is the last tx standing that's pinning the WAL journal, and when it commits SQLite will now take the time to integrate the WAL into the main db file. Better safe than sorry though I'll leave that last commit in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants