Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a local index and SHA-256 digests to track Kindle books #6

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

gfitzp
Copy link

@gfitzp gfitzp commented Feb 19, 2025

Made some updates to make my previous Kindle logic a bit more generic so the ETag comparison works with both Kindle and ePub formats, and saves a local cache file for both file types.

@gfitzp gfitzp mentioned this pull request Feb 19, 2025
@pbryan
Copy link
Owner

pbryan commented Feb 23, 2025

Apologies, Glenn, I've been slammed this week. I'll be reviewing this today.

@gfitzp
Copy link
Author

gfitzp commented Feb 23, 2025

No worries, no rush!

Copy link
Owner

@pbryan pbryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most significant change I'm suggesting here is to drop the idea of tracking ETags, and focusing on using file hashes, as they allow for more durable file tracking. It's not foolproof though (a file could be rewritten by something else, changing its hash, but it will be far superior to tracking by file path.)

@gfitzp
Copy link
Author

gfitzp commented Feb 27, 2025

I think I've got everything taken care of, if you'd like to take another look when you get a chance! I tested it with locally renaming and deleting both .azw3 and .epub files and it seems to work as I expected it to.

@gfitzp gfitzp changed the title Added logic for Kindle ebooks and using ETag comparison when determining if local and remote files are the same Use a local index and SHA-256 digests to track Kindle books Feb 27, 2025
Copy link
Owner

@pbryan pbryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to work on this. There are some more issues I've raised to consider. I hope you find this helpful.

sebsync.py Outdated
import xml.etree.ElementTree as ElementTree
import zipfile

from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from platformdirs import *
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to be added as a dependency in pyproject.toml.

sebsync.py Outdated
if not path.is_file():
continue
try:
if path.name in local_cache:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file were moved or renamed, how would it be located in the index? (See my comment below on line 171.) I suggest that the hash itself become the dictionary key (or alternatively, if you prefer a more structured approach, just a list of dicts with key-value pairs for each book, hexdigest being one of the keys as you're doing below.)

sebsync.py Outdated
path=path,
modified=fromisoformat(modified.text),
modified=fromisoformat(local_cache[path.name].get("modified")),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest that the modification timestamp should be from the filesystem metadata. The other side of that coin is we should set the file modification time when we download it so that it can be compared in books_are_different, which already performs such a comparison.

sebsync.py Outdated
# first create a list of all the local titles
local_titles = {}

if options.type == "kindle":
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it guaranteed to be kindle already from line 543 above?

sebsync.py Outdated
if options.type == "kindle":
stored_files = options.books.glob("**/*.azw3")
downloaded_files = options.downloads.glob("**/*.azw3")
else:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code will never execute.

sebsync.py Outdated
local_ebooks.append(local_ebook)
local_cache[path.name] = local_cache[entry]
local_cache.pop(entry, None)
except Exception as e:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why to check for this exception in the case of EPUBs (corrupt zip file, unexpected internal files, unexpected XML structure). Unclear to me when an exception would be raised for a kinde book, especially as we never actually open it to process its content.

sebsync.py Outdated
)
local_ebooks.append(local_ebook)
except Exception:
echo_status(path, Status.UNKNOWN)
else:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so I see here now the attempt to use the hash. It seems simpler to just do this and not even try to index by path name, which I would expect in a majority of cases, will never match due to being moved or renamed.

@@ -9,3 +9,4 @@ dist/
.vscode
.venv
prof/

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty line?

sebsync.py Outdated
import xml.etree.ElementTree as ElementTree
import zipfile

from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path
from platformdirs import *
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, no wildcard imports.

sebsync.py Outdated
@@ -184,6 +239,16 @@ def download_ebook(url: str, path: Path, status: str) -> None:
for chunk in response.iter_content(chunk_size=1 * 1024 * 1024):
file.write(chunk)
download.replace(path)
return response.headers
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why return headers?

@gfitzp
Copy link
Author

gfitzp commented Feb 28, 2025

Thanks; yes, the feedback is helpful. I'm also kicking myself for not catching some basic errors. I apologize if there was more hand-holding than you might have anticipated as programming is more of a hobby for me than a profession. There's also a lot more functionality within sebsync than I anticipated, simply because I wasn't using the application the same way: I was just downloading from SE and keeping my local directory up to date with the current SE library, and used that directory to import into Calibre, and wasn't trying to commingle the sebsync library with my working one. I'll be busy for the next few days, but I'll take a look at these issues when I get a chance.

@gfitzp
Copy link
Author

gfitzp commented Mar 8, 2025

Thanks again for your help and patience! I realized I was overthinking things a bit so I started anew rather than play whack-a-mole, and looked more into uv, and hopefully this attempt might be a bit better. Let me know if further changes might be necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants