-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a local index and SHA-256 digests to track Kindle books #6
base: main
Are you sure you want to change the base?
Conversation
Apologies, Glenn, I've been slammed this week. I'll be reviewing this today. |
No worries, no rush! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most significant change I'm suggesting here is to drop the idea of tracking ETags, and focusing on using file hashes, as they allow for more durable file tracking. It's not foolproof though (a file could be rewritten by something else, changing its hash, but it will be far superior to tracking by file path.)
I think I've got everything taken care of, if you'd like to take another look when you get a chance! I tested it with locally renaming and deleting both .azw3 and .epub files and it seems to work as I expected it to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to work on this. There are some more issues I've raised to consider. I hope you find this helpful.
sebsync.py
Outdated
import xml.etree.ElementTree as ElementTree | ||
import zipfile | ||
|
||
from dataclasses import dataclass | ||
from datetime import datetime, timezone | ||
from pathlib import Path | ||
from platformdirs import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need to be added as a dependency in pyproject.toml
.
sebsync.py
Outdated
if not path.is_file(): | ||
continue | ||
try: | ||
if path.name in local_cache: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file were moved or renamed, how would it be located in the index? (See my comment below on line 171.) I suggest that the hash itself become the dictionary key (or alternatively, if you prefer a more structured approach, just a list of dicts with key-value pairs for each book, hexdigest
being one of the keys as you're doing below.)
sebsync.py
Outdated
path=path, | ||
modified=fromisoformat(modified.text), | ||
modified=fromisoformat(local_cache[path.name].get("modified")), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest that the modification timestamp should be from the filesystem metadata. The other side of that coin is we should set the file modification time when we download it so that it can be compared in books_are_different
, which already performs such a comparison.
sebsync.py
Outdated
# first create a list of all the local titles | ||
local_titles = {} | ||
|
||
if options.type == "kindle": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it guaranteed to be kindle already from line 543 above?
sebsync.py
Outdated
if options.type == "kindle": | ||
stored_files = options.books.glob("**/*.azw3") | ||
downloaded_files = options.downloads.glob("**/*.azw3") | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code will never execute.
sebsync.py
Outdated
local_ebooks.append(local_ebook) | ||
local_cache[path.name] = local_cache[entry] | ||
local_cache.pop(entry, None) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why to check for this exception in the case of EPUBs (corrupt zip file, unexpected internal files, unexpected XML structure). Unclear to me when an exception would be raised for a kinde book, especially as we never actually open it to process its content.
sebsync.py
Outdated
) | ||
local_ebooks.append(local_ebook) | ||
except Exception: | ||
echo_status(path, Status.UNKNOWN) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so I see here now the attempt to use the hash. It seems simpler to just do this and not even try to index by path name, which I would expect in a majority of cases, will never match due to being moved or renamed.
@@ -9,3 +9,4 @@ dist/ | |||
.vscode | |||
.venv | |||
prof/ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An empty line?
sebsync.py
Outdated
import xml.etree.ElementTree as ElementTree | ||
import zipfile | ||
|
||
from dataclasses import dataclass | ||
from datetime import datetime, timezone | ||
from pathlib import Path | ||
from platformdirs import * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, no wildcard imports.
sebsync.py
Outdated
@@ -184,6 +239,16 @@ def download_ebook(url: str, path: Path, status: str) -> None: | |||
for chunk in response.iter_content(chunk_size=1 * 1024 * 1024): | |||
file.write(chunk) | |||
download.replace(path) | |||
return response.headers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why return headers?
Thanks; yes, the feedback is helpful. I'm also kicking myself for not catching some basic errors. I apologize if there was more hand-holding than you might have anticipated as programming is more of a hobby for me than a profession. There's also a lot more functionality within sebsync than I anticipated, simply because I wasn't using the application the same way: I was just downloading from SE and keeping my local directory up to date with the current SE library, and used that directory to import into Calibre, and wasn't trying to commingle the sebsync library with my working one. I'll be busy for the next few days, but I'll take a look at these issues when I get a chance. |
# Conflicts: # sebsync.py
Thanks again for your help and patience! I realized I was overthinking things a bit so I started anew rather than play whack-a-mole, and looked more into |
Made some updates to make my previous Kindle logic a bit more generic so the ETag comparison works with both Kindle and ePub formats, and saves a local cache file for both file types.