Skip to content

Commit 5595756

Browse files
committed
Implement a flag.
1 parent c74d896 commit 5595756

File tree

6 files changed

+149
-77
lines changed

6 files changed

+149
-77
lines changed

docs/source/how_to_guides/portability.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ This guide explains how to keep pytask state portable across machines.
1212

1313
1. **Portable State Values**
1414

15-
- `state.value` is opaque and comes from `PNode.state()` / `PTask.state()`.
15+
- `state` is opaque and comes from `PNode.state()` / `PTask.state()`.
1616
- Content hashes are portable; timestamps or absolute paths are not.
1717
- Custom nodes should avoid machine‑specific paths in `state()`.
1818

@@ -26,6 +26,21 @@ This guide explains how to keep pytask state portable across machines.
2626
- If inputs live outside the project root, IDs will include `..` segments to remain
2727
relative; this is expected.
2828

29+
## Cleaning Up the Lockfile
30+
31+
`pytask.lock` is updated incrementally. Entries are only replaced when the corresponding
32+
tasks run. If tasks are removed or renamed, their old entries remain as stale data and
33+
are ignored.
34+
35+
To clean up stale entries without deleting the file, run:
36+
37+
```
38+
pytask build --clean-lockfile
39+
```
40+
41+
This rewrites the lockfile after a successful build with only the currently collected
42+
tasks and their current state values.
43+
2944
## Legacy SQLite
3045

3146
SQLite is the old state format. It is used only when no lockfile exists, and the

docs/source/reference_guides/lockfile.md

Lines changed: 32 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -14,33 +14,25 @@ is written during that first run. Subsequent runs use the lockfile only.
1414
# This file is automatically @generated by pytask.
1515
# It is not intended for manual editing.
1616

17-
lock-version = "1.0"
17+
lock-version = "1"
1818

1919
[[task]]
2020
id = "src/tasks/data.py::task_clean_data"
21+
state = "f9e8d7c6..."
2122

22-
[task.state]
23-
value = "f9e8d7c6..."
23+
[task.depends_on]
24+
"data/raw/input.csv" = "e5f6g7h8..."
2425

25-
[[task.depends_on]]
26-
id = "data/raw/input.csv"
27-
28-
[task.depends_on.state]
29-
value = "e5f6g7h8..."
30-
31-
[[task.produces]]
32-
id = "data/processed/clean.parquet"
33-
34-
[task.produces.state]
35-
value = "m3n4o5p6..."
26+
[task.produces]
27+
"data/processed/clean.parquet" = "m3n4o5p6..."
3628
```
3729

3830
## Behavior
3931

4032
On each run, pytask:
4133

4234
1. Reads `pytask.lock` (if present).
43-
1. Compares current dependency/product/task `state()` to stored `state.value`.
35+
1. Compares current dependency/product/task `state()` to stored `state`.
4436
1. Skips tasks whose states match; runs the rest.
4537
1. Updates `pytask.lock` after each completed task (atomic write).
4638

@@ -52,39 +44,43 @@ serialized even when tasks execute in parallel.
5244
There are two portability concerns:
5345

5446
1. **IDs**: Lockfile IDs must be project‑relative and stable across machines.
55-
1. **State values**: `state.value` is opaque; portability depends on each node’s
56-
`state()` implementation. Content hashes are portable; timestamps are not.
47+
1. **State values**: `state` is opaque; portability depends on each node’s `state()`
48+
implementation. Content hashes are portable; timestamps are not.
49+
50+
## Maintenance
51+
52+
Use `pytask build --clean-lockfile` to rewrite `pytask.lock` with only currently
53+
collected tasks. The rewrite happens after a successful build and recomputes current
54+
state values without executing tasks again.
5755

5856
## File Format Reference
5957

6058
### Top-Level
6159

62-
| Field | Required | Description |
63-
| -------------- | -------- | ---------------------------------- |
64-
| `lock-version` | Yes | Schema version (currently `"1.0"`) |
60+
| Field | Required | Description |
61+
| -------------- | -------- | -------------------------------- |
62+
| `lock-version` | Yes | Schema version (currently `"1"`) |
6563

6664
### Task Entry
6765

68-
| Field | Required | Description |
69-
| ------- | -------- | -------------------------------------------- |
70-
| `id` | Yes | Portable task identifier |
71-
| `state` | Yes | State dictionary with a single `value` field |
66+
| Field | Required | Description |
67+
| ------------ | -------- | ----------------------------- |
68+
| `id` | Yes | Portable task identifier |
69+
| `state` | Yes | Opaque state string |
70+
| `depends_on` | No | Mapping from node id to state |
71+
| `produces` | No | Mapping from node id to state |
7272

7373
### Dependency/Product Entry
7474

75-
| Field | Required | Description |
76-
| ------- | -------- | -------------------------------------------- |
77-
| `id` | Yes | Node identifier |
78-
| `state` | Yes | State dictionary with a single `value` field |
75+
Node entries are stored as key-value pairs inside `depends_on` and `produces`, where the
76+
key is the node id and the value is the node state string.
7977

80-
### State Dictionary
78+
## Version Compatibility
8179

82-
| Field | Required | Description |
83-
| ------- | -------- | ------------------- |
84-
| `value` | Yes | Opaque state string |
80+
Only lock-version `"1"` is supported. Older or newer versions error with a clear upgrade
81+
message.
8582

86-
## Version Compatibility
83+
## Implementation Notes
8784

88-
- **Upgrade**: newer pytask upgrades old lock files in memory and writes the new format
89-
on the next update.
90-
- **Downgrade**: older pytask errors with a clear upgrade message.
85+
- The lockfile is encoded/decoded with `msgspec`’s TOML support.
86+
- Writes are atomic: pytask writes a temporary file and replaces `pytask.lock`.

src/_pytask/build.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -75,6 +75,7 @@ def build( # noqa: C901, PLR0912, PLR0913, PLR0915
7575
debug_pytask: bool = False,
7676
disable_warnings: bool = False,
7777
dry_run: bool = False,
78+
clean_lockfile: bool = False,
7879
editor_url_scheme: Literal["no_link", "file", "vscode", "pycharm"] # noqa: PYI051
7980
| str = "file",
8081
explain: bool = False,
@@ -124,6 +125,8 @@ def build( # noqa: C901, PLR0912, PLR0913, PLR0915
124125
Whether warnings should be disabled and not displayed.
125126
dry_run
126127
Whether a dry-run should be performed that shows which tasks need to be rerun.
128+
clean_lockfile
129+
Whether the lockfile should be rewritten to only include collected tasks.
127130
editor_url_scheme
128131
An url scheme that allows to click on task names, node names and filenames and
129132
jump right into you preferred editor to the right line.
@@ -192,6 +195,7 @@ def build( # noqa: C901, PLR0912, PLR0913, PLR0915
192195
"debug_pytask": debug_pytask,
193196
"disable_warnings": disable_warnings,
194197
"dry_run": dry_run,
198+
"clean_lockfile": clean_lockfile,
195199
"editor_url_scheme": editor_url_scheme,
196200
"explain": explain,
197201
"expression": expression,
@@ -341,6 +345,12 @@ def build( # noqa: C901, PLR0912, PLR0913, PLR0915
341345
default=False,
342346
help="Execute a task even if it succeeded successfully before.",
343347
)
348+
@click.option(
349+
"--clean-lockfile",
350+
is_flag=True,
351+
default=False,
352+
help="Rewrite the lockfile with only currently collected tasks.",
353+
)
344354
@click.option(
345355
"--explain",
346356
is_flag=True,

src/_pytask/lockfile.py

Lines changed: 51 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,13 @@
1818
from _pytask.node_protocols import PTask
1919
from _pytask.node_protocols import PTaskWithPath
2020
from _pytask.nodes import PythonNode
21+
from _pytask.outcomes import ExitCode
2122
from _pytask.pluginmanager import hookimpl
2223

2324
if TYPE_CHECKING:
2425
from _pytask.session import Session
2526

26-
CURRENT_LOCKFILE_VERSION = "1.0"
27+
CURRENT_LOCKFILE_VERSION = "1"
2728

2829

2930
class LockfileError(Exception):
@@ -34,20 +35,11 @@ class LockfileVersionError(LockfileError):
3435
"""Raised when a lockfile version is not supported."""
3536

3637

37-
class _State(msgspec.Struct):
38-
value: str
39-
40-
41-
class _NodeEntry(msgspec.Struct):
42-
id: str
43-
state: _State
44-
45-
4638
class _TaskEntry(msgspec.Struct):
4739
id: str
48-
state: _State
49-
depends_on: list[_NodeEntry] = msgspec.field(default_factory=list)
50-
produces: list[_NodeEntry] = msgspec.field(default_factory=list)
40+
state: str
41+
depends_on: dict[str, str] = msgspec.field(default_factory=dict)
42+
produces: dict[str, str] = msgspec.field(default_factory=dict)
5143

5244

5345
class _Lockfile(msgspec.Struct, forbid_unknown_fields=False):
@@ -109,28 +101,25 @@ def read_lockfile(path: Path) -> _Lockfile | None:
109101
msg = "Lockfile is missing 'lock-version'."
110102
raise LockfileError(msg)
111103

112-
if Version(version) > Version(CURRENT_LOCKFILE_VERSION):
104+
if Version(version) != Version(CURRENT_LOCKFILE_VERSION):
113105
msg = (
114106
f"Unsupported lock-version {version!r}. "
115107
f"Current version is {CURRENT_LOCKFILE_VERSION}."
116108
)
117109
raise LockfileVersionError(msg)
118110

119-
lockfile = msgspec.toml.decode(path.read_bytes(), type=_Lockfile)
120-
121-
if Version(version) < Version(CURRENT_LOCKFILE_VERSION):
122-
lockfile = _Lockfile(
123-
lock_version=CURRENT_LOCKFILE_VERSION,
124-
task=lockfile.task,
125-
)
126-
return lockfile
111+
try:
112+
return msgspec.toml.decode(path.read_bytes(), type=_Lockfile)
113+
except msgspec.DecodeError:
114+
msg = "Lockfile has invalid format."
115+
raise LockfileError(msg) from None
127116

128117

129118
def _normalize_lockfile(lockfile: _Lockfile) -> _Lockfile:
130119
tasks = []
131120
for task in sorted(lockfile.task, key=lambda entry: entry.id):
132-
depends_on = sorted(task.depends_on, key=lambda entry: entry.id)
133-
produces = sorted(task.produces, key=lambda entry: entry.id)
121+
depends_on = {key: task.depends_on[key] for key in sorted(task.depends_on)}
122+
produces = {key: task.produces[key] for key in sorted(task.produces)}
134123
tasks.append(
135124
_TaskEntry(
136125
id=task.id,
@@ -159,7 +148,7 @@ def _build_task_entry(session: Session, task: PTask, root: Path) -> _TaskEntry |
159148
predecessors = set(dag.predecessors(task.signature))
160149
successors = set(dag.successors(task.signature))
161150

162-
depends_on = []
151+
depends_on: dict[str, str] = {}
163152
for node_signature in predecessors:
164153
node = (
165154
dag.nodes[node_signature].get("task") or dag.nodes[node_signature]["node"]
@@ -174,9 +163,9 @@ def _build_task_entry(session: Session, task: PTask, root: Path) -> _TaskEntry |
174163
if isinstance(node, PTask)
175164
else build_portable_node_id(node, root)
176165
)
177-
depends_on.append(_NodeEntry(id=node_id, state=_State(state)))
166+
depends_on[node_id] = state
178167

179-
produces = []
168+
produces: dict[str, str] = {}
180169
for node_signature in successors:
181170
node = (
182171
dag.nodes[node_signature].get("task") or dag.nodes[node_signature]["node"]
@@ -191,11 +180,11 @@ def _build_task_entry(session: Session, task: PTask, root: Path) -> _TaskEntry |
191180
if isinstance(node, PTask)
192181
else build_portable_node_id(node, root)
193182
)
194-
produces.append(_NodeEntry(id=node_id, state=_State(state)))
183+
produces[node_id] = state
195184

196185
return _TaskEntry(
197186
id=build_portable_task_id(task, root),
198-
state=_State(task_state),
187+
state=task_state,
199188
depends_on=depends_on,
200189
produces=produces,
201190
)
@@ -238,9 +227,7 @@ def _rebuild_indexes(self) -> None:
238227
self._task_index = {task.id: task for task in self.lockfile.task}
239228
self._node_index = {}
240229
for task in self.lockfile.task:
241-
nodes = {}
242-
for entry in task.depends_on + task.produces:
243-
nodes[entry.id] = entry.state.value
230+
nodes = {**task.depends_on, **task.produces}
244231
self._node_index[task.id] = nodes
245232

246233
def get_task_entry(self, task_id: str) -> _TaskEntry | None:
@@ -261,10 +248,42 @@ def update_task(self, session: Session, task: PTask) -> None:
261248
self._rebuild_indexes()
262249
write_lockfile(self.path, self.lockfile)
263250

251+
def rebuild_from_session(self, session: Session) -> None:
252+
if session.dag is None:
253+
return
254+
tasks = []
255+
for task in session.tasks:
256+
entry = _build_task_entry(session, task, self.root)
257+
if entry is not None:
258+
tasks.append(entry)
259+
self.lockfile = _Lockfile(
260+
lock_version=CURRENT_LOCKFILE_VERSION,
261+
task=tasks,
262+
)
263+
self._rebuild_indexes()
264+
write_lockfile(self.path, self.lockfile)
265+
264266

265267
@hookimpl
266268
def pytask_post_parse(config: dict[str, Any]) -> None:
267269
"""Initialize the lockfile state."""
268270
path = config["root"] / "pytask.lock"
269271
config["lockfile_path"] = path
270272
config["lockfile_state"] = LockfileState.from_path(path, config["root"])
273+
274+
275+
@hookimpl
276+
def pytask_unconfigure(session: Session) -> None:
277+
"""Optionally rewrite the lockfile to drop stale entries."""
278+
if session.config.get("command") != "build":
279+
return
280+
if not session.config.get("clean_lockfile"):
281+
return
282+
if session.config.get("dry_run"):
283+
return
284+
if session.exit_code != ExitCode.OK:
285+
return
286+
lockfile_state = session.config.get("lockfile_state")
287+
if lockfile_state is None:
288+
return
289+
lockfile_state.rebuild_from_session(session)

src/_pytask/state.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ def has_node_changed(
3535
entry = lockfile_state.get_task_entry(task_id)
3636
if entry is None:
3737
return True
38-
return state != entry.state.value
38+
return state != entry.state
3939
node_id = (
4040
build_portable_task_id(node, lockfile_state.root)
4141
if isinstance(node, PTask)
@@ -67,7 +67,7 @@ def get_node_change_info(
6767
entry = lockfile_state.get_task_entry(task_id)
6868
if entry is None:
6969
return True, "not_in_db", details
70-
stored_state = entry.state.value
70+
stored_state = entry.state
7171
else:
7272
node_id = (
7373
build_portable_task_id(node, lockfile_state.root)

0 commit comments

Comments
 (0)