Last updated: 2024-12-28
Archived legacy code (2024-12-28):
We have archived both source and tests for several defunct/legacy namespaces in _archive/ (at project root):
adjust_frequency_test.clj- old bucketing/resampling APIconverters_test.clj- old converter functions (now incolumn.api)rolling_window_test.clj- old rolling window APIslice_test.clj- old slice/index-by APItime_components_test.clj- old time field extractors (now incolumn.api)validatable_test.clj- dataset integrity checking utility
These are preserved as reference material for when we reimplement similar functionality in the new column-based architecture. They provide valuable test cases, edge cases, expected behavior semantics, and implementation patterns.
Both source and test files have been moved outside of src/ and test/ directories so they are not loaded or run. The main tablecloth.time.api namespace has been cleaned up to remove defunct symbol exports.
Active work:
- ✅ Column-level field extractors implemented in
tablecloth.column.api - ✅ Column-level
convert-timefor representation changes - ✅ Dataset-level
sliceoperation implemented intablecloth.time.api.slice - 🚧 Dataset-level operations (bucket, resample) to be reimplemented per the architecture below
Primary audience: Scicloj / tablecloth users working with tech.ml.dataset datasets.
Goal: Make common time manipulations (slicing, bucketing, rolling, etc.) easy and expressive in the dataset context, while using dtype-next for performance.
- Time functionality should live in
tablecloth.time, not in a separategnomonlib, because:- The main use cases are tablecloth +
tech.ml.dataset, not general JVM apps. tablecloth.timealready has the right context: datasets, columns, and thetablecloth.column.api.- We can "lift" dtype-next datetime operations into the column API directly, similar to how functional ops were lifted from
tech.v3.datatype.functional.
- The main use cases are tablecloth +
- The earlier
gnomonwork remains valuable as a source of semantics and tests (especially around the millis pivot and bucketing), but it does not need to be a public library right now.- If a clear, non-tablecloth use case appears later, we can reconsider a tiny scalar library.
We keep the millis pivot idea but apply it at the column level.
Conceptually (gnomon, scalar version):
- There is a single numeric axis: epoch milliseconds.
- Conversions:
to-millis:Instant/ZonedDateTime/OffsetDateTime/LocalDateTime/LocalDate/java.util.Date/ number → epoch millis.millis->anytime: millis → specificjava.timetype (or keyword designator, e.g.:instant,:local-date).convert-to: combines the two with explicit:zonesemantics.
- Bucketing / rounding:
milliseconds-in unit(e.g. seconds, minutes, hours, days, weeks) gives a scalar factor.down-to-nearest interval unit/->every:- For metric units (
:milliseconds,:seconds,:minutes,:hours,:days,:weeks): floor using integer math in millis (x - (mod x step)). - For calendar units (
:months,:quarters,:years): convert toLocalDate, apply calendar-aware floor functions (floor-month,floor-quarter,floor-year), then convert back.
- For metric units (
In tablecloth.time, we adapt this to columns:
- Normalize time columns to epoch millis where needed (using dtype-next datetime operations).
- Do bucketing and other math in millis (or another numeric epoch space as appropriate).
- Convert back to logical datetime dtypes for result columns.
This keeps user-facing code thinking in terms of time and columns, not low-level numeric representations.
We do not reintroduce a pandas-style Index into tech.ml.dataset. Instead, we:
-
Treat an index as just a column whose values we regard as the coordinate system (usually time).
- Example:
:received-dateis the time axis.
- Example:
-
No metadata-based axis marking: Time operations explicitly take column name arguments, consistent with tablecloth's patterns:
group-by,order-by,aggregate, etc. all take explicit column selectors.- Time operations follow the same pattern:
(slice-by ds :time-col start end). - This keeps behavior clear and avoids "spooky action at a distance" from metadata.
-
No
index-byfunction: Deferred unless a compelling use case emerges. Explicit column arguments are simpler and more flexible. -
Sortedness handling:
- Time slicing operations require sorted data for efficient binary search.
- Default behavior: check sortedness (O(n)), use binary search if sorted, error with helpful message if not.
- Optimization:
{:sorted? true}option skips the sortedness check (for when user knows data is sorted). - We never silently reorder data - users must explicitly
(tc/order-by ds :time-col)first.
From the Zulip tech.ml.dataset.dev discussion (summarized):
-
For categorical indexing, Harold uses:
(defn obtain-index [ds colname] (let [m (ds/group-by-column->indexes ds colname)] (fn [v] (map (partial ds/row-at ds) (get m v))))) (def lookup (obtain-index ds :id)) (lookup "a") ; rows where :id == "a"
-
This yields an index-as-function:
- Built once (
group-by-column->indexesis O(n)). - Lookup is O(1) using a hash map.
- Rows are realized lazily.
- No special Index type, just a map + closure.
- Built once (
For time / ordered data, the analogous concept is:
- Treat the sorted time column itself as the index.
- Use binary search over the sorted time column (always; no conditional logic).
- Binary search is fast even on small data (100 rows = ~7 comparisons) and provides consistent, predictable performance.
- For more complex windowing operations, use dtype-next's
variable-rolling-window-ranges.
Why binary search instead of tree structures? (from Zulip discussion with Chris Nuernberger)
Chris Nuernberger (dtype-next author) explained:
"You only need tree structures if you are adding values ad-hoc or removing them - usually with datasets we aren't adding/removing rows but rebuilding the index all at once. Just sorting the dataset and using binary search will outperform most/all tree structures in this scenario as it is faster to sort than to construct trees. Binary search performs as well as any tree search for queries and range queries."
Harold validated this with real-world performance: >1M rows/s using java.util.Collections/binarySearch on a 1M row time series dataset.
Key insight: Datasets are typically rebuilt wholesale (loaded/reloaded), not incrementally modified. In this scenario:
- Sorting is faster than constructing trees
- Binary search performs as well as tree search for queries
- Simpler implementation with predictable behavior
See doc/zulip-indexing-discussion-summary.md for full context.
In practice for tablecloth:
- No special Index type, no metadata tracking, no tree structures.
- Time operations take explicit column arguments and assume/verify sortedness.
- Binary search provides efficient O(log n) slicing over sorted time columns.
We want a small, coherent set of column primitives that:
- Take a column-like thing (vector, dtype-next column/reader).
- Return a new column-like thing.
- Encode the millis-pivot and calendar semantics.
Sketch of desired primitives:
; Normalize a datetime-ish column to epoch millis
(->millis-col [col opts])
; Inverse: millis column -> datetime column of a target dtype
(millis->datetime-col [millis-col target-dtype opts])
``
- Internally, these should use:
- `tech.v3.datatype.datetime/datetime->milliseconds`
- `tech.v3.datatype.datetime/milliseconds->datetime`
- and related dtype-next datetime operations.
### 4.2. Bucketing / rounding
```clojure
; Vectorized gnomon-style bucketing
(bucket-every-col [time-col interval unit opts])Semantics:
-
Metric units (
:milliseconds,:seconds,:minutes,:hours,:days,:weeks):- Normalize to millis if necessary.
- Compute
divisor = interval * (milliseconds-in unit). - For each element:
rounded = millis - (mod millis divisor).
-
Calendar units (
:months,:quarters,:years):- Normalize to a
LocalDatecolumn (via dtype-next conversion from millis). - Apply calendar-aware floors (
floor-month,floor-quarter,floor-year) per element. - Convert back to the desired type (or to millis) based on the calling context.
- Normalize to a
These operations are index-agnostic in the sense that they are per-row transforms; they do not need ordering. They will be composed with index-aware dataset-level operations when needed.
We have implemented field extractor functions in tablecloth.time.column.api:
- Basic calendar fields:
year,month,day,hour,minute,get-second - Derived calendar fields:
day-of-week,day-of-year,week-of-year,quarter
All functions:
- Wrap
dtdt-ops/long-temporal-fieldfor efficient vectorized extraction - Handle
Instantcolumns automatically (convert toLocalDateTimeUTC for extraction) - Work with
LocalDate,LocalDateTime,ZonedDateTime, andInstantcolumns - Return wrapped
tcc/columnobjects
Comprehensive tests added covering all datetime types and edge cases.
Based on pandas dt accessor and R lubridate, here are additional column-level operations for future implementation:
We have down-to-nearest (floor) but are missing:
(defn ceil-to-nearest [col interval unit opts] ...) ; Round up to next boundary
(defn round-to-nearest [col interval unit opts] ...) ; Round to nearest boundaryRationale: Pandas provides .round, .floor, and .ceil methods. Lubridate provides ceiling_date(), floor_date(), and round_date(). For ceil, it always returns the upper bound of the period; for round, it returns the nearest bound.
Implementation notes:
ceil-to-nearest: floor + (if not already aligned, add one interval)round-to-nearest: floor, then check if distance to next boundary is less than half interval
(defn plus-time [col amount unit opts] ...) ; Add time amounts
(defn minus-time [col amount unit opts] ...) ; Subtract time amounts
(defn between [col1 col2 unit opts] ...) ; Difference between two datetime columnsRationale: dtype-next provides plus-temporal-amount, minus-temporal-amount, and between. Lubridate offers efficient arithmetic operations on dates. Essential for "shift by N days" or "time since previous event" operations.
Implementation notes:
- Wrap
dtdt-ops/plus-temporal-amountanddtdt-ops/minus-temporal-amount - Support same units as
down-to-nearest::seconds,:minutes,:hours,:days,:weeks,:months,:years betweenwrapsdtdt-ops/betweenfor column-wise time differences- Handle zone semantics consistently with
convert-time
(defn is-month-start? [col] ...) ; True if first day of month
(defn is-month-end? [col] ...) ; True if last day of month
(defn is-quarter-start? [col] ...) ; True if first day of quarter
(defn is-quarter-end? [col] ...) ; True if last day of quarter
(defn is-year-start? [col] ...) ; True if first day of year
(defn is-year-end? [col] ...) ; True if last day of year
(defn is-leap-year? [col] ...) ; True if year is leap year
(defn is-weekend? [col] ...) ; True if Saturday or SundayRationale: Pandas dt accessor provides is_month_start, is_month_end, etc. Lubridate's leap_year() tests for leap years. Useful for filtering and conditional logic in pipelines.
Implementation notes:
- Compare field extractors with boundary values
is-month-start?: day == 1is-month-end?: compare with days-in-month (account for leap years)is-leap-year?:(and (= 0 (mod year 4)) (or (!= 0 (mod year 100)) (= 0 (mod year 400))))is-weekend?:(or (= day-of-week 6) (= day-of-week 7))
(defn with-tz [col new-zone] ...) ; Change display timezone (same instant)
(defn force-tz [col new-zone] ...) ; Change zone metadata (different instant)Rationale: Lubridate provides with_tz() (changes time zone display, same moment) and force_tz() (changes only zone metadata, describes new moment). Critical for multi-timezone datasets.
Implementation notes:
with-tz: Convert toZonedDateTimewith new zone (instant unchanged)force-tz: Reinterpret clock time in new zone (instant changes)- Only works with temporal types that have zone information
(defn datetime-min [col] ...) ; Minimum datetime value
(defn datetime-max [col] ...) ; Maximum datetime value
(defn datetime-mean [col] ...) ; Mean datetime value
(defn datetime-range [col] ...) ; max - min as a durationRationale: dtype-next provides millisecond-descriptive-statistics. While these might be better as dataset-level aggregations, having column-level helpers is convenient.
Implementation notes:
- Wrap
dtdt-ops/millisecond-descriptive-statistics - Convert results back to appropriate datetime types
- Consider whether these belong here or in aggregation layer
(defn normalize-date [col opts] ...) ; Set time component to 00:00:00Rationale: Pandas dt.normalize always rounds time to midnight (00:00:00). Common operation for comparing dates ignoring time components.
Implementation notes:
- Convert to
LocalDateand back to original temporal type - Equivalent to
floor-to-daybut more explicit intent
(defn strftime [col format-string opts] ...) ; Format as string
(defn day-name [col opts] ...) ; "Monday", "Tuesday", etc.
(defn month-name [col opts] ...) ; "January", "February", etc.Rationale: Pandas Series.dt.strftime converts to string using specified format. Useful for labels and display.
Implementation notes:
- Use Java's
DateTimeFormatterforstrftime day-nameandmonth-namecan use built-in formatters- Consider locale support for internationalization
Implement immediately:
- Rounding operations (
ceil-to-nearest,round-to-nearest) - completes rounding trilogy - Temporal arithmetic (
plus-time,minus-time,between) - enables time-delta operations
Implement soon:
3. Timezone operations (with-tz, force-tz) - critical for multi-timezone use cases
4. Normalization (normalize-date) - common operation, easy to implement
Nice-to-have: 5. Boolean predicates - useful for filtering 6. String formatting - display and export 7. Descriptive statistics - may fit better in aggregation layer
We now have a first version of a column-level convert-time in
tablecloth.time.column.api with the following semantics:
- Purpose: convert between time representations at the column level:
- temporal ↔ epoch (e.g.
:local-date↔:epoch-milliseconds). - temporal ↔ temporal (via a millis pivot).
- epoch ↔ epoch (via numeric scaling only).
- temporal ↔ epoch (e.g.
- Source/target classification:
- Use
dtype/elemwise-datatype+dtdt-base/classify-datatypeto classify the source and target dtypes into:temporal,:epoch,:duration,:relative, etc. - Only
[:temporal …]and[:epoch …]combinations are supported; anything involving:durationor:relativethrows a clear::unsupported-time-conversionExceptionInfo.
- Use
- Targets:
- Accept both keywords and Java time classes.
- Normalize via a private
normalize-targethelper that:- handles a small synonym map (e.g.
:zdt→:zoned-date-time,:ldt→:local-date-time). - uses
casting/object-class->datatypefor class targets.
- handles a small synonym map (e.g.
- Packed vs unpacked dtypes:
- We use
tech.v3.datatype.packing/unpack-datatypeincalendar-local-type?so that both logical and packedLocalDate/LocalDateTime/LocalTimetypes are treated as "calendar local" when deciding whether a zone is needed.
- We use
- Zone semantics:
- The public
convert-timedefault:zoneis UTC, implemented viacoerce-zone-idwith{:default (dtdt/utc-zone-id)}. - Zone is only passed to dtype-next conversions when the temporal side is calendar-local (no inherent zone), or when doing temporal↔temporal via the millis pivot.
- Epoch↔epoch conversions ignore zone entirely (they are pure numeric rescalings of an absolute UTC instant representation).
- The public
- Epoch↔epoch workaround:
- We avoid going through datetime for epoch↔epoch conversions (to dodge a
dtype-next bug around
:epoch-seconds+ zone) and instead scale viaepoch->microsecondsandtech.v3.datatype.functional.
- We avoid going through datetime for epoch↔epoch conversions (to dodge a
dtype-next bug around
We should keep convert-time focused on representation changes only; all
operations that conceptually involve lengths (durations/relatives) will have
separate APIs (e.g. between/time-diff, convert-duration).
On top of column primitives, we can define dataset-level helpers that feel natural in tablecloth. All time operations take explicit column arguments, consistent with tablecloth's existing API patterns.
Implemented in tablecloth.time.api.slice/slice:
(slice [ds time-col start end opts])Example:
(slice ds :received-date "2022-01-01" "2022-12-31")
(slice ds :timestamp #time/date "2022-01-01" #time/date "2022-12-31")
(slice ds :timestamp 1704067200000 1704153599999 {:result-type :as-indices})Implementation details:
- Parse
start/endusingtablecloth.time.parse/parse(ISO-8601 strings) or accept temporal types/epoch millis directly. - Normalize the chosen time column to epoch milliseconds using
convert-time. - Check sortedness (O(n)) and auto-sort if needed (ascending order).
- Support both ascending and descending sorted data.
- Use custom binary search (
tablecloth.time.utils.binary-search) to find[start-idx end-idx]range. - Return dataset by default or indices with
{:result-type :as-indices}.
Sortedness semantics:
- Default: check sortedness using
is-sorted?, auto-sort in ascending order if needed. - Supports both ascending and descending sorted data (detects direction automatically).
- Never errors on unsorted data—just sorts it transparently.
Binary search strategy:
- Always use binary search (no conditional logic or thresholds).
- Custom implementation in
tablecloth.time.utils.binary-searchfor lower/upper bound finding. - Fast even on small data: 100 rows = ~7 comparisons.
- Provides consistent, predictable performance.
Comprehensive test coverage:
- Ascending and descending sorted data
- String dates, time literals, and epoch milliseconds
- Multi-month ranges, single-row datasets, duplicate timestamps
- Edge cases: empty results, out-of-range queries, boundary matches
- Error handling: invalid date ranges, missing columns
Design decision: No dedicated rollup-every or add-bucket-column functions.
Following the R/dplyr philosophy of composable primitives, users can easily compose bucketing workflows using our column-level time functions with standard tablecloth operations:
;; Bucket and aggregate
(-> ds
(tc/add-column :bucket #(down-to-nearest (% :timestamp) 5 :minutes))
(tc/group-by :bucket)
(tc/aggregate {:count tc/row-count
:avg-value #(dfn/mean (% :value))}))
;; Or use field extractors for natural calendar boundaries
(-> ds
(tc/add-column :year #(year (% :timestamp)))
(tc/add-column :month #(month (% :timestamp)))
(tc/group-by [:year :month])
(tc/aggregate {:total #(dfn/sum (% :sales))}))This is:
- Transparent: you see exactly what's happening at each step
- Flexible: easy to customize (keep/drop bucket column, add filters, etc.)
- Consistent: uses standard tablecloth patterns users already know
- Simple: just 3 lines of straightforward code
A dedicated rollup-every function would be too thin to justify—unlike slice, which handles significant complexity (parsing, sorting, binary search), bucketing workflows are simple compositions of existing tools.
Possible future consideration: A rollup-every helper might be justified if user feedback shows this is a very common pattern and convenience is valued. But we'll start with composable primitives and add convenience functions only if needed.
-
Column API: (✅ mostly complete)
- ✅ Field extractors implemented in
tablecloth.time.column.api(year, month, day, hour, minute, second, day-of-week, day-of-year, week-of-year, quarter) - ✅
convert-timefor representation changes (temporal ↔ epoch) - 🚧 Still needed:
bucket-every-col/align-time/floor-timefor rounding/bucketing operations- Additional rounding operations (
ceil-to-nearest,round-to-nearest) - Temporal arithmetic (
plus-time,minus-time,between)
- ✅ Field extractors implemented in
-
Binary search helper: (✅ complete)
- ✅ Custom implementation in
tablecloth.time.utils.binary-search - ✅ Supports both lower-bound and upper-bound finding
- ✅ Always uses binary search (no conditional logic)
- ✅ Custom implementation in
-
Dataset API: (✅ slice complete, bucketing/resampling/interpolation todo)
- ✅
sliceimplemented intablecloth.time.api.slicewith:- Explicit time column argument
- Auto-sorting if unsorted (no error-on-unsorted)
- Binary search for range finding
- Comprehensive test coverage
- 🚧 Still needed (downsampling/aggregation):
Deprioritized: users can composerollup-everydown-to-nearest+tc/add-column+tc/group-by+tc/aggregate- No dedicated dataset-level bucketing functions planned (composition is sufficient)
- 🚧 Still needed (upsampling/interpolation):
resample-to-regular-gridfor irregular → regular time series- Column primitives:
generate-time-range,interpolate-values - Interpolation methods:
:ffill,:bfill,:linear,:nearest,:zero
- ✅
-
Reuse from gnomon: (🚧 in progress)
- Port tests and semantics from the gnomon repo (especially for
down-to-nearest,->every, and conversion behaviors) intotablecloth.timetests. - Use those semantics to guide the column-level and dataset-level implementations.
- Port tests and semantics from the gnomon repo (especially for
-
De-prioritize gnomon as a standalone library until/unless a clear general JVM use case emerges.
We clarified terminology and design layers around bucketing and resampling:
-
Aligning / flooring (scalar or column):
-
The old scalar
down-to-nearestfrom gnomon is conceptually a time-floor/align operation:- Given
nand a unit (:seconds,:minutes,:days, etc.), map each timestamp to the start of the nearest-lower interval of that size.
- Given
-
At the column level this becomes a vectorized op (implemented with dtype-next) that is index-agnostic:
(down-to-nearest col 10 :seconds ...) ; or align/floor naming variant
-
-
Bucketing:
-
Use the aligned value as a bucket key and group by it, usually followed by aggregation.
-
This is conceptually:
(let [aligned (down-to-nearest time-col 10 :seconds)] (-> ds (tc/add-column :bucket aligned) (tc/group-by :bucket) (tc/aggregate agg-spec)))
-
-
Resampling / adjust-frequency:
- A higher-level dataset operation that:
- chooses a time coordinate (index or explicit column),
- aligns it using
down-to-nearest/align-time, - and then groups/aggregates and/or constructs a new regular time index.
- This is where a future
adjust-frequencyorresample-timeAPI will live, built on top of column primitives. - Important: This covers downsampling/aggregation (many → fewer points).
- A higher-level dataset operation that:
We will:
- Keep
down-to-nearest(or a renamedalign-time/floor-time) as the primitive alignment op (scalar + column), completely independent of any dataset index concept. - Build dataset-level resampling (
adjust-frequency,resample-time, etc.) in terms of:- a chosen time column (optionally marked via
index-by), - column-level alignment (
down-to-nearest), and - standard group/aggregate operations.
- a chosen time column (optionally marked via
Use case (from Zulip conversation with Daniel Slutsky):
Just to make it concrete with an example: the beats of the heart (or our estimates of when they happen) are irregular in time. Often, when we wish to conduct some analysis regarding heart rate or heart rate variability, we will interpolate and resample, so we have a regular time series, and then we can use methods like Fourier transform for frequency analysis.
This highlights a second type of resampling distinct from bucketing/aggregation:
-
Downsampling/Aggregation (many → fewer points)
- Input: High-frequency irregular data (e.g., 1000 transactions/day, irregular heartbeats)
- Output: Lower-frequency aggregated data (e.g., daily totals, minute-by-minute averages)
- Operation: Aggregate multiple values into buckets (sum, mean, count, etc.)
- Implementation:
rollup-every,add-bucket-column(planned above)
-
Upsampling/Interpolation (fewer → more points, or irregular → regular)
- Input: Irregular time series (e.g., heartbeats at 0ms, 850ms, 1720ms, 2540ms...)
- Output: Regular time grid (e.g., every 100ms: 0ms, 100ms, 200ms, 300ms...)
- Operation: Interpolate to estimate values at new time points
- Use cases: Signal processing (FFT), regular time grids for ML, filling gaps
(resample-to-regular-grid [ds time-col value-col interval unit opts])Example:
(resample-to-regular-grid ds :beat-time :rr-interval 100 :milliseconds
{:method :linear ; or :ffill, :bfill, :nearest, :zero
:start "2024-01-01" ; optional: explicit start time
:end "2024-01-02"}) ; optional: explicit end timeCommon interpolation methods:
:ffill- forward fill (last observation carried forward):bfill- backward fill:linear- linear interpolation between points:nearest- nearest neighbor:zero- fill missing with zeros
This would build on column-level primitives (to be added):
-
Time grid generation:
(generate-time-range [start end interval unit opts]) ;; Returns a column of evenly-spaced time points
-
Interpolation:
(interpolate-values [time-col value-col new-times method opts]) ;; For each time in new-times, interpolate value from surrounding points in time-col/value-col
-
Dataset-level operation:
- Generate regular time grid (or accept explicit grid)
- For each output time point, interpolate value from surrounding input points
- Return new dataset with regular time index
Implementation notes:
- More complex than bucketing/aggregation because it requires:
- Looking at neighboring points (not just within a bucket)
- Mathematical interpolation logic (especially for
:linear) - Handling edge cases (before first point, after last point)
- Could leverage dtype-next operations where applicable
- Consider whether to support multi-column interpolation (interpolate all numeric columns)
Priority: Medium-to-high, as it addresses a distinct and important use case (signal processing, ML pipelines) that cannot be satisfied by bucketing/aggregation alone.
Unit handling:
- We'll keep a small
normalize-unitfunction at the tablecloth.time layer to normalize user-facing units (e.g.:second→:seconds,:minute→:minutes,:week→:weeks, etc.). - Internally we will:
- Map duration-like units (
:seconds,:minutes,:hours,:days,:weeks, etc.) onto dtype-next's relative dtypes and constants. - Treat calendar-like units (
:months,:quarters,:years) viaLocalDate/LocalDateTimelogic (month/quarter/year boundaries), not as simple fixed-length relative durations.
- Map duration-like units (
This aligns with how other ecosystems behave conceptually:
- pandas:
Timestamp.floor/Series.dt.floorfor alignment,.resample()for resampling. - R (lubridate/dplyr):
floor_date()for alignment,group_by(floor_date(...)) %>% summarise(...)for bucketing/resampling. - SQL (Postgres):
date_trunc()for alignment,GROUP BY date_trunc(...)for bucketing.
We have two different rolling window approaches to evaluate:
Type: Fixed-size row-count-based rolling windows
Key characteristics:
- Window defined by number of rows (e.g., "previous 3 rows")
- Works on any column (time or not)—just uses positional indices
- Returns a grouped dataset where each group contains the window rows
- Implementation uses row indices and leverages
tc/group-bywith custom grouping map
Use case example:
;; Window of previous 3 rows
(rolling-window ds 3)
;; → Each row gets a dataset containing [row-2, row-1, current-row]Characteristics:
- Always same number of rows per window (except at start where truncated)
- Not time-aware—just counts rows
- General-purpose, could work on any ordered data
Type: Time-based duration windows
Key characteristics:
- Window defined by time duration (e.g., "previous 5 minutes")
- Requires sorted, monotonically increasing datetime column
- Works with irregular time series (variable number of rows per window)
- Returns index ranges (just
[start-idx end-idx]pairs, not datasets) - Units in microseconds (or other time units via constants)
Use case example:
;; Window of previous 5 minutes
(variable-rolling-window-ranges time-col 5 :minutes)
;; → For each time point, returns [start-idx end-idx] covering previous 5 minutes
;; → Variable number of rows per window depending on data densityCharacteristics:
- Window size varies: sparse data = fewer rows, dense data = more rows
- Time-aware—uses actual temporal values
- Specifically designed for time-series analysis
| Feature | Archived rolling-window |
dtype-next variable-rolling-window-ranges |
|---|---|---|
| Window definition | Fixed row count | Time duration |
| Result | Grouped dataset | Index ranges |
| Rows per window | Always same (3 rows, 10 rows, etc.) | Variable (depends on time density) |
| Time-aware? | No (just counts rows) | Yes (uses actual time values) |
| Use case | "Last N observations" | "Last N minutes/hours/days" |
| Data requirement | Any ordered data | Sorted datetime column |
For time-series work, dtype-next's approach seems more appropriate because:
- Time-based windows are more meaningful: "5-minute rolling average" vs "3-row rolling average"
- Handles irregular data properly: If heartbeats are irregular, you want "last 10 seconds of beats" not "last 3 beats"
- More flexible: can build row-count windows on top of time windows, but not vice versa
However, the archived version has value as a general-purpose utility:
- Simpler for non-time data
- Returns grouped datasets (higher-level abstraction)
- Could potentially live in tablecloth itself (not time-specific)
Open questions:
- Should we provide both? Row-count for simplicity, time-based for time-series?
- What API should we expose? Grouped datasets vs index ranges vs something else?
- How does this relate to aggregation? Do we want
(rolling-mean ds :value 5 :minutes)or lower-level primitives? - Should row-count rolling windows live in
tablecloth.apirather thantablecloth.time?
Priority: Medium—useful for time-series analysis but not as fundamental as slice/bucket/interpolate.
For details on the underlying dtype-next datetime namespace (tech.v3.datatype.datetime) that we are lifting into tablecloth.time, see:
doc/dtype-next-datetime-api-notes.md