Skip to content

[new tool] @agent-tools/dataframe — In-memory tabular data manipulation and analysis #204

@burner-agent

Description

@burner-agent

Tool Name

@agent-tools/dataframe

Description

In-memory tabular data manipulation with column-oriented operations — filter, derive, group, aggregate, join, pivot, and reshape data tables without external databases.

Why It's Useful for Agents

Agents frequently need to process, transform, and analyze structured data — CSV query results, API response arrays, log entries, metrics. Currently they must write imperative loops or spawn database processes. A DataFrame tool provides declarative, composable table operations directly in memory.

Built on Arquero (1.5k stars, BSD-3-Clause, v8.0.3, ~44k weekly npm downloads, UW Interactive Data Lab) — a dplyr/pandas-inspired verb-based table library with lazy evaluation and Apache Arrow interop. Arquero is lighter and more focused than Danfo.js (5k stars, MIT, ~4.7k weekly downloads — pandas-like but heavier TensorFlow.js dependency) and complements apache-arrow (3M weekly downloads, Apache-2.0 — columnar memory format and IPC, but no query verbs).

Distinct from #150 (@agent-tools/parquet — file format I/O, not in-memory manipulation), #72 (@agent-tools/tabular — CSV/spreadsheet read/write, not query operations), #66 (@agent-tools/sqlite — SQL database, not in-memory functional API), #105 (@agent-tools/math — numerical computation, not table operations).

Proposed API

import { dataframe } from "@agent-tools/dataframe";

// Create from arrays, objects, CSV, or Arrow tables
const df = dataframe.from([
  { name: "Alice", dept: "eng", salary: 120000 },
  { name: "Bob", dept: "eng", salary: 110000 },
  { name: "Carol", dept: "sales", salary: 95000 },
]);

// Verb-based transformations (chainable, lazy)
const result = df
  .filter((d) => d.salary > 100000)
  .derive({ bonus: (d) => d.salary * 0.1 })
  .select("name", "dept", "bonus");

// Grouping and aggregation
const summary = df
  .groupby("dept")
  .rollup({
    count: (d) => op.count(),
    avg_salary: (d) => op.mean(d.salary),
    max_salary: (d) => op.max(d.salary),
  });

// Joins
const merged = df.join(otherDf, ["dept", "dept"]);

// Pivot / reshape
const wide = df.pivot("dept", "name", "salary");
const long = wide.fold(["eng", "sales"], { as: ["dept", "salary"] });

// I/O
const csv = result.toCSV();
const objects = result.objects();
const arrow = result.toArrow(); // Apache Arrow IPC interop

// Summary statistics
const stats = df.describe(); // count, mean, std, min, max per numeric column

// Sorting and sampling
const top5 = df.orderby(desc("salary")).slice(0, 5);
const sample = df.sample(100);

Scope

In scope:

  • Table creation from arrays, objects, CSV strings, Apache Arrow tables
  • Column selection, renaming, reordering
  • Row filtering with expression functions
  • Derived/computed columns
  • Groupby + rollup aggregation (count, sum, mean, median, min, max, stdev, variance)
  • Joins (inner, left, right, full, cross, semi, anti)
  • Pivot (wide↔long), fold, spread
  • Sorting, slicing, sampling, deduplication
  • Descriptive statistics (describe)
  • Export to objects, CSV, Arrow IPC
  • Expression language with arithmetic, string, and date operations

Out of scope:

  • Persistent storage (use @agent-tools/sqlite)
  • File format I/O beyond CSV (use @agent-tools/parquet, @agent-tools/tabular)
  • Visualization / charting (use @agent-tools/chart)
  • Machine learning / statistical modeling
  • Distributed / out-of-core processing

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededinfrastructureCI, workflows, build toolingnew-toolProposal for a new tool package

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions