Skip to content

Assistant: Provide a way to "see" data objects #7114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jmcphers opened this issue Apr 3, 2025 · 4 comments
Open

Assistant: Provide a way to "see" data objects #7114

jmcphers opened this issue Apr 3, 2025 · 4 comments
Assignees
Labels
area: assistant Issues related to Positron Assistant

Comments

@jmcphers
Copy link
Collaborator

jmcphers commented Apr 3, 2025

If you ask Assistant about some data in your environment, especially if it is large, it will typically try to execute R or Python functions that return plain-text versions of the information.

Image

While this kind of works, these tools (summary and str) do not format information in a way that is intended (or, often, even legible) to an LLM. For example, here's the summary the model asked for. It relies on whitespace formatting and has multiple columns, so it's difficult or impossible for the model to parse it.

summary(diamonds)
     carat               cut        color        clarity     
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
 Max.   :5.0100                     I: 5422   VVS1   : 3655  
                                    J: 2808   (Other): 2531  
     depth           table           price             x         
 Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
 1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
 Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
 Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
 3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
 Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
                                                                 
       y                z         
 Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 5.710   Median : 3.530  
 Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :58.900   Max.   :31.800  

This problem was also observed by @jcheng5 when working with DataBot, which is why DataBot converts data to JSON before sending it to the model.

To provide Assistant with better tools for working with data, we should implement a tool that can give it information about a data set that is well-structured. Specifically:

  • Unlike "execute code", the tool need not require confirmation since it is only reading information. This will allow the model to repeatedly look at data without pausing to ask the user to run code to see what the result looks like.
  • The tool should return structured data in JSON. Existing models do really well with this format.
  • Ideally, the tool should be usable to get a structured representation of any data type. (We might be able to use the existing variables comm?)
  • Ideally, the execute code tool could also emit structured JSON when the result of execution is a data frame/table, for the model to consume easily.
@jmcphers jmcphers added the area: assistant Issues related to Positron Assistant label Apr 3, 2025
@seeM
Copy link
Contributor

seeM commented Apr 4, 2025

Some thoughts:

  1. We could implement this in the main thread following the example of EditTool if needed.
  2. The mechanism for getting structured variable data could also be useful if/when we decouple language servers from kernels.
  3. Many dataframe types include a text/html mime bundle in execute results that we could use for the execute code part.

@wesm
Copy link
Contributor

wesm commented Apr 15, 2025

For tabular data (data frames), we could take advantage of the RPCs (schemas, raw values, summary stats, histograms, frequency tables, and so on) provided by the data explorer comm (but without opening a data explorer UI tab). I was just thinking today separately that it would make sense for the assistant to be able to access all of the statistical summaries displayed in the data explorer and compute statistics with filters applied, etc.

@wesm
Copy link
Contributor

wesm commented Apr 17, 2025

I just opened #7299 as one thing to think about -- my initial thought was to add a stateless data querying API to the variables comm that provides an initial subset of capabilities that already exist within the data explorer comm, but this might come with drawbacks. The challenge I see is that it is currently difficult to use the data explorer comm directly without also opening the UI/editor tab. I'll keep thinking some more about this -- the assistant will need to be made aware about changes to datasets which will invalidate previous tool calls, so this is probably an argument to refactor things so that the assistant can open a data explorer comm without needing to open the UI tab. If the user tries to view a dataset that's already being examined by the assistant, we can simply open another data explorer comm for the same variable.

@seeM
Copy link
Contributor

seeM commented Apr 22, 2025

[...] this is probably an argument to refactor things so that the assistant can open a data explorer comm without needing to open the UI tab

This is my thinking atm. I'm not sure if we need any changes to the OpenRPC schema, or if this can be done in the IDE where we can directly reuse types. Maybe by decoupling DataExplorerClientInstance from the UI?

Also worth keeping in mind that we may more generally want custom UIs in the assistant chat pane for tools that interact with comms, and headless comm clients could be useful for that too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: assistant Issues related to Positron Assistant
Projects
None yet
Development

No branches or pull requests

5 participants