Use Folders as input to Lumentis #11

HebeHH · 2024-03-28T00:59:28Z

Folder Input

This PR allows users to provide a folder path as input data, instead of a single file. It gets a tree of usable text files from that, and then asks users to select which of them to include. It doesn't allow users to proceed until their selections are small enough to fit into context. Once it has the selected list of files, it adds them all into the input space.

Using it

The folder import flow begins automatically if the user provides a folder instead of a file when prompted.

The wizard will display a file tree, including only LLM-usable files, and only folders which contain LLM-usable files. The user can then select or deselect here.

Deselecting a folder will mean all children will also be removed from consideration. Unfortunately this doens't change the green wizard circle, because I couldn't find functionality for this, but it does mean they won't be included in the LLM context.

The wizard will keep prompting the user to deselect files until the total file token count is below the constants.ts MAX_TOKEN_LIMIT, which I've currently set to 200,000. Every time it re-prompts, it will first update the file tree to remove anything deselected by the user (including all children of deselected files).

Once the user has trimmed the context sufficiently, this Wizard action ends.

All selected files are then loaded into wizardState.loadedPrimarySource.

Selecting usable files

This is important to make sure all info fed to the LLM is usable. I've got two methods here:

Exclude when fetching directory tree
Check individual files are readable

Excluding when fetching the directory tree

When the user provides a folder, we get a tree representation of that folder using directory-tree, which allows for regex path exclusion.

I'm giving two types of exclusion here.
Folders: Folders which we can assume aren't containing documentation-important files, such as node_modules, cahces, build folders, etc.
Extensions: Extensions which don't typically contain text data, like images, tars, compiled binaries, etc.

GitHub Copilot contributed to extending both of these lists. There'll be some missed anyway, please do build on top of them. They're the folderTreeExclusions and excludeExtensions respectively in the folder-importing-utils.ts file.

I've tested both types of regexes (at different folder levels, similar-but-different naming styles, etc) and they're working for me. However, I'm not sure about cross platform support.

Checking individual file readability

Once we've got the tree of files with the previous exclusions in place, I go through each individual file to check if it looks AI-readable. This is a more complicated business; exclusions can be guaranteed, but there's a vast swathe of "I have no idea what this is"-type files. The goal here is to be pretty confident that, if the file is what the extension says it was, an LLM will be able to get useful info from it.

Again, two methods here: mimetype and extension.

File mimetype readability

I use the mime-type package as a generalized way to sort unknown file extensions into readable or not. Media types are what the internet uses to interpret what kind of media a file is: the main type and the subtype.

I accept all files with mime.type = text, and also message (used for encoding email .eml files). These are pretty much guaranteed to be LLM-readable, and if they're not, that's on the user.

I also accept a limited subset of mime.type = application, based on the mime.subtype. This covers JSON, YAML, XML, and rich text format.

Code extensions

Unfortunately mime types doesn't seem to cover most programming language extensions; the library returns false (aka doesn't know).

This is handled by grabbing ppisarczyk's gist of programming and data languages and the associated extensions. I pull all extensions associated with programming languages from here, and let them all through on the basis that most LLMs are trained on a lot of code.

Possibly we should also include approval for additional file extensions not covered by either of the above, but I haven't personally run into any and it's tricky to find an actual list.

Problems to resolve before committing

Result printout?

My terminal displays all the file names pretty weirdly after the Checkbox finishes:

I'm not sure why this is; would appreciate someone else checking it out.

Remaining testing

I've tested all the different functions pretty thoroughly to make sure they're working. However:

I got rate limited before I could actually test the full suite. All the intermediates look good (outline generation, .lumentis/messages), but I couldn't actually generate the full docs.
I also haven't checked that it bundles properly as a npm package that can run on all systems.
Also haven't successfully tested on a large folder. I gave it my entire /Documents/ and it hung, but I'm not sure if that counts as user error or an actual problem. See 'Other Questions' for some ideas on how to resolve.

Improve file/folder filtering

I've done what I can here, but please feel free to improve.

Additional changes

If the user provides a file name (not a folder), check that it's a usable file type (see: Checking individual file readability).

…prompt tokens.

hrishioa · 2024-03-28T08:26:28Z

package.json

    "cross-spawn": "^7.0.3",
+    "directory-tree": "^3.5.1",
+    "mime-types": "^2.1.35",
+    "next": "^14.1.4",


Haha it happened again

What happened again? Do I need to remove next?

hrishioa · 2024-03-28T08:27:43Z

src/app.ts

+      let promptTokens = getAdditionalPromptTokens(file_choices);
+      while (first_time || folderTokenTotal + promptTokens > CLAUDE_PRIMARYSOURCE_BUDGET) {
+        if (!first_time) {
+          console.log("=You've selected too many tokens. Please deselect files to exclude.");


= here intentional?

nope, pushed fix.

hrishioa · 2024-03-28T08:27:58Z

src/app.ts

+        attributes: ["size", "type", "extension"]
+      });
+      resetFolderTokenTotal();
+      recursivelyRemoveExcludedFilesAndAddTokenCount(fileTree);


Yup need timeout

Promise.race is a good way

HebeHH · 2024-04-01T13:42:53Z

So the issue where it seems to print the Checkbox result to console is persistent:

I'm pretty sure this is the await checkbox() call itself that's doing this - I put console.log() on either side and the printout happened in between.

The inquirer docs seem to indicate there's an optional output parameter which redirects output, but I tried this with both undefined and process.stderror and there wasn't any change.

Would appreciate thoughts from someone who knows inquirer better.

HebeHH · 2024-04-01T13:53:52Z

Resolved previous issue: use renderSelectedChoices in the Theme

…nput.validate. Changed biome line length to 150 so it wouldn't break things up as much, which had a lot of knock on impact.

HebeHH · 2024-04-01T15:01:48Z

Timeouts and Workers

Importing folders is all well and good until the user wants to add all of Documents/ and the entire program hangs silently. In general, this area has the potential for overlong calls that take too much time to execute.

The clear solution is to add timeouts to all the overlong functions. Unfortunately, timeouts are a bit complicated here, given:

Single threaded environments.
Need to cancel calls after timeout.

I've resolved this with worker threads.

Getting the folder tree structure

This is the first long call we make. I was using the directory-tree library, but also compared this to glob and fast-glob when refactoring.

directory-tree: Doesn't require refactoring, and is the fastest in the few tests I did. However, it's also single-threaded, doesn't give up execution, and can't be cancelled. This means timeouts won't work at all if it's called in the same execution thread.
fast-glob: second fastest, about 4x slower than directory-tree. Gives up execution (so can be Promise.race()'d with a timeout), but has no easy way to cancel other than calling process.exit(), which would kill the entire program.
glob: slowest, about 4x slower than fast-glob. However it both gives up execution and can be cancelled with an AbortSignal.

Overall: glob is the only one which is feasible without worker threads, and I just didn't feel like the slowness was worth it. I was surprised by how fast directory-tree was - although that may need to be checked cross-platform.

I went with directory-tree because 1) speed 2) less refactoring of other functions 3) worker threads are probably a better mechanism anyway.

Mechanisms

I've now moved all code for this into src/folder-importing. This has 4 worker scripts (1 for getting the dirtree and 3 for different requirements on recursing over it) and one utils.ts script to pull it together.

For all of the time-heavy worker threads, I start them running and then Promise.race them against a timer.

app.ts needed a bunch of code around communicating to the user if one of the workers timed out, which means a bunch of if/else blocks.

To be double-checked

Cross-platform ability of workers and directory-tree library
Whether it bundles for npm properly
That the merge with master looks okay - I went through it several times but it was a little complex with the introduction of youtube transcripts.

HebeHH · 2024-04-01T15:02:03Z

Anyway @hrishioa I think this is ready for checking and merging now

hrishioa

This is awesome - some comments, on to testing locally!

hrishioa · 2024-05-11T03:50:44Z

.vscode/settings.json

  "editor.defaultFormatter": "biomejs.biome",
-  "editor.formatOnSave": false
+  "editor.formatOnSave": true,
+  "github.gitAuthentication": false,


What's happening here?

changed the VS Code settings so it would format on save and wouldn't try to to github auth on it's own (VS Code github auth can be super buggy).

assets/programming_language_extensions.json

package.json

biome.json

src/types.ts

src/app.ts

hrishioa · 2024-05-11T03:56:13Z

src/app.ts

+
+  if (wizardState.primarySourceFileOrFolderName && fs.lstatSync(wizardState.primarySourceFileOrFolderName).isDirectory()) {
+    const parsed_filename = parsePlatformIndependentPath(wizardState.primarySourceFileOrFolderName);
+    let fileTree: dirTree.DirectoryTree | "timeoutFailed" = await getFileTree(parsed_filename);


Does this have a potential to error out? Should we add a try catch?

I don't think so @hrishioa ? If there's an error it will just return timeoutFailed, and I handled that. If you know of another error, can you add the handling? I haven't run into any

hrishioa · 2024-05-11T03:56:45Z

src/app.ts

+        return; // Return if the timeout is reached
+      }
+      promptTokens = getAdditionalPromptTokens(file_choices);
+    }


Should we move this whole section out to a new function to keep the wizard processing section simpler?

The wizard processing section was already long enough that I didn't think this added too much complexity - and there's also plenty references to wizard state in this section. I'd thought the generat structure was that all references to wizard state were to be in the app.ts.

If we want to tidy up app.ts maybe that should be a separate PR? I tried a bit here by adding code comments to separate out the different sections.

hrishioa · 2024-05-11T04:12:04Z

Package size hasn't gone up by that much ☝️

Works well if you run with bun - especially getting lumentis to review its own code! Here's your upgrade to lumentis (with sonnet for the outline and haiku for the writing) reviewing itself: https://lumentis-self-review.vercel.app/
(Not sure why the _meta.json is a little screwed up, you can see the headings on subsections don't match the actual names of the files)

We do have an import issue with programming_language_extensions.json though - follow these steps to reproduce:

Compile with bun package.
Try executing the compiled package (which is what npx will be using when you publish it) by running node <path_to_lumentis_folder>/dist/app.cjs from any folder other than the project folder.

Once that's fixed we can do some cross-platform testing and release.

HebeHH · 2024-05-15T07:57:22Z

@hrishioa adjusted so that the worker threads would be able to be found. However running the package does give me an error:

I'm not entirely sure how to resolve this one. Was working fine unpackaged.

The worker threads are a requirement in order to timeout the various file globbing etc functions which may need it. I believe there's a previous commit without them, but removing the worker threads gives up the timeout quitting functionality.

Would appreciate your thoughts/help.

HebeHH added 8 commits March 28, 2024 07:25

Allow ability to import folders, main work done

bde3e4a

Fixed file reference, constants, etc

3ff2fcb

tidied imports

6dd6752

prompt user for folder as well

088d587

smol

eba8c94

fixed biome errors

bf081be

remove folders on last time, use CLAUDE_PRIMARYSOURCE_BUDGET

914574f

refactored to minimize app.ts complexity. Added check for additional …

8dca470

…prompt tokens.

hrishioa reviewed Mar 28, 2024

View reviewed changes

HebeHH added 4 commits March 28, 2024 16:49

console log typo

c54b424

Hebe's version

644e88b

Got worker threads and timeouts working

2f3541a

biome check

fd41290

HebeHH added 3 commits April 1, 2024 21:54

actually update filetree, fix rendering issue

5be9ad7

pretty sure I merged these properly

88b6be7

updated if/else folder/file logic in app.ts to work with changes to I…

0277150

…nput.validate. Changed biome line length to 150 so it wouldn't break things up as much, which had a lot of knock on impact.

hrishioa reviewed May 11, 2024

View reviewed changes

HebeHH added 3 commits May 15, 2024 15:07

Move (shorter) extension list into code

f567db3

PR requested fixes

be60d13

reworked worker threads links

06b8660

HebeHH and others added 2 commits May 15, 2024 16:19

field names forgot to save

fa120bb

Working cjs imports, not perfect but works

6cde440

Use Folders as input to Lumentis #11

Are you sure you want to change the base?

Use Folders as input to Lumentis #11

Uh oh!

Conversation

HebeHH commented Mar 28, 2024

Folder Input

Using it

Selecting usable files

Excluding when fetching the directory tree

Checking individual file readability

File mimetype readability

Code extensions

Problems to resolve before committing

Result printout?

Remaining testing

Improve file/folder filtering

Other questions

Additional changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HebeHH commented Apr 1, 2024

Uh oh!

HebeHH commented Apr 1, 2024

Uh oh!

HebeHH commented Apr 1, 2024

Timeouts and Workers

Getting the folder tree structure

Mechanisms

To be double-checked

Uh oh!

HebeHH commented Apr 1, 2024

Uh oh!

hrishioa left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hrishioa commented May 11, 2024

Uh oh!

HebeHH commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HebeHH commented May 15, 2024 •

edited

Loading