Skip to content

Conversation

@HebeHH
Copy link
Collaborator

@HebeHH HebeHH commented Mar 28, 2024

Folder Input

This PR allows users to provide a folder path as input data, instead of a single file. It gets a tree of usable text files from that, and then asks users to select which of them to include. It doesn't allow users to proceed until their selections are small enough to fit into context. Once it has the selected list of files, it adds them all into the input space.

Using it

image

The folder import flow begins automatically if the user provides a folder instead of a file when prompted.

The wizard will display a file tree, including only LLM-usable files, and only folders which contain LLM-usable files. The user can then select or deselect here.

Deselecting a folder will mean all children will also be removed from consideration. Unfortunately this doens't change the green wizard circle, because I couldn't find functionality for this, but it does mean they won't be included in the LLM context.

The wizard will keep prompting the user to deselect files until the total file token count is below the constants.ts MAX_TOKEN_LIMIT, which I've currently set to 200,000. Every time it re-prompts, it will first update the file tree to remove anything deselected by the user (including all children of deselected files).

Once the user has trimmed the context sufficiently, this Wizard action ends.

All selected files are then loaded into wizardState.loadedPrimarySource.

Selecting usable files

This is important to make sure all info fed to the LLM is usable. I've got two methods here:

  1. Exclude when fetching directory tree
  2. Check individual files are readable

Excluding when fetching the directory tree

When the user provides a folder, we get a tree representation of that folder using directory-tree, which allows for regex path exclusion.

I'm giving two types of exclusion here.
Folders: Folders which we can assume aren't containing documentation-important files, such as node_modules, cahces, build folders, etc.
Extensions: Extensions which don't typically contain text data, like images, tars, compiled binaries, etc.

GitHub Copilot contributed to extending both of these lists. There'll be some missed anyway, please do build on top of them. They're the folderTreeExclusions and excludeExtensions respectively in the folder-importing-utils.ts file.

I've tested both types of regexes (at different folder levels, similar-but-different naming styles, etc) and they're working for me. However, I'm not sure about cross platform support.

Checking individual file readability

Once we've got the tree of files with the previous exclusions in place, I go through each individual file to check if it looks AI-readable. This is a more complicated business; exclusions can be guaranteed, but there's a vast swathe of "I have no idea what this is"-type files. The goal here is to be pretty confident that, if the file is what the extension says it was, an LLM will be able to get useful info from it.

Again, two methods here: mimetype and extension.

File mimetype readability

I use the mime-type package as a generalized way to sort unknown file extensions into readable or not. Media types are what the internet uses to interpret what kind of media a file is: the main type and the subtype.

I accept all files with mime.type = text, and also message (used for encoding email .eml files). These are pretty much guaranteed to be LLM-readable, and if they're not, that's on the user.

I also accept a limited subset of mime.type = application, based on the mime.subtype. This covers JSON, YAML, XML, and rich text format.

Code extensions

Unfortunately mime types doesn't seem to cover most programming language extensions; the library returns false (aka doesn't know).

This is handled by grabbing ppisarczyk's gist of programming and data languages and the associated extensions. I pull all extensions associated with programming languages from here, and let them all through on the basis that most LLMs are trained on a lot of code.

Possibly we should also include approval for additional file extensions not covered by either of the above, but I haven't personally run into any and it's tricky to find an actual list.

Problems to resolve before committing

Result printout?

My terminal displays all the file names pretty weirdly after the Checkbox finishes:
image
I'm not sure why this is; would appreciate someone else checking it out.

Remaining testing

I've tested all the different functions pretty thoroughly to make sure they're working. However:

  • I got rate limited before I could actually test the full suite. All the intermediates look good (outline generation, .lumentis/messages), but I couldn't actually generate the full docs.
  • I also haven't checked that it bundles properly as a npm package that can run on all systems.
  • Also haven't successfully tested on a large folder. I gave it my entire /Documents/ and it hung, but I'm not sure if that counts as user error or an actual problem. See 'Other Questions' for some ideas on how to resolve.

Improve file/folder filtering

I've done what I can here, but please feel free to improve.

Other questions

  • I set the max tokens (that can be supplied in the input files) to 200,000. Is this correct? Should we lower this to account for output tokens and additional prompt tokens? By how much?
  • I just shoved all the files together with a separator in between, feel free to review and adjust this structure.
  • Should we add a depth limit to the folder structure? We'd do this by adding a depth parameter to the const fileTree = dirTree() call in app.ts. Might be worthwhile; when I tried this with a very high-level folder it hung for quite awhile, but I haven't tested how changing the depth parameter would impact that. Possibly just add a warning instead, or a timeout -> return?
  • Is there a way to estimate tokens from file size? dirTree already gives us the file size, which is a hell of a lot cheaper than countTokens, which requires reading in each file in order to work. Not at all a problem for baby folders like lumentis, very much a problem for big ones. Possibly solved by adding a depth limit.

Additional changes

  • If the user provides a file name (not a folder), check that it's a usable file type (see: Checking individual file readability).

package.json Outdated
"cross-spawn": "^7.0.3",
"directory-tree": "^3.5.1",
"mime-types": "^2.1.35",
"next": "^14.1.4",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha it happened again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened again? Do I need to remove next?

src/app.ts Outdated
let promptTokens = getAdditionalPromptTokens(file_choices);
while (first_time || folderTokenTotal + promptTokens > CLAUDE_PRIMARYSOURCE_BUDGET) {
if (!first_time) {
console.log("=You've selected too many tokens. Please deselect files to exclude.");
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

= here intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, pushed fix.

src/app.ts Outdated
attributes: ["size", "type", "extension"]
});
resetFolderTokenTotal();
recursivelyRemoveExcludedFilesAndAddTokenCount(fileTree);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup need timeout

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Promise.race is a good way

@HebeHH
Copy link
Collaborator Author

HebeHH commented Apr 1, 2024

So the issue where it seems to print the Checkbox result to console is persistent:
image

I'm pretty sure this is the await checkbox() call itself that's doing this - I put console.log() on either side and the printout happened in between.

The inquirer docs seem to indicate there's an optional output parameter which redirects output, but I tried this with both undefined and process.stderror and there wasn't any change.

Would appreciate thoughts from someone who knows inquirer better.

@HebeHH
Copy link
Collaborator Author

HebeHH commented Apr 1, 2024

Resolved previous issue: use renderSelectedChoices in the Theme

HebeHH added 3 commits April 1, 2024 21:54
…nput.validate. Changed biome line length to 150 so it wouldn't break things up as much, which had a lot of knock on impact.
@HebeHH
Copy link
Collaborator Author

HebeHH commented Apr 1, 2024

Timeouts and Workers

Importing folders is all well and good until the user wants to add all of Documents/ and the entire program hangs silently. In general, this area has the potential for overlong calls that take too much time to execute.

The clear solution is to add timeouts to all the overlong functions. Unfortunately, timeouts are a bit complicated here, given:

  1. Single threaded environments.
  2. Need to cancel calls after timeout.

I've resolved this with worker threads.

Getting the folder tree structure

This is the first long call we make. I was using the directory-tree library, but also compared this to glob and fast-glob when refactoring.

  • directory-tree: Doesn't require refactoring, and is the fastest in the few tests I did. However, it's also single-threaded, doesn't give up execution, and can't be cancelled. This means timeouts won't work at all if it's called in the same execution thread.
  • fast-glob: second fastest, about 4x slower than directory-tree. Gives up execution (so can be Promise.race()'d with a timeout), but has no easy way to cancel other than calling process.exit(), which would kill the entire program.
  • glob: slowest, about 4x slower than fast-glob. However it both gives up execution and can be cancelled with an AbortSignal.

Overall: glob is the only one which is feasible without worker threads, and I just didn't feel like the slowness was worth it. I was surprised by how fast directory-tree was - although that may need to be checked cross-platform.

I went with directory-tree because 1) speed 2) less refactoring of other functions 3) worker threads are probably a better mechanism anyway.

Mechanisms

I've now moved all code for this into src/folder-importing. This has 4 worker scripts (1 for getting the dirtree and 3 for different requirements on recursing over it) and one utils.ts script to pull it together.

For all of the time-heavy worker threads, I start them running and then Promise.race them against a timer.

app.ts needed a bunch of code around communicating to the user if one of the workers timed out, which means a bunch of if/else blocks.

To be double-checked

  • Cross-platform ability of workers and directory-tree library
  • Whether it bundles for npm properly
  • That the merge with master looks okay - I went through it several times but it was a little complex with the introduction of youtube transcripts.

@HebeHH
Copy link
Collaborator Author

HebeHH commented Apr 1, 2024

Anyway @hrishioa I think this is ready for checking and merging now

Copy link
Owner

@hrishioa hrishioa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome - some comments, on to testing locally!

"editor.defaultFormatter": "biomejs.biome",
"editor.formatOnSave": false
"editor.formatOnSave": true,
"github.gitAuthentication": false,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed the VS Code settings so it would format on save and wouldn't try to to github auth on it's own (VS Code github auth can be super buggy).


if (wizardState.primarySourceFileOrFolderName && fs.lstatSync(wizardState.primarySourceFileOrFolderName).isDirectory()) {
const parsed_filename = parsePlatformIndependentPath(wizardState.primarySourceFileOrFolderName);
let fileTree: dirTree.DirectoryTree | "timeoutFailed" = await getFileTree(parsed_filename);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have a potential to error out? Should we add a try catch?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so @hrishioa ? If there's an error it will just return timeoutFailed, and I handled that. If you know of another error, can you add the handling? I haven't run into any

return; // Return if the timeout is reached
}
promptTokens = getAdditionalPromptTokens(file_choices);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this whole section out to a new function to keep the wizard processing section simpler?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wizard processing section was already long enough that I didn't think this added too much complexity - and there's also plenty references to wizard state in this section. I'd thought the generat structure was that all references to wizard state were to be in the app.ts.

If we want to tidy up app.ts maybe that should be a separate PR? I tried a bit here by adding code comments to separate out the different sections.

@hrishioa
Copy link
Owner

CleanShot 2024-05-11 at 12 03 17@2x
CleanShot 2024-05-11 at 12 03 26@2x
Package size hasn't gone up by that much ☝️

Works well if you run with bun - especially getting lumentis to review its own code! Here's your upgrade to lumentis (with sonnet for the outline and haiku for the writing) reviewing itself: https://lumentis-self-review.vercel.app/
(Not sure why the _meta.json is a little screwed up, you can see the headings on subsections don't match the actual names of the files)

We do have an import issue with programming_language_extensions.json though - follow these steps to reproduce:

  1. Compile with bun package.
  2. Try executing the compiled package (which is what npx will be using when you publish it) by running node <path_to_lumentis_folder>/dist/app.cjs from any folder other than the project folder.

Once that's fixed we can do some cross-platform testing and release.

@HebeHH
Copy link
Collaborator Author

HebeHH commented May 15, 2024

@hrishioa adjusted so that the worker threads would be able to be found. However running the package does give me an error:
image
I'm not entirely sure how to resolve this one. Was working fine unpackaged.

The worker threads are a requirement in order to timeout the various file globbing etc functions which may need it. I believe there's a previous commit without them, but removing the worker threads gives up the timeout quitting functionality.

Would appreciate your thoughts/help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants