-
Notifications
You must be signed in to change notification settings - Fork 96
Use Folders as input to Lumentis #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
package.json
Outdated
| "cross-spawn": "^7.0.3", | ||
| "directory-tree": "^3.5.1", | ||
| "mime-types": "^2.1.35", | ||
| "next": "^14.1.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha it happened again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened again? Do I need to remove next?
src/app.ts
Outdated
| let promptTokens = getAdditionalPromptTokens(file_choices); | ||
| while (first_time || folderTokenTotal + promptTokens > CLAUDE_PRIMARYSOURCE_BUDGET) { | ||
| if (!first_time) { | ||
| console.log("=You've selected too many tokens. Please deselect files to exclude."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
= here intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, pushed fix.
src/app.ts
Outdated
| attributes: ["size", "type", "extension"] | ||
| }); | ||
| resetFolderTokenTotal(); | ||
| recursivelyRemoveExcludedFilesAndAddTokenCount(fileTree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup need timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Promise.race is a good way
|
So the issue where it seems to print the Checkbox result to console is persistent: I'm pretty sure this is the The inquirer docs seem to indicate there's an optional Would appreciate thoughts from someone who knows inquirer better. |
|
Resolved previous issue: use |
…nput.validate. Changed biome line length to 150 so it wouldn't break things up as much, which had a lot of knock on impact.
Timeouts and WorkersImporting folders is all well and good until the user wants to add all of The clear solution is to add timeouts to all the overlong functions. Unfortunately, timeouts are a bit complicated here, given:
I've resolved this with worker threads. Getting the folder tree structureThis is the first long call we make. I was using the
Overall: I went with MechanismsI've now moved all code for this into For all of the time-heavy worker threads, I start them running and then
To be double-checked
|
|
Anyway @hrishioa I think this is ready for checking and merging now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome - some comments, on to testing locally!
| "editor.defaultFormatter": "biomejs.biome", | ||
| "editor.formatOnSave": false | ||
| "editor.formatOnSave": true, | ||
| "github.gitAuthentication": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed the VS Code settings so it would format on save and wouldn't try to to github auth on it's own (VS Code github auth can be super buggy).
|
|
||
| if (wizardState.primarySourceFileOrFolderName && fs.lstatSync(wizardState.primarySourceFileOrFolderName).isDirectory()) { | ||
| const parsed_filename = parsePlatformIndependentPath(wizardState.primarySourceFileOrFolderName); | ||
| let fileTree: dirTree.DirectoryTree | "timeoutFailed" = await getFileTree(parsed_filename); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have a potential to error out? Should we add a try catch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so @hrishioa ? If there's an error it will just return timeoutFailed, and I handled that. If you know of another error, can you add the handling? I haven't run into any
| return; // Return if the timeout is reached | ||
| } | ||
| promptTokens = getAdditionalPromptTokens(file_choices); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we move this whole section out to a new function to keep the wizard processing section simpler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wizard processing section was already long enough that I didn't think this added too much complexity - and there's also plenty references to wizard state in this section. I'd thought the generat structure was that all references to wizard state were to be in the app.ts.
If we want to tidy up app.ts maybe that should be a separate PR? I tried a bit here by adding code comments to separate out the different sections.
|
Works well if you run with bun - especially getting lumentis to review its own code! Here's your upgrade to lumentis (with sonnet for the outline and haiku for the writing) reviewing itself: https://lumentis-self-review.vercel.app/ We do have an import issue with
Once that's fixed we can do some cross-platform testing and release. |
|
@hrishioa adjusted so that the worker threads would be able to be found. However running the package does give me an error: The worker threads are a requirement in order to timeout the various file globbing etc functions which may need it. I believe there's a previous commit without them, but removing the worker threads gives up the timeout quitting functionality. Would appreciate your thoughts/help. |




Folder Input
This PR allows users to provide a folder path as input data, instead of a single file. It gets a tree of usable text files from that, and then asks users to select which of them to include. It doesn't allow users to proceed until their selections are small enough to fit into context. Once it has the selected list of files, it adds them all into the input space.
Using it
The folder import flow begins automatically if the user provides a folder instead of a file when prompted.
The wizard will display a file tree, including only LLM-usable files, and only folders which contain LLM-usable files. The user can then select or deselect here.
Deselecting a folder will mean all children will also be removed from consideration. Unfortunately this doens't change the green wizard circle, because I couldn't find functionality for this, but it does mean they won't be included in the LLM context.
The wizard will keep prompting the user to deselect files until the total file token count is below the
constants.tsMAX_TOKEN_LIMIT, which I've currently set to 200,000. Every time it re-prompts, it will first update the file tree to remove anything deselected by the user (including all children of deselected files).Once the user has trimmed the context sufficiently, this Wizard action ends.
All selected files are then loaded into
wizardState.loadedPrimarySource.Selecting usable files
This is important to make sure all info fed to the LLM is usable. I've got two methods here:
Excluding when fetching the directory tree
When the user provides a folder, we get a tree representation of that folder using directory-tree, which allows for regex path exclusion.
I'm giving two types of exclusion here.
Folders: Folders which we can assume aren't containing documentation-important files, such as
node_modules, cahces, build folders, etc.Extensions: Extensions which don't typically contain text data, like images, tars, compiled binaries, etc.
GitHub Copilot contributed to extending both of these lists. There'll be some missed anyway, please do build on top of them. They're the
folderTreeExclusionsandexcludeExtensionsrespectively in thefolder-importing-utils.tsfile.I've tested both types of regexes (at different folder levels, similar-but-different naming styles, etc) and they're working for me. However, I'm not sure about cross platform support.
Checking individual file readability
Once we've got the tree of files with the previous exclusions in place, I go through each individual file to check if it looks AI-readable. This is a more complicated business; exclusions can be guaranteed, but there's a vast swathe of "I have no idea what this is"-type files. The goal here is to be pretty confident that, if the file is what the extension says it was, an LLM will be able to get useful info from it.
Again, two methods here: mimetype and extension.
File mimetype readability
I use the mime-type package as a generalized way to sort unknown file extensions into readable or not. Media types are what the internet uses to interpret what kind of media a file is: the main type and the subtype.
I accept all files with
mime.type = text, and alsomessage(used for encoding email.emlfiles). These are pretty much guaranteed to be LLM-readable, and if they're not, that's on the user.I also accept a limited subset of
mime.type = application, based on themime.subtype. This covers JSON, YAML, XML, and rich text format.Code extensions
Unfortunately mime types doesn't seem to cover most programming language extensions; the library returns
false(aka doesn't know).This is handled by grabbing ppisarczyk's gist of programming and data languages and the associated extensions. I pull all extensions associated with programming languages from here, and let them all through on the basis that most LLMs are trained on a lot of code.
Possibly we should also include approval for additional file extensions not covered by either of the above, but I haven't personally run into any and it's tricky to find an actual list.
Problems to resolve before committing
Result printout?
My terminal displays all the file names pretty weirdly after the Checkbox finishes:

I'm not sure why this is; would appreciate someone else checking it out.
Remaining testing
I've tested all the different functions pretty thoroughly to make sure they're working. However:
/Documents/and it hung, but I'm not sure if that counts as user error or an actual problem. See 'Other Questions' for some ideas on how to resolve.Improve file/folder filtering
I've done what I can here, but please feel free to improve.
Other questions
depthparameter to theconst fileTree = dirTree()call inapp.ts. Might be worthwhile; when I tried this with a very high-level folder it hung for quite awhile, but I haven't tested how changing thedepthparameter would impact that. Possibly just add a warning instead, or a timeout -> return?dirTreealready gives us the file size, which is a hell of a lot cheaper thancountTokens, which requires reading in each file in order to work. Not at all a problem for baby folders like lumentis, very much a problem for big ones. Possibly solved by adding adepthlimit.Additional changes