Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

py-tree-sitter-languages is unmaintained #7

Open
jvmncs opened this issue Jul 3, 2024 · 19 comments · May be fixed by #8
Open

py-tree-sitter-languages is unmaintained #7

jvmncs opened this issue Jul 3, 2024 · 19 comments · May be fixed by #8

Comments

@jvmncs
Copy link

jvmncs commented Jul 3, 2024

Hi @paul-gauthier , thanks for your work on aider. I've been having a blast using it.

This project uses https://github.com/grantjenks/py-tree-sitter-languages, but that project is unmaintained and has been for several months. This forces grep-ast to be stuck on an old tree-sitter version (0.21) and also limits the number of parsers that can be used by upstream projects (including aider). There is a hacky way to install new language parsers, but that dependency will seemingly be stuck on tree-sitter 0.21 indefinitely, which seems bad.

Another project has sprung up called tree-sitter-language-pack, however it's got a slightly different intention (large collection of grammar binaries, as opposed to small/focused one for the most popular languages only). That project is mainly an integration of this unmerged tree-sitter-languages PR with a bunch of new grammar binaries added. There's probably space for a minimal version that bundles just the top N languages and natively allows users to install their own binaries at will (so, essentially, just a version of tree-sitter-languages with that PR merged, and some different grammar binaries).

If you want to replace tree-sitter-languages with tree-sitter-language-pack, I'd be happy to open a PR. Note that the source binary size is quite a bit larger:

  • tree-sitter-language-pack: 35.7 MB, no platform-specific builds
  • tree-sitter-languages: ~9.0MB, depending on the platform
@AdjectiveAllison
Copy link

AdjectiveAllison commented Jul 18, 2024

I don't have a big investment in this decision, so my opinion might not be worth much but I'm sharing it anyway:

I agree that there is likely space for a best of both worlds option. Maintained, smaller binary, adaptable, the dream!
I am personally on board with the large language pack route over the tree sitter version being locked until that gap is filled.

Also ditto what @jvmncs said on aider, both it and this repo are solid tools.

@Goldziher
Copy link

Hi, author of tree-sitter-language-pack here. PRs are welcome. Also, its fine by me to have multiple packages built in the same repo - we can have a minimal build and a more comprehensive build.

@greg-hellings
Copy link

This does make packaging aider, which I am working on, something of a sticky bit. It is possible currently to get around the build issues for py-tree-sitter-languages by pinning tree-sitter to version 0.21.x and explicitly including distutils in the package. But having this whole tree depend on a maintained package would be better, overall, rather than requiring introduction of an abandoned package into a new tool. That workaround probably will not last forever, so it would be fantastic to have a version of grep-ast that did not require the older/abandoned dependencies.

@greg-hellings
Copy link

Ultimately this is resulting in aider not really being able to package for Python 3.12 easily, as tree-sitter 0.21 doesn't like newer versions of Python.

@jmehnle
Copy link

jmehnle commented Aug 15, 2024

@greg-hellings, I don't quite understand the Python 3.12 concern. According to tree-sitter/py-tree-sitter@ce1af66, tree-sitter 0.21.1 and above do build on Python 3.12. Is that somehow not the case?

@greg-hellings
Copy link

@jmehnle Indeed tree-sitter did come out with a 0.21.2 that supported Python 3.12. But most people who are consuming this outside of a pip install are going to use their system libraries. py-tree-sitter-languages is incompatible with tree-sitter 0.22+ which most Linux distributions have moved to because of its improved support for Python 3.12 and because it's the latest. And, since tree-sitter-languages is abandonware, it will not likely ever be updated. It would be better for anyone consuming this if an updated dependency was leveraged, instead. The issue isn't the transitive dep on tree-sitter itself. It's the dependency on tree-sitter-languages which has been abandoned and therefore doesn't support a pip install nor have good support in packaging distributions.

@jmehnle
Copy link

jmehnle commented Aug 16, 2024

Ok, so there's not a specific major problem with Python 3.12. I understand that recent versions of the tree-sitter package >=0.22 have improved support for 3.12, but >=0.21.1,<0.22 should still run on 3.12. I'm also aware of the other issues you mentioned, and I'm very much interested in us building a successor package that is both future-compatible and smaller than tree-sitter-language-pack. The author of the latter seems to be open to building a version of the package that's limited to a subset of languages.

@greg-hellings
Copy link

Correct, the issue is that the dep makes moving forward with Python versions and grep-ast more tedious. Not that there is directly a problem with 3.12 but that the issue is with a stale dep.

gohanlon added a commit to gohanlon/grep-ast that referenced this issue Aug 23, 2024
…ress maintenance

Resolves Aider-AI#7.

This commit replaces the tree-sitter language pack from
grantjenks/py-tree-sitter-languages with
Goldziher/tree-sitter-language-pack, significantly expanding language
support and addressing maintenance issues. Key changes include:

1. Greatly increases the number of supported languages, including Swift
   and Svelte.
2. Resolves dependency on an unmaintained package that was forcing
   grep-ast to use an old tree-sitter version (0.21).
3. Unlocks the ability to use more recent tree-sitter versions.
4. Updates requirements.txt to use tree-sitter-language-pack>=0.2.0.
5. Increments the version number to 0.3.4-dev in setup.py.
6. Adds extensive test cases for parsing various languages in
   test_parsers.py.

Notable changes:
- Removed support for DOT, OCaml, ql (GitHub CodeQL), and tsq (Tree
  Sitter Query) due to their absence in the new pack.
- Removed potentially incorrect mappings for .gomod, .sqlite, and .regex
  extensions.
- Replaced the uncommon ".et" mapping for "embeddedtemplate" with
  mappings for ERB and EJS, which are common uses of embedded templates.
- Re-enabled markdown as the new pack uses to a different markdown
  grammar that likely doesn't suffer from previous bugs.
@gohanlon
Copy link

I've just opened PR #8 (Draft) to migrate grep-ast to Goldziher/tree-sitter-language-pack, significantly expanding language support and (hopefully) resolving the maintenance concerns discussed here.

Please take a look and let me know your thoughts!

@paul-gauthier
Copy link
Collaborator

The PR looks great, thanks for preparing it.

Any thoughts on how the pip install of -language-pack compares to -languages? On my mac, -pack took 4 minutes and ~130MB whereas -languages takes <2 seconds and 80MB.

It seems like -languages had pre-built wheels and -pack is building it on my local?

But more than the time difference, will -pack install cleanly in roughly the same set of environments that -languages did?

The main reason I adopted -languages as a dependency was because it reliably installed in a wide range of environments.

@greg-hellings
Copy link

It sounded like @Goldziher was open to improving the experience with -pack, up above. It's generally considered bad form to have a derived file, like a wheel, included in your source tree but it sounds like for -languages it was a huge performance boost to ship them in that manner.

@gohanlon
Copy link

I do think size and build time are issues needing careful consideration. I too built tree-sitter-language-pack locally. For me, the size and build time, despite being substantial, aren't all that significant compared to the benefits. But, I'm sure we can (and should) do better:

@Goldziher I see that the published files on pypi.org/tree-sitter-language-pack don't include any wheels, but that you've worked on some infrastructure to build and publish wheels. This seems like it'd be a non-trivial undertaking. Can you comment on the status/challenges of that work?

@paul-gauthier If tree-sitter-language-pack builds and publishes wheels with broad enough compatibility, how would that impact your evaluation of (the draft) PR #8?

Besides adding wheels, we could implement a modular system for language support. However, that'd be a much larger undertaking and it's probably better to focus first on the immediate benefits of migrating to a maintained package with expanded language coverage, despite the increased size and build time.

(For anyone curious, here are the files for grantjenks's pack on PyPI, including wheels, and here're its GitHub Actions workflow and build script.)

@paul-gauthier
Copy link
Collaborator

The PR looks great. I would love to support all those languages.

My only hesitation is the end user pip install experience:

  1. If it takes 4 minutes to pip install the tree-sitter-language-pack dependency, that's a lot to ask of users.
  2. How often will the pip install fail when the wheel is being built on demand on the user's machine? I'm not sure what's involved in that step, but it feels like there is potential for problems given the diversity of end user build environments.

@gohanlon
Copy link

  1. If it takes 4 minutes to pip install the tree-sitter-language-pack dependency, that's a lot to ask of users.

@paul-gauthier Does adding pre-built wheels to tree-sitter-language-pack address your install time concerns?

  1. How often will the pip install fail when the wheel is being built on demand on the user's machine? I'm not sure what's involved in that step, but it feels like there is potential for problems given the diversity of end user build environments.

The user build failure rate would likely increase somewhat due to the larger number of grammar projects, but the extent is hard to predict. I suspect the increase would be small, as build environments are often generally broken rather than failing on specific projects. (Importantly, if the new pack's pre-built wheels cover the same targets as the old pack's, the fallback rate to source builds should be identical.)

With the unmaintained language pack's lack of ongoing support for newer systems, we should expect increases in both fallbacks to user builds and user build failures over time.

A system for modular language packs could be ideal, e.g.:

pip install tree-sitter-language-pack[core,gleam,zig]

This would install expected "core" languages plus Gleam and Zig. Other language grammars could be added without concern for bloat or risking breaking user builds. However, I'm less sure that the time and effort required for this modular approach is best way forward now.

@paul-gauthier
Copy link
Collaborator

Does adding pre-built wheels to tree-sitter-language-pack address your install time concerns?

Yes, almost certainly.

@paul-gauthier
Copy link
Collaborator

@Goldziher how are things going with ts-lang-pack? I've been experimenting with it, and it looks like I could swap it in for py-ts-langs.

My install of ts-lang-pack today was quick, without a long build process. So that was nice to see. The README mentions that you are building wheels now, which is great.

I see you have some open issues about build problems on different environments. Any sense of how reliably users are able to install ts-lang-pack? Aider has users on a wide range of platforms, so reliable and hassle free install is a key priority for me.

@gohanlon
Copy link

gohanlon commented Dec 5, 2024

@paul-gauthier Unfortunately, Goldziher/tree-sitter-language-pack does not have published wheels. Perhaps your fast build used a locally cached previous build?

Goldziher clearly did work to include pre-built wheels, as noted in the README and evidenced in a GitHub Actions workflow (5 months ago). I'm sure someone could dig in and finish what Goldziher started. Even if that work is trivial, there'd still be the matter of actually getting them published to PyPI, and maybe needing a separate fork.

For reference, Goldziher/tree-sitter-language-pack has a single published file on PyPI. Compare to grantjenks/py-tree-sitter-languages having many wheels on PyPI.

To avoid an uncontrolled dependency, I’d lean towards Aider owning a modular tree-sitter language pack with prebuilt wheels under the Aider-AI org.

Or, maybe explore bridging to another ecosystem in order to depend on something widely used and maintained, if such a thing exists at all. For example:

  • universal-ctags is very old, very stable, and actively maintained with daily commit activity. It also has tons of languages, while missing many e.g., I noted that it's missing Swift, Zig, Mojo, and Gleam. To maintain a good pip install experience, we'd still need find or create Python bindings, tho.
  • github-linguist/linguist has great language coverage. It's what GitHub uses it to detect languages and render them on the GitHub website itself. They have strict but reasonable rules about what languages are eligible for inclusion—they can't support every upstart language (nor should Aider). I remember Chris Lattner being rejected when asking them to add Mojo (which did eventually meet the criteria and since been included). While Linguist is written in Ruby, go-enry is a Go port of Linguist that's automatically kept in sync with the upstream. They have worked on Python bindings, but they are incomplete and not usable as is.

@Goldziher
Copy link

Feel free to open a pr with updates as you see fit

@paul-gauthier
Copy link
Collaborator

I am also watching this fork:

https://github.com/Textualize/py-tree-sitter-languages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants