-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is a hash-based license detector the right choice of backend? #43
Comments
I also have some problems with the license detection (see #41). One approach could be to let |
Just did a quick look at the old license detection package (github.com/ryanuber/go-license) and I agree, that the approach taken there is pretty rough and maybe error prone. Basically for each license, that is recognized, there is a single sequence of word, that needs to be present in the license (https://github.com/ryanuber/go-license/blob/master/license.go#L162-L221). I would be interested to know how the other packages in https://github.com/go-enry/go-license-detector#quality actually work and if there is a way to either combine them or give the user a choice. |
In that comparison, a notable absence is the (regex-based) one used by pkg.go.dev, https://github.com/google/licensecheck. A different (hash-based) license checker from Google, https://github.com/google/licenseclassifier is part of the comparison. |
@TBBle and @breml, thanks for the comments and the insights. The reason to move away from the original library was to improve license recognition; at the time of the decision, the best option was to use a hash-based detection. Empirical tests proved to be much more accurate. I admittedly didn't pay much attention to this space, but I really like the approach google/licensecheck is using. I'm interested in providing the most accurate reporting possible, so I'll give google/licensecheck a try and maybe implement an interface for multiple backends. Thanks again for helping through this. |
One big advantage of Since Go packages is what wwhrd cares about, that seems like a good thing. Of course, the flip-side is that if |
I did some checks on our mono repo and the results do look promising. I was able to remove all exceptions from the From what I can tell, at least the following cases have changed:
The case for Additionally, I opened #49. |
Sorry, I took a while to get back to this, and try it against https://github.com/docker/compose-on-kubernetes/. Compared to docker/compose-on-kubernetes#170 (as at docker/compose-on-kubernetes@5f3cab1), I was able to remove almost all the exceptions, except:
So overall, a major win.
Edit: Nope, this is just me being bad at licenses. `` does recognise the Also, one interesting note, is that in this codebase, the vendored github.com/prometheus/common includes its own vendored goautoneg module, but as bitbucket.org/ww/goautoneg under internal instead of under vendor. wwhrd used to detect that as Since the https://github.com/docker/compose-on-kubernetes/ CI pipeline always uses the latest wwhrd, I plan to rebase that PR to remove the commits which were only needed before the current changes. But for now, I've just fixed the .wwhrd.yml to be what it needs to look like now: docker/compose-on-kubernetes@521ba3d The full .wwhrd.yml I ended up with---
blacklist:
- GPL-2.0
whitelist:
- Apache-2.0
- BSD-2-Clause
- ISC
- MIT
- MPL-2.0
- BSD-3-Clause
exceptions:
- bitbucket.org/ww/goautoneg # BSD-3-Clause, license is in the source, misdetected as Apache-2.0
- github.com/munnerz/goautoneg # BSD-3-Clause, license is in the source
- github.com/opencontainers/go-digest # Apache-2.0, misdetected as CC-BY-SA-4.0
|
Cool! Looks like this is a significant step in the right direction. I'll continue fixing bugs over the holiday and cut a proper v0.4.0 release in the new year. |
Just cut Please give it a try and thanks again for the help troubleshooting and fixing issues. |
I'll enqueue an issue to make the license library pluggable in Thanks a lot for your help in making |
(Generalising from #40)
Looking at the https://github.com/go-enry/go-license-detector README, I noticed
That suggests to me that wwhrd is using https://github.com/go-enry/go-license-detector outside of its intended scope, in its role as a license checker.
I've outlined some specific concerns/observations with the hash-based approach below.
I suspect the has-based approach of https://github.com/go-enry/go-license-detector is going to be hard to manipulate to fix for #40 when two licenses have overlapping text (as ISC and 0BSD do). My suspicion is the hashing/weighting algorithm is giving more weight to the exact-match 0BSD with interposed extra text, compared to the ISC exact-match-with-optional-and-alts.
There's a few other inexplicable 'UNKNOWN' results in the relevant build-log, e.g.
https://github.com/spf13/cobra is detected as UNKNOWN despite a LICENSE.txt which is a 1:1 match to https://github.com/spdx/license-list-data/blob/master/text/Apache-2.0.txt (data source for go-license-detector) once whitespace is normalised and the appendix is dropped.
I'm somewhat suspicious that go-license-detector might need to clean its dataset, as I suspect the problem here is that it has included the appendix in its hash, even though it is not part of the license itself, and the source dataset marks it as optional.
But I haven't tried to pull go-license-detector apart hard enough to know if I'm right about these issues, or even assigning the issue in the right place. At this point, I consider them symptomatic and likely to reoccur across licenses due to the nature and design intention of the library, i.e. working-as-intended as a fast and rough datamining support library.
The text was updated successfully, but these errors were encountered: