Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is a hash-based license detector the right choice of backend? #43

Closed
TBBle opened this issue Nov 24, 2020 · 11 comments
Closed

Is a hash-based license detector the right choice of backend? #43

TBBle opened this issue Nov 24, 2020 · 11 comments
Assignees

Comments

@TBBle
Copy link

TBBle commented Nov 24, 2020

(Generalising from #40)

Looking at the https://github.com/go-enry/go-license-detector README, I noticed

The detection algorithm is not template matching; this directly implies that go-license-detector does not provide any legal guarantees. The intended area of it's usage is data mining.

That suggests to me that wwhrd is using https://github.com/go-enry/go-license-detector outside of its intended scope, in its role as a license checker.


I've outlined some specific concerns/observations with the hash-based approach below.

I suspect the has-based approach of https://github.com/go-enry/go-license-detector is going to be hard to manipulate to fix for #40 when two licenses have overlapping text (as ISC and 0BSD do). My suspicion is the hashing/weighting algorithm is giving more weight to the exact-match 0BSD with interposed extra text, compared to the ISC exact-match-with-optional-and-alts.

There's a few other inexplicable 'UNKNOWN' results in the relevant build-log, e.g.
https://github.com/spf13/cobra is detected as UNKNOWN despite a LICENSE.txt which is a 1:1 match to https://github.com/spdx/license-list-data/blob/master/text/Apache-2.0.txt (data source for go-license-detector) once whitespace is normalised and the appendix is dropped.

I'm somewhat suspicious that go-license-detector might need to clean its dataset, as I suspect the problem here is that it has included the appendix in its hash, even though it is not part of the license itself, and the source dataset marks it as optional.

But I haven't tried to pull go-license-detector apart hard enough to know if I'm right about these issues, or even assigning the issue in the right place. At this point, I consider them symptomatic and likely to reoccur across licenses due to the nature and design intention of the library, i.e. working-as-intended as a fast and rough datamining support library.

@breml
Copy link
Contributor

breml commented Nov 25, 2020

I also have some problems with the license detection (see #41).

One approach could be to let wwhrd support multiple license detection packages.

@breml
Copy link
Contributor

breml commented Nov 25, 2020

Just did a quick look at the old license detection package (github.com/ryanuber/go-license) and I agree, that the approach taken there is pretty rough and maybe error prone. Basically for each license, that is recognized, there is a single sequence of word, that needs to be present in the license (https://github.com/ryanuber/go-license/blob/master/license.go#L162-L221).

I would be interested to know how the other packages in https://github.com/go-enry/go-license-detector#quality actually work and if there is a way to either combine them or give the user a choice.

@TBBle
Copy link
Author

TBBle commented Nov 25, 2020

In that comparison, a notable absence is the (regex-based) one used by pkg.go.dev, https://github.com/google/licensecheck. A different (hash-based) license checker from Google, https://github.com/google/licenseclassifier is part of the comparison.

@frapposelli
Copy link
Owner

@TBBle and @breml, thanks for the comments and the insights.

The reason to move away from the original library was to improve license recognition; at the time of the decision, the best option was to use a hash-based detection. Empirical tests proved to be much more accurate.

I admittedly didn't pay much attention to this space, but I really like the approach google/licensecheck is using.

I'm interested in providing the most accurate reporting possible, so I'll give google/licensecheck a try and maybe implement an interface for multiple backends.

Thanks again for helping through this.

@TBBle
Copy link
Author

TBBle commented Nov 27, 2020

One big advantage of github.com/google/licensecheck specifically is that because it's used by https://pkg.go.dev/, there's positive pressure to either account for more random variations seen in Go packages in the wild, or for the Go packages with random variations to change to conform to what github.com/google/licensecheck can recognise.

Since Go packages is what wwhrd cares about, that seems like a good thing.

Of course, the flip-side is that if github.com/google/licensecheck wrong, then wwhrd is as wrong as pkg.go.dev. github.com/google/licensecheck specifically calls out "no false positives" as a goal, so an incorrect detection should result in "UNKNOWN" and manual follow-up rather than misidentified licenses and accidental violations.

@frapposelli
Copy link
Owner

@TBBle @breml I found the time to implement google/licensecheck as the backend in a new release and cut an RC tag for 0.4.0.

I plan to make the backend pluggable and add an interface for it before cutting a GA tag, but I would appreciate some feedback on the new detection engine 😄

@breml
Copy link
Contributor

breml commented Dec 9, 2020

I did some checks on our mono repo and the results do look promising. I was able to remove all exceptions from the .wwhrd.yml config file and github.com/davecgh/go-spew/spew is now correctly recognized as ISC (this has not been the case before, because ISC was not yet on the whitelist but wwhrd did never complain).
I also could remove some of the licenses from my config, where I suspect, that these have been cases of non deterministic results in the past, especially: 0BSD, MPL-2.0-no-copyleft-exception and Mup.

From what I can tell, at least the following cases have changed:

package old new
github.com/davecgh/go-spew/spew 0BSD ISC
github.com/go-sql-driver/mysql MPL-2.0-no-copyleft-exception MPL

The case for Mup I am no longer able to reproduce, I assume, that this case already vanished in one of the previous updates of wwhrd.

Additionally, I opened #49.

@TBBle
Copy link
Author

TBBle commented Dec 19, 2020

Sorry, I took a while to get back to this, and try it against https://github.com/docker/compose-on-kubernetes/. Compared to docker/compose-on-kubernetes#170 (as at docker/compose-on-kubernetes@5f3cab1), I was able to remove almost all the exceptions, except:

So overall, a major win.

I have a concern that the license for github.com/hashicorp/golang-lru is detected as MPL-2.0, not MPL-2.0-no-copyleft-exception. I have re-added those modules as exceptions so that this distinction is not lost, as it's the reason for blacklisting GPL and related licenses. This is the same thing @breml noted for github.com/go-sql-driver/mysql earlier. I believe this distinction is simply not recognised by github.com/google/licensecheck.

Edit: Nope, this is just me being bad at licenses. `` does recognise the MPL-2.0-no-copyleft-exception, but it's (correctly) looking for the notice somewhere other than Exhibit B of the license. Which is correct. For some reason I was under the impression that the presence of Exhibit B had the effect of blocking the copyleft exception. So in fact, wwhrd is right, and I was wrong. (Unhelpfully, the golang-lru source includes neither of the notices in its source, and just includes a copy of the MPL-2.0 license as LICENSE, implying the whole thing is `MPL-2.0` but not `MPL-2.0-no-copyleft-exception`.)

Also, one interesting note, is that in this codebase, the vendored github.com/prometheus/common includes its own vendored goautoneg module, but as bitbucket.org/ww/goautoneg under internal instead of under vendor. wwhrd used to detect that as UNKNOWN (I think... I had an exception for it, anyway), and now sees it as Apache-2.0, which is the license for github.com/prometheus/common itself. I guess it's possible this is a valid relicensing action, but I haven't checked.

Since the https://github.com/docker/compose-on-kubernetes/ CI pipeline always uses the latest wwhrd, I plan to rebase that PR to remove the commits which were only needed before the current changes. But for now, I've just fixed the .wwhrd.yml to be what it needs to look like now: docker/compose-on-kubernetes@521ba3d

The full .wwhrd.yml I ended up with
---
blacklist:
  - GPL-2.0

whitelist:
  - Apache-2.0
  - BSD-2-Clause
  - ISC
  - MIT
  - MPL-2.0
  - BSD-3-Clause

exceptions:
  - bitbucket.org/ww/goautoneg # BSD-3-Clause, license is in the source, misdetected as Apache-2.0
  - github.com/munnerz/goautoneg # BSD-3-Clause, license is in the source
  - github.com/opencontainers/go-digest # Apache-2.0, misdetected as CC-BY-SA-4.0

@frapposelli
Copy link
Owner

Cool! Looks like this is a significant step in the right direction. I'll continue fixing bugs over the holiday and cut a proper v0.4.0 release in the new year.

@frapposelli
Copy link
Owner

Just cut v0.4.0-rc.2 which is going to be the last RC before cutting the release, probably tomorrow or next week.

Please give it a try and thanks again for the help troubleshooting and fixing issues.

@frapposelli
Copy link
Owner

v0.4.0 has been tagged and wwhrd now uses the github.com/google/licensecheck library for license detection.

I'll enqueue an issue to make the license library pluggable in v0.5.0, for now, I will close this issue.

Thanks a lot for your help in making wwhrd better! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants