Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple Instrumentation of Enry predictions in dev mode #188

Open
bzz opened this issue Jan 9, 2019 · 1 comment
Open

Simple Instrumentation of Enry predictions in dev mode #188

bzz opened this issue Jan 9, 2019 · 1 comment

Comments

@bzz
Copy link
Contributor

bzz commented Jan 9, 2019

To assist debugging in dev mode, it would be nice to have some visibility into the decision-making logic that Enry uses at runtime.

Problem: after getting a final prediction e.g though enry.GetLanguage() it's very hard to tell:

  • what strategies were used
  • what suggestions each strategy made
  • what was the winning strategy

Such introspection would simplify maintenance and reduce the time to debug miss-predictions in case of sync-ups with Linguist, etc.

Linguist does have a simple protocol for Linguist.instrumenter that serves this needs and is very generic, \w ability to be deployed and enabled in production, etc.

Something simpler, similar to a LocalInstrumenter (in details below) that is propagated to every Strategy would work for Enry in Golang development mode and is subject of this issue.

class LocalInstrumenter
  Event = Struct.new(:name, :args)
   attr_reader :events
   def initialize
    @events = []
  end
   def instrument(name, *args)
    @events << Event.new(name, args)
    yield if block_given?
  end
end
Linguist.instrumenter = LocalInstrumenter.new

would produce

   #<struct LocalInstrumenter::Event
    name="linguist.strategy",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :candidates=>
        [#<Linguist::Language name=C>,
         #<Linguist::Language name=C++>,
         #<Linguist::Language name=Objective-C>]}]>,

  #<struct LocalInstrumenter::Event
    name="linguist.detected",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :language=>#<Linguist::Language name=C++>}]>]>
@bzz
Copy link
Contributor Author

bzz commented Jan 10, 2019

At current state, simplistic version of this is possible by hard-coding log statements in

diff --git a/common.go b/common.go
index 949db71..d4a6c57 100644
--- a/common.go
+++ b/common.go
@@ -3,11 +3,14 @@ package enry
 import (
 	"bufio"
 	"bytes"
+	"log"
 	"path/filepath"
 	"strings"
 
 	"gopkg.in/src-d/enry.v1/data"
 	"gopkg.in/src-d/enry.v1/regex"
+
+	"github.com/sanity-io/litter"
 )
 
 // OtherLanguage is used as a zero value when a function can not return a specific language.
@@ -118,6 +121,7 @@ func GetLanguageBySpecificClassifier(content []byte, candidates []string, classi
 // At least one of arguments should be set. If content is missing, language detection will be based on the filename.
 // The function won't read the file, given an empty content.
 func GetLanguages(filename string, content []byte) []string {
+	log.Printf("file:%s\n", filename)
 	if IsBinary(content) {
 		return nil
 	}
@@ -126,6 +130,8 @@ func GetLanguages(filename string, content []byte) []string {
 	candidates := []string{}
 	for _, strategy := range DefaultStrategies {
 		languages = strategy(filename, content, candidates)
+		log.Printf("\tstrategy:%s, langs:%q\n", litter.Sdump(strategy), languages)
+
 		if len(languages) == 1 {
 			return languages
 		}
diff --git a/data/heuristics.go b/data/heuristics.go
index dc3663d..c894985 100644
--- a/data/heuristics.go
+++ b/data/heuristics.go
@@ -1,6 +1,11 @@
 package data
 
-import "regexp"
+import (
+	"log"
+	"regexp"
+
+	"github.com/sanity-io/litter"
+)
 
 type (
 	Heuristics []Matcher
@@ -20,7 +25,10 @@ type (
 
 func (h *Heuristics) Match(data []byte) []string {
 	var matchedLangs []string
+	litter.Config.Compact = true
+
 	for _, matcher := range *h {
+		log.Printf("matcher:%s\n", litter.Sdump(matcher))
 		if matcher.Match(data) {
 			for _, langOrAlias := range matcher.(Rule).GetLanguages() {
 				lang, ok := LanguagesByAlias(langOrAlias)
@@ -31,6 +39,7 @@ func (h *Heuristics) Match(data []byte) []string {
 				}
 				matchedLangs = append(matchedLangs, lang)
 			}
+			log.Printf("\t\tlangs:%q\n", matchedLangs)
 			break
 		}
 	}

but the idea is to provide API with simple instrumentation for all strategies instead, which can be used in tests to archive similar results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant