Simple Instrumentation of Enry predictions in dev mode #188

bzz · 2019-01-09T17:02:07Z

To assist debugging in dev mode, it would be nice to have some visibility into the decision-making logic that Enry uses at runtime.

Problem: after getting a final prediction e.g though enry.GetLanguage() it's very hard to tell:

what strategies were used
what suggestions each strategy made
what was the winning strategy

Such introspection would simplify maintenance and reduce the time to debug miss-predictions in case of sync-ups with Linguist, etc.

Linguist does have a simple protocol for Linguist.instrumenter that serves this needs and is very generic, \w ability to be deployed and enabled in production, etc.

Something simpler, similar to a LocalInstrumenter (in details below) that is propagated to every Strategy would work for Enry in Golang development mode and is subject of this issue.

class LocalInstrumenter
  Event = Struct.new(:name, :args)
   attr_reader :events
   def initialize
    @events = []
  end
   def instrument(name, *args)
    @events << Event.new(name, args)
    yield if block_given?
  end
end
Linguist.instrumenter = LocalInstrumenter.new

would produce

   #<struct LocalInstrumenter::Event
    name="linguist.strategy",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :candidates=>
        [#<Linguist::Language name=C>,
         #<Linguist::Language name=C++>,
         #<Linguist::Language name=Objective-C>]}]>,

  #<struct LocalInstrumenter::Event
    name="linguist.detected",
    args=
     [{:blob=>
           ...full file content...
         @detect_encoding=
          {:type=>:text,
           :encoding=>"UTF-8",
           :ruby_encoding=>"UTF-8",
           :confidence=>80},
         @encoded_newlines_re=/\r\n|\r|\n/,
         @fullpath=".linguist/samples/C++/Types.h",
         @path=".linguist/samples/C++/Types.h",
         @size=1484,
         @symlink=false>,
       :strategy=>Linguist::Heuristics,
       :language=>#<Linguist::Language name=C++>}]>]>

The text was updated successfully, but these errors were encountered:

bzz · 2019-01-10T10:27:17Z

At current state, simplistic version of this is possible by hard-coding log statements in

diff --git a/common.go b/common.go
index 949db71..d4a6c57 100644
--- a/common.go
+++ b/common.go
@@ -3,11 +3,14 @@ package enry
 import (
 	"bufio"
 	"bytes"
+	"log"
 	"path/filepath"
 	"strings"
 
 	"gopkg.in/src-d/enry.v1/data"
 	"gopkg.in/src-d/enry.v1/regex"
+
+	"github.com/sanity-io/litter"
 )
 
 // OtherLanguage is used as a zero value when a function can not return a specific language.
@@ -118,6 +121,7 @@ func GetLanguageBySpecificClassifier(content []byte, candidates []string, classi
 // At least one of arguments should be set. If content is missing, language detection will be based on the filename.
 // The function won't read the file, given an empty content.
 func GetLanguages(filename string, content []byte) []string {
+	log.Printf("file:%s\n", filename)
 	if IsBinary(content) {
 		return nil
 	}
@@ -126,6 +130,8 @@ func GetLanguages(filename string, content []byte) []string {
 	candidates := []string{}
 	for _, strategy := range DefaultStrategies {
 		languages = strategy(filename, content, candidates)
+		log.Printf("\tstrategy:%s, langs:%q\n", litter.Sdump(strategy), languages)
+
 		if len(languages) == 1 {
 			return languages
 		}
diff --git a/data/heuristics.go b/data/heuristics.go
index dc3663d..c894985 100644
--- a/data/heuristics.go
+++ b/data/heuristics.go
@@ -1,6 +1,11 @@
 package data
 
-import "regexp"
+import (
+	"log"
+	"regexp"
+
+	"github.com/sanity-io/litter"
+)
 
 type (
 	Heuristics []Matcher
@@ -20,7 +25,10 @@ type (
 
 func (h *Heuristics) Match(data []byte) []string {
 	var matchedLangs []string
+	litter.Config.Compact = true
+
 	for _, matcher := range *h {
+		log.Printf("matcher:%s\n", litter.Sdump(matcher))
 		if matcher.Match(data) {
 			for _, langOrAlias := range matcher.(Rule).GetLanguages() {
 				lang, ok := LanguagesByAlias(langOrAlias)
@@ -31,6 +39,7 @@ func (h *Heuristics) Match(data []byte) []string {
 				}
 				matchedLangs = append(matchedLangs, lang)
 			}
+			log.Printf("\t\tlangs:%q\n", matchedLangs)
 			break
 		}
 	}

but the idea is to provide API with simple instrumentation for all strategies instead, which can be used in tests to archive similar results.

bzz added the good first issue label Feb 5, 2019

bzz mentioned this issue Mar 15, 2019

Breakdown of django/django is different from Linguist #204

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple Instrumentation of Enry predictions in dev mode #188

Simple Instrumentation of Enry predictions in dev mode #188

bzz commented Jan 9, 2019

bzz commented Jan 10, 2019 •

edited

Loading

Simple Instrumentation of Enry predictions in dev mode #188

Simple Instrumentation of Enry predictions in dev mode #188

Comments

bzz commented Jan 9, 2019

bzz commented Jan 10, 2019 • edited Loading

bzz commented Jan 10, 2019 •

edited

Loading