-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor search results not replicable; something wrong on server? #721
Comments
I agree, this isn't desirable behavior. I don't know much about lucene, but if someone who does know know more about it and wants to look at solving this, the search code lives in https://github.com/clojars/clojars-web/blob/master/src/clojars/search.clj |
Maybe someone can coordinate with the cljdoc people? They're planning to update their Lucene as well. Since cljdoc currently uses clojars for search, anything not on the front page is unsearchable on cljdoc. I might be able to take a look later. |
I'm beginning to suspect something else is wrong, maybe in the data/server, and not the code, because I can't replicate the problem locally. I was originally considering using I forked clojars, set it up, imported the entire clojars mvn with rsync, and made sure to overwrite the random |
May be related to #719. This is starting to feel more like an error somewhere, and less like a suboptimal scoring algorithm. |
I'm going to update the PR title to reflect recent investigation. |
Interesting. I started a full re-index of all of the jars - we'll see if that fixes things. It will take a while to complete, so I'll report back when it is done. |
The index rebuild has finished, but I don't see that the results are any better. |
Interesting. It's actually gotten worse, if anything; re-frame is now on the third page of search results. What else could it be? The You know what it might be? I think Lucene is splitting words on the "-", either when indexing, parsing the query, or both. E.g., a search for "re-frame" and "re frame" produce identical results. A search for just "re" makes "re-frame-re-play" as the number one hit. Then the reason re-frame-re-play has the top score because it has "re" twice, and the reason so many packages have higher scores than re-frame is because they mention re-frame in the description, boosting their score relative to re-frame itself. Unfortunately, this doesn't explain why I can't replicate the issue locally. rsync failed on downloading some packages, but I thought I had enough to reasonably reproduce the problem. |
@KingMob How did you test the search locally, through the web UI i.e. in the same way we search http://clojars.org/? Perhaps a maintainer could share the generated Lucene index from the server so we could try search against it locally / compare it with manually generated one? If the search way is the same and the index doesn't make a difference, what could it be? Could clojars.org run different versions of dependencies??? |
BTW I see that clucy is 6 year old abandonware, using Lucene 4 (the latest is 8). Perhaps we should consider switching to a different library or just use Lucene directly? |
FYI this is the cljdocs issue regarding search cljdoc/cljdoc#85 And this is my work-in-progress Clojure -> Lucene 8 artifact search https://github.com/holyjak/tmp-clj-artifact-search |
@holyjak For local usage, I set it up as I described above in #721 (comment). I tried to pull in the public hosted all.edn file to get the stats right, but my rsync call didn't pull the whole maven repository. I think I got maybe 2/3 of all the packages available. Then, I ran it with I noticed clucy was old. I couldn't locate javadocs for that version of Lucene, and unfortunately, the archived .zip file has some errors that prevented building. That alone suggests it's worth it to upgrade though I'm not sure if clucy is the issue here. |
I'm going through this for the past few days, and here is what I found. The people involved in this issue, should probably know by now what's happening, but I guess it's good to document things here. Why re-frame results have re-frame-re-play first?Which seems to be happening is that Lucene handles things like Why can't we find clj-concord for clj-concordion (#719)?This search is mapped to the terms Why not so popular jars are being prioritized over more popular ones?You can increase the number of matches for a particular term by including repetitions of a bunch of terms several times, I managed to overtook I couldn't understand exactly how TF-IDF is interfering on this, but I guess it's probably what we could both leverage and tweak here to improve search results performance. |
I've rewritten search, and the search results look much better now (see #806 (comment)). I'm closing this issue. |
E.g., a search for "re-frame" has re-frame appearing on the second page of search results. Ditto for searching for "component"; com.stuartsierra/component isn't shown until the second page. This seems pretty undesirable to me.
The text was updated successfully, but these errors were encountered: