Skip to content

Commit

Permalink
Add rank-aggreg article
Browse files Browse the repository at this point in the history
  • Loading branch information
gr-im committed Aug 10, 2024
1 parent 8348ded commit e04b514
Show file tree
Hide file tree
Showing 8 changed files with 360 additions and 4 deletions.
3 changes: 2 additions & 1 deletion articles/dune
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
(mdx
(files *.md))
(files *.md)
(libraries site.article_lib))
1 change: 1 addition & 0 deletions articles/fold-gof.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ plea against the usefulness of pattern matching. Indeed, I am instinctively
convinced that, akin to the explicit ability to delineate algebraic types,
pattern matching objectively enhances a language.


## The main idea behind the `fold` function

We can offer an exceedingly broad definition of the `fold` function, which
Expand Down
3 changes: 3 additions & 0 deletions articles/lib/dune
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
(library
(name article_lib)
(public_name site.article_lib))
49 changes: 49 additions & 0 deletions articles/lib/rank_aggregation.ml
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
type data = { product_name : string; amount_votes : int; result : float }
type dataset = data list

let up_down { amount_votes; result; _ } =
let voters = Float.of_int amount_votes in
let up = result /. 100.0 *. voters in
let down = voters -. up in
(Float.to_int up, Float.to_int down)

let make ?(prefix = "Product ") product_name amount_votes result =
let product_name = prefix ^ product_name in
{ product_name; amount_votes; result }

let make_ud ?prefix product_name up down =
let amount_votes = up + down in
let result = Float.of_int up /. Float.of_int amount_votes *. 100. in
make ?prefix product_name amount_votes result

let pp_set ppf set =
Format.fprintf ppf "\nName\t\tVoters\tResult\tUpvotes\tDownvotes\n\n%a"
(Format.pp_print_list ~pp_sep:Format.pp_print_newline
(fun ppf ({ product_name; amount_votes; result } as data) ->
let up, down = up_down data in
Format.fprintf ppf "%s\t% 5d\t%.01f\t% 5d\t% 5d" product_name
amount_votes result up down))
set

let result { result; _ } = result

let dataset =
[
make_ud "A" 32 68
; make_ud "B" 890 110
; make_ud "C" 2 0
; make_ud "D" 1530 1470
; make_ud "E" 524 235
; make_ud "F" 118 472
; make_ud "G" 25 75
]

let update ?(dataset = dataset) name ~up ~down =
List.map
(fun data ->
if String.equal name data.product_name then
make_ud ~prefix:"" name up down
else data)
dataset

let sort ?(dataset = dataset) compare = List.sort compare dataset
293 changes: 293 additions & 0 deletions articles/rank-aggregation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,293 @@
---
title: Sorting things, rank-aggregation (beginner's approach)
date: 2024-08-08
description:
Summary of a response about how to order products by their votes/reviews using
rank aggregation (using Shopify approach).
referenced_humans:
- [xvw, https://xvw.lol]
bib:
- ident: bmfh
title: "Bayesian Methods for Hackers: Probabilistic Programming and Bayesian Inference"
year: 2015
authors: [Cameron Davidson-Pilon]
url: https://www.oreilly.com/library/view/bayesian-methods-for/9780133902914/
- ident: wilson
title: "Binomial proportion confidence interval"
url: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval
authors: [Wikipedia]
- ident: reddit-comment
title: "Reddit's new comment sorting system"
url: http://web.archive.org/web/20140307091023/http://www.redditblog.com/2009/10/reddits-new-comment-sorting-system.html
authors: [Randall Munroe]
- ident: reddit-formula
title: Deriving the Reddit Formula
url: https://www.evanmiller.org/deriving-the-reddit-formula.html
authors: [Evan Miller]
- ident: not-sort-by-avg
title: How Not To Sort By Average Rating
url: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
authors: [Evan Miller]
- ident: ranking-star
title: Ranking Items With Star Ratings
url: https://www.evanmiller.org/ranking-items-with-star-ratings.html
authors: [Evan Miller]
---

Last year, a friend of [Xavier Van de Woestyne][xvw] posed a question to him—one
for which he had some inklings but lacked precise answers. Given his
professional commitments, he requested that I provide a succinct response on his
behalf. This article seeks to outline the issue concisely and offer a considered
solution. As with many problems, knowing the name of the problem makes it much
easier to find a solution. The goal of the response I provided (and, by
extension, of this article) was to offer a _simple solution_ that could be
easily integrated into an existing database. For the sake of clarity (and
conciseness) in the article, **performance considerations are not taken into
account**.

<div class="hidden-block">

```ocaml
# open Article_lib.Rank_aggregation ;;
# #install_printer pp_set ;;
```

</div>


## Context

Let’s imagine a **list of products** where users can vote using `up` and `down`
options. How might we go about sorting these results in an intelligent manner?
Here is an example of our data source: we have the **name of a product**, an
**amount of voters** and a **result**. The result is the percentage calculated
between the number of positive votes and the number of negative votes :


```ocaml
# dataset ;;
- : data list =
Name Voters Result Upvotes Downvotes
Product A 100 32.0 32 68
Product B 1000 89.0 890 110
Product C 2 100.0 2 0
Product D 3000 51.0 1530 1470
Product E 759 69.0 524 235
Product F 590 20.0 118 472
Product G 100 25.0 25 75
```

This kind of modelling of data by expressing the popularity of an entry is
fairly common; it's the kind of representation on which the sorting of
[Reddit](https://www.reddit.com/) messages is based, for example.

## Problem sketch

Our exercise is relative. We want to provide a function capable of sorting
products _in a relevant way_. Obviously, it is difficult to give a **systematic
definition of relevance** for any context, but let's look at this example. Let's
say I want to sort products "from most popular to least popular" (in descending
order of popularity). I could very naively sort my products using the `result`,
which is the percentage calculated on the basis of `up/down` votes.


```ocaml
# sort (fun a b -> Float.compare b.result a.result) ;;
- : data list =
Name Voters Result Upvotes Downvotes
Product C 2 100.0 2 0
Product B 1000 89.0 890 110
Product E 759 69.0 524 235
Product D 3000 51.0 1530 1470
Product A 100 32.0 32 68
Product G 100 25.0 25 75
Product F 590 20.0 118 472
```

We can argue that our result is **objectively good** because the product `C`
which received 100% positive votes is in first position. There are many cases in
which this answer would be acceptable. However, at a time when recommendation
engines are building streams of consumption (like
[Netflix](https://netflix.com), [Spotify](http://spotify.com/), etc.), the
organisation of this kind of vote raises some interesting questions, for
example:

- is `product C` really considered more positively than `product B`
- is `product D` potentially considerer more positively than `product B`

We can arbitrarily consider that in the case of the first question, `product C`
ranks lower than `product B`, which has many more positive votes. On the other
hand, in the second question, even though `product D` has **far more positive
votes**, the large number of negative votes mitigates against its ranking. This
kind of question raises a set of more general questions about the relationship
between upvotes and downvotes and is broadly referred to as **rank aggregation
problems** which is widely documented in the statistical and Bayesian
interpretation literature.

The problem is also extremely well summarized in the introduction of the article
"[_How Not To Sort By Average Rating_][not-sort-by-avg]" by [Evan
Miller](https://www.evanmiller.org/index.html).

### Some bizarre solutions

In drafting his question, [Xavier][xvw] suggested a few avenues, a little
strange, but which illustrate the versatility of the problem (the first proposal
being that using only the percentage). All the solutions relied on changes in
weighting and didn’t lead to results that one might _instinctively_ consider
valid. However, I would like to share with you his latest proposal, which I
found quite _original_.

Firstly, weighting _slices_ are defined. The idea is to allow you to use the
percentage only if you are in the same slice.

```ocaml
let slice_of data =
let p = result data |> Float.to_int in
if p = 100 then 8 else p / 9
```

We assume that if we have a perfect score (`100`), we project the result into
the last slice, `9`, so that the score can be compared with the results between
`90` and 100 (_inclusive_). Now we can refine our heuristic so that it only
produces a fine comparison when scores are present in the same slice:

```ocaml
# sort (fun a b ->
let slice_a = slice_of a and slice_b = slice_of b in
if Int.equal slice_a slice_b then
let a = a.result *.
(Float.of_int a.amount_votes *. (a.result /. 100.0))
and b = b.result *.
(Float.of_int b.amount_votes *. (b.result /. 100.0)) in
Float.compare a b
else Int.compare slice_b slice_a) ;;
- : data list =
Name Voters Result Upvotes Downvotes
Product B 1000 89.0 890 110
Product C 2 100.0 2 0
Product E 759 69.0 524 235
Product D 3000 51.0 1530 1470
Product A 100 32.0 32 68
Product G 100 25.0 25 75
Product F 590 20.0 118 472
```

The result seems correct, however, the calculation of the _pivot_ to define the
slices seems a little arbitrary (mainly because this approach is problematic
where there are **strong proximities with the boundaries of a slice**) and it
would be interesting to see what the big ad sites offer.

### Some preliminary conclusions

In the light of our few small experiments, we can quickly draw a number of
interesting pre-conclusions. To avoid products with too many (potentially
negative) votes, i.e. 10,000 positive votes against 20,000 negative votes, we
probably need to weight the positive votes more heavily (in other words,
penalise the negative votes) by means of exponential curves.

In our example of slices, the indexing **gave some exponential
differentiation**, but in a very _ad hoc way_. To break these ad hoc encodings,
I reread certain chapters of the book [Bayesian Methods for Hackers:
Probabilistic Programming and Bayesian Inference][bmfh].

## The Shopify approach

The [book][bmfh] provides an extremely precise answer to a sorting problem based
on reviews used at [Shopify](https://www.shopify.com) (a toolbox offering
advanced functionalities for implementing e-commerce platforms), and here is a
fairly free implementation. Firstly, we define a function to assign a score to
an input (a product). We will then use the comparison of these scores to sort
our dataset.

```ocaml
let shopify_score data =
let (total_up, total_down) = up_down data in
let up = total_up |> succ |> Float.of_int
and down = total_down |> succ |> Float.of_int in
let total = up +. down in
let base = up /. total
and coef = 1.65 *. sqrt (up *. down /. (Float.pow total 2.0 *. (total +. 1.0))) in
base -. coef
```

Now we can use our score to order our dataset. As you can see, the result seems
reasonable:

```ocaml
# sort (fun a b ->
let a = shopify_score a
and b = shopify_score b in
Float.compare b a) ;;
- : data list =
Name Voters Result Upvotes Downvotes
Product B 1000 89.0 890 110
Product E 759 69.0 524 235
Product D 3000 51.0 1530 1470
Product C 2 100.0 2 0
Product A 100 32.0 32 68
Product G 100 25.0 25 75
Product F 590 20.0 118 472
```

To (_try to_) confirm the relevance of our result, we can slightly modify the number of
positive votes for our `product C', to see how it grows in the list of reviews:

```ocaml
# sort ~dataset:(update "Product C" ~up:20 ~down:0)
(fun a b ->
let a = shopify_score a
and b = shopify_score b in
Float.compare b a) ;;
- : data list =
Name Voters Result Upvotes Downvotes
Product C 20 100.0 20 0
Product B 1000 89.0 890 110
Product E 759 69.0 524 235
Product D 3000 51.0 1530 1470
Product A 100 32.0 32 68
Product G 100 25.0 25 75
Product F 590 20.0 118 472
```

From my point of view, which is clearly not that of an expert in "_presenting
ordered data_", I find the results rather convincing, and they provide a good
basis for sorting, which could be altered by taking into account temporal data
(for example) or additional criteria. It's possible to go much further, but if
we take the origin of this response in context, providing a function capable of
ordering product lists according to reviews (up/down vote) taking into account
the _number of votes_, I found this solution to be very useful. One can quickly
fall down the rabbit hole (for example, through the [Binomial proportion
confidence interval][wilson]), discovering increasingly sophisticated sorting
solutions, and one of the challenges of this exercise was to remain concise
while seamlessly integrating with an existing sorting function.

## Conclusion

Sorting data based on reviews is a complex task that likely requires advanced
domain knowledge. For instance, the heuristic would probably differ
significantly if you were ordering products versus posts on a social network
(like [Reddit](https://reddit.com)). However, it's an enjoyable challenge (that
lends itself quite well to _literate programming_).

In fact, before settling on the Shopify approach, I was very intrigued by the
methods used by Reddit to calculate message scores. It’s a fascinating topic
that led me to several interesting resources (for example, [this
article][reddit-comment] describing how comments are sorted, written by the
author of [XKCD](https://xkcd.com/), [Randall
Munroe](https://en.wikipedia.org/wiki/Randall_Munroe)), which I've included in
the bibliography (even though they didn't serve as the basis for this article).

This is the end of this very brief article, which sketches out a somewhat naive
solution to what is potentially a much more complex problem. If you have any
suggestions for _plug and play_ approaches similar to the one I’ve presented,
I’d be delighted to hear them and possibly update this article.
9 changes: 7 additions & 2 deletions css/style.css
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ body > header,
body > main,
body > footer {
max-width: 960px;
margin: 0 64px;
margin: 0 32px;
}

body > header {
margin-top: 32px;
}

body > main {
margin: 32px 64px;
margin: 32px 32px;
flex: 1;
}

Expand Down Expand Up @@ -115,6 +115,11 @@ ol.bibliography > li .authors > li:first-child::before {
font-style: italic;
}

.hidden-block {
display: none;
margin: 0;
}

.atom-btn {
margin-top: 12px;
display: inline-block;
Expand Down
1 change: 1 addition & 0 deletions dune-project
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@
yocaml_jingoo
yocaml_syndication
yocaml_omd

(ocamlformat (and :with-dev-setup (= 0.26.2)))
(ocp-indent :with-dev-setup)
(merlin :with-dev-setup)
Expand Down
Loading

0 comments on commit e04b514

Please sign in to comment.