You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: report/main.tex
+1-2
Original file line number
Diff line number
Diff line change
@@ -154,7 +154,6 @@ \subsection{Re-ranking}
154
154
The re-ranking stage of our baseline system consists of two stages: First, the top 1000 documents retrieved by the \texttt{BM25} retrieval method are re-ranked using the \texttt{monoT5} reranker. Afterwards, the top 50 documents of the previous re-ranking stage are rearranged using the \texttt{duoT5} reranker, see Section \ref{sec:rerankers}. The precise count of documents subject to reranking at each stage is a hyperparameter of our system, allowing to balance computational cost and result quality. These rerankers were implemented in the \texttt{pyterrier\_t5} library.\footnote{URL: \url{https://github.com/terrierteam/pyterrier_t5}} Again, since a low latency of our retrieval pipeline is crucial to us, we utilized smaller \texttt{T5} models: \texttt{castorini\-/monot5\--base\--msmarco}\footnote{URL: \url{https://huggingface.co/castorini/monot5-base-msmarco}} for \texttt{monoT5} and \texttt{castorini\-/duot5\--base\--msmarco}\footnote{URL: \url{https://huggingface.co/castorini/duot5-base-msmarco}} for \texttt{duoT5}.
155
155
156
156
\section{Incorporating Pseudo-Relevance Feedback into Our Baseline}\label{sec:baseline+rm3}
157
-
158
157
Recognizing the substantial performance enhancements associated with pseudo-relevance feedback, we felt compelled to integrate a query expansion mechanism into our baseline retrieval method, see Section \ref{sec:baseline}. Our choice fell upon the \texttt{RM3} query expansion technique, well-established for its robustness and acceptance within the information retrieval community. For a deeper dive into its mechanics and principles, readers are directed to Section \ref{sec:prf}.
159
158
160
159
In the \texttt{Pyterrier} framework, the setup requires that any query expansion follows an initial retrieval phase. This initial retrieval fetches the top $p$ documents, forming the foundation for subsequent query expansion by $n$ words using \texttt{RM3}. With the query expanded, it's then passed into a secondary retrieval phase to retrieve the final document set for the end-user. And, to fine-tune the output, we again apply re-ranking using both \texttt{monoT5} and \texttt{duoT5}.
The core idea behind the \texttt{doc2query-T5} model is to dynamically generate specific questions or queries that are closely related to the content of a given document. These generated questions are then seamlessly incorporated into the document. The goal of this process is to expand the document's content, thereby providing additional information that can significantly improve the effectiveness of our information retrieval system. By generating relevant queries based on the document's content, we are essentially expanding the scope of potential search terms, enabling our system to better capture the user's intent and find more relevant documents.
179
178
180
-
The integration of the \texttt{T5} model allows us to transform the document into highly relevant queries tailored to the content of the document. This is achieved by utilizing a \texttt{T5} model fine-tuned on this task of understanding the contextual relationships within the document and generate queries that effectively summarize the key points of the document.
179
+
The integration of the \texttt{T5} model allows us to transform the document into highly relevant queries tailored to the content of the document. This is achieved by utilizing a \texttt{T5} model fine-tuned on this task of understanding the contextual relationships within the document and generate queries that effectively summarize the key points of the document. In particular, we use the \texttt{castorini\-/doc2query\--t5\--large\--msmarco}\footnote{https://huggingface.co/castorini/doc2query-t5-large-msmarco} model.
181
180
182
181
The use of \texttt{doc2query-T5} will be added to our baseline, which will otherwise remain unchanged. In particular, \texttt{doc2query-T5} can be seen as a preprocessing step to indexing, where first $m$ queries can be generated for each document in the collection, which then will be appended to the original document to form the input for the indexing stage. The system architecture for this pipeline, which we will refer to as "\texttt{doc2query-T5}", will therefore take the following form:
0 commit comments