Skip to content

documentation

Moritz edited this page Jun 12, 2016 · 10 revisions

The search engine - Documentation

This file contains the documentation for this project. Some more technical docs can be found in the files in the doc folder. These were written throughout the development process and functioned as a guid to the development.

Framework

We used entirely Python 3 to realize this project and used a bunch of popular python libraries to make our work more effective and professional. Here a list of these libraries:

Testing

  • nose
  • mock

HTTP/HTML

  • requests
  • lxml
  • validators

Data processing

  • pandas
  • numpy
  • scipy

Web Server for the App

  • flask

Testing

Our code is comprehensively tested and the tests also demonstrate the behaviour of the different modules and functions

Run

To install and run the code you need a Python 3 environment and simply run

´python setup.py install´

Furthermore a Dockerfile is provided and the run_docker.sh script helps to run the code in a docker environment. All scripts to run the modules and the server are contained in bin/

As it takes a very long time to crawl, parse, intermediate, pagerank and index the target website, we provide the service for you to test on the following URL:

´http://136.243.16.223:8910´

The service is weakly secured. Login as:

Username: admin Password: brianmoritzmiguel

Architecture/Pipeline

This Crawler/Search-Engine is partitioned in different, independently working, modules which, altogether, form a pipeline from crawling to serving an app, that provides the search engine. In this Section we want to introduce the modules and go into details on their features and implementation.

Crawler

The crawler module contains the logic to download single webpage as well as traversing though the target website(http://dblp.uni-trier.de/) to hit on every page we care about. To respect the websites crawling policies, and as the crawler got blocked from time to time due to too many subsequent requests, the robots.txt is being parsed and its rules are being enforced. Also a throttle time between to consecutive downloads is ensured. The crawler paginates through all the listing of each category(journals, authors and conferences) and downloads the html for each of these pages to a directory. It is ensured, that the detail pages of the journals and conferences are being found and downloaded. As there were many authors and the throttle had be at around one second to prevent the server from blocking us, crawling all the pages took more than two weeks and saving all the HTML files took more than 50 Gigabyte of storage.

Parser

The parser, parses every html-file being found and tokenizes the words. The data being extracted from the html-files are the title, content, url, isbn, type and listingscount. Providing a type which contains one of 'conference', 'journal' and 'author' allowed for searching explicitly for an author, or for a conference just by typing the keyword into the search bar. The parsed data was then saved to a JSON file, one for each html file. The listingscount field resulted from an analysis of the HTML and reflected how many listings appeared in a given document. For an author it would reflect the number of his publications for example. This number was scaled and used as the quality score in the indexing.

Intermediate

To make the indexing and ranking easier, we added an intermediate module, which brought the data from the per-document basis (every document is represented by one file and contains the relating terms) to a per-term basis, where every term is represented by one document, containing all the numbers of occurrences of all the documents.

Page rank

The set of documents are interconnected in a way that each document has links to other documents, usually related by topic, conference, authors, journals, etc. An 'important' document is then defined as a document who appears often in other documents links.

The page rank score is a value between 0 and 1 which represent how authoritative a document is in the whole collection of documents, this is, the set of documents is modeled as a directed graph and using the concept of Markov chains we obtain the probability of random walk to a document from another. This obtained using a uniform distributed probability matrix initially and iterate the calculation of each page probability. After a couple of iterations, the probabilities of each document converge to a static value, which we use for static score of document.

Indexer

The indexer builds the final dictionary using the data from the intermediate: For each term, it reads the corresponding intermediate file, calculates the corresponding weights for each of the documents, ranks them and adds a line to the dictionary with the processed term, and a concatenated list of documents with weights. This custom file format allows for a fast search.

Weights calculation

Each term in the dictionary has a score for every document in which it appears. That score is calculated using the logarithmic version of tf-idf, a weight, based on the product of term frequency (tf) and inverse document frequency (idf) in their logarithmic form. The tf will give the number of times a word appears in a document and idf would give a higher score for rare words. The formula is (1+ log tf)*(log (N/ df)).

The quality score is a static score we applied to each document to introduce the concept of authoritative documents, and is based on the numbers of publications in a document, title words and category. The quality is then the normalized mean distribution of the number of postings within a document.

Additionally, we multiply the score by a fixed number (around 3) if it appears in the certain fields other than the main content of a document (e.g. title, isbn).

The page rank of a document A is obtained by Pr(A) = (1-d) + d*[(Pr(A_1) / C(A_1) + ... + Pr(A_n) / C(A_n)] where Pr(A_i) is the page rank and C(A_i) is the number of links in a document A_i which have a link to document A. Each document first have a uniform distributed page rank and is updated by iteration using the previous formula until the final value is set in a page rank score dictionary.

The idea is that each document score will have a "vector" for each term, which we use to give a rank to this document based on the query in the ranker. This document score combine all previous scores. In the ranker we calculate the list of documents to be retrieved using the cosine similarity, where we sum each query term score and normalize all score dividing by the largest document score. The query vector doesn't affect since all terms have the same weight and when we sum them all we obtain the unit vector.

Ranker

The ranker is, paired with the indexed documents (resulting in the dictionary file), the heart of the search engine. It can be passed a search query, which is first tokenized in the same way as in the Parser module. For each of those resulting tokens a document list is retrieved from the dictionary, along with weights for each document. These document lists are then combined to a final results list, which is then returned. That list, is ordered by the combined weight(score), and contains for each result the document-id and the calculated score for that document.

Server

The web frontend simple calls the ranker module, forwards it the search query entered by the user in the web form and presents the returned ranked documents. Furthermore it presents a preview for each of the documents TODO simply calls ranker, paginates results and displays details for each result.

Issues and Solutions

  • Respecting robots.txt: A list of Allowed and disallowed sites from robots.txt and logic in the crawler to decide if can crawl this page.
  • Hashing for faster dictionary reading in ranker.
  • Title weighting to improve results based on title match.
  • Calculations of documents weights in index building phase for faster search queries.
  • Intermediate files of terms to ease the indexing procedure.
  • Sometimes a search target document isn't ranked on the top of the results, even though the search query corresponds perfectly to it's title. This happens due to higher quality and page rank scores from other documents. To attack this issue we boosted the weight of documents for terms which appeared in the title, though for some results, the target document is still ranked a bit too low. This is somewhat justified, as higher-ranked documents have a very high quality and pagerank then.
  • Search queries with very frequent terms like "is", "web" and so on took a long time when being searched for, especially when using them combined (e.g. "is web good the a"). To speed these queries up, we restricted the search for documents for each term to the first 100000 with the highest weights.