DP_Pavel_Kachalouski_2012.tex

% options:
% thesis=B bachelor's thesis
% thesis=M master's thesis
% czech thesis in Czech language
% english thesis in English language

\documentclass[thesis=M,english]{FITthesis}[2011/07/15]

\usepackage[utf8]{inputenc} % LaTeX source encoded as UTF-8
% \usepackage[latin2]{inputenc} % LaTeX source encoded as ISO-8859-2
% \usepackage[cp1250]{inputenc} % LaTeX source encoded as Windows-1250

\usepackage{graphicx} %graphics files inclusion
\usepackage{algorithmic} %subfigures
% \usepackage{amsmath} %advanced maths
% \usepackage{amssymb} %additional math symbols

% % list of acronyms
% \usepackage[acronym,nonumberlist,toc,numberedsection=autolabel]{glossaries}
% \iflanguage{czech}{\renewcommand*{\acronymname}{Seznam pou{\v z}it{\' y}ch zkratek}}{}
% \makeglossaries

\usepackage{listings}
\usepackage{color}

\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\definecolor{lightgray}{rgb}{0.93,0.93,0.93}

\lstset{frame=none,
  language=C,
  aboveskip=3mm,
  belowskip=3mm,
  showstringspaces=false,
  columns=flexible,
  basicstyle={\small\ttfamily},
  numbers=left,
  numberstyle=\tiny\color{gray},
  keywordstyle=\color{blue},
  commentstyle=\color{dkgreen},
  stringstyle=\color{mauve},
	backgroundcolor=\color{lightgray},
  breaklines=true,
  breakatwhitespace=true
  tabsize=3
}

% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % 
% EDIT THIS
% % % % % % % % % % % % % % % % % % % % % % % % % % % % % % 

\department{Department of Computer Systems}
\title{Network intrusion detection system accelerated with GPU.}
\author{Pavel Kachalouski} %author's name without academic degrees
\authorWithDegrees{Bc. Pavel Kachalouski} %author's name with academic degrees
\supervisor{Šimeček Ivan Ing., Ph.D.}
\acknowledgements{}
\abstractEN{The goal of this thesis is to develop and test an application for  the network intrusion detection accelerated with the GPU. A deep packet inspection is a heavy process and consumes a lot of CPU calculation resources, thereby the main application objective is to offload the heaviest operations during analysing network traffic to the GPU. The project is based on Nvidia CUDA architecture and uses standard features provided by CUDA devices with 1.1 version of compute capabilities.}
\abstractCS{Cílem této práce je vývoj a testování aplikaci pro rychlou a včasnou detekci průniků sítě s pomoci GPU. Hloubkové prověřování paketů je složíty proces který konzumuje velké množství výpočetních prostředků procesoru, a proto hlavním cílem aplikace je delegováni nejtěžších operaci v analýze síťového provozu na GPU. Projekt je založen na architektuře Nvidia CUDA a používá standardní funkce poskytované CUDA zařízení s verzí 1.1 výpočetních schopností.}
\placeForDeclarationOfAuthenticity{Prague}
\keywordsCS{Síťová analýza, Detekce průniku, GPGPU, CUDA, Paralelní výpočty.}
\keywordsEN{Network analysis, Intrusion detection, GPGPU, CUDA, Parallel computing.}

\parskip 2mm
\setcounter{tocdepth}{5}
\begin{document}

% \newacronym{CVUT}{{\v C}VUT}{{\v C}esk{\' e} vysok{\' e} u{\v c}en{\' i} technick{\' e} v Praze}
% \newacronym{FIT}{FIT}{Fakulta informa{\v c}n{\' i}ch technologi{\' i}}

\chapter{Introduction}
\label{chap:introduction}
\section{Motivation}
First major malicious internet-based attack was introduced in 1988 when one of the first Internet worms was released. Referred to as "The Morris Worm" \cite{morris_worm}, it was written by Robert Tappan Morris and caused major interruptions across large parts of the Internet. Since 1988 Internet has greatly evolved and as a result increased number of internet-based attacks and network security became a really important issue. As an attempt to overcome this problem Intrusion Detection systems were developed. Such systems allow to monitor and possibly prevent attempts to intrude into or compromise the network resources.

In short it works as follows: you have a computer system. Your computer system is attached to the network and may be even to the Internet. You are willing to allow access from the network, by authorized people, but you are not however willing to allow unauthorized access to the system by anyone else. Typically, a firewall or authentication system is used to prevent unauthorised access. However, sometimes authentication systems and firewalling can be broken. Intrusion Detection is a set of mechanisms that are helps to detect unauthorised access to the system. Also Intrusion Detection systems may take some steps to deny access for intruders. Such systems are placed between firewall and system being secured and provide an extra layer of protection to that system.
\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/Typical_NIDS.png}
\caption{Typical NIDS scenario. Based on \cite{iatac_tools_report}}
\label{fig:typical_nids}
\end{figure}
Intrusion detection systems fall into two categories:
\begin{itemize}
\item Network based system. 
Such types of systems are placed in the network near the system that should be monitored. They monitor network traffic and examine whether it is within acceptable boundaries.
\item Host based systems. 
Such types of detection systems are run on the system being monitored and determine if activity on the system is acceptable.
\end{itemize}
There are also more recent type of intrusion detection system which reside in the kernel of operation system and monitor on low-level system activity.

Most modern network intrusion detection and prevention systems use a set of rules that are compared to incoming and outcoming packets \cite{symantec_nids_intro}. Usually, a rule consists of a filter based on packet header information, a data that should be contained in the packet payload and an associated action that is taken in case filter and payload matching conditions are met.
Matching payload signatures is a resource-intensive process, accounting for about 75\% of the total CPU processing time of modern Network Intrusion Detection systems. The reason why is it so intensive is in the fact that most of the time, every byte of every network packet should be processed as a part of data searching algorithm that searches for matches along a large set of rules that apply for particular packet. As an example, most widely used open-source NIDS Snort has rule set with about 10000 strings. Searching every packet for all of this strings needs a lot of resources. 
\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/web-attacks.png}
\caption{Average Web-based attacks per day, by month, 2009-2010. Source: Symantec Corporation.}
\label{fig:symantec_report}
\end{figure}
With increasing number of internet-based attacks rules datasets grow every year creating additional load on NIDS systems. Figure \ref{fig:symantec_report} shows a general trend of the number of web-based attacks on web servers taken from Symantec Corporation report.

The other source of high load on NIDS is constant grow of networks speeds. In the past 6 years, computer networks were characterized by a steady increase in the transfer rate, complexity and number of networked users. In 2009 the Internet traffic growth all over the world amounted to 40-50 \%, and according to Cisco company forecast \cite{cisco_firecast} global IP traffic will increase in the next 5 years, reaching by 2015 about 1 zettabyte of data transferred during one year.

Problem with high load on NIDS can be solved by using specialized network processing devices such as FPGA, ASIC, Network Processing Unit (NPU), or the TCP Offoading Engine (TOE). Nonetheless, the solutions may not be cost effective and easy to manage because they require special hardware and configurations.

\section{Problem definition}
A quick packet processing in a NIDS over huge amount of data obtained from the network \textbf{is a must}. Perfomance of network data processing algorithms should be fast and should not interfere with the overall network perfomance (or interfere as little as possible).
The current factors related to the networks and traffic analysis systems listed below:
\begin{enumerate}
\item The number of network nodes and users are increasing. 
\item Network speed (bit rate) is gradually increasing.
\item The amount of network traffic is increasing heavily.
\item Analysis algorithms are getting more complex.
\item Databases of known threats grow resulting in high number of rules for matching.
\end{enumerate}

This thesis approach is to use heterogeneus computing, and more specifically use \textbf{graphic processing units} to perform network data analysis.

Heterogeneous computing is a way of computing some tasks using systems made up by different types of computational units. Computational unit types could be: a general-purpose processors (GPP), also known as central processor units (CPUs), which are usually the main processor of the most of the computing systems, and special-purpose processors (SPP). SPP are, for example, digital signal processors (DSP) or graphics processor units (GPU).

Graphics Processing Unit or GPU is essentially a dedicated hardware device that is responsible for translating data into a 2D image formed by pixels. GPU evolution came from the growing popularity and complexity of rendering programs like CAD (Computer Aided Design) programs and 3D video games. These software required high computing capabilities and GPUs gave such a possibility to them. GPUs have not only a highgly parallel processor structure and are able to execute many threads of the code concurrently, but also a high speed memory.

The interest to use GPUs to perform calculation tasks that are not suitable for GPUs began to grow in 2005. Developers and scientists were trying to use parallel computers and high memory bandwidth to improve complex algorithms perfomance. GPUs are used for computations in molecular dynamics, quantum chemistry, bioinformatics, computational fluid dynamics, weather and climate forecasting and more other fields.

GPGPU (general-purpose computing on graphics processing units) at the beginning used GPUs as they were computing graphics tasks. Input data were encoded into image, then by using graphics libraries some operations were perfomed with an image and result data were decoded from image to its original form. Later companies realised that GPGPU could be an advantage for their business and invested in developing tools to make easier to use their products. NVIDIA developed CUDA (an acronym for Compute Unified Device Architecture) version 1.0 library in 2006, which enabled to run CUDA code written in special manner on some of their GPUs, while AMD developed ATI Stream technology that uses OpenCL language to write code executed on GPUs.

CUDA is the computing engine in Nvidia GPUs that is accessible to software developers through variants of industry standard programming languages. Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions) \cite{cuda_guide}, compiled through a PathScale Open64 C compiler to code algorithms for execution on the GPU.
The main objective of this thesis is to develop open-source NIDS application which will inspect network traffic using classic CPU mode and heterogeneous CPU+GPU mode and compare perfomance of the two modes. 

In additional, the software should fulfill the following requirements:
\begin{itemize}
\item Open source. The application should be developed under the terms of open source software.
\item Application should be developed in C/C++ and C for CUDA languages.
\item Easy to use. Application should be easy to use for the users. NIDS rules should have easy format.
\item Application will analyze only TCP/IP traffic due to simplifications on the packet decomposition level.
\item Logging and offline analysis. Application should be able to store traffic for the future analysis in offline mode.
\end{itemize}

\section{Project overview}

The general application workflow is shown in the figure \ref{fig:app_workflow}.
\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{images/workflow.png}
\caption{Application workflow.}
\label{fig:app_workflow}
\end{figure}

The main stages of the workflow:
\begin{itemize}
\item \textbf{Packet capture}. Application captures traffic on one of the network interfaces for later analysis. Raw packet should be obtained from network for full analysis of route information on the next stage.
\item \textbf{Route information analysis}. Every packet is analyzed by means of type of the protocols used on network and transport layers and also destination and source IP address. If packet route information matches to any rule from the dataset it transported to the next stage of the workflow.
\item \textbf{Payload analysis}. Application looks for a string or regex in the packet payload.
\item \textbf{Writing result to a database}. On this stage all matched packets from the previous stage are written to a database with a message obtained from rule with which packet was matched.
\end{itemize}

\chapter{Background}
\section{Traffic analyzing: sniffers.}
Packet sniffer is a program or library which allows to intercept data packets transmitted over a certain network to which the system is connected throught a network card. Most popular known sniffing software is tcpdump and Wireshark. These programs are used for obtaining raw network packets and performing analyzing over obtained packets in order to represent them in a some visual form or dump to a file. There are also pure sniffing libraries, such as Libpcap (for UNIX like OS) and Winpcap (for Windows OS) \cite{cap_app}.

\subsection{How they work.}
Most of the network cards, support what is known as promiscuous mode or monitor mode. Normal operation of the network card is to receive packet from the network and compare the destination layer 2 (link layer) address (also known as MAC address) to the one used by the network card. If packet destination address and network card address match or packet destination address is a broadcast address, packet is transferred to the operating system. Not matched packets are dropped.

In case promiscuous or monitor mode is enabled the behavior is slightly different. In this mode network card transfers all packets received from the network to the operating system, even if their address doesn't match to the network card's one. Operating system later uses packet filtering engine to pass packets to the applications. Enabling promiscuous mode requires root privileges under Linux operating systems.

There are also differences in capturing of the network packets in a different types of networks and different network topologies. The following subsection explain some details about sniffing in the 802.3 types of the networks and 802.11 type of the networks.

\subsection{Packet capture in IEEE 802.3 networks.}
In a typical IEEE 802.3 LAN network \cite{ieee_802_3}, a star topology is used, so all the nodes in the network are connected (through their own cable) to either a hub or a switch.
\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{images/IEEE802_3_LAN.png}
\caption{IEEE 802.3 networks usually use star topology.}
\label{fig:ieee802.3star_topology}
\end{figure}

Hubs are basically repeaters: signal introduced at the input of any port appears at the output of every port except the original incoming. Switches work in a different way: they identify all the hosts connected to the port. When packet is received on some port switch passes that packet only to the port where destination host is connected. Switched networks has better perfomance, because they do not multiply packets when they are passed to the output port. Switches also may filter traffic on different protocol levels (link, transport, network, etc.).

For the packet sniffer it means that only packets transferred to or from particular host running the sniffer and broadcast packets will be captured. To overcome this problem several techniques may be used\cite{gpu_framework}:

\begin{itemize}
\item Using a hub: a simple but not good solution is to replace switches by hubs. This solution is not good because network perfomance built with hubs is very reduced compared to the network built with switches. Also hubs production is practically discontinued.
\item Put the sniffer in a gateway link as a router: this technique has some advantages and disadvantages at the same time. The advantage is that such a sniffer is able to capture packets from a lot of subnetworks. The disadvantage is that the sniffer captures traffic going only through the link where it was placed. Internal traffic (between subnetworks or between nodes of the same subnetwork) is not captured, which is important in some cases, like data centers \cite{act_defense_model}. 

\begin{figure}[h]
\centering
\includegraphics[scale=0.6]{images/gateway_sniffering.png}
\caption{Sniffering traffic in the gateway links.}
\label{fig:gateway_sniffering}
\end{figure}

In case sniffer in a gateway link doesn't satisfy requirements port mirroring or distributed sniffer could be used. Figure \ref{fig:gateway_sniffering} illustrates this technique with an example.
\item Switch port mirroring: some switches have port mirroring feature or monitoring port 2. When port mirroring is enabled, all packets passed over the switch are transmitted to the mirroring port. If the network contain several switches, obtaining packets on a single host could be more complex and require advanced switch features like Cisco RSPAN \cite{cisco_site}. Also distributed sniffer in such case could be used.

\begin{figure}[h]
\centering
\includegraphics[scale=0.6]{images/mirror_sniffering.png}
\caption{Sniffering traffic using switch port mirroring.}
\label{fig:mirror_sniffering}
\end{figure}

\item Distributed sniffer: distributed sniffers gather traffic from several hosts using a specialized combination of hardware and software and send it to the unique host. Such type of the sniffer is very flexible and scalable. But there are some drawbacks in this approach. The major drawback is less perfomance comparing to the previous type of the sniffers, caused by increased network traffic due to using software architecture. Distributed sniffer architecture is shown in the figure \ref{fig:distributed_sniffering}.

\begin{figure}[h]
\centering
\includegraphics[scale=0.6]{images/distributed_sniffering.png}
\caption{Distributed sniffer.}
\label{fig:distributed_sniffering}
\end{figure}
\end{itemize}
\subsection{Packet capture in IEEE 802.11 networks.}
The main advantage of the packet sniffering in the IEEE 802.11 network compared to the IEEE 802.3 is that 802.11 networks all share the same carrier for transferring packets \cite{ieee_802_11}. So the only hardware required to capture packets is a network card in promiscuous mode. However, some considerations have to be kept in mind. IEEE 802.11 network could be made up by a number of access points which could be a problem for capturing packets. The coverage area becomes bigger therefore higher reception antenna is needed if one sniffer point is used. Also some environment conditions and sniffer physical position could have big influence on the quality of sniffering. Because of shadowing packet sent by some host could be lost at all.

Some approaches to solve these problems are\cite{gpu_framework}:
\begin{itemize}
\item Move the capture point to the wired network section. If it's possible, sometimes is better to capture packets in the wired section of the network instead of capturing them in the wireless section. Such approach is similar to the one listed before: placing sniffer in the gateway link, so the disatvantage is the same. Also the drawback of this approach is that internal wireless traffic could not be captured.

\item Distributed sniffer: the same method as listed above with the same advantages and disatvantages.
\end{itemize}

\subsection{Libpcap.}
Libpcap is a sniffer library for packet capturing in a Unix-like systems. For a Windows systems a port of Libpcap named Winpcap \cite{winpcap_site} exist. The library provides an API for using kernel resources for packet capturing in kernel architectures that are based on Berkeley Packet Filter \cite{bsd_filter}. Libpcap was originally developed by the tcpdump developers in the Network Research Group at Lawrence Berkeley Laboratory. The following features are supported by Libpcap: packet capture, saving captured packets to a file, reading files containing saved packets. A capture file saved in the format that Libpcap can be read by applications such as tcpdump, Wireshark, CA NetMaster.

\subsection{Network traffic analysis techniques.}
This section gives a brief information about network traffic monitoring and analysis process. In the network analysis process a network data is intercepted and examined in order to retrieve information from patterns in communication. Packets collected from the network are called a dataset and contain the raw network data including header information and high level user information. By analysing datasets a useful pieces of information are retrieved and perfomed some manipulations over it.

Collected datasets are often divided into a smaller parts, data subsets, to analyse later separately. It is done because of perfomance reasons: algorithms used for analysis such a data could have non-linear compute time and processing a large amount of data by such algorithms may increase time exponentially. Such technique of dividing large dataset into smaller parts called windowing. Depending on the requrements two types of windowing are used:

\begin{itemize}
\item Size based windowed datasets. Large set of the data divided into a subset of smaller subset with the same size.
\item Time windowed datasets. Data is constantly collected and in some time periods collected data is processed. In such a scheme the size of the data subset is unknown and based on the network speed.
\end{itemize}

Using technique of dividing captured data into a subsets of data may affect the results of the analysis performed on it. So windowing parameters should be taken into account during interpreting results of the analysis.

\subsection{Network data inspection techniques.}
Network data inspection is a form of packet filtering based on inspecting each packet header information, compute some calculations on them and produce results.

\textbf{Packet decoding (packet analyzing)}
Packet decoding is the simplest task that could be performed over the gathered data. During this operation all header parameters are analysed and displayed in a human readable way. Examples of such type of the applications are tcpdump and Wireshark.

The fields where packet decoding is usable includes network security (intrusion detection systems, bandwidth abuse, etc.), network management and network failures detection. Network forensics analysis also is very interested in such type of inspection.

The following list presents the different tasks which could be performed over the collected packets:
\begin{itemize}
\item Graphical representation of raw data.
\item Statistical information and pattern extraction.
\item Rule based (signature based) analysis, anomaly detection and policies.
\item Flow based analysis.
\end{itemize}

Network monitoring, network management and security, all are interested in a data presented in a graphical form. It could be a 2D or 3D scatter plots, time based graphs, histograms or different types of diagrams.

A major position in the network analysis has pattern retrival task and calculation of the statistical parameters. Different time distributions, statistical moments of the first and second order, mean values and probability distributions could be calculated over the gathered network data. Also statistics can be used for other kind of the purpose, like marketing. Gathering information about average number of connections, average inbound and outboud speeds, time distributions of connections to servers could give a good picture for some product on the market.

Rule based or signature based analysis is a type of traffic inspection where every packet is matched agains a certain rule or signature \cite{rule_based_analysis}. Rules or signatures define specific fields in the packet header or combine several field values in a certain packet. Values may also be defined as an intervals or thresholds. Threshold rules are used all over security, for example, to identify DoS or bandwith abuse attacks. In this case rules may be treated as policies, which allow or disallow certain amount of traffic in the network.
The synonym for the rule based analysis is a signature pattern matching and both two terms are used in a literature about network analysis. But the signature pattern matching is also used for the statistical part of the packets inspection. In this thesis rule based analysis is used to name the packet analysis for matching against rules as it is done in network intrusion detection systems and described later. 

\section{Intrusion detection.}
Intrusion detection is a process of detection undesired traffic on a network or device \cite{nids_evolution}. An Intrusion Detection System is a combination of a physical device with a software which monitors network traffic in order to detect undesired activity and traffic that violates security policies. Different IDS tools are able also to store event information into log for later review or combine events with other information to make decisions about policies and losses. More functionalities provide IPS that can prevent or stop undesired traffic.

\textbf{Network-based Intrusion Detection (NID)}

A Network Intrusion Detection System (NIDS) is a type of IDS that monitors network traffic and analyzes packets deep on all levels of OSI model. Based on result of analyzis it makes decisions about the traffic type and suspicious activity. However, historically NIDS has been unable to operate in the following environment:
\begin{itemize}
\item Switched networks.
\item Encrypted networks.
\end{itemize}

\textbf{Host-based Intrusion Detection (HID)}

Host-based intrusion detection systems monitor and detect system and user activities on a given host. More complicated tools also offer policy management, data forensics statistical analysis and some access control. The difference between network-based intrusion detection and host-based intrusion detection is that NID deals with a traffic transmitted from one host to another while HID controls what is happening on the hosts themselfs.

Host-based intrusion detection is better fits to fight with internal threats, because it monitor and control specific user actions and file access. A lot of computer threats comes from the inside of organization, for example disgruntled employees or corporate spies.

\subsection*{Detection types}

\subsection*{Signature-Based Detection}

Intrusion Detection may use signature-based detection to analyze traffic. Using such type of detection means that system is relying on known traffic data to analyze running traffic. An attacker can modify a little an attack to make it undetectable for such type of IDS. But regular expression-based rules give more power against such slightly changed attacks in the same time remaining very accurate.

\subsection*{Anomaly-Based Detection}

This type of IDS looks at network traffic and detects data that is incorrect, not valid, or generally abnormal. This method is useful for detecting undesired traffic that is not specifically known. For instance, an anomaly-based IDS will detect that an Internet protocol (IP) packet is malformed. It does not detect that it is malformed in a specific way, but indicates that it is anomalous.

\subsection*{Stateful Protocol Inspection}

This type of detection is similar to anomaly-based detection, but it additionally inspects traffic on the transport layer and vendor-specific application layer, while anomaly-based detection is not doing it.

\section{Intrusion Detection Systems: A Brief History.}
Beginning in 1980, with James Anderson's paper, Computer Security Threat Monitoring and Surveillance, the notion of intrusion detection was born. Since then, several pivotal events in IDS technology have advanced intrusion detection to its current state \cite{nids_evolution}.

James Anderson's seminal paper, written for a government organization, introduced the notion that audit trails contained vital information that could be valuable in tracking misuse and understanding user behavior. With the release of this paper, the concept of "detecting" misuse and specific user events emerged. His insight into audit data and its importance led to tremendous improvements in the auditing subsystems of virtually every operating system. Anderson's conjecture also provided the foundation for future intrusion detection system design and development. His work was the start of host-based intrusion detection and IDS in general.

In 1983, SRI International, and specifically Dr. Dorothy Denning, began working on a government project that launched a new effort into intrusion detection development. Their goal was to analyze audit trails from government mainframe computers and create profiles of users based upon their activities. One year later, Dr. Denning helped to develop the first model for intrusion detection, the Intrusion Detection Expert System (IDES), which provided the foundation for the IDS technology development that was soon to follow.

In 1984, SRI also developed a means of tracking and analyzing audit data containing authentication information of users on ARPANET, the original Internet. Soon after, SRI completed a Navy SPAWAR contract with the realization of the first functional intrusion detection system, IDES. Using her research and development work at SRI, Dr. Denning published the decisive work, An Intrusion Detection Model, which revealed the necessary information for commercial intrusion detection system development. Her paper is the basis for most of the work in IDS that followed.

Meanwhile, there were other significant advances occurring at University of California Davis' Lawrence Livermore Laboratories. In 1988, the Haystack project at Lawrence Livermore Labs released another version of intrusion detection for the US Air Force. This project produced an IDS that analyzed audit data by comparing it with defined patterns. In a telephone interview with the author, Crosby Marks, a former Haystack Project team member and Haystack Labs employee said that, "searching through this large amount of data for one specific misuse was equivalent to looking for a needle in a haystack."

The subsequent iteration of this tool was called the Distributed Intrusion Detection System (DIDS). DIDS augmented the existing solution by tracking client machines as well as the servers it originally monitored. Finally in 1989, the developers from the Haystack project formed the commercial company, Haystack Labs, and released the last generation of the technology, Stalker. Crosby Marks says that "Stalker was a host-based, pattern matching system that included robust search capabilities to manually and automatically query the audit data." The Haystack advances, coupled with the work of SRI and Denning, greatly advanced the development of host-based intrusion detection technologies.

To kick off the '90s, UC Davis's Todd Heberlein introduced the idea of network intrusion detection. In 1990, Heberlein was the primary author and developer of Network Security Monitor (NSM), the first network intrusion detection system ( see Heberlein, L. et al. "A Network Security Monitor." Proceedings of the IEEE Computer Society Symposium, Research in Security and Privacy, May 1990, pp. 296-303.) NSM was deployed at major government installations where network traffic analysis provided massive amounts of information. This new awareness generated more interest in the field of intrusion detection and investments in that market increased significantly. Heberlein's contributions also extended to the DIDS project where, along with the Haystack team, he introduced the first idea of hybrid intrusion detection. The work of the Haystack project and the introduction of the Network Security Monitor revolutionized the IDS field and brought it into the commercial world.

Commercial development of intrusion detection technologies began in the early 1990s. Haystack Labs was the first commercial vendor of IDS tools, with its Stalker line of host-based products. SAIC was also developing a form of host-based intrusion detection, called Computer Misuse Detection System (CMDS). Simultaneously, the Air Force's Cryptologic Support Center developed the Automated Security Measurement System (ASIM) to monitor network traffic on the US Air Force's network. ASIM made considerable progress in overcoming scalability and portability issues that previously plagued NID products. Additionally, ASIM was the first solution to incorporate both a hardware and software solution to network intrusion detection. ASIM is still currently in use and managed by the Air Force's Computer Emergency Response Team (AFCERT) at locations all over the world. As often happened, the development group on the ASIM project formed a commercial company in 1994, the Wheel Group. Their product, NetRanger, was the first commercially viable network intrusion detection device. Nonetheless, commercial intrusion detection systems developed slowly during these years and only truly blossomed towards the latter half of the decade.
The intrusion detection market began to gain in popularity and truly generate revenues around 1997. In that year, the security market leader, ISS, developed a network intrusion detection system called RealSecure. A year later, Cisco recognized the importance of network intrusion detection and purchased the Wheel Group, attaining a security solution they could provide to their customers. Similarly, the first visible host-based intrusion detection company, Centrax Corporation, emerged as a result of a merger of the development staff from Haystack Labs and the departure of the CMDS team from SAIC. From there, the commercial IDS world expanded its market-base and a roller coaster ride of start-up companies, mergers, and acquisitions ensued.

Currently, market statistics show that IDS is amidst the top selling security vendor technologies and should continue to rise. Furthermore, government initiatives, such as the Federal Intrusion Detection Network, (FIDNet) created under Presidential Decision Directive 63, are also adding impetus to the evolution of IDS. Advancements in IDS will ultimately push security technology into a whole new arena of automated security intelligence.

\section{GPUs}
GPU is a graphical processor unit sometimes also called visual processing unit or VPU, is a specialized type of processor that is responsible to render 3D graphics and offload this procedure from the micriprocessor or CPU. GPU history starts in 70s, when CTIA and ANTIC chips were developed. These chips allowed mixed graphics and text mode on Atari computers be hardware controlled. The ANTIC chip was a special purpose processor, that was responsible for mapping graphics and text data to the video output.

In the 1984 appeared one of the first accelerators for the 2D and 3D graphics for the IBM PC architecture compatible systems: IBM's Professional Graphics Controller. Chip didn't succeed because of high cost and poor compatibility with already existing software. Commondore Amiga was the first mass-market computer which included a dedicated graphics processor launched in 1985. Amiga's graphic processor was the first full graphics accelerator that offloaded practically all video operations from the CPU.

The first graphics system that implemented 2D primirives in hardware was the IBM's 8514 PC video card. In 1991, S3 manufacturer introduced the S3 86C911 to the market, which claimed to be the first single-chip graphics card to implement 2D acceleration functions in hardware. Other manufacturers followed the 86C911 model, and in 1995 all graphics card vendors has added 2D hardware acceleration support to their chips. In the first half of 1990s real-time 3D graphics became very significant, because of developing CAD (Computer Aided Design) systems and especially due to video games. Video games became more and more popular and as a result of growing demand on 3D hardware acceleration graphics processor vendors started to develop 2D and 3D graphics accelerators. Launching in 1996 the Verite V1000 chip by Rendition a new level of 3D acceleration was reached. This chip was the first consumer 3D accelerator cards truly capable of delivering playable performance with significantly improved graphics quality.

In the second half of 1990s only several manufacturers appeared to compete over the GPU market and by the end of 1990s, manufacturers leaders were NVIDIA, ATI and 3dfx. In 1999 NVIDIA launched Geforce 256 which was the first card on the market supported hardware transormations and lighting capabilities. 

In the early 2000s, the OpenGL API appeared. It is a multiplatform and multilanguage API that was created in 1992 by Silicon Graphics Inc. to help programmers draw 3D images. Due to the new API and new hardware architectures that allowed to process each pixel by a short program which is able to include additional textures as inputs and geometric vertex be processed in a similar way, 3D software got a huge capability improvement. NVIDIA's Geforce 3 was the first device that supported vertex shaders programming. In 2000 NVIDIA bought 3dfx and from that point to present, the market of high perfomance GPU is dominated by NVIDIA on one side, with an estimated market share of 59.4\% in June 2011 according to \cite{gpu_marketshare}, and ATI, with an estimated 40.6\% of market share.

The latest chips of NVIDIA are the GF119 and GF110 chip family (Geforce 500 and 600 series) generation. Recently NVIDIA has published a new architecture for the CUDA enabled chips with the code name Kepler, which has 1536 CUDA cores in the entire GPU. For his part ATI has developed Radeon 7000 family, with the Tahiti graphic chipsets.

\subsection{GPGPU: general-purpose computing on graphics processing units.}
GPGPU stands for General-Purpose Computing on Graphics Processing Units. Since 2003, several researchers like Harris, Mark J., William V. among others \cite{gpgpu}, outlined that current architecture of high performance GPUs in terms of FLOPS (FLoating-point Operations Per Second), with programmable fragment and vertex shaders that enabled the programmers to create more realistic and complex graphics, could be used for other purposes rather than graphic calculations.

The major motivation for the progress of GPGPU was improving perfomance of the algorithms which easily scales for a parallel computations and overcome limitations of traditionally CPU based computing: the instruction-level parallelism wall and the memory wall. On the one side GPUs have a small number of instructions to be performed on the data, however they are able to execute many of this instructions simultaneously over the large amount of data. This became available because of the programmable shaders that were added to the GPU processor's pipelines. GPUs can compute a lot of vertices or fragments of graphics in the same way. Such a computation called stream. The streams is an array of elements required the same calculation performed on it, and kernels are the functions applied to all elements in the stream.

\begin{figure}[h]
\centering
\includegraphics[scale=0.45]{images/cpuvsgpu.png}
\caption{GPU vs. CPU processor FLOPS performance comparison \cite{cuda_guide}.}
\label{fig:cpuvsgpu}
\end{figure}

Also, the usage of graphical processor units have another important advantage over traditionally computing CPU based model: the memory bandwidth. In the last decade the gap between CPU and memory speed have kept growing, and thus memory latency has become a major bottleneck in CPU computing, specially in applications with an intensive usage of memory. The evolution of theoretical single precision floating point operations (FLOPS) for both Intel based CPU processors and NVIDIA based GPU processors is shown in the figure \ref{fig:cpuvsgpu}. 

First attempts of using GPUs with other purposes rather than graphics, required to transform or convert complex algorithms and data into a graphics, to be able to use the GPU through graphics libraries (like OpenGL) to solve them and later revert the transformation. 

NVIDIA, conscious that GPGPU could be an important boost for the GPU market and also knowing that the current approaches for using GPUs for general purpose programming required a high level of knowledge and was a tedious job, started developing an SDK with the purpose of simplifying the task of GPGPU programming. The result of this development was CUDA (Compute Unified Device Architecture), that was launched in November 2006. 

CUDA is a parallel computing architecture that enables programmers to use both CPU and GPU processors to cooperate in a single program, using a computing paradigm known as heterogeneous computing. Software developers are able to program general purpose functions or routines to be run on the GPU by simply use “C for CUDA” (C with NVIDIA extensions) while the rest of the program is still executed in the CPU. 

CUDA has become widely used in many areas such as physics simulations, scientific and medical simulations, signal processing, cryptography or audio and video processing among others. ATI also launched his own GPGPU SDK called Stream SDK, but at the time Stream SDK has not been as successful as CUDA.

\subsection{CUDA architecture and programming model for GPGPU.}

CUDA is the parallel computing architecture developed by NVIDIA which allows developers to use 'C for CUDA' language to code parts or functions of a general purpose program. The main three key abstractions that are available to the programmer as the C extensions are: a hierarchy of thread groups, shared memories and thread barrier synchronization. CUDA programmers divide the complex task that is going to be accelerated using GPU into a smaller several tasks that can be solved independently in parallel. Functions executed by GPUs are called kernels. The rest of the code remains the same and is executed on the CPU.

\begin{figure}[h]
\centering
\includegraphics[scale=0.4]{images/cuda_threads.png}
\caption{CUDA thread hierarchy\cite{cuda_guide}.}
\label{fig:cuda_threads}
\end{figure}

Kernels are functions with \_global\_ attribute. When they are called, kernels throw a total number of N threads. To achieve a good performance in general, thousand of kernels should be executed simultaneously.

Figure \ref{fig:cuda_threads} shows kernel’s thread organization of model. All CUDA threads are organized into a 2D array of blocks called grid. Each of this block contain a logical 3D array of threads. However at this moment hardware support only 2D array. The dimensions of grid and blocks should be choosed before running a kernel, because they cannot be cnahged during execution of the CUDA code. If the programmer doesn't want to use multidimensional array of threads or blocks one dimension could be defined as 0. This way threads will be organized in a simple array of 1 dimension.

During kernel execution code could read current thread ID and current block ID. It is done by using built-in variables blockIdx.x, blockIdx.y for blocks and threadIdx.x, threadIdx.y and threadIdx.z for threads. Dimensions of the grid and block can also be retrieved using built-in variables like blockDim and gridDim. The combination of the blockIdx and threadIdx identify each thread, are used to perform coalesced memory access and execute code for a different input data based in the thread and block IDs. The current restriction for the maximum threads is 65536 with limit over the one dimension of 512 threads/blocks.

\lstset{language=C++, label=lst:cuda_kernel, captionpos=b, caption={Simplified CUDA kernel and associated main() function.}}
\begin{lstlisting}
__global__ void vecAdd (float *A , float *B , float *C)
{
	int i = threadIdx.x;
	C[i] = A[i] + B[i];
}

int main (int argc, char *argv[])
{
	vecAdd<<<NUM_BLOCKS, NUM_THREADS>>>(A, B, C);
	return 0;
}

\end{lstlisting}

The code contained in the listing \ref{lst:cuda_kernel} shows a simplified example of a kernel call, executing vecAdd kernel with a 1D grid organization and 1D thread block organization, throwing NUM\_BLOCKS blocks, with NUM\_THREADS threads per block.

\setsecnumdepth{all}
\chapter{Design}
\label{chap:design}

\section{Developing tools and methodology.}
The tools and programming languages selected for implementing application are:

Languages:

\begin{itemize}
\item \textbf{C++}\cite{bjarn_str}: due to perfomance requirements and supporting of object oriented features (classes, inheritance), C++ is used for developing the whole application, except of the CUDA device code part, where C dialect for CUDA is used.
\item \textbf{CUDA}\cite{cuda_zone}: CUDA uses C language with additional features to perform general purpose calculations in the GPUs.
\end{itemize}

Libraries:

\begin{itemize}
\item \textbf{Libpcap}\cite{cap_app}: is packet-capture library for retrieving packets from the Packet Filter.
\item \textbf{QT}\cite{qt_project}: is a cross-platform application framework for developing application with a rich graphical interface. It is used for building GUI part of the application.
\item \textbf{Boost}\cite{boost}: is a set of libraries that extend the functionality of the C++ programming language. Classes for processes synchronization are used (semaphores and mutexes) and also a Boost.Spirit parser to parse rules dataset.
\item \textbf{Sqlite}\cite{sqlite}: is a software library that implements a SQL database engine and is used to store results.
\end{itemize}

The methodology used has been the spiral model.
\begin{figure}[h]
\centering
\includegraphics[scale=0.7]{images/spiral-model.png}
\caption{Spiral methodology used in the developing process of the application.}
\end{figure}

In this methodology, the process of determining requirements, analyzing possible risks, implementing, testing and planning (results evaluation), are performed during the whole project. The number of rotating through the spiral depend on the found implementation issues, the precision of objective definition in the early stages of the development and the requirements accomplishment of the current implementation.

\section{NIDS rules.}
NIDS rules dataset has a simple text format. Each line of the text file contain rule in the following format:


\begin{figure}[h]
\centering
\begin{verbatim}
action protocol src_ip src_port direction dst_ip dst_port {options}
\end{verbatim}
\caption{Rule format.}
\label{fig:rule_format}
\end{figure}

\textbf{Action} could be an \textbf{alert} or \textbf{log}, where 'alert' means that if case rule is matched against some packet it should be alerted to user, for example highlighted with a red color in a GUI or appeared as a MessageBox dialog. If rule starts with a 'log' action it means that packet matched against rule will be logged into a database without any specific actions for gaining user attention.

\textbf{Protocol} is a packet's protocol type. Because of simplifications only 'tcp' is supported by developing application. Also 'tcp' will take into consideration only IP packets from the network layer of the OSI model.

\textbf{Src\_ip} is a source IP address of the packet. IP address is written in a standard form '192.168.0.1' where each position could be a range, for example '192.168.0-20.1'. Also instead of any position in the IP address '*' symbol can be used to accept all values on that position. For example '192.168.*.*'.

\textbf{Src\_port} is a source TCP port. It is a value from 0 to 65535, or range, for example 5-1234, or '*' symbol which means any port.

\textbf{Direction} is a string "\textendash\textgreater". At this moment only this direction string is supported.

\textbf{Dst\_ip} is a destination IP address written in the same format as source IP address.

\textbf{Dst\_port} is a destination TCP port written in the same format as a source TCP port.

\textbf{Options} are parameters that specifies payload analysis. Within options should be a \textbf{content} or \textbf{pcre} parameters. Content specifies a string which is searched within a packet payload. Also content string could contain raw bytes which should be written within '\textbar' symbols. Example of the content option with a string and raw bytes: \textbf{content:"stringtosearch\textbar00 01 02\textbar"}.\newline Pcre parameter allows to specify the searched string as a regular expression. Example of pcre option: \textbf{pcre:"\textasciicircum(some)*regex"}. The reason of dividing searched string on content and pcre is that all content strings are compiled into one DFA while each pcre is compiled in its own DFA and could consume much more memory. Therefore, if it's possible, it is preffered to use more content parameters against pcre parameters, because content parameters are using fast searched Aho-Corasick algorithm to search all contents simultaneously, while every pcre expression is searched individually for each packet. Options may contain parameters \textbf{offset} and \textbf{length} which means position in a payload from which searching should be started and number of bytes from the start position to observe. And finally \textbf{msg} parameter can appear within options. It is a simple string that should be logged or alerted to user when some packet matches against the rule.

\section{Application design overview.}
The main objective of this thesis is to design and implement simple NIDS application capable to analyze network traffic using GPUs using CUDA. 
\subsection*{Pattern matching.}
Pattern matching is the most heavy operation that affects the NIDS perfomance. Pattern matching algorithms can be classified into single-pattern and multi-pattern algorithms.

Single-pattern algorithms work in a way that each pattern is searched in a given text individually. Therefore for k searched patterns algorithm should be repeated k times. Some of the widely used algorithms are Knuth-Morris-Pratt\cite{kmp} and Boyer-Moore\cite{boyermoore}. Knuth-Morris-Pratt is able to skip characters when a mismatch occurs in the comparison phase using a partial match table for each pattern. Each table is built by preprocessing every pattern separately. Boyer-Moore is the most widely used single-pattern algorithm. Its execution time can be sublinear if the suffix of the string to be searched for appears infrequently in the input stream, due to the skipping heuristics that it uses.

Multi-pattern matching algorithms search a set of patterns in a text simultaneously. It is done by preprocessing all patterns that could be searched into an automation which is used during searching phase. The automation may be represented as a trie, a table or combination of the two. Such algorithms searches each byte from the text only once. Also such algorithms scales much better that single-pattern. Example of multi-pattern algorithms: Aho-Corasick\cite{aho_corasick}, Wu-Manber and Commentz-Walter\cite{commwalter}.

Previous section introduced NIDS rule which could contain \textbf{content} or \textbf{pcre} patterns that should be searched within payload. For each regular expression a single DFA will be built and packet payload will be matched against certain DFA. To search content strings it was decided to use Aho-Corasick algorithm, this way all content string are searched simultaneously within the packet payload.

The Aho-Corasick algorithm consists of two parts. In the first part we construct from the set of patterns a finiste state machine; in the second part we apply the packet payload as input to the pattern matching machine. The machine signals whenever it has found a match for a keyword. The pattern matching machine consists of a set of states. Each state is represented by a number. The machine processes the text by successively reading symbols from input string, making state transitions and occasionally emitting output. The behaviour of the pattern matching machine is dictated by three functions: a goto function \emph{g}, a failure function \emph{f} and an output function \emph{output}. Figures \ref{fig:goto_function}, \ref{fig:failure_function} and \ref{fig:output_function} shows the functions used by a pattern matching machine for a set of keywords (he, she, his, hers).

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/goto_function.png}
\caption{Aho-Corasick goto function.}
\label{fig:goto_function}
\end{figure}

\begin{figure}[h]
\centering
\begin{tabular}{l l l l l l l l l l}
\emph{i}       & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 \\
\emph{f(i)}    & 0 & 0 & 0 & 1 & 2 & 0 & 3 & 0 & 3 \\
\end{tabular}
\caption{Aho-Corasick failure function.}
\label{fig:failure_function}
\end{figure}

\begin{figure}[h]
\centering
\begin{tabular}{l l}
\emph{i} & \emph{output(i)} \\
2 & {he} \\
5 & {she, he} \\
7 & {his} \\
9 & {hers} \\
\end{tabular}
\caption{Aho-Corasick output function.}
\label{fig:output_function}
\end{figure}

One state (usually 0) is designated as a start state. In figure \ref{fig:goto_function} the states are 0, 1,...,9. The goto function \emph{g} maps a pair consisting of state and an input symbol into a state or the message fail. The directed graph in figure \ref{fig:goto_function} represents the goto function. The absence of an arrow indicates \emph{fail}. Thus, g(1, \emph{s}) = \emph{fail} for all input symbols \emph{s} that are not 'e' or 'i'.

The failure function \emph{f} maps a state into a state. The failure function is consulted whenever the goto function reports \emph{fail}. Certain states are designated as output states which indicate that a set of keywords has been found.

\begin{figure}[h]
\textbf{Input.} A text string $x = a_1 a_2 ... a_n$ where each $a_i$ is an input symbol and a pattern matching machine M with goto function \emph{g}, failure function \emph{f}, and output function \emph{output}.\\
\textbf{Output.} Locations at which keywords occur in x.\\
\textbf{Method.}
\begin{algorithmic}
\FOR{$i=1$ to $n$}
\WHILE{$g(state, a_i) = fail$}
\STATE $state \leftarrow f(state)$
\ENDWHILE
\STATE $state \leftarrow g(state, a_i)$
\IF{$output(state) \neq empty$}
\PRINT $i$
\PRINT $output(state)$
\ENDIF
\ENDFOR
\end{algorithmic}
\caption{Aho-Corasick pattern matching machine algorithm.}
\label{fig:ac_search_pseudocode}
\end{figure}

\begin{figure}[h]
\textbf{Input.} Set of keywords K={$y_1$, $y_2$,..., $y_k$} \\
\textbf{Output.} Goto function \emph{g} and a partially computed output function \emph{output}.\\
\textbf{Method.} It is assumed that \emph{output(s)} is when empty state s is first created, and \emph{g{s, a} = fail} if \emph{a} is undefined or if \emph{g(s, a)} has not yet been defined. The procedure \emph{enter(y)} inserts into goto graph a path that spells out \emph{y}.
\begin{algorithmic}
\STATE $newstate \leftarrow$0
\FOR{$i=1$ to $k$}
\STATE $state \leftarrow$0
\STATE $j \leftarrow$1
\WHILE{$g(state, a_j) \neq fail$}
\STATE $state \leftarrow g(state, a_j)$
\STATE $j \leftarrow j+1$
\ENDWHILE
\FOR{$p \leftarrow j$ to $m$}
\STATE $newstate \leftarrow newstate+1$
\STATE $g(state, a_p) \leftarrow newstate$
\STATE $state \leftarrow newstate$
\ENDFOR
\STATE $output(state) \leftarrow {a_1a_2 ... a_m}$
\ENDFOR
\FORALL{$a$ such that $g(0, a)=fail$}
\STATE $g(0, a) \leftarrow$0
\ENDFOR
\end{algorithmic}
\caption{Aho-Corasick goto function construction algorithm.}
\label{fig:ac_goto_pseudocode}
\end{figure}

\begin{figure}[h]
\textbf{Input.} Goto function \emph{g} and failure function \emph{f}. \\
\textbf{Output.} Next move function \emph{d}.\\
\begin{algorithmic}
\STATE $queue \leftarrow empty$
\FOR{each symbol $a$}
\STATE $d(0, a) \leftarrow g(0, a)$
\IF{$g(0, a) \neq 0$}
\STATE $queue \leftarrow queue \cup g(0, a)$
\ENDIF
\ENDFOR
\WHILE{$queue \neq empty$}
\STATE let $r$ be the next state in $queue$
\STATE $queue \leftarrow queue - {r}$
\FOR{each symbol $a$}
\IF{$g(r, a) = s \neq fail$}
\STATE $queue \leftarrow queue \cup {s}$
\STATE $d{r, a} \leftarrow s$
\ELSE
\STATE $d(r, a) \leftarrow d(f(r), a)$
\ENDIF
\ENDFOR
\ENDWHILE
\end{algorithmic}
\caption{Aho-Corasick DFA construction algorithm.}
\label{fig:ac_dfa_pseudocode}
\end{figure}

An operating cycle of a pattern matching machine is defined as follows. Let s be the current state of the machine and a the current symbol of the input string x.
\begin{enumerate}
\item If \emph{g(s, a) = s'}, the machine makes a goto transition. It enters state s' and the next symbol of x becomes the current input symbol.
In addition, if  \emph{output(s')  empty}, then the machine emits the set \emph{output(s')} along with the position of the current input symbol. The operating cycle is now complete.
\item If \emph{g(s, a) = fail}, the machine consults the failure function \emph{f} and is said to make a \emph{failure} transition. If \emph{f(s)=s'} the machine repeats the cycle with s' as the current state and a as the current input symbol.
\end{enumerate}
Initially, the current state of the machine is the start and the first symbol of the text string is the current input symbol. 

Figure \ref{fig:ac_search_pseudocode} shows Aho-Corasick matching machine algorithm pseudocode. Pseudocode for \textbf{Goto} function construction is shown in the figure \ref{fig:ac_goto_pseudocode}.

To be able to use Aho-Corasick algorithm with a GPU a DFA should be constructed from the previously constructed \textbf{Goto}, \textbf{Fail} and \textbf{Output} functions. The algorithm for constructing DFA is shown in a figure \ref{fig:ac_dfa_pseudocode}. In practice, algorithm shown in figure \ref{fig:ac_dfa_pseudocode} should be evaluated in conjunction with algorithm for constructing failure function. The resulting DFA is a two dimensional array. The dimensions of the array are equal to the number of states and the size of the alphabet (256 in our case) plus one position to store flag if state is \emph{output}.

The algorithm for searching through DFA is similar to the regular expression's one and described in the next section.

\subsection*{Regular expression pattern matching.}
Regular expressions are compiled into a DFA by using external library. As a result a two dimensional array is obtained. Array structure is similar to Aho-Corasick DFA. The dimensions of the array are equal to the number of states and the size of the alphabet (256 in our case) plus one position to store index of the rule if current state signs that regular expression is matched. Therefore the final algorithm for searching pattern within packet payload is shown in figure \ref{fig:dfa_search}.

\begin{figure}[h]
\textbf{Input.} A text string $x = a_1 a_2 ... a_n$ where each $a_i$ is an input symbol and a next move function \emph{d} \\
\textbf{Output.} Locations at which keywords occur in x.\\
\begin{algorithmic}
\STATE $curstate \leftarrow$0
\FOR{$i=1$ to $n$}
\STATE $state \leftarrow d(state, a_i)$
\IF{$d(state, 256) \neq $0}
\PRINT $i$
\PRINT $output(i)$
\ENDIF
\ENDFOR
\end{algorithmic}
\caption{Pattern matching DFA search.}
\label{fig:dfa_search}
\end{figure}

DFA search algorithm just traverse over DFA starting from zero state and for every input symbol it calculates the new state. Current state corresponds to the row in the DFA two-dimensional array. And current input symbol corresponds to the offset in the current row. This way new state is read from the DFA. For every state it is also checked if it has output values. Output value means that match to some string occured.

\subsection*{Packet filtering based on route information.}
Information obtained from packet header is used to select packet for analysing its payload or drop it without any analysis. Information from loaded rules should match the packet's header information to continue processing of received packet. It was decided to store route information from the loaded rule as a sequence of bytes in a trie structure. Figure \ref{fig:trie_example} shows example of a trie structure with next IP addresses stored: 192.168.0.1, 192.168.1.38, 192.168.5-100.1 and 232.222.1.79. 

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/trie.png}
\caption{Example of trie structure.}
\label{fig:trie_example}
\end{figure}

The full sequence stored in trie is shown in a figure \ref{fig:route_sequence}. Therefore to compare any received packet to some rule, route information from packet's header should be represented as a sequence of bytes shown in a figure \ref{fig:route_sequence}, and search procedure on a trie for the sequence of bytes is perfomed. This approach is similar to Aho-Corasick algorithm of searching with data saved in trie. The only difference is that trie may contain IP range (trie node may be not just a simple node with value, but contain range of values). If searching in trie found any rule for this packet, this packet is put in a queue for a payload analysis. 

\begin{figure}[h]
\centering
\includegraphics[scale=0.6]{images/route_sequence.png}
\caption{Full route sequence extracted from the packet header.}
\label{fig:route_sequence}
\end{figure}

\subsection*{Transfering packets to the GPU.}
The simplest approach is to transfer each packet to the GPU separately. However, there is an overhead associated with a data transfer operation to the GPU and many small transfers to the GPU result into a perfomance decreasing. Therefore the following approach is designed to achieve better perfomance for packets transfering. Packets obtained by CPU are collected in a buffer for some time. When CPU decides that it's time to process packets it transfers a batch of packets to the GPU for processing. 

CPU program decision on transferring packets to the GPU could be based on two policies:
\begin{itemize}
\item Decision based on the time. Buffer with packets is transfered to the GPU on timer event.
\item Decision based on buffer size. Buffer with packets is transfered to the GPU when size of the buffer exceeds some size.
\end{itemize}

\subsection*{Database structure.}

\begin{figure}[h]
\centering
\includegraphics[scale=0.55]{images/database.png}
\caption{Database structure.}
\label{fig:database_structure}
\end{figure}

Database with results is very simple and contain only one table which have all information about matched packet inside. Figure \ref{fig:database_structure} shows table fields that are filled for a matched packets.
\begin{itemize}
\item \textbf{Time} field contains timestamp when NIDS detected suspicious packet.
\item \textbf{Action} field contains action from the matched rule.
\item \textbf{Message} field contains message from the matched rule.
\item \textbf{Rule} field contains the matched rule itself.
\end{itemize}
In the current application version it was decided not to save payload from the matched packet because of perfomance reasons.

\subsection*{Application components.}
Application design is divided into a several components. The diagram contained in a figure \ref{fig:app_architecture} shows the main application components which should be implemented and used.

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/app_architecture.png}
\caption{The main application components.}
\label{fig:app_architecture}
\end{figure}

\begin{itemize}
\item \textbf{Rules dataset parser}. This component is used to parse input file with Nids rules written in a text format. It is better for parser to be easy extendable for easy adding new options to the rule.
\item \textbf{Database}. This component allows in an easy way adding results of analysing packets into a database.
\item \textbf{GUI}. This component implements user interactions with a program. Application has simple window-based interface which allows to perform all major manipulations with the application like starting capturing traffic, stop capturing, switching analyzing mode, saving traffic into the file, etc.
\item \textbf{Regular expressions implementation}. This module is used to compile regular expression obtained from rules. Regular expressions are compiled to DFA for loading into the CUDA device later.
\item \textbf{Aho-Corasick implementation}. This module is used to create DFA for all simple content strings that are read from rules. Application uses Aho-Corasick algorithm for fast string searching within network packets.
\item \textbf{Network packets capture}. Application itself doesn't capture packet from the kernel, because it is not a trivial task. Instead, program uses external library for packet capturing.
\item \textbf{CUDA module}. Code that runs on a GPU and performs analysis of packets against rules.
\item \textbf{NIDS}. The main component of the system. Uses all others modules to read rules, start capturing and analysing of packets. Stores results of analyzing into a database and updates counters which are used within NIDS to do perfomance calculations.
\end{itemize}

\chapter{Implementation and evaluation.}
\section{General considerations.}
Application implementation has been developed using the following version of the programming tools and libraries:
\begin{itemize}
\item GCC 4.7.0.
\item Cuda compilation tools, version 4.2
\item Libpcap 1.2.1
\item Boost 1.49.0
\item QT 4.8.0
\item SQLite 3.7.11
\item RE2 (without version, source codes snapshot revision '2d252384c5e8' was used)
\end{itemize}

\lstset{language=Python, caption={Part of QT project file with extra compiler configuration.}, label=lst:extra_compiler_config, captionpos=b}
\begin{lstlisting}
# CUDA settings (for Arch Linux)
CUDA_SOURCES += my_nids_cuda.cu
# Path to cuda toolkit install
CUDA_DIR = /opt/cuda-toolkit
# GPU architecture
CUDA_ARCH = sm_11
# nvcc flags (ptxas option verbose is always useful)
NVCCFLAGS = --compiler-options -fno-strict-aliasing -use_fast_math --ptxas-options=-v
# include path
INCLUDEPATH += $$CUDA_DIR/include
# lib dir
QMAKE_LIBDIR += $$CUDA_DIR/lib64
# libs
LIBS += -lcudart
# join the includes in a line
CUDA_INC = $$join(INCLUDEPATH,' -I', '-I', ' ')

# Prepare the extra compiler configuration (taken from the nvidia forum)
cuda.input = CUDA_SOURCES
cuda.output = ${OBJECTS_DIR}${QMAKE_FILE_BASE}.o
cuda.commands = $$CUDA_DIR/bin/nvcc -m64 -g -G -arch=$$CUDA_ARCH -c $$NVCCFLAGS $$CUDA_INC $$LIBS ${QMAKE_FILE_NAME} -o ${QMAKE_FILE_OUT}

cuda.dependcy_type = TYPE_C
cuda.depend_command = $$CUDA_DIR/bin/nvcc -g -G -M $$CUDA_INC $$NVCCFLAGS ${QMAKE_FILE_NAME}
# Tell Qt that we want add more stuff to the Makefile
QMAKE_EXTRA_COMPILERS += cuda
\end{lstlisting}

The NIDS application has been developed based on the design presented in the chapter \ref{chap:design}. The application uses workflow presented in \ref{chap:introduction}. Project sources are organized into QT project file. To compile project first \textbf{qmake} utility from QT is used to generate Makefile. Makefile is required by \textbf{make} utility to compile and link sources into final executable. The whole project structure is placed in one folder, but logically source codes are divided into "Headers", "Sources", "Forms" and "Other files" if IDE such as QT creator is used to manage the project. All source codes are compiled with \textbf{gcc} compiler except CUDA code. Files with CUDA source codes have \textbf{*.cu} extension and are compiled with \textbf{nvcc} compiler. To add possibility to compile \textbf{*.cu} files to QT project an extra compiler configuration is created in QT project file (shown in listing \ref{lst:extra_compiler_config}).

\section{Application threading model.}
The NDIS application has been developed using Boost.Thread library for Nids itself and QThread for GUI part. Boost.Thread is used in the main Nids class for processing packets on CPU or GPU. The QThread is used for executing Main window message processing and running Nids packet capture task from GUI event callback. The reason why two different threads classes were used is that developing of the core Nids application was started using only Boost library and QT part is added a little bit later.

When NIDS application process in CPU mode N threads are created for packet processing and 1 thread is created for processing analysis results. Packet capture thread writes packet payload into a buffer and puts pointer to the buffer into a queue. Other threads are sleep until some packet arrive into the queue. In the moment when packet arrived into the queue only one of the threads awakes and pulls packet from the queue for processing. After finishing working on packet it goes to sleep state again until new packet arrive into the queue.

For NIDS working in CPU+GPU mode packets processing is slightly different. There is only 1 thread which puts \textbf{tasks} (packet payload, payload length and number of DFA to search) into the queue and 1 thread which process packets when size of the packets in the queue exceeds some limit or time for processing elapsed. A double-buffered mechanism of processing packets were implemented to speed up the process of collecting packets: before processing packets on the GPU, buffers are switched and CPU is able to collect arriving packets into the second buffer.

To keep multithreaded code integrity, synchronization mechanisms from Boost library were used: Boost.Semaphore to notify threads that new packet arrived to the queue and Boost.Mutex for synchronizing access to queue and other shared objects.

\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/cpu_threads.png}
\caption{Threading model for CPU processing.}
\label{fig:cpu_threads}
\end{figure}

\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/gpu_threads.png}
\caption{Threading model for CPU+GPU processing.}
\label{fig:gpu_threads}
\end{figure}

Figures \ref{fig:cpu_threads} and \ref{fig:gpu_threads} shows the threading model for CPU and CPU+GPU modes graphically.

\section{Main application classes.}
In this section the main application classes are described, their functionality and purpose.

\subsection{Packet class.}
Packet class is used to store all information about network packet needed to analyze packet header and payload. It is used by Nids class to search within trie structure and analyze payload for regex or ac patterns. Listing \ref{lst:packet_class} shows part of the implementation of the class header.

\lstset{language=C++, caption={Extract of Packet.h}, label=lst:packet_class, captionpos=b}
\begin{lstlisting}
class Packet
{
public:
    Packet(in_addr ip_from, in_addr ip_to, int port_from, int port_to);

    // create packet from raw data
    static Packet* create_packet(const u_char *raw_packet);

    // for test purposes
    static Packet* create_fake_packet();

    // getters and setters for packet components
    in_addr get_ip_from() const;
    // ...

    sequence32 &get_route();
    const sequence32 &get_route() const;
    int get_payload_offset() const;
    int get_payload_size() const;
    const u_char *get_raw_packet() const;

    // payload from raw_packet is saved into internal vector
    // it means that packet will be analyzed by CPU
    vector<u_char> &save_payload();
    const vector<u_char> &get_payload() const;

private:
    //...
    int payload_size;
    int payload_offset;

    // info about route for analyzing in the trie
    sequence32 route;
    const u_char *raw_packet;
    vector<u_char> payload;
};
\end{lstlisting}

When new packet is arrived a create\_packet static method is used to analyze packet. This method is a packet analyzer and it used for every received packet to create Packet class with properly filled fields. Method create\_packet analyzes raw packet data and makes protocol identification. If received packet is TCP/IP packet it return newly created Packet object, otherwise it returns null object which means that it's not need to analyze received packet.

Right after returning from create\_packet method raw\_packet field contains the pointer to the raw packet payload. Method create\_packet doesn't copy payload data into any other place because of efficienty. This moment is described later in Nids class description.

Field route is used to save packet route information in the format described in figure \ref{fig:route_sequence}. The route information is stored in sequence32 type array which is defined as a std::vector class that contain 32 bit values. There is also sequence8 type in the application which is std::vector itself containing 8 bit values.

\subsection{AC class.}
AC class is the implementation of the Aho-Corasick algorithm. The main methods are \textbf{insert} and \textbf{search}. Method \textbf{insert} inserts some string into the internal trie while \textbf{search} tries to find given string within saved strings. When all strings are inserted \textbf{finish} method should be called to translate all strings into a DFA which later can be extracted for loading to the GPU. 

\lstset{language=C++, caption={Extract of AC.h}, label=lst:ac_class, captionpos=b}
\begin{lstlisting}
#define FAIL -1
#define STATE_SIZE UCHAR_MAX + 1

class AC
{
public:
    AC();

    void insert(const sequence8 &sequence, int value);
    void search(const sequence8 &sequence, vector<int> &result);
    void search(const u_char *sequence, int length, vector<int> &result);
    int next(int state, sequence_val8 value);

    void finish();
    int count_states();
    int get_transition_value(int state, int c);
    const vector<int> &get_transition_row(int state);
    const vector<int> &get_out(int state);

private:
    void enter(const sequence8 &sequence, int value);
    void compute_fail_transitions();

    vector< vector<int> > transition_table;
    vector< vector<int> > out;
};
\end{lstlisting}

Internally DFA is stored in a \textbf{transition\_table} field with unbounded maximum number of states and number of transitions for the state equals to UCHAR\_MAX + 1 (256). Field \textbf{out} is an out function as described in Aho-Corasick algorithm description. For every state it contains all matched strings.

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/ac_grid.png}
\caption{Aho-Corasick DFA traversing.}
\label{fig:ac_grid}
\end{figure}

The search algorithm is simple traversing internal DFA for every next input symbol is shown on figure \ref{fig:ac_grid}.

\subsection{Regex class.}
Regex class implements compilation of regex patterns into DFA and searching in the input strings for previously compiled patterns. The main methods are \textbf{compile} and \textbf{search}.

\lstset{language=C++, label=lst:regex_class, captionpos=b, caption={Extract of Regex.h}}
\begin{lstlisting}
#define STATE_SIZE (UCHAR_MAX + 1)
#define STATE_DEAD -1
#define STATE_WHOLE_STR -2

class Regex
{
public:
    Regex();

    int compile(const sequence8 &regex);
    int count_states();
    int get_transition_value(int dfa_offset, int state, int c);
    const vector<int> get_transition_state(int dfa_offset, int state);
    const vector<int> get_transition_state(int row);
    bool is_match(int dfa_offset, int state);
    bool is_match(int row);
    const vector<int> get_dfa_offsets() const;
    bool search(const sequence8 &data, int dfa_offset);
    bool search(const u_char *data, int length, int dfa_offset);

private:
    vector< vector<int> > transition_table;
    vector<int> dfa_offsets;
    int states_total;
    int states_cur_dfa;
    map<void *, int> states_map;
    const re2::uint8 *bytemap;

    int analyze_state(void *state);
};
\end{lstlisting}

Method \textbf{compile} uses RE2 regular expression library to compile input pattern into DFA. Default memory budget for compiling one DFA is 8388608 bytes. If compiling pattern exceed memory budget such regex is declined and won't be searched during packet analyzing.

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/regex_grid.png}
\caption{Regex DFA internal representation.}
\label{fig:regex_grid}
\end{figure}

The internal representation of DFA is similar to AC's one. The only difference is that AC has one DFA and all input string falls into it, while for every regular expression separate DFA is created. Thus, Regex class stores all DFAs into one vector and to be able to find specific DFA start offsets are stored into other vector. Other difference is that regex DFA contains states defined as \textbf{STATE\_DEAD} and \textbf{STATE\_WHOLE\_STR}. This states finish the whole searching procedure: value \textbf{STATE\_DEAD} means that for any input character the next state is always the current one, value \textbf{STATE\_WHOLE\_STR} means that no matter what the next input symbols could be, current searching procedure in any case is not able to finish successfully (e.g. searched pattern should match from the start of the string, if start is not matched, not matter what are the other symbols, searching will not succeed).
Regex DFAs representation stored in Regex class is shown in figure \ref{fig:regex_grid}.

\subsection{Trie class.}
Trie class is used to store the sequence of symbols and quick search over the input string. Internal structure of the trie was described in \ref{chap:design}. Trie class itself contain only three methods: \textbf{insert}, \textbf{search} and \textbf{clear}. The name of the methods are pretty clear and speaks for themselfs.

\lstset{language=C++, label=lst:trie_node_class, captionpos=b, caption={Extract of Trie_node.h}}
\begin{lstlisting}
typedef vector<trie_range> trie_range_t;
typedef vector<Trie_node *> trie_range_value_t;
typedef map<sequence_val32, Trie_node *> trie_transition_t;

class Trie_node
{
public:
    Trie_node();
    vector<int>& get_values();
    void set_values(vector<int> &values);
    void append_values(vector<int> &values);
    void append_value(int value);
    Trie_node *next(sequence_val32 val);
    void set_next(sequence_val32 val, Trie_node *next);
    void set_next_any(Trie_node *next);
    Trie_node *get_next_any();

    trie_transition_t &get_transitions();
    trie_range_t &get_ranges();
    trie_range_value_t &get_range_values();

private:
    trie_transition_t next_;
    trie_range_t range_next_;
    trie_range_value_t range_next_value_;
    Trie_node *any_next_;
    vector<int> values;
};
\end{lstlisting}

More complex is the Trie\_node class which stores information for one trie node (listing \ref{lst:trie_node_class}). The type trie\_transition\_t maps input symbol value to the next trie node. Thus the search algorithm is similar to the searching over DFA, but trie node additionally contains range\_next\_ field which is used to store range for IP or port values. If, for example, 192.0-4.*.* is used, than range\_next\_ will contain 0-4 part of the IP address in current trie node. Also any\_next\_ is the pointer to the next trie node for * input symbol. Therefore searching over such kind of the trie is getting next trie node from the next\_ field for each input symbol from the input string. Additionally check should be done if range or any symbol exist in the current node.

\subsection{Rule class.}
For each rule in the input \textbf{rules.txt} file one object of Rule class type is created. Rule class stores a text representation of the rule and also all parameters of the rule which are parsed by using Boost.Spirit parser.

\lstset{language=C++, label=lst:rule_class, captionpos=b, caption={Extract of Rule.h}}
\begin{lstlisting}
class Rule
{
public:
    static Rule *create_rule(const string &rule);

    Action get_action();
    const sequence32 &get_route();
    // ...

private:
    Rule();
    // ...
    // rule as string
    string rule_str;
    // action that should be taken if rule is matched
    Action action;
    // message that should be inserted into database if this rule matched
    string message;
    // route info includes all information about protocol and IP adresses and ports.
    // it will be inserted in trie and matched against route info of every received packet.
    sequence32 route;
    vector<sequence8> content;
    sequence8 pcre;
    int offset;
    int length;
};

\end{lstlisting}

The Rule object is created using static \textbf{create\_rule} method. This method simply parses passed string value and returns pointer to newly created Rule object with filled parameters. Parsing is done by using Boost.Spirit library which allows to write grammar for parsing directly in the C++ code. For parsing rules a \textbf{RuleRouteGrammar} was created and used in a Boost's \textbf{phrase\_parse} function. In case of malformed input rule \textbf{create\_rule} function returns null pointer. Listing \ref{lst:ip_grammar} contains an example of a simple grammar for parsing IP address. This grammar returns std::vector object with four ip parts represented as unsigned integers. The whole grammar for parsing rules is more complex. Parsing is done into a \textbf{RuleSpirit} structure and later data from this structure is copied into the Rule class.

\lstset{language=C++, label=lst:ip_grammar, captionpos=b, caption={Boost.Spirit grammar for parsing IP address.}}
\begin{lstlisting}
template<typename Iterator>
struct IPGrammar : qi::grammar<Iterator, std::vector<unsigned>(), space_type>
{
    IPGrammar() : IPGrammar::base_type(ip_address)
    {
        ip_address = uint_parser<unsigned, 10, 1, 3>() >> '.' >> 
                     uint_parser<unsigned, 10, 1, 3>() >> '.' >> 
                     uint_parser<unsigned, 10, 1, 3>() >> '.' >> 
                     uint_parser<unsigned, 10, 1, 3>();
    }

    qi::rule<Iterator, std::vector<unsigned>()> ip_address;
}
\end{lstlisting}

Grammar could be easily extended to include new parameters by adding qi::rule as a field and include new rules definitions into IPGrammar. In the same way the RuleRouteGrammar grammar could be extended if new options should be added into the Nids rule. Boost.Spirit allows to parse strings directly into C++ structures, so it is possible to create structures layout for parsing very complex expressions.

\subsection{Nids class.}

Nids class is the main class in the application. It manages all activities in the application, starts and stops network monitoring, all previously described classes are used by Nids class as helpers. 

\lstset{language=C++, label=lst:nids_class, captionpos=b, caption={Extract of Nids.h}}
\begin{lstlisting}
class Nids
{
public:
    Nids();
    ~Nids();

    int read_rules(const char *filename);
    bool start_monitor(const char *interface, int threads_num, const char *db_filename = NULL);
    bool start_monitor_gpu(const char *interface,
                           int device_num,
                           WINDOW_TYPE window_type = BUF_SIZE,
                           int buf_threshold_size = DEF_PACKET_BUFFER_SIZE,
                           int buf_flush_time = DEF_TIME_FLUSH_BUFFER,
                           const char *db_filename = NULL);
    bool start_monitor_offline(const char *interface, 
                           const char *cap_filename, 
                           int threads_num, 
                           const char *db_filename = NULL);
    bool start_monitor_offline_gpu(const char *interface, 
                           const char *cap_filename, 
                           int device_num,
                           WINDOW_TYPE window_type = BUF_SIZE,
                           int buf_threshold_size = DEF_PACKET_BUFFER_SIZE,
                           int buf_flush_time = DEF_TIME_FLUSH_BUFFER,
                           const char *db_filename = NULL);
    bool start_monitor_dump(const char *interface, const char *cap_filename);
    // ...
    void add_alert_callback(AlertCallbackFunc callback, void *user);
    void add_log_callback(LogCallbackFunc callback, void *user);

private:
    // ...

    // loaded rules
    vector<Rule *> rules;

    // route info saved in trie
    Trie rules_trie;

    // ...

    // CPU processing
    // ...
    boost::interprocess::interprocess_semaphore packets_queue_sem;
    boost::interprocess::interprocess_semaphore threads_finished_sem;
    boost::interprocess::interprocess_semaphore process_result_sem;
    boost::interprocess::interprocess_mutex packets_queue_mutex;
    boost::interprocess::interprocess_mutex analyzing_result_mutex;

    // GPU processing
    // ...
    // pointers to current buffer that should be processed on gpu
    vector<int> *gpu_ac_rules;
    vector<u_char> *gpu_packet_buffer;
    vector<int> *gpu_tasks;

    // double buffered packets, tasks and ac_rules
    vector<int> ac_rules1;
    vector<int> ac_rules2;
    vector<u_char> packet_buffer1;
    vector<u_char> packet_buffer2;
    vector<int> tasks1;
    vector<int> tasks2;

    vector<int> analyze_rules;
    vector<int> matched_rules;

    // CUDA synchronization semaphores
    boost::interprocess::interprocess_semaphore gpu_task_ready;
    boost::interprocess::interprocess_semaphore gpu_task_finished;
    boost::interprocess::interprocess_semaphore gpu_init_finished;
    boost::interprocess::interprocess_mutex gpu_switch_buffer_mutex;

		void switch_buffers();
    // ...
    friend void packet_received(u_char *user, const struct pcap_pkthdr *h, const u_char *bytes);
    friend struct process_on_gpu_callable;
    friend struct process_on_cpu_callable;
    friend struct process_results_callable;
};\end{lstlisting}

When application is started it should read Nids rules first. For this process \textbf{read\_rules} method is used: it creates array of rules in a \textbf{rules} variable and later uses these rules for packet analysis.

Different start\_* methods are used to start monitoring network in different modes:
\begin{itemize}
\item \textbf{start\_monitor} is used to start network monitoring in a CPU mode. Number of threads used to process packets is passed as \textbf{threads\_num} parameter.
\item \textbf{start\_monitor\_gpu} is used to start network monitoring in a CPU+GPU mode. Pararmeter \textbf{window\_type} specifies what kind of mechanism is used for transferring packets to GPU: \textbf{BUF\_SIZE}(size of the buffer with payloads) or \textbf{TIME}(time based transfer of payloads). Parameter \textbf{device\_num} specifies which GPU device is used for packets analyzing.
\item \textbf{start\_monitor\_offline} is used to process packets from saved *.cap file in CPU mode. It operates in the same manner as start\_monitor, the only difference that new parameter \textbf{cap\_filename} is used as a name of the file with packets.
\item \textbf{start\_monitor\_offline\_gpu} is used to process packets from saved *.cap file in CPU+GPU mode. It operates in the same manner as start\_monitor\_gpu, the only difference that new parameter \textbf{cap\_filename} is used as a name of the file with packets.
\item \textbf{start\_monitor\_dump} is used to start monitoring network and save packets into the *.cap log file. Later this file is used by previously defined methods for analyzing.
\end{itemize}

Just after starting monitoring network Nids class creates threads for processing packets on CPU or GPU. For this purpose three process\_* structures were created:
\begin{itemize}
\item \textbf{process\_on\_gpu\_callable} is a structure with overrided operator (). This way Boost.Thread executes code in operator() in a separate thread. The process\_on\_gpu\_callable thread performs tasks execution on GPU. Because CUDA device should be initialized in the thread where it will be used, process\_on\_gpu\_callable also initializes CUDA device. For synchronization purposes during initialization and processing on GPU Boost.Semaphore and Boost.Mutex are used. For example, after Nids started the process\_on\_gpu\_callable thread, it waits on gpu\_init\_finished semaphore until thread finishes initialization of the CUDA device.
\item \textbf{process\_on\_cpu\_callable} has similar structure with overrided operator (). Parameter threads\_num defines how many process\_on\_cpu\_callable threads will be created. Every process\_on\_cpu\_callable thread waits on packets\_queue\_sem until new packet appears in the queue. For every packet exactly one thread will seize the packets\_queue\_sem and get packet from the queue.
\item \textbf{process\_results\_callable} is a thread which process results of the analysis. The thread main task is to insert matched rules into a database. There are also mechanism of the callbacks in Nids class. Callbacks functions can be added to Nids using add\_alert\_callback or add\_log\_callback methods and for every matched rule process\_results\_callable thread will call it.
\end{itemize}

The \textbf{packet\_received} function is a callback for a Libpcap library. Libpcap call this function every time the new packet arrived into the system. The main responsibility of this function in CPU mode is to check whether new packet is TCP/IP (using Packet::create\_packet() method), create the Packet class object and put it into the queue. For CPU+GPU mode processing of the received packet is slightly different. The Packet class object is created as it is done in CPU mode, but just after creation all parsed parameters are saved into internal vectors (including payload) and Packet object is deleted. This way Nids class gathers all payloads into the one linear buffer and if the buffer size exceeds limit or time for processing packets expired, it increases gpu\_task\_ready semaphore to awake process\_on\_gpu\_callable thread to process packets on the GPU. Also CPU+GPU mode manages \textbf{tasks} array. This array is directly copied to the CUDA device and contains tasks that CUDA threads will process. The structure of task is described later in the CUDA code description.

The Nids class has two vectors for CUDA tasks and two vectors for packet buffer. This way double-buffered mechanism is implemented for packet capturing and CUDA processing. When process\_on\_gpu\_callable thread copies data to the GPU and executes kernel on \textbf{tasks1} vector of tasks, packet\_received adds newly received packets to the \textbf{tasks2} vector of tasks.

Nids class logs a lot of information during its work using Boost.Log library. To reduce logged information \textbf{set\_log\_level} method is used. It defines six different log levels, but the major ones for this application are: TRACE, DEBUG and INFO.

\subsubsection*{Future work.}
The future work for the Nids class is to implement packet categorization based on the payload size and process different categories of packets separately. There could be several categories of packets: small, medium, large, etc. This way CUDA resources could be used more effectively.

Also a problem with TCP stream is not solved in the current version of the application. Packets from the network layer of the network stack are split into several TCP packets. It could be a problem if, for example, packet data is compressed and should be uncompressed before analyzing payload. In such a case TCP reassembly of the several packets from TCP stream should be performed before analyzing packet payload\cite{tcp_ref}.

\subsection{GanidsMainWindow class.}
GanidsMainWindow class implements a GUI part of the application. It inherits QT's QMainWindow class and creates GUI components when application is started. QT allows to describe all GUI components in the '*.ui' class using QT designer and during compilation QT framework automatically compiles this file into the proper '*.cpp' file to create the whole user interface. GanidsMainWindow just call \textbf{setupUi} method in it constructor to initialize all parts of the user interface.

\lstset{language=C++, label=lst:main_window, captionpos=b, caption={GanidsMainWindow class.}}
\begin{lstlisting}
class GanidsMainWindow : public QMainWindow
{
// ...    
private slots:
    void on_button_start_clicked();
    void on_check_gpu_toggled(bool checked);
    void on_check_save_toggled(bool checked);
    void on_check_load_toggled(bool checked);
    void on_button_stop_clicked();
    void on_capture_finished();
    void on_capture_terminated();
    void on_nids_alert(const char *message, const char *rule);
    void on_nids_log(const char *message, const char *rule);
    void on_button_clear_matched_clicked();
signals:
    void nids_alert(const char *message, const char *rule);
    void nids_log(const char *message, const char *rule);

private:
    Ui::GanidsMainWindow *ui;

    void update_network_interfaces();
    void update_cuda_devices();
    void start_gpu_capture_thread(QString interface_name,
                                  int cuda_device_num,
                                  WINDOW_TYPE window_type,
                                  int flush_buffer_size,
                                  int flush_time);
    void start_cpu_capture_thread(QString interface, int threads_num);
    void stop_capture_thread();
// ...
\end{lstlisting}

All slots 'on\_*' are used as a callbacks for user actions. For example \textbf{on\_button\_start\_clicked} starts Nids application to monitor network, \textbf{on\_button\_stop\_clicked} stops monitoring and so on. Signal \textbf{nids\_alert} and \textbf{nids\_log} are emitted when Nids matched some packet to some rule. Methods 'update\_*' are used to update GUI with the currently available CUDA devices and network interfaces.

\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/user_interface.png}
\caption{User interface.}
\label{fig:user_interface}
\end{figure}

The whole user interface in design mode is shown in the figure \ref{fig:user_interface}.

\subsection{CUDA code.}
The code on the CUDA device is quite simple. It receives array of tasks, pointer to packets buffer and the offset in the DFAs buffer, because all DFAs (including Aho-Corasick one) are copied to the GPU as one linear buffer. Every CUDA thread gets task from the tasks buffer which contain information about payload for processing. The packet processing is just traversing over DFA table for every input byte from the packet payload. If new state is the final one and has \textbf{found} flag  set, CUDA thread sets \textbf{found} flag for the current task. The task structure is described in the figure \ref{fig:cuda_task}. 

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/cuda_task.png}
\caption{Cuda task.}
\label{fig:cuda_task}
\end{figure}

When CUDA kernel is started, every CUDA thread calculates its task number using formula: \textbf{task\_num = blockIdx.x * blockDim.x + threadIdx.x}. This way every thread process takes task in sequential order. CUDA has limits on the number of threads per block (maximum 512 threads), but the number of tasks could be greather than the maximum number of threads, therefore CUDA code reuses threads to process all tasks. If we are using 4 block with 256 threads in each block, than maximum number of threads is 1024. This way only 1024 threads are logically executed in the same moment and should be reused for tasks that are not processed yet. The mechanism of thread resuing is described in the figure \ref{fig:cuda_thread_reuse}.

\lstset{language=C++, label=lst:nids_class, captionpos=b, caption={Extract of my\_nids\_cuda.cu}}
\begin{lstlisting}
    while (task_num < tasks) {
        int task_offset = task_num * 4;
        out_buffer[task_num] = -1;

        unsigned char* payload = packet_buffer + task_buffer[task_offset];
        unsigned payload_length = task_buffer[task_offset + 1];
        int *dfa_offset = dfa + pitch_int * task_buffer[task_offset + 2];
        unsigned result_rule = task_buffer[task_offset + 3];
        unsigned cur_state = 0;
        unsigned cur_idx = 0;
        int ret;

        while (cur_idx < payload_length) {
            int4 payload_part = *((int4*)(payload + cur_idx));
            if ((ret = check_payload_part(payload_part.x, cur_state, result_rule, task_num, dfa, dfa_offset, pitch_int, out_buffer)) < 0)
                break;
            cur_state = ret;

            if ((ret = check_payload_part(payload_part.y, cur_state, result_rule, task_num, dfa, dfa_offset, pitch_int, out_buffer)) < 0)
                break;
            cur_state = ret;

            // ...

            cur_idx += 16;
        }

        task_num += gridDim.x * blockDim.x;
    }
\end{lstlisting}

To speed up the process of analyzing packet on the GPU packet data is transferred from the global memory using 16 byte structure (int4). This way 16 bytes are transferred to the register memory in the one transaction and processed using register memory which has very fast access time.

Function \textbf{check\_payload\_part} takes cur\_state and 4 bytes of the payload as integer parameter and traverse over regex DFA modifying current state. It returnes new state in case no matches were found or -1 if current payload part was matched against some regex.

\begin{figure}[h]
\centering
\includegraphics[scale=0.5]{images/cuda_thread_reuse.png}
\caption{Cuda threads reuse.}
\label{fig:cuda_thread_reuse}
\end{figure}

Thread reusing is implemented with expression \textbf{task\_num += gridDim.x * blockDim.x} which increases the number of task for the current thread by the size of grid dimension multiplied on block dimension.

\subsubsection*{Future work.}
The future work for the CUDA code is to use texture memory to save DFA or packets payload. Texture memory has the same access time as a global memory: 400-600 cycles. The only difference is that texture memory is cached and in case of the cache hit access time is about 1 cycle which is very fast. On the other side access pattern for the packet analyzing has random structure, therefore using structures may not give desired results.

\section{Evaluation}
In the experiments NVIDIA GeForce GT 9600M with 512MB of memory videocard and Intel Core 2 T9400 2.53 GHz was used. First the time needed for the transfer batch of pakets to the GPU was measured.

\begin{table}[h]
\centering
\begin{tabular}{| l | l |}
\hline
Buffer size (in bytes) & Transfer time for 1 packet (ms) \\ \hline
3200 & 0.01688 \\ \hline
6400 & 0.006219 \\ \hline
12800 & 0.004908 \\ \hline
25600 & 0.004835 \\ \hline
51200 & 0.004628 \\ \hline
102400 & 0.004592 \\ \hline
\end{tabular}
\caption{Packet transfer time.}
\label{fig:packet_transfer_times}
\end{table}

From the figure \ref{fig:packet_transfer_times} it is visible that packet transfer time is decreasing while size of packets buffer increased. And transfer times for big buffers of packets remains somewhat constant. The buffer size about 25000 bytes is enough to get acceptable perfomance for packet transferring to the GPU.

\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/graph1.png}
\caption{Number of served rules in worst case scenario.}
\label{fig:graph1}
\end{figure}

Testing of the online processing mode was done on the 2, 4 and 10 Mb/s network. For all network speeds were created the worst case scenario when all incoming packets should be analyzed by many regular expressions. Figure \ref{fig:graph1} shows the number of rules when Libpcap starts to drop packets in CPU and CPU+GPU modes. Processing packet on the GPU gives about 20\% more rules which may be under expectations. But it should be noticed that CPU load during processing packets on GPU is quite small. Current implementation doesn't load CPU much, because CPU performs only simple operations of recognizing TCP/IP packets and matching packet header which has small length. In case of TCP stream reassembly and additional protocols for recognizing CPU will have more work to do and more packets will be dropped in CPU mode. Also CUDA code is not optimized because of lack of experience during implementing it, which may give significant reduction in speed.

Figure \ref{fig:graph2} shows dependency between packets buffer size and the number of served rules in CPU+GPU mode. As it was expected when buffer is small more overheads in the copying of data to the GPU and application is able to serve less rules in the same time. Larger buffers gave more perfomance.

\begin{figure}[h]
\centering
\includegraphics[scale=0.3]{images/graph2.png}
\caption{Number of served rules with different packets buffer size.}
\label{fig:graph2}
\end{figure}

It is also interesting to test database where all rules have \textbf{content} option only. In a such situation only one DFA is created from the Aho-Corasick alrorithm. The result is shown in the figure \ref{fig:graph3}. As it was expected only content rules work much faster than combined rules which have both content and regular expression options.

\begin{figure}[t]
\centering
\includegraphics[scale=0.3]{images/graph3.png}
\caption{Number of served rules with only content option.}
\label{fig:graph3}
\end{figure}

The normal scenario is not possible to test in the current version of the application, because a real database of the rules and real traffic with known IP addresses should be developed. Creating a real database of rules is a question of time and experience in finding vulnerabilities.

\setsecnumdepth{part}
\chapter{Conclusion}
The Thesis result is an application which is capable of achieving all objectives defined in chapter \ref{chap:introduction}. Application allows monitoring network traffic in two modes: CPU and CPU+GPU. Using CPU+GPU gave about 20-30\% of speed increase comparing to CPU mode. The next step for the application is creating or porting large database of the rules. Also introducing more network protocols and more options for rules will be a good idea. Such an options could be, for example, TCP stream reassembly and decompressing packet payload, because almost all servers and browsers compress data before sending today. GUI part of the application doesn't accomplish it functionality and was not a good idea. In case PC has only one videocard, CUDA code could take major part of the GPU resources and GUI may have lags in response. Command line based application is better choise for such kind of application.

Writing CUDA code is a simple and hard at the same time. Simple, because C language used for CUDA applications is well known and NVIDIA extensions for it didn't broke it's conciseness. And hard, because writing programs for parallel architectures is something different compared to writing regular applications. Code should be optimized as much as possible, every memory transaction should be taken into account. This thesis was a good point to learn different technologies such as C++ programming language including some templates in pair with GPGPU based on CUDA architecture, Boost synchronization mechanisms and Spirit parser, Libpcap library API, SQLite database, QT and as an extra \LaTeX{} program was used for the first time to create this document.

\bibliographystyle{plain}
\bibliography{mybibliographyfile}

\setsecnumdepth{all}
\appendix

\chapter{Acronyms}
% \printglossaries
\begin{description}
	\item[GUI] Graphical user interface
	\item[NIDS] Network Intrusion Detection System
	\item[CPU] Central processing unit
	\item[IP] Internet Protocol
	\item[FPGA] Field-programmable gate array
	\item[ASIC] Application-specific integrated circuit
	\item[NPU] Network processing unit
	\item[TCP] Transmission Control Protocol
	\item[GPP] General-purpose processor
	\item[SPP] Special-purpose processor
	\item[DSP] Digital signal processors
	\item[CAD] Computer aided design
	\item[GPGPU] General-purpose computing on graphics processing units
	\item[CUDA] Compute Unified Device Architecture
	\item[IEEE] Institute of Electrical and Electronics Engineers
	\item[LAN] Local area network
	\item[RSPAN] Remote Switched Port Analyzer
	\item[API] Application programming interface
	\item[ARPANET] Advanced Research Projects Agency Network
	\item[IDES] Intrusion Detection Expert System
	\item[DIDS] Distributed Intrusion Detection System
	\item[CMDS] Computer Misuse Detection System
	\item[ASIM] Automated Security Measurement System
	\item[AFCERT] Air Force’s Computer Emergency Response Team
	\item[NSM] Network Security Monitor
	\item[IBM] International Business Machines
	\item[FLOPS] Floating-point operations per second
	\item[SQL] Structured Query Language
	\item[OSI] Open Systems Interconnection
	\item[DFA] Deterministic finite automaton
	\item[IDE] Integrated development environment
\end{description}


\chapter{Contents of enclosed CD}

\texttt{\\
\noindent.\\
|\\
|--\ text\\
|\ \ \ |-- pdf\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ This document content in PDF format.\\
|\ \ \ \textbackslash--\ tex\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ This document content in LaTeX format.\\
|\ \ \ \ \ \ \ \ \textbackslash--\ figures\ \ \ \ \ \ \ \ \ \ \ \ \ Figures used in this document.\\
\textbackslash-- src\\
.\ \ \ |-- ganids\_gui\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Application source codes.\\
.\ \ \ \textbackslash--\ test\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ Rules used for testing application.\\


}
\end{document}