The dataset, which you can download from Kaggle, is provided in CSV format and includes both benign DNS traffic and malicious DNS exfiltration traffic. Each row represents a query-response pair and consists of 45 attributes. Below is an example illustrating the structure of the dataset:
| flow_id | timestamp | src_ip | src_port | dst_ip | dst_port | duration | total_bytes | receiving_bytes | sending_bytes | packets_rate | packets_len_rate | min_packets_len | max_packets_len | mean_packets_len | standard_deviation_packets_len | variance_packets_len | coefficient_of_variation_packets_len | dns_domain_name | dns_top_level_domain | dns_second_level_domain | dns_domain_name_length | dns_subdomain_name_length | uni_gram_domain_name | bi_gram_domain_name | tri_gram_domain_name | numerical_percentage | character_distribution | character_entropy | max_continuous_numeric_len | max_continuous_alphabet_len | max_continuous_consonants_len | max_continuous_same_alphabet_len | vowels_consonant_ratio | conv_freq_vowels_consonants | distinct_ttl_values | ttl_values_min | ttl_values_max | ttl_values_mean | ttl_values_mode | ttl_values_median | distinct_A_records | ans_resource_record_type | ans_resource_record_class | label |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2023-08-05 09:03:40.862988_192.168.68.62_49517_192.168.68.1_53_11349 | 2023-08-05 09:03:40.862988 | 192.168.68.62 | 49517 | 192.168.68.1 | 53 | 0.38218 | 178 | 105 | 73 | 5.23314 | 465.74915 | 73 | 105 | 89.0 | 16.0 | 256.0 | 0.17978 | athomenet.com. | com | athomenet.com | 14 | 0 | ['a', 't', 'h', 'o', 'm', 'e', 'n', 'e', 't', ... | ['at', 'th', 'ho', 'om', 'me', 'en', 'ne', 'et... | ['ath', 'tho', 'hom', 'ome', 'men', 'ene', 'ne... | 0.0 | {'o': 2, 'm': 2, 'c': 1, 'a': 1, 'n': 1, 't': ... | 3.093069 | 0 | 9 | 2 | 1 | 0.714286 | 0.642857 | 1 | 300 | 300 | 300.0 | 300.0 | 300.0 | 2 | [1, 1] | [1, 1] | Benign |
| 2019-04-29 07:57:43.810610_198.41.0.130_36304_198.41.0.10_53_18725 | 2019-04-29 07:57:43.810610 | 198.41.0.130 | 36304 | 198.41.0.10 | 53 | 0.00547 | 665 | 373 | 292 | 365.89933 | 121661.52665 | 292 | 373 | 332.5 | 40.5 | 1640.25 | 0.1218 | bba9012dfe6eb0362db5ae43cc103986d2d96b5c1d937c... | com | dnscat-txt.com | 233 | 60 | ['b', 'b', 'a', '9', '0', '1', '2', 'd', 'f', ... | ['bb', 'ba', 'a9', '90', '01', '12', '2d', 'df... | ['bba', 'ba9', 'a90', '901', '012', '12d', '2d... | 0.549356 | {'9': 11, '6': 19, 'd': 14, 'm': 1, 'n': 1, '-... | 4.201555 | 7 | 6 | 4 | 2 | 0.462687 | 0.077253 | 1 | 60 | 60 | 60.0 | 60.0 | 60.0 | 0 | [16] | [1] | Malicious |
The dataset comprises 797.626 benign samples and 785.537 malicious samples. The malicious samples were generated using nine different DNS exfiltration tools, including iodine, dnsexfiltrator, cobaltstrike, and others.
Malware using DNS exfiltration typically attempts to mimic legitimate traffic to avoid detection. With this context, two assumptions were made when generating the dataset:
- Like benign DNS queries, DNS exfiltration queries use randomly selected Transaction IDs. Consequently, traffic generated by DNS-Shell was excluded from this dataset, as this tool reuses the same Transaction ID across consecutive queries.
- DNS exfiltration domains follow the LDH rule as defined in RFC 1035, meaning domain names are limited to letters, digits, and hyphens. Therefore, traffic from Iodine that uses Base128 encoding was also excluded, as it does not conform to the LDH rule.
There is a lack of publicly available DNS exfiltration datasets that are both high-quality and provided in an analyzable format. Most existing CSV datasets exhibit one or more of the following issues:
- The DNS exfiltration traffic was generated using only one or a few exfiltration tools, resulting in malicious samples that do not represent a diverse range of attack patterns.
- There are limited attributes per sample, which restricts the extensive analysis of the behaviors of DNS exfiltration tools.
- The datasets contain inaccurate data, including erroneous labeling (e.g., benign samples labeled as malicious).
- There is insufficient explanation regarding the dataset's creation process, including details about the exfiltration tools that were utilized.
This dataset aims to address these challenges.
This dataset is built upon PCAP files obtained from the following sources:
-
PCAP files by Singh et al., used in their work, Detecting bot-infected machines using DNS fingerprinting:
Benign university traffic captured on Monday, Wednesday, Thursday, and Friday between 8 a.m. and 10 p.m. (DGA domains identified by the authors have been removed). -
PCAP files by Gao et al., used in their work, GraphTunnel: Robust DNS Tunnel Detection Based on DNS Recursive Resolution Graph:
Top 1 million domains' benign traffic, with only domains containing a subdomain being included. Malicious traffic generated by the tools tcp-over-dns and dnspot. -
PCAP files by Chen et al., used in their work, DNS Covert Channel Detection Method Using the LSTM Model:
Malicious traffic generated by the tools cobaltstrike, dns2tcp, dnscat2, iodine, and ozymandns. -
My own network environment:
Malicious traffic generated by the tools dnsexfiltrator and a modified version of dnsexfiltrator.
The original PCAP files were subsequently converted to CSV format using the tool ALFlowLyzer. ALFlowLyzer is a Python open-source project to extract application layer features from network traffic. It was developed by Shafi et al. to generate the BCCC-CIC-Bell-DNS-2024 dataset. Further details can be found in their work titled Unveiling malicious DNS behavior profiling and generating benchmark dataset through application layer traffic analysis
Each row in the CSV files generated by ALFlowLyzer represents a bidirectional DNS flow, with the first packet determining the forward (source-to-destination) and backward (destination-to-source) directions. The primary identifier for a DNS flow is the transaction ID in the DNS header. Consequently, if the same transaction ID is reused within a short timeframe, a single row may correspond to multiple query-response pairs.
This scenario poses an issue, as multiple query-response pairs with the same transaction ID within a short timeframe typically occur by chance and are not logically related. This behavior also applies to DNS exfiltration malware, which, as previously noted, typically mimics legitimate traffic by generating separate queries with randomly chosen transaction IDs. To address this, all rows containing DNS flows with more than one request and response were removed to ensure each row accurately represents a unique query-response interaction.
To further enhance the quality of the dataset, additional data cleaning procedures were performed. These include:
- Removing duplicate query-response pairs.
- Removing incorrectly labeled query-response pairs.
- Removing noisy heartbeat packets that have no exfiltration data in requests and no commands in responses.
- Removing query-response pairs where data was incorrectly extracted due to parsing errors by ALFlowLyzer (The tool appears to have limitations with SRV and MX record types).
- Removing query-response pairs that exhibited erroneous or incomplete responses due to network or nameserver issues.
| No. | Attribute | Description |
|---|---|---|
| 1 | flow_id | The identifier of the DNS flow, structured as follows: Timestamp_SrcIP_SrcPort_DstIP_DstPort_TransactionID. |
| 2 | timestamp | When the DNS request was sent. |
| 3 | src_ip | The IP address from which the DNS request was sent. |
| 4 | src_port | The port from which the DNS request was sent. |
| 5 | dst_ip | The IP address to which the DNS request was sent. |
| 6 | dst_port | The port to which the DNS request was sent. |
| 7 | duration | The number of seconds between the DNS request and the DNS response. |
| 8 | total_bytes | The total size of the DNS request and its corresponding DNS response, measured in bytes. |
| 9 | receiving_bytes | The size of the DNS response in bytes. |
| 10 | sending_bytes | The size of the DNS request in bytes. |
| 11 | packets_rate | The rate at which DNS packets are sent or received during the DNS flow. A higher packets rate means that the DNS request and response were exchanged in a shorter timeframe, suggesting fast communication. This metric is calculated by dividing the number of packets by the duration of the DNS flow: 2 / duration. |
| 12 | packets_len_rate | The rate at which the total size of DNS packets is sent or received during the DNS flow. A higher packets_len_rate means that a larger amount of data was transmitted relative to the duration. This metric is calculated by dividing the total size of the DNS packets by the duration of the DNS flow: total_bytes / duration. |
| 13 | min_packets_len | The size in bytes of the smallest packet within the DNS flow. |
| 14 | max_packets_len | The size in bytes of the largest packet within the DNS flow. |
| 15 | mean_packets_len | The average size in bytes of the packets within the DNS flow. |
| 16 | standard_deviation_packets_len | The standard deviation of packet sizes in the DNS flow. |
| 17 | variance_packets_len | The variance of packet sizes in the DNS flow. |
| 18 | coefficient_of_variation_packets_len | The coefficient of variation of packet sizes in the DNS flow. This metric is calculated by dividing the standard deviation of packet sizes by the average packet size: standard_deviation_packets_len / mean_packets_len. |
| 19 | dns_domain_name | The domain name for which information is being queried. |
| 20 | dns_top_level_domain | The top-level domain of the queried domain name. |
| 21 | dns_second_level_domain | The second-level domain (including the first-level domain) of the queried domain name. |
| 22 | dns_domain_name_length | The length of the domain name (including the period at the end, which signifies the root level), measured in characters. |
| 23 | dns_subdomain_name_length | The length of the first subdomain when read from left to right. |
| 24 | uni_gram_domain_name | The unigram of the domain name. |
| 25 | bi_gram_domain_name | The bigram of the domain name. |
| 26 | tri_gram_domain_name | The trigram of the domain name. |
| 27 | numerical_percentage | The percentage of numeric characters in the domain name relative to its total length. This metric is calculated by dividing the number of digits in the domain name by its length: number of digits in the domain name / dns_domain_name_length. |
| 28 | character_distribution | The frequency distribution of individual characters in the domain name. This attribute outputs the number of occurrences of each character in the domain name as a dictionary, where the keys are the characters and the values are their frequencies. |
| 29 | character_entropy | The entropy of the characters in the domain name. |
| 30 | max_continuous_numeric_len | The length of the longest continuous sequence of digits in the domain name. |
| 31 | max_continuous_alphabet_len | The length of the longest continuous sequence of alphabetic characters in the domain name. |
| 32 | max_continuous_consonants_len | The length of the longest continuous sequence of consonants in the domain name. |
| 33 | max_continuous_same_alphabet_len | The length of the longest continuous sequence of identical letters in the domain name. |
| 34 | vowels_consonant_ratio | The ratio of vowels to consonants in the domain name. |
| 35 | conv_freq_vowels_consonants | The relative frequency of alternating vowels and consonants in the domain name. This metric is calculated by dividing the number of occurrences of alternating letters (vowel followed by consonant or vice versa) by the total length of the domain name. |
| 36 | distinct_ttl_values | The number of distinct TTL values present in the resource records of the DNS flow. |
| 37 | ttl_values_min | The smallest TTL value among the resource records in the DNS flow. |
| 38 | ttl_values_max | The largest TTL value among the resource records in the DNS flow. |
| 39 | ttl_values_mean | The average TTL value among the resource records in the DNS flow. |
| 40 | ttl_values_mode | The TTL value that occurs most frequently among the resource records in the DNS flow. |
| 41 | ttl_values_median | The median of the TTL values among the resource records in the DNS flow. |
| 42 | distinct_A_records | The number of distinct resource records of type A present in the DNS flow. |
| 43 | ans_resource_record_type | The types of resource records in the answer section. |
| 44 | ans_resource_record_class | The classes of resource records in the answer section. |
| 45 | label | Indicates whether the DNS flow is classified as "Malicious" or "Benign." |