Skip to content

Commit 5465bfc

Browse files
committed
feat(post): breaking repeated XOR ciphertext
1 parent 4e1eebf commit 5465bfc

File tree

2 files changed

+187
-2
lines changed

2 files changed

+187
-2
lines changed

src/posts/breakrepeatingxor.mdx

Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
---
2+
3+
title: "Cryptography basics: breaking repeated-key XOR ciphertext"
4+
description: "a (kind of) nice introduction to cryptography"
5+
date: 05-07-2022
6+
7+
---
8+
9+
In this post, we are going to learn a bit of what is the XOR encryption algorithm and how to decipher it through Friedman Test using Hamming Distance and Frequency Analysis.
10+
11+
### First of all, what exactly is a XOR cipher?
12+
If you ever studied bitwise operators, you have already heard of _exclusive or_, or simply XOR.
13+
It takes two inputs and returns 1 if these inputs are different.
14+
![xor truth table](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j6iqp9pb4z712uc51ksc.png)
15+
16+
But the interesting part is that this simple operation, that happens in the bits level, is very useful for **composing cryptographic keys.** That's what we'll see in this post, using a bit of Python and the problem presented in the 6th challenge of Cryptopals (set 1)
17+
18+
### How can we use XOR as a method of encryption? In fact, what is a cryptographic cipher?
19+
To answer this question, let's think in terms of functions. Encrypting a message is taking its plaintext (or, more precisely, its _bytes_), and generating an appearing random output with the help of an _encryption algorithm_. **This algorithm defines the pattern we'll follow when replacing the original content with the encrypted one.**
20+
For example, the Caesar cipher replaces a letter with its corresponding following letter, such that "ABC" becomes "BCD". This pattern goes through the whole message.
21+
But the Caesar cipher can skip more than one letter - what matters here is the logic of substitution. In this way, the **XOR cipher is very similar.**
22+
23+
### Bytes, ASCII and single-byte XOR
24+
Before introducing the encryption with a repeating cipher, let's first understand how a single-byte encryption would be done.
25+
The encryption with a single-byte XOR cipher is made when we use the XOR operation to change the value of each byte; we make this operation in the whole text using a key - that is the constant value which we are going to use to do this operation.
26+
27+
```python
28+
binary_string = b"hello"
29+
for byte in binary_string:
30+
print(byte ^ 100)
31+
```
32+
The outputs will be `12, 1, 8, 8` and `11`.
33+
It happens because each letter in a binary string can be represented by a binary number that, XORed against 100 (the key here), returns a different byte. This number could be any value within the range [0, 255].
34+
Therefore, here `100` acts as our key - we would need to know this value to perform the decryption of the message. **Using a XOR cipher is a symmetric encryption method, what means that we need the same key both to encrypt and decrypt a message**. It happens because XOR is an involutory function - it's its own inverse, so to decrypt a message we would need to simply XOR its bytes against the key again.
35+
![xor explained](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rgl2ug8cbr2uzzsag85c.png)
36+
So, we already have a substitution cipher similar in terms of Caesar cipher, _but a bit more sophisticated._
37+
38+
_**Side note 1:** it turns out that not all XORed bytes will be printable, since they can be outside the ASCII range. In this case, we can make a conversion to base64, for example, to print them. See [(in Portuguese)](https://dev.to/wrongbyte/como-funciona-a-codificacao-em-base64-2njd)_.
39+
_**Side note 2:** the article above can also be helpful to help you understand how things work with ASCII characters in byte-level_.
40+
41+
### Repeating XOR cipher
42+
It turns out that encrypting something with a single-byte key would be a very weak encryption method. To break it, we would only need to know which key was used - which could be achieved by bruteforcing all the 256 possible values. Then, we could look at the outputs of these operations and choose the one that is more "English-like", by assign scores to each output, based on the most frequent letters across the English language.
43+
44+
**PS: remember this function, we are going to see it later again!**
45+
```python
46+
# Breaking a single-byte XOR cipher: we perform a XOR operation
47+
# in the string using all possible values for the key.
48+
# The key used to generate the output closer to English is what we are searching for.
49+
def assign_score(output_string):
50+
string_score = 0
51+
freq = [' ', 'e', 't', 'a', 'o', 'i', 'n', 's', 'h', 'r', 'd', 'l', 'u']
52+
for letter in output_string:
53+
if letter in freq:
54+
string_score += 1
55+
return string_score
56+
57+
def XOR_decode_bytes(encoded_array):
58+
last_score = 0
59+
greatest_score = 0
60+
for n in range(256): # checks for every possible value for XOR key
61+
xord_str = [byte ^ n for byte in encoded_array]
62+
xord_ascii = ('').join([chr(b) for b in xord_str])
63+
last_score = assign_score(xord_ascii)
64+
if (last_score > greatest_score):
65+
greatest_score = last_score
66+
key = n
67+
return key
68+
```
69+
70+
So, we can make it harder to break by simply creating a longer key - with more bytes and that repeats itself across the text.
71+
It would require us two steps to break: first, we would need to know the length of the key, and then we would need to know the key itself - which means testing each possible value for each one of the key's characters. A bit more complicated, right?
72+
73+
So, let's understand how to encrypt a text using a XOR repeating key first.
74+
75+
### Repeating key: encryption
76+
```python
77+
input_text1 = b"Burning 'em, if you ain't quick and nimble\nI go crazy when I hear a cymbal"
78+
XOR_key = b"ICE"
79+
80+
def XOR_repeating_encode(input_string: bytes, key: bytes) -> bytes:
81+
xord_output = []
82+
print(input_string)
83+
for i in range(len(input_string)):
84+
xord_output.append(input_string[i] ^ key[i % len(key)])
85+
86+
return bytes(xord_output)
87+
```
88+
The logic here is pretty the same we used for the single-byte key. In short, we perform a XOR operation against each of the characters of the key, which is `ICE` here. So, "B" is XORed against "I", "u" against "C" and "r" against "E", and so forth until we reach the end of the text. Getting the plaintext back is achieved through the same process.
89+
90+
### But what if we wanted to recover the plaintext without knowing the key?
91+
Here things start to get interesting. Let's see how to break a repeated-key XOR ciphertext!
92+
93+
### 1 - The key's length: Hamming distance
94+
How far is "a" from "d"?
95+
You may say that they are a few letters apart in the alphabet. But there's another interesting way to measure their "distance": how many different bits they have - which is their **Hamming distance**.
96+
So, lowercase a is 95 in the ASCII table, and lowercase d is 100. Their binary representations are `1011111` and `1100100`. They have 5 different bits, so **their hamming distance is 5**.
97+
You can measure this distance across phrases, too - the process is the same, you only sum the result from each pair of characters.
98+
99+
### What this measure has to do with the repeating XOR cipher?
100+
The average Hamming distance between two bytes picked at random is around `4.0`, while that of any two lowercased ASCII alphabet - whose values range from 97 to 122 - is `2.5`.
101+
So it means that the hamming distance between the characters of a plaintext will be much lower than that from a bunch of random bytes - and this information is very useful when we get to test the possible outputs for a variety of possible key lengths.
102+
103+
Let's understand it better.
104+
105+
```python
106+
def hamming_distance(string1: bytes, string2: bytes) -> int:
107+
distance = 0
108+
for (byte1, byte2) in zip(string1, string2):
109+
distance += bin(byte1 ^ byte2).count('1')
110+
return distance
111+
```
112+
Checking the different bits of two strings is basically the same as performing an XOR operation on them and counting the 1's, so the function above does exactly this.
113+
Alright. We now have a way to score a string to know the distance between its bytes. How can we use it now?
114+
115+
In this challenge, the range of the size of possible keys is within the interval [2, 40]. Therefore, we will have to do the following steps:
116+
**1) Divide our text into different chunk sizes, ranging from 2 to 40.
117+
2) On each iteration - for each chunk size chosen - we will check the hamming score between the chunks.
118+
3) The length of chunks with the lower average hamming distance corresponds to the key's length.**
119+
120+
This technique works because, once we get the right size for the key, the chunks will be just XORed plaintext. Therefore, their hamming distance will be way lower than if they were random bytes.
121+
```python
122+
def find_key_length():
123+
# we are searching for the length that produces an output with the lowest hamming score
124+
min_score = len(text)
125+
126+
for keysize in range(2, 40):
127+
chunks = [text[start:start + keysize] for start in range(0, len(text), keysize)]
128+
subgroup = chunks[:4]
129+
# getting the average hamming distance per byte
130+
average_score = (sum(hamming_distance(a, b) for a,b in combinations(subgroup, 2)) / 6) / keysize
131+
if average_score < min_score:
132+
min_score = average_score
133+
key_length = keysize
134+
135+
return key_length
136+
```
137+
In the code above, the logic is as it follows:
138+
Let's say that we are guessing that the key is 4 characters long. If we had the following text:
139+
```python
140+
text = "YWJjZGVmZ2hpamtsbW4="
141+
```
142+
We would divide it in several chunks with four letters each:
143+
```python
144+
chunks = ['YWJj', 'ZGVm', 'Z2hp', 'amts', 'bW4=']
145+
```
146+
Now, if we take the first four chunks, we are able to measure the average hamming distance between them:
147+
```python
148+
# keysize here is equal to 4 and subgroup is the first four chunks
149+
# dividing the score by 6 gives us the average diff between chunks, dividing it by keysize gives us the average diff between each a, b bytes for chunk1, chunk2
150+
average_score = (sum(hamming_distance(a, b) for a,b in combinations(subgroup, 2)) / 6) / keysize
151+
```
152+
153+
154+
### 2 - The key itself
155+
156+
Once we get the key's length, things get easier.
157+
In this particular challenge, it turns out that the key is 29 characters long. Even though we know the length, we still have 29 characters to guess, each one having 256 possibilities.
158+
159+
How to do that? Well, the answer lies on matrices.
160+
161+
```python
162+
def find_key(key_length = find_key_length()):
163+
key_blocks = [text[start:start + key_length] for start in range(0, len(text), key_length)]
164+
# transpose the 2D matrix
165+
key = []
166+
single_XOR_blocks = [list(filter(None,i)) for i in zip_longest(*key_blocks)]
167+
for block in single_XOR_blocks:
168+
key_n = XOR_decode_bytes(block)
169+
key.append(key_n)
170+
171+
ascii_key = ''.join([chr(c) for c in key])
172+
return ascii_key.encode()
173+
```
174+
175+
If our key had exactly three characters - say the key was "RED" - then each first character of each chunk would be XORed against "R", the second against "E" and so on.
176+
So, if we joined each nth character of each chunk, the result would be a list of characters encrypted with a single byte XOR cipher, whose key is the nth character of our repeating key!
177+
178+
Oof, that's too much. Let's see it in detail:
179+
![cute](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/putpgy2fh4t59m8syaa7.png)
180+
181+
The operation of creating an array joining every nth element of each chunk is basically transposing a matrix.
182+
To finally discover the key, then, we just need to apply the function that finds the key for a single-byte XOR ciphertext in each of the lines of our new generated matrix.
183+
And now, we have our deciphered plaintext!
184+
![result](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ekoposymtd36shxw14z6.png)
185+
186+
You can see the full code for this post [here](https://github.com/wrongbyte/study-stuff/blob/main/cryptopals/set1/6_break_repeatingXOR.py).
187+

src/posts/hashcat.mdx

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,8 +6,6 @@ date: 16-06-2022
66

77
---
88

9-
10-
119
Given a hashed password `$2y$12$Dwt1BZj6pcyc3Dy1FWZ5ieeUznr71EeNkJkUlypTsgbX1H68wsRom`, we have only one hint: **the password has four letters, all lowercase.**
1210

1311
### Let's start: finding the hash type

0 commit comments

Comments
 (0)