Skip to content

Commit 38a2432

Browse files
committed
Update regex doc for upcoming LOOT changes
The next release of LOOT (probably v0.28.0) will change what regex engines it uses and support Unicode matching.
1 parent bd6db05 commit 38a2432

File tree

1 file changed

+36
-32
lines changed

1 file changed

+36
-32
lines changed

docs/contributing/Regular-Expression-Regex.html

Lines changed: 36 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -55,15 +55,19 @@ <h4>Advantages of Regex</h4>
5555

5656
<h4>Regex within LOOT</h4>
5757

58-
<p>Now that we taken the first step to understand what regular expressions are, it should be noted that LOOT uses three different regex implementations in different places.</p>
58+
<p>Now that we taken the first step to understand what regular expressions are, it should be noted that LOOT currently uses three different regex engines in different places.</p>
5959

6060
<ul>
61-
<li>Masterlist entry matching and the contains filter uses <code>C++</code>'s standard regex.</li>
61+
<li>Masterlist entry matching uses the Rust <code>regress</code> library.</li>
6262
<li>Condition evaluation uses the Rust <code>regex</code> library.</li>
63-
<li>The search dialog uses <code>Qt</code>'s regex implementation.</li>
63+
<li>The contains filter and search dialog uses <code>Qt</code>'s regex implementation.</li>
6464
</ul>
6565

66-
<p>This means, that depending on where you want to use a regular expression within LOOT, <a href="https://en.wikipedia.org/wiki/Unicode">unicode characters</a> for character matching might be supported or not.</p>
66+
<p>These different regex engines support different flavours of regex syntax, but they behave similarly. The most significant differences are:</p>
67+
<ul>
68+
<li>Regexes in conditions do not support lookaround.</li>
69+
<li>The three engines define <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\s</code> and <code>\S</code> differently for non-<a href="https://en.wikipedia.org/wiki/ASCII#Character_set">ASCII</a> characters.</li>
70+
</ul>
6771

6872
<pre><code> - name: 'Example\.es(m|p)'
6973
msg:
@@ -72,17 +76,17 @@ <h4>Regex within LOOT</h4>
7276
condition: 'many("Example\.es(m|p)")'
7377
</code></pre>
7478

75-
<p>The regex in the <code>name</code> field uses <code>C++</code>'s standard regex, which doesn't support unicode.</p>
79+
<p>The regex in the <code>name</code> field uses the Rust <code>regress</code>library.</p>
7680

77-
<p>The regex in the <code>condition</code> field uses the Rust <code>regex</code> library, which supports unicode by default.</p>
81+
<p>The regex in the <code>condition</code> field uses the Rust <code>regex</code> library.</p>
7882

7983
<img src="../../images/Regex_Search_Cards.png" alt="Regex_Search_Cards" width="602" height="409">
8084

81-
<p>The search dialog uses <code>Qt</code>'s regex implementation, which doesn't support unicode characters by default (though support may be enabled in a future LOOT release).</p>
85+
<p>The search dialog uses <code>Qt</code>'s regex implementation.</p>
8286

8387
<img src="../../images/Regex_Filter_Contain.png" alt="Regex_Filter_Contain" width="890" height="570">
8488

85-
<p>The contains filter uses <code>C++</code>'s standard regex, which doesn't support unicode.</p>
89+
<p>The contains filter uses <code>Qt</code>'s regex implementation.</p>
8690

8791
<p>The following section will concentrate on the use of regex in the name field of plugin objects in the masterlist, so that we get a better understanding of what is possible while adding and updating plugin entries.</p>
8892

@@ -92,42 +96,42 @@ <h4>Characters and Rules</h4>
9296
<p>The following regular expressions can be used to match characters:</p>
9397
<p><code>\d \D \w \W \s \S</code></p>
9498

99+
<p>Different regex engines define <code>\d</code>, <code>\w</code> and <code>\s</code> differently for non-<a href="https://en.wikipedia.org/wiki/ASCII#Character_set">ASCII</a> characters:</p>
100+
<ul>
101+
<li><code>\d</code> matches any digit character. The <code>regress</code> engine defines <code>\d</code> as matching the digits <code>0</code> to <code>9</code>, but the <code>regex</code> and <code>Qt</code> engines will also match non-ASCII decimal digits such as <code></code>.</li>
102+
<li><code>\w</code> matches any word character. The three engines all define what counts as a word character differently, with too many differences to detail here.</li>
103+
<li><code>\s</code> matches any whitespace character.
104+
<ul>
105+
<li><code>regex</code> matches all Unicode Whitespace characters.</li>
106+
<li><code>Qt</code> matches all Unicode Whitespace characters, plus <code><a href="https://unicode-explorer.com/c/180E">U+180E</a></code>.</li>
107+
<li><code>regress</code> matches all Unicode Whitespace characters except <code><a href="https://unicode-explorer.com/c/0085">U+0085</a></code>, and also matches <code><a href="https://unicode-explorer.com/c/FEFF">U+FEFF</a></code>.</li>
108+
</ul>
109+
</li>
110+
</ul>
111+
112+
<p>The uppercase expressions are easy to define: <code>\D</code> matches everything that doesn't match <code>\d</code>, <code>\W</code> matches everything that doesn't match <code>\w</code>, and <code>\S</code> matches everything that doesn't match <code>\s</code>.</p>
113+
95114
<p>In order to get a feeling what sorts of characters these expressions are capable of matching, let's take the following set of characters as reference:</p>
96115

97-
<p>Digits:</p>
98-
<p><code>0 1 2 3 4 5 6 7 8 9</code></p>
116+
<p>Digits: <code>0 1 2 3 4 5 6 7 8 9</code></p>
99117

100-
<p>Latin Alphabet:</p>
101-
<p><code>A B C D E F G H I J K L M N O P Q R S T U V W X Y Z</code></p>
118+
<p>Latin Alphabet: <code>A B C D E F G H I J K L M N O P Q R S T U V W X Y Z</code></p>
102119

103-
<p>Additional Latin Characters:</p>
104-
<p><code>Ä Æ Ö Ü</code></p>
120+
<p>Additional Latin Characters: <code>Ä Æ Ö Ü</code></p>
105121

106-
<p>Special Characters:</p>
107-
<p><code>! # $ % & ( ) , . ' ` - ; [ ] ^ _ { } ~ € + = Œ ੩</code></p>
122+
<p>Special Characters: <code>! # $ % & ( ) , . ' ` - ; [ ] ^ _ { } ~ € + = Œ ੩</code></p>
108123

109-
<p>Greek Alphabet:</p>
110-
<p><code>α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ ς τ υ φ χ ψ ω</code></p>
124+
<p>Greek Alphabet: <code>α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ σ ς τ υ φ χ ψ ω</code></p>
111125

112-
<p>Japanese Hiragana:</p>
113-
<p><code>あ い う え お か き く け こ さ し す せ そ た ち つ て と な に ぬ ね の は ひ ふ へ ほ ま み む め も や ゆ よ ら り る れ ろ わ を ん</code></p>
126+
<p>Japanese Hiragana: <code>あ い う え お か き く け こ さ し す せ そ た ち つ て と な に ぬ ね の は ひ ふ へ ほ ま み む め も や ゆ よ ら り る れ ろ わ を ん</code></p>
114127

115-
<p>Japanese Katakana:</p>
116-
<p><code>ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ヲ ン</code></p>
128+
<p>Japanese Katakana: <code>ア イ ウ エ オ カ キ ク ケ コ サ シ ス セ ソ タ チ ツ テ ト ナ ニ ヌ ネ ノ ハ ヒ フ ヘ ホ マ ミ ム メ モ ヤ ユ ヨ ラ リ ル レ ロ ワ ヲ ン</code></p>
117129

118130
<p>Through testing the following rules could be derived:</p>
119131

120132
<ul>
121-
<li><code>\d</code> can be used exclusively for digits 0 to 9</li>
122-
<li><code>\D</code> can be used to identify (non-digit) characters from the Latin Alphabet (except <code>Ä Æ Ö Ü</code>) and all the Special Characters</li>
123-
124-
<li><code>\w</code> can be used for the digits 0 to 9, for any of the normal letters from the Latin Alphabet and for the underscore _</li>
125-
<li><code>\W</code> can detect exclusively all the special characters except the underscore _ (and no digits or Latin characters)</li>
126-
127-
<li><code>\s</code> detects whitespace</li>
128-
<li><code>\S</code> can detect non-whitespace characters: digits, Latin characters (except additional ones) and the special characters</li>
129-
130-
<li>None of the expressions <code>\d \D \w \W \S</code> apply for the Greek and Japanese characters</li>
133+
<li><code>\d</code> can be used exclusively for digits 0 to 9 when using <code>regress</code>, but will also match <code></code> when using <code>regex</code> or <code>Qt</code>.</li>
134+
<li><code>\w</code> can be used for the digits 0 to 9, for any of the normal letters from the Latin Alphabet and for the underscore <code>_</code> when using <code>regress</code>. It will match all of the reference characters above except the symbols <code>! # $ % & ( ) , . ' ` - ; [ ] ^ { } ~ € + =</code> when using <code>Qt</code> or <code>regex</code>.</li>
131135
</ul>
132136

133137

0 commit comments

Comments
 (0)