You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/contributing/Regular-Expression-Regex.html
+36-32Lines changed: 36 additions & 32 deletions
Original file line number
Diff line number
Diff line change
@@ -55,15 +55,19 @@ <h4>Advantages of Regex</h4>
55
55
56
56
<h4>Regex within LOOT</h4>
57
57
58
-
<p>Now that we taken the first step to understand what regular expressions are, it should be noted that LOOT uses three different regex implementations in different places.</p>
58
+
<p>Now that we taken the first step to understand what regular expressions are, it should be noted that LOOT currently uses three different regex engines in different places.</p>
59
59
60
60
<ul>
61
-
<li>Masterlist entry matching and the contains filter uses <code>C++</code>'s standard regex.</li>
61
+
<li>Masterlist entry matching uses the Rust <code>regress</code> library.</li>
62
62
<li>Condition evaluation uses the Rust <code>regex</code> library.</li>
<li>The contains filter and search dialog uses <code>Qt</code>'s regex implementation.</li>
64
64
</ul>
65
65
66
-
<p>This means, that depending on where you want to use a regular expression within LOOT, <ahref="https://en.wikipedia.org/wiki/Unicode">unicode characters</a> for character matching might be supported or not.</p>
66
+
<p>These different regex engines support different flavours of regex syntax, but they behave similarly. The most significant differences are:</p>
67
+
<ul>
68
+
<li>Regexes in conditions do not support lookaround.</li>
69
+
<li>The three engines define <code>\d</code>, <code>\D</code>, <code>\w</code>, <code>\W</code>, <code>\s</code> and <code>\S</code> differently for non-<ahref="https://en.wikipedia.org/wiki/ASCII#Character_set">ASCII</a> characters.</li>
70
+
</ul>
67
71
68
72
<pre><code> - name: 'Example\.es(m|p)'
69
73
msg:
@@ -72,17 +76,17 @@ <h4>Regex within LOOT</h4>
72
76
condition: 'many("Example\.es(m|p)")'
73
77
</code></pre>
74
78
75
-
<p>The regex in the <code>name</code> field uses <code>C++</code>'s standard regex, which doesn't support unicode.</p>
79
+
<p>The regex in the <code>name</code> field uses the Rust <code>regress</code>library.</p>
76
80
77
-
<p>The regex in the <code>condition</code> field uses the Rust <code>regex</code> library, which supports unicode by default.</p>
81
+
<p>The regex in the <code>condition</code> field uses the Rust <code>regex</code> library.</p>
<p>The search dialog uses <code>Qt</code>'s regex implementation, which doesn't support unicode characters by default (though support may be enabled in a future LOOT release).</p>
<p>The following section will concentrate on the use of regex in the name field of plugin objects in the masterlist, so that we get a better understanding of what is possible while adding and updating plugin entries.</p>
88
92
@@ -92,42 +96,42 @@ <h4>Characters and Rules</h4>
92
96
<p>The following regular expressions can be used to match characters:</p>
93
97
<p><code>\d \D \w \W \s \S</code></p>
94
98
99
+
<p>Different regex engines define <code>\d</code>, <code>\w</code> and <code>\s</code> differently for non-<ahref="https://en.wikipedia.org/wiki/ASCII#Character_set">ASCII</a> characters:</p>
100
+
<ul>
101
+
<li><code>\d</code> matches any digit character. The <code>regress</code> engine defines <code>\d</code> as matching the digits <code>0</code> to <code>9</code>, but the <code>regex</code> and <code>Qt</code> engines will also match non-ASCII decimal digits such as <code>੩</code>.</li>
102
+
<li><code>\w</code> matches any word character. The three engines all define what counts as a word character differently, with too many differences to detail here.</li>
103
+
<li><code>\s</code> matches any whitespace character.
104
+
<ul>
105
+
<li><code>regex</code> matches all Unicode Whitespace characters.</li>
106
+
<li><code>Qt</code> matches all Unicode Whitespace characters, plus <code><ahref="https://unicode-explorer.com/c/180E">U+180E</a></code>.</li>
107
+
<li><code>regress</code> matches all Unicode Whitespace characters except <code><ahref="https://unicode-explorer.com/c/0085">U+0085</a></code>, and also matches <code><ahref="https://unicode-explorer.com/c/FEFF">U+FEFF</a></code>.</li>
108
+
</ul>
109
+
</li>
110
+
</ul>
111
+
112
+
<p>The uppercase expressions are easy to define: <code>\D</code> matches everything that doesn't match <code>\d</code>, <code>\W</code> matches everything that doesn't match <code>\w</code>, and <code>\S</code> matches everything that doesn't match <code>\s</code>.</p>
113
+
95
114
<p>In order to get a feeling what sorts of characters these expressions are capable of matching, let's take the following set of characters as reference:</p>
96
115
97
-
<p>Digits:</p>
98
-
<p><code>0 1 2 3 4 5 6 7 8 9</code></p>
116
+
<p>Digits: <code>0 1 2 3 4 5 6 7 8 9</code></p>
99
117
100
-
<p>Latin Alphabet:</p>
101
-
<p><code>A B C D E F G H I J K L M N O P Q R S T U V W X Y Z</code></p>
118
+
<p>Latin Alphabet: <code>A B C D E F G H I J K L M N O P Q R S T U V W X Y Z</code></p>
102
119
103
-
<p>Additional Latin Characters:</p>
104
-
<p><code>Ä Æ Ö Ü</code></p>
120
+
<p>Additional Latin Characters: <code>Ä Æ Ö Ü</code></p>
<p>Through testing the following rules could be derived:</p>
119
131
120
132
<ul>
121
-
<li><code>\d</code> can be used exclusively for digits 0 to 9</li>
122
-
<li><code>\D</code> can be used to identify (non-digit) characters from the Latin Alphabet (except <code>Ä Æ Ö Ü</code>) and all the Special Characters</li>
123
-
124
-
<li><code>\w</code> can be used for the digits 0 to 9, for any of the normal letters from the Latin Alphabet and for the underscore _</li>
125
-
<li><code>\W</code> can detect exclusively all the special characters except the underscore _ (and no digits or Latin characters)</li>
126
-
127
-
<li><code>\s</code> detects whitespace</li>
128
-
<li><code>\S</code> can detect non-whitespace characters: digits, Latin characters (except additional ones) and the special characters</li>
129
-
130
-
<li>None of the expressions <code>\d \D \w \W \S</code> apply for the Greek and Japanese characters</li>
133
+
<li><code>\d</code> can be used exclusively for digits 0 to 9 when using <code>regress</code>, but will also match <code>੩</code> when using <code>regex</code> or <code>Qt</code>.</li>
134
+
<li><code>\w</code> can be used for the digits 0 to 9, for any of the normal letters from the Latin Alphabet and for the underscore <code>_</code> when using <code>regress</code>. It will match all of the reference characters above except the symbols <code>! # $ % & ( ) , . ' ` - ; [ ] ^ { } ~ € + =</code> when using <code>Qt</code> or <code>regex</code>.</li>
0 commit comments