-
-
Notifications
You must be signed in to change notification settings - Fork 66
Names list indexer #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
eggrobin
wants to merge
64
commits into
unicode-org:main
Choose a base branch
from
eggrobin:indexer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Names list indexer #1304
Changes from 44 commits
Commits
Show all changes
64 commits
Select commit
Hold shift + click to select a range
fc428f3
What have I done
eggrobin 91c41cf
meow
eggrobin 3eb0363
count leaves
eggrobin 575e2ed
meow
eggrobin 29a2fa9
kwoc comparator, subentries
eggrobin dad5eb1
lemmatization tweaks
eggrobin 4a1a15f
JS
eggrobin c653d2f
html, no kwoc
eggrobin 1acc3f2
spotless
eggrobin c7add69
dead code
eggrobin dbc8ae7
meow
eggrobin 81abfc5
meow
eggrobin bf50f64
fffe
eggrobin 9568d54
autocomplete
eggrobin ee0b722
spotless
eggrobin 8a4fd22
CSS
eggrobin a00a2d8
Merge remote-tracking branch 'la-vache/main' into indexer
eggrobin 0d25640
Show latest & α Δ charts as appropriate; use the right kind of word s…
eggrobin 002b363
Rename
eggrobin d662381
suggestions
eggrobin d81c7a3
fix chart links for ranges too, this needs to be factored
eggrobin 64e1a96
showDevProperties on new characters
eggrobin b3660cd
title
eggrobin b0836fd
comments
eggrobin beea7bd
Search by code point
eggrobin bd57592
BOOP
eggrobin 1444fd7
Pretty block
eggrobin 37925d5
ungleichmäßige unzugewiesene
eggrobin 3570a36
Lemmatize the corpus but not the query as suggested by Markus
eggrobin 60798ae
Merge branch 'walking-down-the-plane' into indexer
eggrobin 8c85c8b
Pretty block
eggrobin 550046c
Merge remote-tracking branch 'la-vache/main' into indexer
eggrobin b4a5b43
terminology
eggrobin 9cf091d
Drop the CLI search
eggrobin 63dfd03
meow
eggrobin 0fb75ed
Sentence segmentation
eggrobin 3b52f93
Strip sentences, search by literal, name ranges
eggrobin 9f1089f
nfkc
eggrobin d5627e5
No words for code point search
eggrobin cd2b92c
split informal aliases
eggrobin 0cb5028
Some work towards radicals
eggrobin 20c0185
Limit subentries not entries
eggrobin 5cc5caf
Seems usable
eggrobin 90210de
Don’t include . in most words
eggrobin 01ee291
More selectively override segmentation
eggrobin fd7797b
Fix BOOP, show radical/stroke entry for code point search
eggrobin fa6bc8d
less hacky seal
eggrobin 29d4008
Prettier presentation and usable rsindex
eggrobin ac948ac
rename counter
eggrobin 00764b1
deduplicate locations
eggrobin b7f3af9
format
eggrobin c10792e
In the night, no control
eggrobin c0763bc
ic
eggrobin 162f374
CSS shenanigans
eggrobin 21183cd
g
eggrobin cb9c040
M
eggrobin c63ceaa
q
eggrobin cba8f50
looong
eggrobin d7a9a99
template
eggrobin b603535
bad link
eggrobin d16c5a2
eggsamples
eggrobin b3c88d2
Fix High Private Use Surrogates chart links, more specific noncharact…
eggrobin 01651cd
Renamings
eggrobin 37fb3bb
appease a linter
eggrobin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,28 @@ | ||
| input { | ||
| width:100%; | ||
| max-width:40em; | ||
| } | ||
|
|
||
| ul#results { | ||
| max-width: 40em; | ||
| list-style: none; | ||
| padding: 0; | ||
| overflow-x: hidden; | ||
| } | ||
|
|
||
| .tail { | ||
| display: inline-block; | ||
| padding-left: 2em; | ||
| text-indent: -2em; | ||
| box-sizing: border-box; | ||
| } | ||
| .head { | ||
| display: inline-block; | ||
| padding-left: 1em; | ||
| width: max-content; | ||
| max-width: 100%; | ||
| box-sizing: border-box; | ||
| } | ||
| span.ranges { | ||
| float: right; | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,210 @@ | ||
| // Lemma to snippet to position of the word in the snippet. | ||
| /**@type {Map<string, Map<String, number>>}*/ | ||
| let wordIndex/*= GENERATED LINE*/; | ||
| // Property name to snippet to index entry. | ||
| /**@type {Map<string, Map<string, {html: string, characters: [number, number][]}>>}*/ | ||
| let indexEntries/*= GENERATED LINE*/; | ||
|
|
||
| /**@type {Map<number, string>}*/ | ||
| let characterNames = new Map(); | ||
| /**@type {Map<[number, number], string>}*/ | ||
| let characterNameRanges = new Map(); | ||
|
|
||
| let maxResults = 100; | ||
|
|
||
| for (let [name, entry] of indexEntries.get("Name")) { | ||
| if (entry.characters[0][0] == entry.characters[0][1]) { | ||
| characterNames.set(entry.characters[0][0], name); | ||
| } else { | ||
| for (let range of entry.characters) { | ||
| characterNameRanges.set(range, name); | ||
| } | ||
| } | ||
| } | ||
| for (let [name, entry] of indexEntries.get("Name_Alias")) { | ||
| if (!characterNames.has(entry.characters[0][0])) { | ||
| characterNames.set(entry.characters[0][0], name); | ||
| } | ||
| } | ||
|
|
||
| function updateResults(event) { | ||
| /**@type {string}*/ | ||
| let query = event.target.value; | ||
| let {entries, rangeCount} = search(query); | ||
| if (rangeCount >= maxResults) { | ||
| document.getElementById("info").innerHTML = `Showing first ${maxResults} results`; | ||
| } else { | ||
| document.getElementById("info").innerHTML = rangeCount + " results"; | ||
| } | ||
| document.getElementById("results").innerHTML = "<tr><td>" + entries.join("</tr></tr><tr><td>") + "</td></tr>"; | ||
| } | ||
|
|
||
| function search(/**@type {string}*/ query) { | ||
| let wordBreak = new Intl.Segmenter("en", { granularity: "word" }); | ||
| let queryWords = Array.from(wordBreak.segment(query.replace(/\.-/, "pm").replace(/['.]/, "p"))) | ||
| .filter(s => s.isWordLike) | ||
| .map(s => query.substring(s.index, s.index + s.segment.length)); | ||
| let foldedQuery = queryWords.map(fold); | ||
| var rangeCount = 0; | ||
| var covered = []; | ||
| /**@type {string[]}*/ | ||
| var result = []; | ||
| /**@type {Set<string>}*/ | ||
| var resultSnippets = new Set(wordIndex.get(foldedQuery[0])?.keys() ?? []); | ||
| let firstLemmata = [foldedQuery[0]]; | ||
| if (resultSnippets.size === 0 && foldedQuery.length == 1) { | ||
| let prefix = fold(queryWords.at(-1)); | ||
| for (let [completion, leaves] of wordIndex) { | ||
| if (completion.startsWith(prefix)) { | ||
| firstLemmata.push(completion); | ||
| resultSnippets = resultSnippets.union(leaves); | ||
| } | ||
| } | ||
| } | ||
| for (var i = 1; i < foldedQuery.length; ++i) { | ||
| var rhs = new Set(wordIndex.get(foldedQuery[i])?.keys() ?? []); | ||
| let intersection = resultSnippets.intersection(rhs); | ||
| if (intersection.size === 0 && i == foldedQuery.length - 1) { | ||
| let prefix = fold(queryWords.at(-1)); | ||
| for (let [completion, leaves] of wordIndex) { | ||
| if (completion.startsWith(prefix)) { | ||
| rhs = rhs.union(leaves); | ||
| } | ||
| } | ||
| resultSnippets = resultSnippets.intersection(rhs); | ||
| } else { | ||
| resultSnippets = intersection; | ||
| } | ||
| } | ||
| let pivots = firstLemmata.map(l => wordIndex.get(l)).filter(x => !!x); | ||
| let getPivot = (/**@type {string}*/s) => pivots.map(p => p.get(s)).filter(x => x !== undefined)[0]; | ||
| let collator = new Intl.Collator("en"); | ||
| resultSnippets = Array.from(resultSnippets).sort( | ||
| (left, right) => collator.compare( | ||
| left.substring(getPivot(left)) + | ||
| ' \uFFFE ' + | ||
| left.substring(0, getPivot(left)), | ||
| right.substring(getPivot(right)) + | ||
| ' \uFFFE ' + | ||
| right.substring(0, getPivot(right)))); | ||
| for (let [property, propertyIndex] of indexEntries) { | ||
| /**@type {[number, number][]}*/ | ||
| for (let snippet of resultSnippets) { | ||
| let entry = propertyIndex.get(snippet); | ||
| if (!entry) { | ||
| continue; | ||
| } | ||
| let entrySet = entry.characters; | ||
| if (superset(covered, entrySet)) { | ||
| continue; | ||
| } | ||
| rangeCount += entrySet.length; | ||
| covered = covered.concat(entrySet); | ||
| let pivot = getPivot(snippet); | ||
| let tail = snippet.substring(pivot); | ||
| result.push(entry.html.replace( | ||
| "[RESULT TEXT]", | ||
| "<span class=tail" + | ||
| (snippet.includes(",") ? " style=width:100%" : "") + ">" + | ||
| toHTML(tail) + | ||
| (pivot > 0 && !tail.endsWith(".") ? "," : "") + | ||
| "</span> " + | ||
| (pivot > 0 ? "<span class=head>" + | ||
| toHTML(snippet.substring(0, pivot)) + | ||
| "</span>" | ||
| : ""))); | ||
| if (rangeCount >= maxResults) { | ||
| return {entries: result, rangeCount}; | ||
| } | ||
| } | ||
| } | ||
| if (queryWords.length <= 1 && query.length > 0) { | ||
| let codePoints = []; | ||
| if (/^[0-9A-F]+$/ui.test(query)) { | ||
| codePoints.push(parseInt(query, 16)); | ||
| } | ||
| if (/^.$/ui.test(query)) { | ||
| codePoints.push(query.codePointAt(0)); | ||
| } | ||
| for (let cp of codePoints) { | ||
| var name = characterNames.get(cp); | ||
| if (!name) { | ||
| for (let [[first, last], n] of characterNameRanges) { | ||
| if (first <= cp && cp <= last) { | ||
| name = n; | ||
| break; | ||
| } | ||
| } | ||
| } | ||
| if (name) { | ||
| rangeCount += 1; | ||
| result.push( | ||
| (indexEntries.get("Name").get(name) ?? | ||
| indexEntries.get("Name_Alias").get(name)).html.replace( | ||
| "[RESULT TEXT]", toHTML(name))); | ||
| } | ||
| } | ||
| } else if (queryWords.length == 1 && /^boop$/i.test(queryWords[0])) { | ||
| rangeCount += 1; | ||
| result.push( | ||
| indexEntries.get("Block").get("Betty").html.replace( | ||
| "[RESULT TEXT]", toHTML("Betty"))); | ||
| } else if (queryWords.length == 1 && /^dood$/i.test(queryWords[0])) { | ||
| rangeCount += 1; | ||
| result.push( | ||
| indexEntries.get("Block").get("the").html.replace( | ||
| "[RESULT TEXT]", toHTML("the"))); | ||
| } | ||
| return {entries: result, rangeCount}; | ||
| } | ||
|
|
||
| function toHTML(/**@type {string}*/ plain) { | ||
| return plain.replaceAll("&", "&") | ||
| .replaceAll("<", "<") | ||
| .replaceAll(">", ">") | ||
| } | ||
|
|
||
| function superset(/**@type {[number, number][]}*/left, /**@type {[number, number][]}*/right) { | ||
| var remaining = right.slice(); | ||
| for (containingRange of left) { | ||
| remaining = remaining.flatMap(r => rangeMinus(r, containingRange)); | ||
| } | ||
| if (remaining.length > 0) { | ||
| return false; | ||
| } | ||
| return true; | ||
| } | ||
|
|
||
| function rangeMinus(/**@type {[number, number]}*/left, /**@type {[number, number]}*/right) { | ||
| let intersection = rangeIntersection(left, right); | ||
| if (intersection === left || intersection === right) { | ||
| return []; | ||
| } else if (intersection === null) { | ||
| return [left]; | ||
| } { | ||
| /**@type {[number, number][]}*/ | ||
| let result = []; | ||
| if (left[0] < intersection[0]) { | ||
| result.push([left[0], intersection[0] - 1]); | ||
| } | ||
| if (left[1] > intersection[1]) { | ||
| result.push([intersection[1] + 1, left[1] - 1]); | ||
| } | ||
| return result; | ||
| } | ||
| } | ||
|
|
||
| function rangeIntersection(/**@type {[number, number]}*/left, /**@type {[number, number]}*/right) { | ||
| let [leftStart, leftEnd] = left; | ||
| let [rightStart, rightEnd] = right; | ||
| if (leftEnd < rightStart || rightEnd < leftStart) { | ||
| return null; | ||
| } else { | ||
| return [Math.max(leftStart, rightStart), Math.min(leftEnd, rightEnd)]; | ||
| } | ||
| } | ||
|
|
||
| function fold(/**@type {string}*/ word) { | ||
| var folding = word.normalize("NFKC").toLowerCase(); | ||
| return folding.replace("š", "sh"); | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.