Unredacter is a collaborative forensic tool specialized in finding plausible text that is obscured or removed in a text layout. We assume the removed text was removed after the typesetting was made. Hence, the words around the removed text constrain the set of possible words. The constraints form due to different rules of language and typography and may often constrain the word set down to only one possible word.
Much of this is actually instructions for copilot.
This is a proof of concept and scientific research project, not a production system. Our development approach:
- Simple and straightforward code style without unnecessary complexity
- Fix problems at the source rather than implementing workarounds
- Clean and understandable code that can be easily modified
- Sound and correct implementations even if not perfect
- Test coverage for new features and bug fixes
- Rapid iteration with willingness to remove obsolete code
We follow KISS and DRY principles pragmatically, focusing on research goals over production polish.
-
Typographical The typeset length of the hidden word may be rather exactly determined by setting the text before and after it in a way identical to the visible text. The discriminatory power of the typeset word length constraint depends on the font originally used. A fixed-width font has the least discriminatory power as all words with the same number of letters are viable candidates. A font with more complex letter or part-of-word typesetting will considerably reduce the set of viable candidates.
-
Partly visible clues A word might be hidden, but some clues about the word's composition might be deduced from visible details. This often happens at the end of the word, where visible pixels might reveal that the word ends in a specific letter, or a letter from a small set of letters. This also applies to the top and bottom of a mask that might reveal clues where letters with tall stems or low extensions might hide.
-
Lack of visible extending elements Letters with vertical descenders or ascenders may be predicted to extend beyond the erased area of the word. If the erased area and type properties can both be determined, we may exclude certain groups of letters at positions where they should have protruded if they were present.
-
Semantic and grammatical Semantic and grammatical rules may rule out certain words or types of words that may result in nonsensical sentences.
-
Statistical analysis Analyzing the frequencies of words or letters may result in selecting words that are more probable to be present in a sentence in the current document.
-
Inter-mask relations If an unknown hidden word appears in more than one place, and we are able to determine that this is the case, the constraints applied to each of the words can be applied to both of them.
There are also factors that make the constraints above less discriminating:
-
Spelling errors will almost always change the length of the word and hence greatly reduce the probability of selecting the true hidden word.
-
Document quality Low-quality documents with, e.g., skewed or rotated text, non-linear deformations, blurry or low-resolution text.
The web frontend is a Vite React app. It displays the PDF, the labels, and various features of the text layout.
The backend is implemented using Python, venv, and FastAPI. It does all the heavy lifting.
When a word in a label is highlighted (via mouse gestures or shift and arrow keys), this is interpreted as a request from the user that they want suggestions on other words of the same typeset length. The words are selected from preselected word lists. The frontend first requests a length measurement of the selected string, given the font it is set with. The backend uses Pillow to render the word on a virtual canvas and then gets the rendered length. When received by the frontend, it will send another request to the backend to retrieve all words with the same rendered length in pixels, with an additional error margin.
These candidate words are sorted by rendered length and displayed in the candidate word list. The user may then select a word from that list by clicking or using hotkeys, and the label is changed to display what it would look like if that word was used instead of the selected word. Behind the scenes, the original label is hidden and a new one is rendered on top of it. The user may then commit the selected word, and the original label will then be changed by replacing the highlighted word with the selected word.
To quickly return words based on length, all candidate words have a precalculated length. As rendered length depends on the font used, each word length is calculated for each selectable font. This is stored in a server-side SQLite database, one for each font/wordlist pair. This is to improve performance and reduce contention. After the word length databases are built during deployment, they are de facto read-only.
There are no users as such, but there are session IDs. Access is checked by Basic Auth. Any change made by any user is broadcast to all others and saved as JSON to local storage.
PDF Export has several modes. The standard one is to export the PDF while adding the labels as is on top. This is the standard way but gives a rather busy document.
A more coherent document can be generated by first removing black masks used to remove words. This is done by OpenCV through dilation and erosion, to first get black and white masks of the redaction markings. This is then subtracted from the first document. Any labels that are written in the redaction masks are written to the final document as black on white.
- Precise Typography: Fractional point sizes, standard fonts, weight/style control
- Advanced Export: Server-side PDF generation with progress tracking
- OpenCV Processing: Intelligent redaction mask workflow for clean document processing
- Real-time Collaboration: WebSocket-based multi-user editing
- Font Management: Comprehensive TTF font support with conflict resolution
- Backend: FastAPI (Python) with OpenCV image processing
- Frontend: Vite + React + TypeScript
- PDF rendering: pdfjs-dist
- Image processing: OpenCV, NumPy, PIL/Pillow
- Containerization: Docker with optimized multi-stage builds
- Node 18+
- Python 3.10+
- Docker/Podman (for containerized deployment)
The staging environment automatically deploys when changes are pushed to the main branch on GitHub.
/app/assets/words
Word lists with candidate words. One word per line. Currently, these lists are static and cannot be changed at runtime. The only actual use at runtime is to generate the dropdown list in the UI. At deploy time, the lists are used to build widths databases, and these are the wordlists that are actually used at runtime.
/app/static/
General home for all static files that are not processed but just read or served.
/app/static/assets/
App's JS files like pdfjs, pdf.worker*.js and vendor*.js
/app/static/fonts
Folder with TTF fonts. The filenames are used as the unique identifier throughout the lifetime of the app
/app/server/
Backend server files
/app/server/app
Python FastAPI backend files
/app/server/scripts
Various deploy-time or runtime bootstrapping scripts
/app/storage
Persistent storage for the app. This should be backed by e.g. a PVC on the host
/app/storage/pdfs
The uploaded PDF files, renamed to hashes. Metadata goes in JSON files named as the corresponding PDF file but with the added extension .meta.json
/app/storage/layouts
The saved state, i.e. mostly the created labels, are saved here in JSON files. Each file corresponds to a PDF. The layout files are named as their corresponding PDF, but with the extension .json
/app/storage/widths
This is where the widths databases are hosted. For every wordlist/fontfile pair a widths database is created. Each database contains the normalized typographical width in pixels of every word in the word list. This is to quickly be able to serve candidate words given a font file name and a word. Widths are calculated at deploy time. If running in a k8s system, each pair will generate a job item running on an indexed job. This is done using the same docker image as the app itself runs in, but as separate containers. Hence the app folder layout will be used in this generation. Fonts will be found and read from /app/static/fonts, word lists from /app/assets/words and the resulting databases will be generated in /app/storage/widths
/tmp
Temporary and not so persistent storage.
