Unredacter

Unredacter is a collaborative forensic tool specialized in finding plausible text that is obscured or removed in a text layout. We assume the removed text was removed after the typesetting was made. Hence, the words around the removed text constrain the set of possible words. The constraints form due to different rules of language and typography and may often constrain the word set down to only one possible word.

Much of this is actually instructions for copilot.

Project Principles

This is a proof of concept and scientific research project, not a production system. Our development approach:

Simple and straightforward code style without unnecessary complexity
Fix problems at the source rather than implementing workarounds
Clean and understandable code that can be easily modified
Sound and correct implementations even if not perfect
Test coverage for new features and bug fixes
Rapid iteration with willingness to remove obsolete code

We follow KISS and DRY principles pragmatically, focusing on research goals over production polish.

Architecture

Typographical The typeset length of the hidden word may be rather exactly determined by setting the text before and after it in a way identical to the visible text. The discriminatory power of the typeset word length constraint depends on the font originally used. A fixed-width font has the least discriminatory power as all words with the same number of letters are viable candidates. A font with more complex letter or part-of-word typesetting will considerably reduce the set of viable candidates.
Partly visible clues A word might be hidden, but some clues about the word's composition might be deduced from visible details. This often happens at the end of the word, where visible pixels might reveal that the word ends in a specific letter, or a letter from a small set of letters. This also applies to the top and bottom of a mask that might reveal clues where letters with tall stems or low extensions might hide.
Lack of visible extending elements Letters with vertical descenders or ascenders may be predicted to extend beyond the erased area of the word. If the erased area and type properties can both be determined, we may exclude certain groups of letters at positions where they should have protruded if they were present.
Semantic and grammatical Semantic and grammatical rules may rule out certain words or types of words that may result in nonsensical sentences.
Statistical analysis Analyzing the frequencies of words or letters may result in selecting words that are more probable to be present in a sentence in the current document.
Inter-mask relations If an unknown hidden word appears in more than one place, and we are able to determine that this is the case, the constraints applied to each of the words can be applied to both of them.

There are also factors that make the constraints above less discriminating:

Spelling errors will almost always change the length of the word and hence greatly reduce the probability of selecting the true hidden word.
Document quality Low-quality documents with, e.g., skewed or rotated text, non-linear deformations, blurry or low-resolution text.

Architecture

The web frontend is a Vite React app. It displays the PDF, the labels, and various features of the text layout.

The backend is implemented using Python, venv, and FastAPI. It does all the heavy lifting.

When a word in a label is highlighted (via mouse gestures or shift and arrow keys), this is interpreted as a request from the user that they want suggestions on other words of the same typeset length. The words are selected from preselected word lists. The frontend first requests a length measurement of the selected string, given the font it is set with. The backend uses Pillow to render the word on a virtual canvas and then gets the rendered length. When received by the frontend, it will send another request to the backend to retrieve all words with the same rendered length in pixels, with an additional error margin.

These candidate words are sorted by rendered length and displayed in the candidate word list. The user may then select a word from that list by clicking or using hotkeys, and the label is changed to display what it would look like if that word was used instead of the selected word. Behind the scenes, the original label is hidden and a new one is rendered on top of it. The user may then commit the selected word, and the original label will then be changed by replacing the highlighted word with the selected word.

To quickly return words based on length, all candidate words have a precalculated length. As rendered length depends on the font used, each word length is calculated for each selectable font. This is stored in a server-side SQLite database, one for each font/wordlist pair. This is to improve performance and reduce contention. After the word length databases are built during deployment, they are de facto read-only.

Multiplayer

There are no users as such, but there are session IDs. Access is checked by Basic Auth. Any change made by any user is broadcast to all others and saved as JSON to local storage.

PDF Export

PDF Export has several modes. The standard one is to export the PDF while adding the labels as is on top. This is the standard way but gives a rather busy document.

A more coherent document can be generated by first removing black masks used to remove words. This is done by OpenCV through dilation and erosion, to first get black and white masks of the redaction markings. This is then subtracted from the first document. Any labels that are written in the redaction masks are written to the final document as black on white.

Features

Precise Typography: Fractional point sizes, standard fonts, weight/style control
Advanced Export: Server-side PDF generation with progress tracking
OpenCV Processing: Intelligent redaction mask workflow for clean document processing
Real-time Collaboration: WebSocket-based multi-user editing
Font Management: Comprehensive TTF font support with conflict resolution

Stack

Backend: FastAPI (Python) with OpenCV image processing
Frontend: Vite + React + TypeScript
PDF rendering: pdfjs-dist
Image processing: OpenCV, NumPy, PIL/Pillow
Containerization: Docker with optimized multi-stage builds

Prerequisites

Node 18+
Python 3.10+
Docker/Podman (for containerized deployment)

Deployment

Staging Environment

The staging environment automatically deploys when changes are pushed to the main branch on GitHub.

Development Setup

Directory Structure

When deployed in the bundled container

/app/assets/words Word lists with candidate words. One word per line. Currently, these lists are static and cannot be changed at runtime. The only actual use at runtime is to generate the dropdown list in the UI. At deploy time, the lists are used to build widths databases, and these are the wordlists that are actually used at runtime.

/app/static/ General home for all static files that are not processed but just read or served.

/app/static/assets/ App's JS files like pdfjs, pdf.worker*.js and vendor*.js

/app/static/fonts Folder with TTF fonts. The filenames are used as the unique identifier throughout the lifetime of the app

/app/server/ Backend server files

/app/server/app Python FastAPI backend files

/app/server/scripts Various deploy-time or runtime bootstrapping scripts

/app/storage Persistent storage for the app. This should be backed by e.g. a PVC on the host

/app/storage/pdfs The uploaded PDF files, renamed to hashes. Metadata goes in JSON files named as the corresponding PDF file but with the added extension .meta.json

/app/storage/layouts The saved state, i.e. mostly the created labels, are saved here in JSON files. Each file corresponds to a PDF. The layout files are named as their corresponding PDF, but with the extension .json

/app/storage/widths This is where the widths databases are hosted. For every wordlist/fontfile pair a widths database is created. Each database contains the normalized typographical width in pixels of every word in the word list. This is to quickly be able to serve candidate words given a font file name and a word. Widths are calculated at deploy time. If running in a k8s system, each pair will generate a job item running on an indexed job. This is done using the same docker image as the app itself runs in, but as separate containers. Hence the app folder layout will be used in this generation. Fonts will be found and read from /app/static/fonts, word lists from /app/assets/words and the resulting databases will be generated in /app/storage/widths

/tmp Temporary and not so persistent storage.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
assets		assets
client		client
docs		docs
scripts		scripts
server		server
static/fonts		static/fonts
.gitignore		.gitignore
.todo		.todo
AGENTS.md		AGENTS.md
Dockerfile		Dockerfile
Dockerfile.dev		Dockerfile.dev
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unredacter

Project Principles

Architecture

Architecture

Multiplayer

PDF Export

Features

Stack

Prerequisites

Deployment

Staging Environment

Development Setup

Directory Structure

When deployed in the bundled container

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Unredacter

Project Principles

Architecture

Architecture

Multiplayer

PDF Export

Features

Stack

Prerequisites

Deployment

Staging Environment

Development Setup

Directory Structure

When deployed in the bundled container

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages