The goal of the challenge was to reassemble five progressively more complex shredded documents. The first was shredded into about 250 chads, and the last into 6200. Each document was hand-written, and all but one were written on lined paper. DARPA shredded each document, and scanned in the resulting chads.
Here's how I went about it. Everything was built using C# / .NET 4.0 / MSSQL.
Part 1: Preparing chads for assembly
The images provided by DARPA are scanned in paper chads against a magenta background, and need to be split into separate image files.
A flood fill was used to identify the chad borders, and each chad was saved as a separate image file.
The scanned in chads were placed on the scanner by hand, so they need to be rotated to the angle they entered the shredder. If the shredder had perfectly sharp blades, the chads would have a uniform width and smooth edges.
Using this assumption, the rotation angle can be determined. A chad is straight at the angle where the highest number of edge pixels are filled. Chads which are below a certain threshold of filled edge pixels were flagged for human review, and rotated manually. About 1% had to be manually rotated.
The shredder blades leave one end of each chad concave, and one end convex. The top and bottom edges are compared, and the upside down chads are rotated by 180 degrees.
Each chad is now trimmed to the "clean" width, and both rough and clean copies are saved.
Calculate Chad Data
Information about each chad is calculated and stored in a database.
- Rough dimensions
- "Clean" dimensions
- Coordinates of lines (if lined paper)
- Shape of each edge
- Pen connection points on each edge
- Color of each point on each edge
Manually Record Characters
OCR sucks on handwritten text, especially on half or quarter of a letter, but it's not too annoying to do manually. For each chad, the whole and partial contained characters were recorded. I recommend a libation rate of one beer per 1000 chads.
Find Neighbors by Position
The paper shredders shred documents in a predictable way, so each chad always has the same offsets to its neighbors. Knowing this, it's possible to determine all of the ways in which each chad can connect with the rest of the chads in the pool. This is way less painful with documents written on lined paper
All of the possible neighbor matches are stored in a database.
Calculate Neighbor Match Values
Matching chads is pretty inexact. The edges are rough, the paper is ripped in unpredictable ways and the chads vary a bit in size. There's no magic bullet when determining neighbors, but making a bunch of approximations works pretty well. Match percentages are calculated using the following criteria.
- Pen connecting points on trimmed chads
- Line connection points (filtered during the previous step)
- Color similarity (Delta-E) of the edge pixels
- Rough edge shape similarity
- Connected pen area (the total area covered by pen between two neighbor chads)
Part 2: Assembly
I assembled the first document using GIMP and Paint.NET. For a couple hundred chads, this isn't too bad. For many thousand of chads, a purpose built assembly tool is needed.
Assembly Tool Interface
Document Overview View
The document overview is a surface on which chads can be positioned on and manipulated.
- Chads snap-fit to a grid sized to the "clean" chad size
- Chads can be grouped together, or manipulated individually
- Matches can be viewed by right clicking on the desired edge
Potential Neighbors View
Possible matches for a chad or chads. The matches can be filtered by match status, position, contained text and or any chad or edge match criteria.
All Chads View
List of all chads, which can be filtered by contained text, current match status, or other criteria (pen color, coffee stains, edge pieces, etc).
When the potential neighbors view is populated for a specific chad, the document layout is first checked for other chads bordering the same space, and the potential neighbors are filtered by all edges each would touch. The result is that chad suggestion is much more accurate when two or more edges are used.
With the potential neighbors calculated, and the assembly tool built, assembling the document is just a matter of iterating through each chad and its neighbors, and building progressively larger groups of chads until the entire document is reconstructed.
The fourth document was a special case. The colored chads were first grouped automatically, then assembled manually. After assembling the color text, it was apparent that the collaborators names were written in different handwriting than the main text. I simply picked out a few chads that appeared to be written in different handwriting, or at an angle, and then used the edge matching functionality to build out the surrounding chads until I had all of the names.
This project has been a ton of fun, and I'm planning on purchasing a couple paper shredders so I can continue development. I'm not sure to what end exactly, but I can't wait to see how much this process can be optimized.
It turns out that a single programmer working in his spare time can effectively reconstruct shredded documents. The next time you're in the market for a paper shredder, you might want to invest in a fire pit instead.