Friday, December 2, 2011

You should probably start burning your mail: What I learned from the DARPA Shredder Challenge.

I spent most of my free time the last few weeks hacking away at the DARPA Shredder Challenge. I reassembled four of the five documents, finishing in third place.

The goal of the challenge was to reassemble five progressively more complex shredded documents. The first was shredded into about 250 chads, and the last into 6200. Each document was hand-written, and all but one were written on lined paper. DARPA shredded each document, and scanned in the resulting chads.

Here's how I went about it. Everything was built using C# / .NET 4.0 / MSSQL. 

Part 1: Preparing chads for assembly


Separate

The images provided by DARPA are scanned in paper chads against a magenta background, and need to be split into separate image files.

A flood fill was used to identify the chad borders, and each chad was saved as a separate image file.

Rotate

The scanned in chads were placed on the scanner by hand, so they need to be rotated to the angle they entered the shredder. If the shredder had perfectly sharp blades, the chads would have a uniform width and smooth edges.

Using this assumption, the rotation angle  can be determined. A chad is straight at the angle where the highest number of edge pixels are filled. Chads which are below a certain threshold of filled edge pixels were flagged for human review, and rotated manually. About 1% had to be manually rotated.

Flip

The shredder blades leave one end of each chad concave, and one end convex. The top and bottom edges are compared, and the upside down chads are rotated by 180 degrees.

Trim

Each chad is now trimmed to the "clean" width, and both rough and clean copies are saved.

Calculate Chad Data

Information about each chad is calculated and stored in a database.

  • Rough dimensions
  • "Clean" dimensions
  • Coordinates of lines (if lined paper)
  • Shape of each edge
  • Pen connection points on each edge
  • Color of each point on each edge

Manually Record Characters

OCR sucks on handwritten text, especially on half or quarter of a letter, but it's not too annoying to do manually. For each chad, the whole and partial contained characters were recorded. I recommend a libation rate of one beer per 1000 chads.

Find Neighbors by Position

The paper shredders shred documents in a predictable way, so each chad always has the same offsets to its neighbors. Knowing this, it's possible to determine all of the ways in which each chad can connect with the rest of the chads in the pool. This is way less painful with documents written on lined paper

All of the possible neighbor matches are stored in a database.

Calculate Neighbor Match Values

Matching chads is pretty inexact. The edges are rough, the paper is ripped in unpredictable ways and the chads vary a bit in size. There's no magic bullet when determining neighbors, but making a bunch of approximations works pretty well. Match percentages are calculated using the following criteria.

  • Pen connecting points on trimmed chads
  • Line connection points (filtered during the previous step)
  • Color similarity (Delta-E) of the edge pixels
  • Rough edge shape similarity
  • Connected pen area (the total area covered by pen between two neighbor chads)

Part 2: Assembly

I assembled the first document using GIMP and Paint.NET. For a couple hundred chads, this isn't too bad. For many thousand of chads, a purpose built assembly tool is needed. 

Assembly Tool Interface

Document Overview View

The document overview is a surface on which chads can be positioned on and manipulated.

  • Chads snap-fit to a grid sized to the "clean" chad size
  • Chads can be grouped together, or manipulated individually
  • Matches can be viewed by right clicking on the desired edge

Potential Neighbors View

Possible matches for a chad or chads. The matches can be filtered by match status, position,  contained text and or any chad or edge match criteria.


All Chads View

List of all chads, which can be filtered by contained text, current match status, or other criteria (pen color, coffee stains, edge pieces, etc).




Multi-Chad Matching

When the potential neighbors view is populated for a specific chad, the document layout is first checked for other chads bordering the same space, and the potential neighbors are filtered by all edges each would touch. The result is that chad suggestion is much more accurate when two or more edges are used.

Assembling

With the potential neighbors calculated, and the assembly tool built, assembling the document is just a matter of iterating through each chad and its neighbors, and building progressively larger groups of chads until the entire document is reconstructed.

The fourth document was a special case. The colored chads were first grouped automatically, then assembled manually. After assembling the color text, it was apparent that the collaborators names were written in different handwriting than the main text. I simply picked out a few chads that appeared to be written in different handwriting, or at an angle, and then used the edge matching functionality to build out the surrounding chads until I had all of the names.

Future Improvements

This project has been a ton of fun, and I'm planning on purchasing a couple paper shredders so I can continue development. I'm not sure to what end exactly, but I can't wait to see how much this process can be optimized.

It turns out that a single programmer working in his spare time can effectively reconstruct shredded documents. The next time you're in the market for a paper shredder, you might want to invest in a fire pit instead. 

7 comments:

  1. Nice explanation. Lots of similarities to what we did (All your shreds are belong to U.S.). We are gonna try to do a write-up of some of our algorithms too when we have some time.
    -Otavio Good

    ReplyDelete
  2. Otavio -- Congrats on the win :) I'm looking forward to learning what approaches worked well for your team, and the other successful competitors.

    ReplyDelete
  3. I received an e-mail from d.w. who was unable to post his comment for some reason:



    Congrats on placing third in the Challenge! Also, it was gracious of you to post links to separated chad.
    Notes:
    1. DARPA has made public the fact that shredders for home, business, and government use are inadequate. One outcome of this challenge is that we will now have better shredder specifications. Also, DARPA has created a community of researchers (like you) who are keenly interested in the problem and can improve the approaches. Unshredder.com is making money and so are unshredding services. Like virus and anti-virus, the white-hat guys and the black-hat guys both want each other’s documents and are playing leap frog with technology.

    2. Taking a shredder apart is an educational activity and will show how engineers make office machines quieter and reduce MTBF and cost. The Wikipedia article on “Paper Shredder” is worth reading closely. Manufacturers’ web sites give insightful information (e.g., see http://www.shredexonline.com/news.htm). Know what a planimeter is? or do you know about UN-SCAN-IT? Child-like analysis is invaluable (e.g., “put the lines together before you consider damaged edges or cursive writing”). As you mentioned, with offset crosscut and lines (printed or virtual), there are a limited number of places that a chad could fit on a page (circumference of cutter modulo line spacing). Note: Each new constraint further divides the problem so it is easier to conquer.

    3. Excellent computer programs are adaptive. They monitor the clock, adjust priority of searches of the data base and operate on probabilities of outcomes (humans live that way). Why exhaustively iterate through every possibility of a match of chads when the obvious candidates (you separated colors first) have already been found?

    4. Some interesting research is needed on (a) 3-D of disfigured edges, (b) an UNSHRED iPhone app, (c) recognizing fake document pages (reconstruction merely reconstructs, it does not validate content, so an occasional fake page disrupts the effort), the value of normal and blank chads as “stepping stones” to other chads (like playing Soduku).
    --d.w.

    p.s, I'm considering writing a book on unshredding. Also, listen to: http://www.npr.org/2011/03/09/134384792/Egypt-Documents

    ReplyDelete
  4. Here (puzzles.vanisoft.pl) we also tried similar things. We didn't notice the fact, that y-offsets are predictible, but restricted them anyway to multiples of distance between lines, which is still great optimization. We also didn't took shades of paper color into account, thinking that this is just the artefact of scaning/compression, rather than helpfull information.
    We had a software that computed score for each pair similar to yours, but we managed human verification differently: by crowd-sourcing. People just have to tell if the match is correct or not. This way only administrators could view the whole picture, while the users seens just two pices at the time

    ReplyDelete
  5. Isn't the problem made hugely more complex by the fact that the chads of multiple documents are apt to be merged in the same pile? Couldn't shredder output be made much more complex by the simple acts of stirring the output to mix it more thoroughly and/or by running the shredder output through a second-pass shredding?

    ReplyDelete
  6. tedb -

    The real world problem of reconstructing shredded documents is definitely very different than the DARPA challenge.

    If you have a bin of shreds that haven't been manipulated after shredding, you can make assumptions about which chads should be grouped together depending on their proximity within the bin.

    Just mixing up the pile, though, makes it a much more complex problem.

    ReplyDelete
  7. hello, guys
    I did't took part in DARPA shredder challenge, but this theme became interesting for me. So, what I achieved for now is posted to http://shredderleaks.com

    ReplyDelete