Annoying Anti-Spam Web Tool Is Helping Digitize History

  by  |  December 10th, 2008  |  Published in All, Featured, Technology


 |  Stumble |  Share on Facebook |  Tweet This | 

Anyone who’s spent much time online has encountered websites that require you to solve distorted word puzzles to "prove you’re human." You may find them annoying but now that effort may not be going to waste. Turns out you and millions of others could be transcribing old books and newspapers little by little, every day.

[If you cannot see the flash video below, you can click here for a high quality mp4 video.]

Interviewee: Luis von Ahn, Carnegie Mellon University
Produced by Sunita Reed– Edited by Sunita Reed and James Eagan
Copyright © ScienCentral, Inc.

Proving Our Humanity

Computer scientist Luis von Ahn was a PhD candidate at Carnegie Mellon University when the chief scientist at Yahoo! asked von Ahn’s advisor, Manuel Blum, for help against spammers. Spammers were writing computer programs that registered for millions of free Yahoo! email accounts every day, and then sent hundreds of millions of spam emails from them. Carnegie Mellon’s team put their heads together.

“We came up with idea of trying to have a test that can distinguish humans from computers so that only humans can obtain free email accounts,” explains von Ahn, now an associate professor at the university.

Blum and von Ahn, along with colleagues Nicholas Hopper and John Langford, took a year to perfect the computer program they named CAPTCHA. (An acronym that stands for Completely Automated Turing Test To Tell Computers and Humans Apart.) The program slightly distorts the shape of words and requires people to type that word into a form. It works because computers cannot read distorted text very well.

Also on ScienCentral

3D Photos
12.18.07

The Real Wall-E
06.26.08

Tongue Joystick
10.03.08

“During those ten seconds while you’re typing a CAPTCHA your brain is doing something amazing,” says von Ahn. You’re doing something that computers cannot yet do. Despite 50 years of research in computer science, computers cannot yet read those squiggly letters.”

Clever, right? So clever that once Yahoo! started using it, the program’s popularity snowballed to other websites and became a staple of the internet. Von Ahn recalls how proud he felt of its effectiveness and its widespread use. But pride gave way to guilt when he did a back of the envelope calculation on how long people were spending solving CAPTCHAs. He was shocked by the cost of proving our humanity.

“Each time you type a CAPTCHA essentially you waste ten seconds of your time," he says. "And if you multiply that by 200 million you get that humanity, as a whole, is wasting like 500,000 hours a day typing these annoying CAPTCHAs."

Spam and Shakespeare

Von Ahn wanted to find a way to make use of that effort. He was aware that some libraries were taking books that were published before the digital age and converting them into digital archives.

This two-step process starts with a scan or digital photograph of the pages of the old book. But since this was merely a photo (text within these images cannot be searched or copied) a computer program called optical character recognition (OCR) then transforms the images into actual text.

The problem is that the text from old books and newspapers is often faded or distorted, resulting in OCR making about a twenty-percent error rate in transcription.

“The reason the computer cannot decipher many of the words,” explains von Ahn, “is precisely the same reason why computers cannot read CAPTCHAs—because they cannot read distorted text.”

It occurred to van Ahn that people were decoding distorted text every time they solved a CAPTCHA puzzle. So he revised CAPTCHA — instead of using random characters as the puzzle, he substituted in words from books or newspapers that automated OCR programs could not recognize. He called the new program "reCAPTCHA."

Crowdsourcing Transcription

In order to maintain the first goal of CAPTCHA (differentiating between human and computer) two word puzzles are displayed: one is a "control" word for which the answer is known and the other is a word from old text. If the human solves the control correctly, “the system assumes they are human and gains confidence that they also typed the other word correctly.”

Von Ahn and colleagues recently published a study in the journal Science, evaluating the effectiveness of reCAPTCHA over a one-year period. They found that its word accuracy exceeded 99-percent. He says that matches the accuracy of professional human transcribers, a very expensive alternative. Currently von Ahn works with the nonprofit Internet Archive and The New York Times to digitize old books and newspapers.

How Can You Tell If It’s reCAPTCHA?

Since reCAPTCHA requires a control word before the "target" to be transcribed, you will know it if you see two separate words instead of one word or random characters. (And before you complain about having to do more work, von Ahn says it turns out that it takes the same amount of time to solve two words as it does to solve one random series of characters, like the original CAPTCHA.)

Today, more than 40,000 Web sites, from Facebook to Ticketmaster to craiglist to Twitter, use reCAPTCHA. Von Ahn estimates that 20 million new words are transcribed each day, some of them possibly by you.

And von Ahn says that once people find out that it’s for, they seem to be less resentful of the time they spend solving reCAPTCHAs.

“People are usually very happy about this. They say that well at least my time—you know these things are annoying but at least my time is not wasted anymore,” he says.

Why Are They Sometimes So %*$@ Difficult?

“That’s because you’re not human,” quips von Ahn with a straight face.

But he goes on to explain that, over the years, spammers’ computer programs have gotten better and better at reading distorted text. So he and his colleagues had to make the text much harder to read for computers. But at the same time, it’s harder for humans. He warns that there are also CAPTCHAs made by other people that haven’t been intensely tested like his and could be more difficult to decipher.

So, blame the spammers for having to do the puzzles in the first place. And thank von Ahn for finding a way to put your amazing mental powers to good use.

Special thanks to Mary Kay Johnsen of Carnegie Mellon’s Posner Collection for giving us access to film rare books by Shakespeare and Dickens taken from the vault. Click here to search the Collection’s digital archives.

PUBLICATION: Science, September 12, 2008
AUTHORS: Luis von Ahn, Benjamin Maurer, Colin McMillen, David Abraham, Manuel Blum
RESEARCH FUNDED BY: This work was partially supported by gifts from the Heinz Endowment and the Fine Foundation, by an equipment grant from Intel Corporation and a research grant from the New York Times Company, and by the Army Research Office through grant number DAAD19-02-1-0389 to CyLab at Carnegie Mellon University. Luis von Ahn was partially supported by a Microsoft Research New Faculty Fellowship and a MacArthur Fellowship.

Elsewhere on the Web:

Carnegie Mellon’s Posner Collection in Electronic Form

Internet Archive

FTC information on spam


 |  Stumble |  Share on Facebook |  Tweet This | 


Responses

  1. sciencesensei says:

    December 18th, 2008 at 3:58 pm (#)

    I have always thought that I may have transcribing something. Id like to see some of the results.

  2. nocare says:

    February 14th, 2010 at 3:29 am (#)

    This is a bit stupid.
    Think about it, the way the captcha verifies you entered the correct text is it already knows the answer.

    So none of us users are doing anything. The work was already done by the captcha company.

  3. Ann Noyed says:

    October 16th, 2012 at 10:55 am (#)

    Hmm, no mention of the captcha that I are not legible. I spent at least 20 mins one time trying to subscribe to a site and never did get past the captcha. It got harder and harder and more annoying as I kept getting it wrong. I just gave up.

  4. click here says:

    August 7th, 2013 at 3:28 am (#)

    magnificent put up, very informative. I wonder why
    the other specialists of this sector do not realize
    this. You should proceed your writing. I’m sure, you’ve a great readers’ base already!

    my web-site – click here

Leave a Response


Archives


Clicky Web Analytics