Luis von Ahn is making a real impact. He’s improving the world, and you’ve already helped him, likely without even realizing it.
The story begins with the process of registering for a Hotmail account in 1998. Imagine you’re a spammer, surfing the net on your bondi blue iMac G3 and you realize that Hotmail (and every other free email service) has a limit of 100 e-mails you can send in a day. That justwon’t do. You decide to write a crude program that registers 250,000 e-mail addresses before the new hit “…Baby One More Time” plays all the way through. Time goes on, and all is well in your spammer universe. Y2K scares come and go, everyone starts to forget about the Furby, and then, suddenly, your program breaks, and you can’t register e-mail accounts without this new verification called a CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart).
CAPTCHAs and reCAPTCHAs
A CAPTCHA is a program that protects websites against spambots by generating and grading tests that humans can pass but current computer programs cannot.
We’ve all squinted and guessed, regenerated, and occasionally screamed at this 10-second translation chore during account registration/verification. Luis Von Ahn helped create CAPTCHA while obtaining his PhD from Carnegie Mellon University in 2000, but he didn’t stop there. With 200 million CAPTCHAs being typed every day, at 10 seconds of human time per CAPTCHA, von Ahn was concerned about the 500,000 hours of time he’s causing humanity to waste every day. Enter reCAPTCHA—a company von Ahn founded in 2007 that found a way to repurpose said CAPTCHAs for book digitization. How? Optical Character Recognition (OCR) was developed to convert scanned images of handwritten, typed, or printed text into machine-encoded text. However, OCR isn’t perfect, especially when applied to older, worn-out books (about 30% of the words in older books are unrecognizable by the system). reCAPTCHA, which was acquired by Google in 2009, improves the process of digitizing books by sending the words that cannot be read by OCR to the web in the form of an image within a CAPTCHAs for humans to decipher. Each new, unknown word is given to the user along with another word to which the answer is already known. The user is then asked to type both words. If the user solves the one for which the answer is known, the system assumes the answer is correct for the unknown word. The system then crosschecks the answer against a number of other users to determine, with higher confidence, whether the original answer was correct.
reCAPTCHAs are currently being used on 350,000 sites worldwide, and 100 million unknown words are being translated and archived everyday—approximately 2.5 million books a year! There’s even a subreddit devoted to the quirky reCAPTCHA word combinations that often arise.
As a side note, while Gravitate supports the reCAPTCHA effort, we also know that not every form needs a tedious verification process. We often will use the “honeypot” CAPTCHA method that places an additional field on the form that is hidden to users. Spambots process and interact with raw HTML rather than render the source code and therefore cannot detect that the field is hidden. If data is inserted into this “honeypot,” the form submission fails.
Von Ahn didn’t stop there—striving to harvest every idle moment in our lives and turn it into productive use, he developed the ESP Game as part of his PhD thesis. The ESP Game harnessed humans’ ability to recognize images far better than computers do by pitting two users against each other that see the same image. The objective is to assign labels to the image, which are most likely going to be echoed by the other user. Once a label is entered by both users and agreed upon, that word becomes a label for the image. Eventually, “taboo” words were added to each image to force users to become more descriptive with their labels and to further refine the image metadata. The game was eventually licensed by Google in the form of the Google Image Labeler and lead to the expansive amount metadata currently within Google Images.
Using what he’d learned through reCAPTCHA and the ESP Game, Luis von Ahn started his next crowdsourcing venture: Duolingo. Language translation across the web is poor, to say the least. Duolingo was born from the question, “How can we get 100 million people accurately translating the web to every major language for free?” The answer: Make it a game—and make it free! Duolingo is designed so that as users progress through lessons, they simultaneously help translate websites and other documents. It’s the same approach taken in reCAPTCHA and ESP—a user attempts a translation, and other users crosscheck his or her “answer.” This is accomplished by supplying phrases from actual webpages that need to be translated. The phrases are given to a number of users, and duplicate translations are put into a pool to be verified by advanced users. Once verified, the webpage supplying the content is translated to the designated language.
That’s about it. We realize that we didn’t offer any resource or meaty takeaway from this article—except the expansion of knowledge, of course. We simply believe Luis von Ahn is brilliant and wanted to let you know. And hey, don’t be annoyed by the next reCAPTCHA you come across—just remember, you’re helping improve the literary world.
UPDATE (9/24/13): Luis von Ahn just released a Duolingo explanatory video on Reddit.
Here’s some further reading in case you’re interested: