Blog Header Banner

Archive for the ‘digitize’ tag

reCAPTCHA: digitizing the printed word, one spam filter at a time   7 comments

bigstock-Abstract-Background-with-words-26287847In 2009, Google went from having around 20,000 employees, to having millions of people all over the world working for them. Well… sort of.

You might already be familiar with what is pictured below. If not, let me explain to you a little bit about something that I was recently informed about –this magical thing called “reCAPTCHA”.

recaptcha In the year 2000, I was worried about passing the 4th grade. I was anticipating that all the computers in the world were going to explode due to Y2K. I was hoping that I could fend for my family and not die of Dysentery on The Oregon Trail. Needless to say, life was rough.

But Yahoo!—and hundreds of other web companies, for that matter—were dealing with a much larger epidemic than Dysentery—spam. No, not that gross, canned mystery meat, and definitely not George Michael’s Wham! This kind of spam is something (debatably) worse than both… combined!

We’ve all encountered spam in our email inboxes, but now, thanks to Luis von Ahn, we also have all run into what is stopping most of it.

Luis von Ahn grew up in Guatemala and worked in his family’s candy shop as a kid. Later on in his life, along with his college advisor, he was hired by Yahoo! to create a program that could tell the difference between a human and a form bot. They came up with “CAPTCHA”, which—and I’m serious here—stands for “Completely Automated Public Turing test to tell Computers and Humans Apart.”

While the name isn’t exactly the work of genius, these brilliant guys created a challenge-response test that could be predominantly passed by humans to block those evil form bots and eliminate as much spam as possible. These computer-generated, squiggly words are made so humans can read and submit them, but that computers cannot.

After becoming extraordinarily successful off this creation, Mr. von Ahn still found a weakness in his own program. The flaw? The 10 annoying seconds wasted while someone types in a CAPTCHA every time they come across one. After turning down a personal offer from Bill Gates to work for Microsoft and winning the MacArthur Fellowship Award in 2006, von Ahn re-created CAPTCHA and titled it…erm… reCAPTCHA! Luis von Ahn believed this new idea would be good for humanity, and as far as some other types of crowdsourcing go, I agree.

In 2009, Google bought reCAPTCHA and released it upon the masses. Now, what reCAPTCHA does is take the words we type in and use them to digitize old books and newspapers. These books and newspapers are scanned and turned into text-images by using “Optical Character Recognition” (OCR). The problem remains that computers still cannot read text as well as humans. A simple word like “of” could be interpreted as “at,” since old books and newspapers may have words that are damaged or hard to scan.

Here is where the superpower of humans comes in! We can read the word “of” and correctly submit “of,” instead of “at”, along with a computer-generated CAPTCHA word. So a reCAPTCHA image is combined with a CAPTCHA word, and placed at the login of something like an email. If we get the CAPTCHA word correct, we are in-there-like-swimwear. Even if we get the reCAPTCHA wrong and cannot decipher it ourselves, but get the CAPTCHA, we are still allowed access. The reCAPTCHA word will be tested by many other humans to increase the likelihood of it being deciphered correctly.

Using the aforementioned example of the word “of” being read as “at”, if people keep typing “of”, the word “of” will digitally replace the word that the OCR program recommended. After some time, millions of people are deciphering these scanned reCAPTCHA words and creating digitized versions of old New York Times newspapers and classic books for Google!

In months, with the power of reCAPTCHA and humans’ ability to read damaged words, 20 years’ worth of material is digitized and transcribed thanks to… well… you…me… Alan Rosenberg… maybe Luis von Ahn and Bill Gates… your mom? Everyone! In time, thanks to Luis von Ahn and his team, we will all be a part of digitizing millions of old texts to be distributed online. Now, where are our paychecks, Google?

Share : Facebooktwitterredditlinkedinmail Follow Us : Facebooktwitterlinkedinyoutubeinstagram

Written by Dylan on March 14th, 2013

Tagged with , , , , , ,