Blog Header Banner

reCAPTCHA: digitizing the printed word, one spam filter at a time   7 comments

bigstock-Abstract-Background-with-words-26287847In 2009, Google went from having around 20,000 employees, to having millions of people all over the world working for them. Well… sort of.

You might already be familiar with what is pictured below. If not, let me explain to you a little bit about something that I was recently informed about –this magical thing called “reCAPTCHA”.

recaptcha In the year 2000, I was worried about passing the 4th grade. I was anticipating that all the computers in the world were going to explode due to Y2K. I was hoping that I could fend for my family and not die of Dysentery on The Oregon Trail. Needless to say, life was rough.

But Yahoo!—and hundreds of other web companies, for that matter—were dealing with a much larger epidemic than Dysentery—spam. No, not that gross, canned mystery meat, and definitely not George Michael’s Wham! This kind of spam is something (debatably) worse than both… combined!

We’ve all encountered spam in our email inboxes, but now, thanks to Luis von Ahn, we also have all run into what is stopping most of it.

Luis von Ahn grew up in Guatemala and worked in his family’s candy shop as a kid. Later on in his life, along with his college advisor, he was hired by Yahoo! to create a program that could tell the difference between a human and a form bot. They came up with “CAPTCHA”, which—and I’m serious here—stands for “Completely Automated Public Turing test to tell Computers and Humans Apart.”

While the name isn’t exactly the work of genius, these brilliant guys created a challenge-response test that could be predominantly passed by humans to block those evil form bots and eliminate as much spam as possible. These computer-generated, squiggly words are made so humans can read and submit them, but that computers cannot.

After becoming extraordinarily successful off this creation, Mr. von Ahn still found a weakness in his own program. The flaw? The 10 annoying seconds wasted while someone types in a CAPTCHA every time they come across one. After turning down a personal offer from Bill Gates to work for Microsoft and winning the MacArthur Fellowship Award in 2006, von Ahn re-created CAPTCHA and titled it…erm… reCAPTCHA! Luis von Ahn believed this new idea would be good for humanity, and as far as some other types of crowdsourcing go, I agree.

In 2009, Google bought reCAPTCHA and released it upon the masses. Now, what reCAPTCHA does is take the words we type in and use them to digitize old books and newspapers. These books and newspapers are scanned and turned into text-images by using “Optical Character Recognition” (OCR). The problem remains that computers still cannot read text as well as humans. A simple word like “of” could be interpreted as “at,” since old books and newspapers may have words that are damaged or hard to scan.

Here is where the superpower of humans comes in! We can read the word “of” and correctly submit “of,” instead of “at”, along with a computer-generated CAPTCHA word. So a reCAPTCHA image is combined with a CAPTCHA word, and placed at the login of something like an email. If we get the CAPTCHA word correct, we are in-there-like-swimwear. Even if we get the reCAPTCHA wrong and cannot decipher it ourselves, but get the CAPTCHA, we are still allowed access. The reCAPTCHA word will be tested by many other humans to increase the likelihood of it being deciphered correctly.

Using the aforementioned example of the word “of” being read as “at”, if people keep typing “of”, the word “of” will digitally replace the word that the OCR program recommended. After some time, millions of people are deciphering these scanned reCAPTCHA words and creating digitized versions of old New York Times newspapers and classic books for Google!

In months, with the power of reCAPTCHA and humans’ ability to read damaged words, 20 years’ worth of material is digitized and transcribed thanks to… well… you…me… Alan Rosenberg… maybe Luis von Ahn and Bill Gates… your mom? Everyone! In time, thanks to Luis von Ahn and his team, we will all be a part of digitizing millions of old texts to be distributed online. Now, where are our paychecks, Google?

Share : Facebooktwitterredditlinkedinmail Follow Us : Facebooktwitterlinkedinyoutubeinstagram

Written by Dylan on March 14th, 2013

Tagged with , , , , , ,

7 Responses to 'reCAPTCHA: digitizing the printed word, one spam filter at a time'

Subscribe to comments with RSS or TrackBack to 'reCAPTCHA: digitizing the printed word, one spam filter at a time'.

  1. So interesting! I don’t believe I’ve read a single thing like this before. Seriously.. This site is something that is required on the web, a blog with a little originality!

  2. I read this post because the topic of the difference of most recent and previous technologies, it’s great article.

  3. Very energetic blog, I loved this bit. Will there be a part 2?

  4. Hiya, I am really glad I have found this info. Nowadays bloggers publish only about gossips and internet and this is really annoying. A good site with exciting content, this is what I need. Thanks for keeping this website, I’ll be visiting it. Do you do newsletters? Can not find it.

  5. Great goods from you, man. Really like what you are saying and the way in which you are saying it. You are making it enjoyable and you still take care to keep it wise.

    I can’t wait to learn far more from you. This is really a great blog.

    Weldon

    6 Apr 13 at April 6th, 2013

  6. Hello, It`s a very good post, thx.

  7. Hello I am so glad I found your website, I really found you by mistake, while
    I was searching on Yahoo for something else, Regardless I am here now and would just like tto say many thanks ffor a incredible post annd a all round excitimg blog (I also love the theme/design), I don’t have time to go through it all at
    the minute but I have bookmarked it and also added
    your RSS feeds, sso when I hage time I will be back to read a great
    deal more, Please do keep uup the great job.

Leave a Reply