Scraping OkCupid: Will Bot For Love
Recently I've found myself with a bit of extra time on my hands (being unemployed can do that to a person) and not having any software projects to work on (which for me is almost unheard of). So I sat down and wracked my brain to think of something to write, and I came up with a project to get my feet wet in the wonderful world of botting. Recently, I wrote an article for 2600 in which I had to spend a rather long time gathering information. I want to write more pieces for 2600, and I have a few ideas rattling around my head that would make good articles... the only problem is they are pretty data intensive, and data gathering by hand is slow and boring! So, let's kill two birds with one stone, and start working smarter, not harder (with bots!)
I've thought about writing bots before, but I never really got around to it. Back in the days of the original Starcraft, I wanted to write a bot that would act as a trivia master in the various chat channels. It would ask questions, keep track of scores, and have cool functions like leaderboards and such. Well, while the idea was sound the execution was lacking, and I never really got it up and running.
With that in mind, I set out to find something to bot. I needed to find a site that was easy to strip data from, as well as having data worth mining. I eventually landed on OkCupid.com, a free dating site with millions of users (how many I'm not sure of, but Google has indexed around 2 million profiles at the time of this writing). The data on the site is all wrapped nice and neat into different <div> tags, each with a unique ID, and the information might be fun to work with and see if we can't find any cool trends out there. So, now that I have a place to deploy my bot, it was time to plan it out and write some code!
Logistics and Design
My initial designs for the bot was to search Google to discover URLs to user profiles, and then visit each of those profiles and strip the data out. It turns out, there were a few problems with that. First, Google will only let you search up to the first 1000 results before cutting you off. I could always apply for an API key, but even then I don't think I can make as many searches per day to Google that I'd like. Second, even if I had a way to keep it within the first 1000 results, Google knows I''m a bot and will test you every so often (which is lame! Or good, depending on how you look at it...) So, how do we go about it?
Well, before we get too much further, I'll have to make an account. There are certain profiles on OkCupid that only allow for users to view them, and we want access to ALL the data, not just the public stuff. So I made i_am_cupidbot, a quirky computer who loves robot things, and smokes on occasion. And since he's set up to be looking for anyone, of any age, gender, location, or preference, he'll find everyone! Check him out!
With the profile in place, I noticed that on the home page I get a suggested profile to visit, which is random at any given page load. That initil URL was my seed, which the bot would start on to beging scraping information from the site. And if you follow that link, each profile page has around 12 links to other people that might be similar. In that way, we can use those links to spread out from our seed URL, and act like a real web spider when scraping the site. The Google problem? Yeah, solved.
First things first: Here's the code on Github if you want a copy for yourself to play with! It's a Visual Studio solution called SpiderScrape, written in C# and built on top of the .NET 3.5 runtime. Have fun!
Now that all the logistical issues were taken care of, next came the fun part: the code! I picked C# since I am working in a Windows environment at the moment, and I already have Visual Studio installed on this machine. That, and I've been writing in C# for the past few years, so the code will go fast. One of the cool things about using C# is having access to all the UI elements for Windows Forms applications, such as the WebBrowser control! Using a WebBrowser control, I can navigate around the web and do basic content manipulations with InnerHTML functions to strip the data I want out. Well, that was easy!
So, what does the bot do? Well, it maintains a list (functionally it''s a queue, but using a System.Collections.Generic.List) of pending URLs to visit. Each cycle, it will pop a URL off the pending queue, and add it to the visited queue. Then we make a HTTP Request to the URL, and wait for it to either succeed or time out. When a URL times out, it is tried again until it finally works (this was needed because the internet I'm on is so unreliable and slow). Once a request succeeds, the bot will scrape the page for the data we want to save, and for URLs that we haven't seen yet (so, not in either the pending or visited lists) that match good URL parameters (see the paragrap below about the configuration). This continues until there are no more URLs in the pending list, at which time the scrape of the site is considered complete (well, as complete as it's going to get!)
Data is stored in a few different ways. The scraped data is written to a .csv file for manipulation later (using tools I wrote in C# a while back for my 2600 article), while the pending/visited lists are written out to files so they persist between runs of the spider. Each of the files is written out every so often based on configuration data, though during these tests I used an interval of 50 URLs between data dumps (disk I/O is too slow to do it every time!).
The application took a few hours to write the initial version of, and then a few small tweaks/changes here and there. So, in less than a day, the bot took shape and was out in the wild, scraping OkCupid of its data, and stealing the hearts of everyone it came in to contact with.
A quick aside concerning the configuration files, for inquisitive folks who want to download the bot and play with it themselves. The config.txt file will tell the bot everything it needs to know to get started scraping a site. The browser_start_url property is the "seed" URL that I talked about earlier, which you'll probably have to gather by hand. The http_request_timeout is how long until the browser gives up on a request and attempts it again (since I'm on slow internet that is also unreliable, these restarts are priceless), while the pages_per_dump property is how many pages to scrape before syncing the data to disk. All the output files are specified in the config, and are pretty easy to spot. The most useful settings are in the Scraping section of the config. The bad_url_regex property is a regular expression that, when it matches a URL, that URL is skipped over instead of added to the pending queue. The good_url_regex is just the opposite, as when it matches a URL, it IS added to the pending queue. And finally, the ids_to_scrape property is a list of IDs of tags to pull out of each page visited. Changing these three properties along with the browser_start_url will get you up and scraping another site in no time!
Let's start with the successes: the bot itself works fairly well, and data is getting scraped as I write this. I'm at around 60,000 profiles scraped in the past week, though that's not with it running 24/7 and on a terrible internet connection at my parents' place (I'm lucky to see 1mbps, and the latency is killer!) But, regardless, the bot is stripping data from the site and performing exactly as it should. The bot has also been generalized a bit, so that other configuration files can be dropped in and other sites can be scraped without changing any code.
And I ended up with a lot of data to play with, too! I still have a ton of pending URLs to visit, if I decide to keep botting (at the rate I'm going, it'll take a few hundred days to visit each profile, which doesn't make for a very nice data set). But I have a week's worth here and now, at least.
Now for the bad stuff: the bot is slow. Using the built in C# controls is easy to code, but it's very slow! If I was in a Linux environment, I would probably replace the C# WebBrowser control's functionality with wget and sed, and see a great performance boost (because I don''t have to render the pages, and at most I will have one HTTP request per profile) on even a slow internet conneciton. The other issue is that the bot is a bit error prone on bad internet connections. Right now, the internet I'm working on will randomly drop for hours on end, and the bot has a hard time coping with that at times. I have a fix in place to see if the WebBrowser control is getting a 404 page, but it's a hackish fix at best (if you look at the code, you'll even notice the //HACK comment tags) and could be done right with another hour or so of work on the bot.
But, overall, I''m as happy with the bot as I expected to be. It's working well, is as stable as my parents' internet connection, and is successfully scraping data from sites with minimal human intervention! There are a few areas where I have problems to solve, and a few missing features that would be nice (such as an option to choose from multiple configuration files when the bot isn't running, so you can bot multiple sites without having to move files on the disk) but for a quick, one off bot, it's doing just fine!
Extra Fun Stuff!
Ok, so while this section isn't going to contain any technical information, I'd like to add it in here because I find it quite humorous. In the week or so that I've been botting, I've gotten quite a few (103 as of right now) messages from people whose profiles I've crawled. I got a lot of links to YouTube videos regarding robots, especially the Flight of the Conchords video for The Humans Are Dead. I also got a few people hitting on my robot self, and a couple even propositioned me! I guess I should have picked a less attractive guy from Google Images when I was setting up the profile! I just took the first dude with glasses I saw in my search! Anyway, here's a dump of some of the more interesting messages I've sent/received. Enjoy!
Act I: In which the word "analogue" appears
Act II: In which MUDs are discussed
Act III: In which despair rears its ugly head
Act IV: In which smiles occur
Act V: In which turtles are serialized
Act VI: In which motives are questioned
Act VII: In which "you" and "are" are misspelled
Act VIII: In which someone uses the fuck word
And thus ends the saga of i_am_cupidbot! I'll probably kill my spider tonight some time, as the data gathering is going too slow for my tastes. But it was a fun ride, at least!