A while back, I became a client moderator on my hosting company's forum in order to delete spammy posts outside of business hours. Yet the spammers gradually began to pile on more and more spam, and before I knew it, 20+ posts were coming in a day. Often times, they were at 4 AM when nobody could moderate. Although my host is 24/7, their support is too occupied to fill in on the forums. To make matters worse, all the CAPTCHAS we deployed on the forum would be cracked before a REAL user registered. I began to devise a bot that would crawl the forums searching for new topics, and then delete them if the post contained keywords that identified spam such as the "v word".
To implement this, I needed a bot that could
a). Use the open-uri library to load the forum, and delete offending topics
b). Scan HTML for certain patterns such as topics and post bodies
c). Classify spam based on the "PLOW" system, which i'll explain later
Before we get started, I would like to state the dependencies:
- Ruby and Rails 1.8.4/1.8.5 with Mongrel (easiest way to get a good appserver and the Ragel C compiler required for Hpricot)
- Hpricot (http://code.whytheluckystiff.net/hpricot)
To make the script reusable, I embedded the bot in a function called automate. It takes six variables, forum_url (url of the forum without http etc.), user, password, forums (an array of the id numbers of the forums to crawl), delete (specifies whether the bot is allowed to delete posts), and log (specifies whether deleted posts will be archived).
Because the bot is 214 lines of code, I will only focus on specific parts.
The first challenge was to find the urls of topics. Because the bot is not logged in, the topics/posts will appear as
aboutxxxx.html. After using GET on
/viewforum.php?f=xx, we retrieve the topics by scanning for
/about[0-9][0-9][0-9][0-9].html/.
One of the primary concerns when building the bot was bandwith usage. In order to reduce overhead, the bot would only crawl posts that were new. This meant the bot would keep a table of clean posts which would not be crawled again. This is assuming spammers won't piggy back on existing topics, but so far it hasn't been a problem.
When new topics are found,
Hpricot is required, and the bot logs in so it can see the user's email. One of the ways that spammers are identified are their use of .ru or .pl addresses, as the software automatically registers spammers with mail.ru.
In order to login, we fill in the following data:
path = "/login.php"
# Prepare POST data
data = "username=#{user}&password=#{password}&login=Log+In"
headers = {
'Referer' => 'http://' + forum_url,
'Content-Type' => 'application/x-www-form-urlencoded'
}
We can then login with
resp, data = http.post(path, data, headers).
Finally, we retrieve our session id with the following code:
cookie2 = resp.response['set-cookie'].split("=")[4].gsub("; path", "") cookie = resp.response['location'].gsub("http://#{forum_url}/index.php?sid=", "")
We append our session id to the headers, and then begin to crawl the postHpricot makes it easy to scrape html, and we can obtain the original poster, and the number of posts they have with the following code:
op = (doc/"span.name")[0].inner_html
number = (doc/"a.postdetails")[0].inner_html.gsub("Posts: ", "")
We can also obtain the first post only in the same manner. The first part of the "PLOW" system is posts. The number of posts a user has is taken into account. If the user has only one post, the number of links he or she has, or L will be counted. We can do this with the following code:
if number.to_i == 1
# Get the text of the first post only
post = (doc/"div.postbody")[0].inner_html
# Check for bbcode links
@urls = post.scan(/.*/) # note, the closing tag is omitted, this was used to please blogger
# Check for code links not in bbcode
# You can delete this, but not many new users have 7 or more http links even in code samples
@urlsamp = post.scan(/<a href="http:.*">/)
# If they have 7 or more links on their first post, it is spam
if @urls.size + @urlsamp.size >= 5 # Changed
spam = true
end
If the user has only 1 post, and 5 or more links, their post will be declared spam.
The final part of the "PLOW" system is overt keywords. If the posts contains the infamous "v word" and other words spammers use, it will be deleted. Mail.ru and Cashette.com have also been added to the list of keywords. Unfortunately, often times the only way to identify the post was by using more common keywords such as "girl". I felt this was a slippery slope, but since this was a tech forum, I realized that "girl" would only be used if a flame war started. Since the bot only sees the first post before it declares the topic "ham", words such as girl will never be used on the first post.
The final part of the bot is deletion. The bot prepares the following POST data:
post_data = "sid=#{cookie}&f=#{f}&delete=Delete&#{topics}confirm=Yes"
The trick to getting it to delete is adding the confirm variable. Otherwise it will prompt you to confirm.
The bot then posts to modcp.php and then the spam posts are deleted. Unfortunately, some spambots have been known to repost up to 107 times.
Ultimately, the deployment of an anti-spam bot allowed 95% of spam to be deleted before a moderator or Google ever saw it, saving money, and protecting the reputation of the forum. The tactics that worked best were blocking .ru mail accounts, checking for the number of links the user posted if it was his or her first post, and the usage of medical keywords.
If you are interested in deploying a similar bot, please comment and I would be glad to send the full source.