To implement this, I needed a bot that could
a). Use the open-uri library to load the forum, and delete offending topics
b). Scan HTML for certain patterns such as topics and post bodies
c). Classify spam based on the "PLOW" system, which i'll explain later
Before we get started, I would like to state the dependencies:
- Ruby and Rails 1.8.4/1.8.5 with Mongrel (easiest way to get a good appserver and the Ragel C compiler required for Hpricot)
- Hpricot (http://code.whytheluckystiff.net/hpricot)
Because the bot is 214 lines of code, I will only focus on specific parts.
The first challenge was to find the urls of topics. Because the bot is not logged in, the topics/posts will appear as aboutxxxx.html. After using GET on /viewforum.php?f=xx, we retrieve the topics by scanning for /about[0-9][0-9][0-9][0-9].html/.
One of the primary concerns when building the bot was bandwith usage. In order to reduce overhead, the bot would only crawl posts that were new. This meant the bot would keep a table of clean posts which would not be crawled again. This is assuming spammers won't piggy back on existing topics, but so far it hasn't been a problem.
When new topics are found, Hpricot is required, and the bot logs in so it can see the user's email. One of the ways that spammers are identified are their use of .ru or .pl addresses, as the software automatically registers spammers with mail.ru.
In order to login, we fill in the following data:
path = "/login.php"We can then login with resp, data = http.post(path, data, headers).
# Prepare POST data
data = "username=#{user}&password=#{password}&login=Log+In"
headers = {
'Referer' => 'http://' + forum_url,
'Content-Type' => 'application/x-www-form-urlencoded'
}
Finally, we retrieve our session id with the following code:cookie2 = resp.response['set-cookie'].split("=")[4].gsub("; path", "")
cookie = resp.response['location'].gsub("http://#{forum_url}/index.php?sid=", "")
We append our session id to the headers, and then begin to crawl the post
Hpricot makes it easy to scrape html, and we can obtain the original poster, and the number of posts they have with the following code:op = (doc/"span.name")[0].inner_html
number = (doc/"a.postdetails")[0].inner_html.gsub("Posts: ", "")
We can also obtain the first post only in the same manner. The first part of the "PLOW" system is posts. The number of posts a user has is taken into account. If the user has only one post, the number of links he or she has, or L will be counted. We can do this with the following code:
if number.to_i == 1If the user has only 1 post, and 5 or more links, their post will be declared spam.
# Get the text of the first post only
post = (doc/"div.postbody")[0].inner_html
# Check for bbcode links
@urls = post.scan(/.*/) # note, the closing tag is omitted, this was used to please blogger
# Check for code links not in bbcode
# You can delete this, but not many new users have 7 or more http links even in code samples
@urlsamp = post.scan(/<a href="http:.*">/)
# If they have 7 or more links on their first post, it is spam
if @urls.size + @urlsamp.size >= 5 # Changed
spam = true
end
The final part of the "PLOW" system is overt keywords. If the posts contains the infamous "v word" and other words spammers use, it will be deleted. Mail.ru and Cashette.com have also been added to the list of keywords. Unfortunately, often times the only way to identify the post was by using more common keywords such as "girl". I felt this was a slippery slope, but since this was a tech forum, I realized that "girl" would only be used if a flame war started. Since the bot only sees the first post before it declares the topic "ham", words such as girl will never be used on the first post.
The final part of the bot is deletion. The bot prepares the following POST data:
post_data = "sid=#{cookie}&f=#{f}&delete=Delete&#{topics}confirm=Yes"The bot then posts to modcp.php and then the spam posts are deleted. Unfortunately, some spambots have been known to repost up to 107 times.
The trick to getting it to delete is adding the confirm variable. Otherwise it will prompt you to confirm.
Ultimately, the deployment of an anti-spam bot allowed 95% of spam to be deleted before a moderator or Google ever saw it, saving money, and protecting the reputation of the forum. The tactics that worked best were blocking .ru mail accounts, checking for the number of links the user posted if it was his or her first post, and the usage of medical keywords.
If you are interested in deploying a similar bot, please comment and I would be glad to send the full source.
3 comments:
Interesting! Would like to get the source. Rgds Johan
Very cool! I'm currently working on a bot that uses session variables and does some screen scraping. It would be really helpful if you could send over the source code for your forum bot.
Thanks!
Jake
Great, I am also currently working on it, It would be really helpful if u could send the Source code..
Thx
Saket
Post a Comment