<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7547572096336199062</id><updated>2011-04-21T20:55:55.365-07:00</updated><category term='debug'/><category term='anti-spam'/><category term='about'/><category term='Mechanize'/><category term='configuration'/><category term='Ruby'/><category term='Hpricot'/><category term='ajax'/><category term='upload progress'/><category term='mongrel'/><category term='rails'/><title type='text'>Quirky Projects on Rails</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://quirksonrails.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://quirksonrails.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Admin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>3</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7547572096336199062.post-3486740284574995820</id><published>2007-03-11T12:20:00.000-07:00</published><updated>2007-03-11T12:28:24.759-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='configuration'/><category scheme='http://www.blogger.com/atom/ns#' term='debug'/><category scheme='http://www.blogger.com/atom/ns#' term='rails'/><category scheme='http://www.blogger.com/atom/ns#' term='mongrel'/><title type='text'>Dealing with Mongrel Crashes</title><content type='html'>If you're running single-instance Mongrel, you may sometimes experience crashing if a process takes too long to execute. In the previous post, I talked about how I created an automated forum moderator. Problem is, it can sometimes crash if it has to archive too many spam posts with too many links at the same time. I won't be able to save its processing when it crashes, but next time it is hit by the cron, I can attempt to reboot the server by setting the 502 (bad gateway) error file to mongrel.php which contains the following code: &lt;blockquote&gt;&amp;lt;? shell_exec('exec/mongrel.sh'); ?&amp;gt;&lt;/blockquote&gt; Futhermore, mongrel.sh contains a simple&lt;blockquote&gt;#!/bin/sh&lt;br /&gt;echo "Stopping any running ruby processes..."&lt;br /&gt;/usr/bin/ruby /usr/bin/mongrel_rails stop -c /vservers/path/to/my/app&lt;br /&gt;rm /vservers/path/to/my/app/log/mongrel.pid&lt;br /&gt;killall -9 ruby&lt;br /&gt;echo "Restarting Mongrel on port 4000.. Users may encounter 502 errors.."&lt;br /&gt;/usr/bin/ruby /usr/bin/mongrel_rails start -c /vservers/path/to/my/app -e production -dp 4000&lt;/blockquote&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7547572096336199062-3486740284574995820?l=quirksonrails.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://quirksonrails.blogspot.com/feeds/3486740284574995820/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7547572096336199062&amp;postID=3486740284574995820' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/3486740284574995820'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/3486740284574995820'/><link rel='alternate' type='text/html' href='http://quirksonrails.blogspot.com/2007/03/dealing-with-mongrel-crashes.html' title='Dealing with Mongrel Crashes'/><author><name>Admin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7547572096336199062.post-1421299649237387402</id><published>2007-02-27T14:38:00.000-08:00</published><updated>2007-02-27T18:33:13.915-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='anti-spam'/><category scheme='http://www.blogger.com/atom/ns#' term='Ruby'/><category scheme='http://www.blogger.com/atom/ns#' term='rails'/><category scheme='http://www.blogger.com/atom/ns#' term='Mechanize'/><category scheme='http://www.blogger.com/atom/ns#' term='Hpricot'/><title type='text'>Creating an automated moderator with Rails + Hpricot</title><content type='html'>A while back, I became a client moderator on my hosting company's forum in order to delete spammy posts outside of business hours. Yet the spammers gradually began to pile on more and more spam, and before I knew it, 20+ posts were coming in a day. Often times, they were at 4 AM when nobody could moderate. Although my host is 24/7, their support is too occupied to fill in on the forums. To make matters worse, all the CAPTCHAS  we deployed on the forum would be cracked before a REAL user registered. I began to devise a bot that would crawl the forums searching for new topics, and then delete them if the post contained keywords that identified spam such as the "v word".&lt;br /&gt;&lt;br /&gt;    To implement this, I needed a bot that could&lt;br /&gt;                        a). Use the open-uri library to load the forum, and delete offending topics&lt;br /&gt;                        b). Scan HTML for certain patterns such as topics and post bodies&lt;br /&gt;                        c). Classify spam based on the "PLOW" system, which i'll explain later&lt;br /&gt;&lt;br /&gt;Before we get started, I would like to state the dependencies:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;        Ruby and Rails 1.8.4/1.8.5 with Mongrel (easiest way to get a good appserver and the Ragel C compiler required for Hpricot)&lt;/li&gt;&lt;li&gt;Hpricot (&lt;a href="http://code.whytheluckystiff.net/"&gt;http://code.whytheluckystiff.net/hpricot&lt;/a&gt;)&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;To make the script reusable, I embedded the bot in a function called automate. It takes six variables, forum_url (url of the forum without http etc.), user, password, forums (an array of the id numbers of the forums to crawl), delete (specifies whether the bot is allowed to delete posts), and log (specifies whether deleted posts will be archived).&lt;br /&gt;&lt;br /&gt;Because the bot is 214 lines of code, I will only focus on specific parts.&lt;br /&gt;&lt;br /&gt;The first challenge was to find the urls of topics. Because the bot is not logged in, the topics/posts will appear as &lt;span style="font-style: italic;"&gt;aboutxxxx.html&lt;/span&gt;. After using GET on &lt;span style="font-style: italic;"&gt;/viewforum.php?f=xx&lt;/span&gt;, we retrieve the topics by scanning for &lt;span style="font-style: italic;"&gt;/about[0-9][0-9][0-9][0-9].html/.&lt;br /&gt;&lt;br /&gt;&lt;/span&gt;One of the primary concerns when building the bot was bandwith usage. In order to reduce overhead, the bot would only crawl posts that were new. This meant the bot would keep a table of clean posts which would not be crawled again. This is assuming spammers won't piggy back on existing topics, but so far it hasn't been a problem.&lt;br /&gt;&lt;br /&gt;When new topics are found, &lt;span style="font-weight: bold;"&gt;Hpricot &lt;/span&gt;is required, and the bot logs in so it can see the user's email. One of the ways that spammers are identified are their use of .ru or .pl addresses, as the software automatically registers spammers with mail.ru.&lt;br /&gt;&lt;br /&gt;In order to login, we fill in the following data:&lt;br /&gt;&lt;div style="text-align: left;"&gt;&lt;blockquote&gt;path = "/login.php"&lt;br /&gt;     # Prepare POST data&lt;br /&gt;     data = "username=#{user}&amp;password=#{password}&amp;amp;login=Log+In"&lt;br /&gt;     headers = {&lt;br /&gt;       'Referer' =&gt; 'http://' + forum_url,&lt;br /&gt;       'Content-Type' =&gt; 'application/x-www-form-urlencoded'&lt;br /&gt;     }&lt;/blockquote&gt;We can then login with &lt;span style="font-style: italic;"&gt;resp, data = http.post(path, data, headers).&lt;br /&gt;&lt;span style="font-style: italic;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;Finally, we retrieve our session id with the following code:&lt;span style="font-style: italic;"&gt;&lt;/span&gt;&lt;span style="font-style: italic;"&gt;cookie2 = resp.response['set-cookie'].split("=")[4].gsub("; path", "")&lt;/span&gt;&lt;br /&gt;&lt;span style="font-style: italic;"&gt;        cookie = resp.response['location'].gsub("http://#{forum_url}/index.php?sid=", "")&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We append our session id to the headers, and then begin to crawl the post&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Hpricot makes it easy to scrape html, and we can obtain the original poster, and the number of posts they have with the following code:&lt;blockquote&gt;&lt;/blockquote&gt;op = (doc/"span.name")[0].inner_html&lt;br /&gt;number = (doc/"a.postdetails")[0].inner_html.gsub("Posts: ", "")&lt;br /&gt;&lt;br /&gt;We can also obtain the first post only in the same manner. The first part of the "PLOW" system is posts. The number of posts a user has is taken into account. If the user has only one post, the number of links he or she has, or L will be counted. We can do this with the following code:&lt;blockquote&gt;if number.to_i == 1&lt;br /&gt;            # Get the text of the first post only&lt;br /&gt;            post = (doc/"div.postbody")[0].inner_html&lt;br /&gt;            # Check for bbcode links&lt;br /&gt;            @urls = post.scan(/&lt;a href="http:.*"&gt;.*&lt;/a&gt;/) # note, the closing tag is omitted, this was used to please blogger&lt;br /&gt;            # Check for code links not in bbcode&lt;br /&gt;            # You can delete this, but not many new users have 7 or more http links even in code samples&lt;br /&gt;            @urlsamp = post.scan(/&amp;lt;a href="http:.*"&amp;gt;/)&lt;br /&gt;            # If they have 7 or more links on their first post, it is spam&lt;br /&gt;            if @urls.size + @urlsamp.size &gt;= 5 # Changed&lt;br /&gt;              spam = true&lt;br /&gt;            end&lt;br /&gt;&lt;br /&gt;&lt;blockquote&gt;&lt;/blockquote&gt;&lt;/blockquote&gt;If the user has only 1 post, and 5 or more links, their post will be declared spam.&lt;br /&gt;&lt;br /&gt;The final part of the "PLOW" system is overt keywords. If the posts contains the infamous "v word" and other words spammers use, it will be deleted. Mail.ru and Cashette.com have also been added to the list of keywords. Unfortunately, often times the only way to identify the post was by using more common keywords such as "girl". I felt this was a slippery slope, but since this was a tech forum, I realized that "girl" would only be used if a flame war started. Since the bot only sees the first post before it declares the topic "ham", words such as girl will never be used on the first post.&lt;br /&gt;&lt;br /&gt;The final part of the bot is deletion. The bot prepares the following POST data:&lt;blockquote&gt;post_data = "sid=#{cookie}&amp;f=#{f}&amp;amp;delete=Delete&amp;amp;#{topics}confirm=Yes"&lt;br /&gt;The trick to getting it to delete is adding the confirm variable. Otherwise it will prompt you to confirm.&lt;/blockquote&gt;The bot then posts to modcp.php and then the spam posts are deleted. Unfortunately, some spambots have been known to repost up to 107 times.&lt;br /&gt;&lt;br /&gt;Ultimately, the deployment of an anti-spam bot allowed 95% of spam to be deleted before a moderator or Google ever saw it, saving money, and protecting the reputation of the forum. The tactics that worked best were blocking .ru mail accounts, checking for the number of links the user posted if it was his or her first post, and the usage of medical keywords.&lt;br /&gt;&lt;br /&gt;If you are interested in deploying a similar bot, please comment and I would be glad to send the full source.&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7547572096336199062-1421299649237387402?l=quirksonrails.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://quirksonrails.blogspot.com/feeds/1421299649237387402/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7547572096336199062&amp;postID=1421299649237387402' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/1421299649237387402'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/1421299649237387402'/><link rel='alternate' type='text/html' href='http://quirksonrails.blogspot.com/2007/02/creating-automated-moderator.html' title='Creating an automated moderator with Rails + Hpricot'/><author><name>Admin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7547572096336199062.post-4347067264556883798</id><published>2007-02-25T09:10:00.000-08:00</published><updated>2007-02-25T09:38:40.788-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='about'/><category scheme='http://www.blogger.com/atom/ns#' term='upload progress'/><category scheme='http://www.blogger.com/atom/ns#' term='anti-spam'/><category scheme='http://www.blogger.com/atom/ns#' term='rails'/><category scheme='http://www.blogger.com/atom/ns#' term='ajax'/><title type='text'>What is this blog?</title><content type='html'>Having started developing with Rails 6 months ago, I have been amazed by the ability to branch out when coding with Rails. Things I never thought possible such as Upload Progress, one line Ajax, and anti-spam bots in 200 lines of code have traveled on the Rails. In this blog, I will share my various tinkering with Rails and Ruby to build projects out of the ordinary.&lt;br /&gt;&lt;br /&gt;In the next post (tomorrow, hopefully), I will share how I reduced spam by 95% on my hosting company's forum with an automated spam-deleting bot called "ModBot".&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7547572096336199062-4347067264556883798?l=quirksonrails.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://quirksonrails.blogspot.com/feeds/4347067264556883798/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=7547572096336199062&amp;postID=4347067264556883798' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/4347067264556883798'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7547572096336199062/posts/default/4347067264556883798'/><link rel='alternate' type='text/html' href='http://quirksonrails.blogspot.com/2007/02/what-is-this-blog.html' title='What is this blog?'/><author><name>Admin</name><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry></feed>
