This Blog continues on http://aliafshar.github.io/blog

Saturday, May 17, 2008

Blogger Comment Spam - Deleting it

It seems over recent months that my blog gets comment spam. I imagine any bloggers out there experience the same thing and it is a bit of a pain.

I have three immediate problems with this and blogger.com.

1. Blogger doesn't notify me of all comments at the time they are posted. It notifies me of some, and I have of course configured it to notify me of all comments, but it seems to miss off about 70%. So not only do I not notice the spam, I also miss a bunch of legitimate comments. Please get it together Blogger! Ajax panel configuration is nice, but only if the core functions work.

2. Blogger should/could/might try to stop this spam before it happens. I am not guessing how, but then the company that runs Blogger.com are much brighter than me, and I am sure they have a solution.

3. The interface for browsing comments and deleting many at a time simply does not exist. This would make the task of sifting through, identifying, and delting spam much easier.

Now that I have had my grumble about it, I will offer my small solution. In praise of Google, they do provide a nice API and Python bindings to access all of their services and blogger is one of them. So I wrote a small script to go through all the comments, do a little bit of flagging on dodgy looking ones and offer you a chance of deleting them.

The script is uncommented, has no tests, and I don't plan in any way to maintain it or release it, but for those people suffering the same problems, I provide it here.

It is worth noting that the spam detection is really pathetic, and it could be vastly improved. I targetted it at my particular spam.


Full script available here



"""
(c) Ali Afshar 2008
MIT License
"""

import sys, getpass

from gdata import service


def get_details():
email = raw_input('email: ').strip()
password = getpass.getpass()
return email, password


def create_service(email, password):
blogger_service = service.GDataService(email, password)
blogger_service.source = 'blogger_spam_killer'
blogger_service.service = 'blogger'
blogger_service.server = 'www.blogger.com'
blogger_service.ProgrammaticLogin()
return blogger_service


def get_all_blog_ids(svc):
query = service.Query()
query.feed = '/feeds/default/blogs'
feed = svc.Get(query.ToUri())
for entry in feed.entry:
blog_id = entry.GetSelfLink().href.split("/")[-1]
yield blog_id


def get_blog_comments(svc, blog_id):
query = service.Query()
query.feed = '/feeds/%s/comments/default' % blog_id
query.max_results = sys.maxint
feed = svc.Get(query.ToUri())
for entry in feed.entry:
yield entry


def get_all_comments(svc):
for blog_id in get_all_blog_ids(svc):
for comment in get_blog_comments(svc, blog_id):
yield comment


def rank_comment(comment):
words = 0
for word in spamwords:
words += comment.content.text.count(word)

author = comment.author[0]
has_uri = (author.uri is not None and
# I figure no one who puts a URI would link to a blogger
# profile. They would link to whatever they are spamming.
'http://www.blogger.com/profile/' not in author.uri.text)
print 'Spam words: %s' % words
print 'Dodgy author uri: %s' % has_uri
return bool(words) or has_uri


def delete_comment(svc, comment):
svc.Delete(comment.GetEditLink().href)


def filter_all_comments(svc):
for comment in get_all_comments(svc):
print '--'
t = comment.content.text
print t[:70] + '...'
print '...' + t[-70:]
a = comment.author[0]
print 'Author Info: ', a.name.text
if rank_comment(comment):
print '**** LOOKS DODGY'
else:
print '==== OK'
s = raw_input('Delete? (y/N) ').strip()
if s == 'y':
print 'Deleting.'
delete_comment(svc, comment)
else:
print 'Not deleting.'


# http://codex.wordpress.org/Spam_Words
spamwords = """
4u
adipex
advicer
...
""".strip().splitlines()


if __name__ == '__main__':
em, pw = get_details()
svc = create_service(em, pw)
filter_all_comments(svc)