Purple exclamation mark.svg Planning the future of Botwiki! - Help us bring Botwiki up to date, contribute to our strategy discussion, add bot scripts, and contribute manuals, guides, and tutorials! Almost anything related to bots, particularly those used to edit mediawiki, is welcome.

Red exclamation mark.svg UNABLE TO EDIT? - We've experienced attacks by spambots lately and now require you to confirm your e-mail before you can edit (go to your preferences, enter an e-mail address, and request a confirmation e-mail, then go to your e-mail and click on the confirmation link). We also require new accounts to make a few edits and wait a few minutes before before you can create a page; however, if this is a problem contact us in #botwiki and we can manually confirm your account. Sorry for the inconvenience.

Perl:Copyright Violation Bot

From Botwiki
Jump to: navigation, search

The writer of this bot is Where at the English wiki.

Here is the latest code as of 3/5/2007. There are no major problems with it at as of the time of writing.

Here is the source code. This has only been tested on UNIX-like systems, but it should theoretically also work on Windows. Note that the code was not intended for wide distribution, so it is not well-commented. Sorry! Also note that the code requires wget, pywikipediabot ,Yahoo's python search plugin, perl , and the Bot::BasicBot and IPC::Open2 perl modules. You may use the code under the GNU General Public License.

If you want to modify Wherebot to run on a different wiki or language, there are some modifications that need to be made. I have marked where people may want to do so on lines containing the text "#CONFIG."

Please go into edit mode to see the source of the program with proper linebreaks.

Here is the main file, cv-watch.pl. Place it where you wish:

 #!/usr/bin/perl
 use strict;
 
 #some of the IRC parts of this bot are based off of the Bot::BasicBot sample code
 
 Wherebot->new(channels => ["#ar.wikipedia"], nick=>"viobot", server => "irc.wikimedia.org")->run(); #CONFIG: change Wherebot4 to something unique
 
 package Wherebot;
 use base qw/Bot::BasicBot/;
 use IPC::Open2;
 
 sub said {
    shift(); #don't care about the first parameter
    our %hash = %{shift()};
 
    our $rawMessage = $hash{"body"};
    our $channel = $hash{"channel"};
    our $site = $channel;
    $site =~ s&#&&;
    $rawMessage =~ m#02(http://$site.org[^ ]+)#;
    our $url = $1;
#CONFIG: the next four lines are to ignore certain pages. Customize if you like
    if ($url =~ /[Tt]alk:/) {return;}
    if ($url =~ /Wikipedia:Sandbox/) {return;}
    if ($url =~ /Articles for Deletion/) {return;}
    if ($url =~ /Wikipedia:Intro/) {return;}
    chop $rawMessage;
    if ($rawMessage =~ /N\x{03}10/) {
#CONFIG: the next four lines are to ignore certain namespaces. Customize if you like.
       if ($url =~ /User:/) {return;}
       if ($url =~ /Wikipedia:/) {return;}
       if ($url =~ /Portal:/) {return;}
       if ($url =~ /Help:/) {return;}
       if ($url =~ /Template:/) {return;}
       if ($url =~ /Category:/) {return;}
       if ($url =~ /Image:/) {return;}
 
       &act($channel, $url);
    }
 }
 sub URLDecode { #From http://glennf.com/writing/hexadecimal.url.encoding.html
   my $theURL = $_[0];
   $theURL =~ tr/+/ /;
   $theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;
   $theURL =~ s/<!--(.|\n)*-->//g;
   return $theURL;
 }
 sub act {
    our $misc = "/home/where/misc";
    our $channel = shift;
    our $url = shift;
    $url =~ s#'##g; #just in case, although this would never be necessary
    chop $url;
    our $term = `wget '$url?action=raw' -q -O - | head -n 1`;
    chomp $term;
 
    our $origUrl = $url;
    $url =~ m#/wiki/(.*)#;
    our $page = $1;
    $url .= "?action=raw";
    $url =~ s#'##g; #shouldn't be a problem, but hey, I'm paranoid
    chomp $term;
    $term = &trim($term); #get it to <100 words so yahoo doesn't go crazy
    if ($term =~ /#redirect/i) {
       return;
    }
    if ($term =~ /^\{/) {
       return;
    }
    if ($term =~ /^</) {
       return;
    }
 
    $term =~ s#'''##g;
    $term =~ s#''##g;
    $term =~ s#\[\[##g;
    $term =~ s#\]\]##g;
    $term =~ s#\*##g;
    $term =~ s#"##g; #Yahoo chokes on quotes; yes, this will probably return false matches, but it is better than the alternative
    $term =~ s#\(##g;
    $term =~ s#\)##g;
 #   if (m#([^\(\)]+)[\(\)]#) { #same thing with parenthesis
 #      $term = $1;
 #   }
 
    if (length($term) < 75) {
       return;
    }
 
    our $firstLine;
    our $n=0;
    while (1) {
       our $pid = open2(*Reader, *Writer, "python", "/home/alnokta/local/lib/python2.3/site-packages/yahoo/search/web.py", "-t", "web", '"' . $term . '"'); #CONFIG: CHANGE $misc/search2.py to the path to search.py from the Yahoo search API
       $firstLine = <Reader>;
      # print "($url): FL: $firstLine\n";
       if ($firstLine =~ /Internal WebService error, temporarily unavailable/ || $firstLine =~ /^Got an error/) {
         warn "Search failed; retrying\n";
         sleep 60;
         waitpid $pid, 0;
         ++$n;
         if ($n < 3) {
            next;
         }
         else {
            last;
         }
       }
       else {
         waitpid $pid, 0;
         last;
       }
    }
 
    if (!($firstLine =~ /^No results\s*/)) {
       <Reader>;<Reader>; #skip some lines
       our $from = <Reader>;
       $from =~ s#\s##g;
       if ($from =~ m#^http://ar\.wikipedia\.org# || $from =~ m#\.gov# || $from =~ m#^http://en.wikibooks#) {
         return;
       }
 
     $page = &URLDecode($page);
 
       $page =~ s#_# #g;
 
       our $strippedUrl = $from;
       $strippedUrl =~ s#^http://##;
       print "($page) copyvio from $from\n";
 
       if ($channel eq "#ar.wikipedia") { #CONFIG: change this line according to your language and version
         chdir "/home/alnokta/local/lib/python/pywiki"; #CONFIG: change this line according to where your pywikipedia directory is
       }
       print "Writing\n";
       open APPEND_PY, "|nice -n 10 python append.py";
       print APPEND_PY  "* [[$page]] -- [$from $strippedUrl]. Reported at~~~~~";
       close APPEND_PY; #CONFIG: change wording of how Wherebot reports if you like
    }
 }
 
 sub trim { #cut parameter to <100 words
    our $in = shift;
    our @in = split / /, $in;
    our $out = "";
    our $i = 1;
    for (@in) {
       $out .= $_ . " ";
       ++$i;
       if ($i == 99) {
         last;
       }
    }
    chop $out; #get rid of last space
    return $out;
 }

For reasons unknown the me, after long periods of time, the bot may shut down. I thus recommend running it using persist.pl:

 
 #!/bin/perl

 while (1) {
    system "perl cv-watch.pl";
 }

The following file, append.py, should go in the pywikipediabot directory.

# -*- coding: utf-8 -*-
#!/usr/bin/python
 
import wikipedia
import sys
import codecs
import re

site = wikipedia.getSite()
page = wikipedia.Page(site, "User:alnokta/copyvio")
text = page.get()
text = unicode(text + "\n") + unicode(raw_input(), 'utf8')
wikipedia.setAction("Adding a suspected copyright violation")
page.put(text,minorEdit=False)

You need a user-config.py file in the pywikipediabot dir. Here's mine:

 mylang='en' #CONFIG: change for your wiki language
 usernames['wikipedia']['en']='Wherebot' #CONFIG: change for your wiki, wiki language and username
 
 maxthrottle=2
 put_throttle=3

Now run login.py in the pywikipediabot dir.

Finally, run persist.pl.

Personal tools
Share