Purple exclamation mark.svg Planning the future of Botwiki! - Help us bring Botwiki up to date, contribute to our strategy discussion, add bot scripts, and contribute manuals, guides, and tutorials! Almost anything related to bots, particularly those used to edit mediawiki, is welcome.

Red exclamation mark.svg UNABLE TO EDIT? - We've experienced attacks by spambots lately and now require you to confirm your e-mail before you can edit (go to your preferences, enter an e-mail address, and request a confirmation e-mail, then go to your e-mail and click on the confirmation link). We also require new accounts to make a few edits and wait a few minutes before before you can create a page; however, if this is a problem contact us in #botwiki and we can manually confirm your account. Sorry for the inconvenience.

Rewrite

From Botwiki
Jump to: navigation, search

This page is about the forthcoming 2.0 version of pywikibot. See subpages here.

Contents

Participants

Please sign in if you want to contribute to the rewrite project. All help is most welcome!

Classes

Structure proposed by Misza

Edit at will.

  • External packages (outside the standard Python library) such as simplejson that are used in the framework should be at the top level for development purposes; but instead of packaging them with the framework, we should simply mark them as dependencies in the install script (except that ez_install tools should be packaged).
  • pywikibot/
    • tests/ - unit tests!
    • site.py - implements Site class
    • page.py - implements Page and its subclasses - ImagePage and possibly CategoryPage (categories are apges after all) that could merge methods from catlib.Category
    • pagenerators.py - should possibly be dismantled and merged as methods of other objects (site,page,category,image, also template - should TemplatePage be another subclass?)
    • families/
      • Dynamical; We should only provide basic info (namely: site name and where to find its api.php). I'm not sure whether we should store family wide interwiki information, as the interwiki links are language dependent. Valhallasw
        • We can get namespace ids and names from api.php; can we get valid interwiki prefixes somewhere? Also, if sites in a family don't all have the same namespaces (which is true in Wikipedia, for instance), this might break the current interwiki.py code, which assumes that namespace prefixes are the same in all related sites -- not that that matters, necessarily, as long as we're aware of the issue. Russ Blau
    • comms/ - implement low-level access to the site
      • http.py - HTTP connection stuff, auto reconnection, etc.
        • httplib2 is a Python module that supports persistent HTTP connections as well as authentication (multiple methods) and HTTPS (with an add-on module). I think we could use this for most MW access needs with thin wrappers (for cookies, user-agent headers, etc.). Can anyone come up with some suggested unit tests for this component? --Russ Blau 16:48, 12 December 2007 (UTC)
        • The problem with httplib2 is the lack of support for cookie security: cookies are forwarded when a page returns a 3xx-return code - which should not happen. There is a script under MIT/BSD3 that handles cookies using the python builtin cookiejar code, but we will need to add redirect support. Alternatively we could add cookie support to httplib2, but that may be more time-consuming. Valhallasw 11:27, 28 December 2007 (UTC)
      • html.py - good ol' screen scraping - we need this at least as long as api.php doesn't have editing capabilities
        • I think that screenscraping should be done in a generic dirtystuff.py or whatever you like to call it ;) http.py should handle the http connections, automatic reconnection, and connection pooling.
        • It's really HTML parsing, not http (which is a data protocol), so I've renamed it. Russ Blau 21:04, 12 November 2007 (UTC)
      • api.py - yay!
      • mysql.py - read-only access via direct DB queries - possibly useful on the toolserver
        • Why not adapt api.php to use the toolserver? Saves a lot of double work, really
          • I was rather thinking of an interface that can interface with the ts databases (example: getting a whatlinkshere generator from the result of a direct SQL) as well as handling any custom query
      • xml.py - XML dump handling?
      • ... (what else?)
    • bot.py - some very generic, yet powerful Bot class with the following wishlist:
      1. Handle program arguments like site and family, but also create generators that are passed to the bot
      2. Easy instantiation for beginners - as easy as subclassing and overriding def singleStep(self,page): that handles a single page - execution with correct parameters will create generators that will in turn feed singleStep with pages
      3. Flexible so that experts need not to ignore it just because it won't accommodate their sophisticated ideas (I say, if we successfully make the interwiki bot its subclass, we have accomplished the goal)
      4. Anything else?
      • User interfaces. Query for participants: do we want to keep the existing framework which tries to be flexible enough to accomodate GUI interfaces, but doesn't actually work for anything but a text/command-line interface, or do we want to be realistic and just program for the CLI? Russ Blau
        • ~User interfaces should not be part of the core framework, but provided as a separate package. Bryan 21:21, 15 November 2007 (UTC)
        • OK. That would suggest that instead of wikipedia.input() and wikipedia.output(), we should just read from stdin, write to stdout/stderr, and let external user interface scripts worry about what is at the other end of those streams. Alternatively, we could use the Python logging module, which would allow finer-grained control over output without having to worry about user interface details. Russ Blau
    • user.py: Class User : to provide basic attributes of a user, such as list of contributions (cross-wiki), blockedness, logs, rights, etc; more useful for admininstration than for editing.
  • <insert project codename here>-bots/
    • Put all the bot cruft here (interwiki,template replacer and whatever else litters the pywikipedia/ folder now)
    • But user-specific files (user-config.py, cookie file, data dumps, logs, etc.) should go in a user's home directory, not in a directory under the framework code.

Proposed style guidelines

  • Python 2.5 compatibility
  • All files are UTF-8-encoded, with PEP 0263 # -*- coding: utf-8 -*- stanzas and u'unicodify all strings'
  • Docstrings mandatory, using Epydoc for describing parameters and return value. Defining the function in this way almost automatically defines unit tests for that function. Follow PEP 0257 for general docstring format.
  • Define __version__ in every file, and set the corresponding SVN property.
  • Comment your code well (explain what you meant to do, not what the code states)
  • Indentation with four spaces
  • ClassNamesInCamelCase
  • function_names_using_underscores
  • All code should be written thread safe: we want to use persistent HTTP connections and this means we will be sharing a connection throughout several page objects. If we want to be able to put in a thread, we really need to make sure code is thread safe.
  • Focus on i18n when designing code, instead of trying to shoehorn it in later.


Overall structure

[Valhallasw] Because adding structure afterwards is much harder, I think we first should decide on what modules/classes we want, then defining what functions we want and what these functions should do. After that, we can start writing unit tests and functions.
[Valhallasw] Unit tests are there to help defining what code is necessary. In general, first define what a function should to, then write unit tests, then write code to make these tests work, and stop coding when all tests pass. If a function needs to do more, first update the defenition, then update the unit tests, then update the code. This makes sure a) that documentation is always up to date with the code and b) that a change does not break existing behavior.
[Valhallasw] For more information about unit testing and how to code using them, please read chapter 13, 14 and 15 of Dive Into Python.
  • [Bryan] What is very important is that we clearly separate the different layers that a framework consists of and get rid of functions like replaceExceptInWhatever in the main module. In my opinion a proper framework consists of three separate layers:
    • High-level
    • Middleware
    • Lowware or core
[Valhallasw] Sounds like a good starting point for adding structure.
  • [Bryan] The core functionality should consists of methods to get and put raw page data, such as Page.get, Page.categories, Site.recentchanges, etc. The middleware consists of commonly used functions such as replaceIn, replaceImage, replaceCategory. The high-level software is the bot itself. It performs tasks by calling the functions of the middle ware and core.
[Valhallasw] I'm not sure whether bots should be part of the framework in the first place. In my opinion, the framework should be installable through easy-install for example- in a central location in any case. This means that the framework resides in /usr/lib/python2.5/site-packages/ while the bots reside in /home/valhallasw/bots (or sth).
  • [Bryan] Related to this, the question of i18n. I strongly believe that low-ware should never output things to stdout or ask something from stdin.
[Valhallasw] Yes and no. I think there should be a core module that handles the output in a structured manner, but it should not be used in other core modules. I.e. core module 'comm layer' should not use core module 'stdout'. I do agree that i18n is a higher level function than the pure output, and hence should be 'middleware'.
  • [Bryan] The core itself can also be divided into sublayers:
    • Python equivalents of the functions that the API provides
    • Abstract Page/List objects
    • Generic API function
    • Generic communication layer
    • (...)
[Valhallasw] I think this is pretty much what we indeed need. I was also thinking if it would not be easier to just generate Page objects from an API query used in the python source. For example,
http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Main%20Page
would generate a Page 'Main Page' with only the linkedpages loaded, while
http://en.wikipedia.org/w/api.php?action=query&generator=templates&titles=Main%20Page&prop=langlinks
would generate a list of Page objects that would be very suitable for interwiki.py
[Valhallasw] This does mean we will need a more sophisticated way of finding what data is already loaded and what data is not. Maybe we should use __getattr__ to more logically shape the class? Some nice warnings about non-preloaded data being used would be neat, too :)
  • [Bryan] What is important to consider is where the error correction is put. Some errors are recoverable after retry. The lower in level such a retry is placed, the less code duplication is required. However, retry code placed to low may cause to catch to generic errors.
[Valhallasw] 'make sure you only catch errors you expect' :). In general, we should only catch errors that the user does not need to hear about. This means that even retry-after errors might need to traverse to the upper layer so the user at least gets a message 'Server load too high, retrying in ...'. If there is a way to get this information back to the user without backtracking completely, I'd like to hear :)

Branching

The trunk should always (ideally) be kept up to date with the version of MediaWiki currently being used on Wikimedia Foundation projects. Whenever there is a significant API change, the code should be branched, so that the older version can continue to be used on wikis that don't upgrade to the current version. There is not to be any effort at maintaining backwards-compatibility with old MW versions.

Also, some release version should be tagged, each working with one release version of MediaWiki.

Available code

What code is available already? Of course we have the current framework, but Bryan and Russell have made api code. Valhallasw

mwclient. Bryan 21:25, 15 November 2007 (UTC)
Personal tools
Share