User:CzechOut/Bot tricks: Difference between revisions

From Tardis Wiki, the free Doctor Who reference
No edit summary
Line 90: Line 90:
Here's an example of how to get rid of a whole section.  It depends on knowing the format the section is in, however.  Any sections called "Timeline" that deviate from this pattern won't be affected.
Here's an example of how to get rid of a whole section.  It depends on knowing the format the section is in, however.  Any sections called "Timeline" that deviate from this pattern won't be affected.
<pre>python -regex "\=\= Timeline \=\=\r.*\n\*.*\r.*\n\*.*\n" "" -summary:"Getting rid of timeline sections per [[Forum:Timeline sections on pages]]" -catr:"stories"</pre>
<pre>python -regex "\=\= Timeline \=\=\r.*\n\*.*\r.*\n\*.*\n" "" -summary:"Getting rid of timeline sections per [[Forum:Timeline sections on pages]]" -catr:"stories"</pre>
== Stripping double vertical spaces ==
<pre>python "\s\s\s" "" -page:"The End (audio story)" -regex</pre>

Revision as of 16:35, 1 September 2012

The following are a list of tricks I've learned while using pywikipedia.

One of the harder things to do with bots is to work on pages that have no categories. This is because bots depend upon categories for many of their functions. However, bots can be used on pages without categories, as long as you go about things creatively.

If you have a user who is constantly uploading pictures without licenses, it may be easiest just to look for their work, to the exclusion of other people. Here's a run that'll look for only their additions to the file namespace:

python -text:"{{bbcvidcover}}" -namespace:6 -usercontribs:"Doctor Who 63"  -except:"\{\{[Bb]bcvidcover" 

Note that this goes through all their work in namespace 6. So it doesn't look at only their unlicensed work in that namespace. Note that the parameter -uncatfiles doesn't actually help, here. It doesn't hurt, but it doesn't actually confine the search to just those things in namespace 6 modified by Doctor Who 63 which are also uncategorised.

However, -uncatfiles is helpful if you don't have that many files to look after. This is what you use if you just want to add {{bbcvidcover}} to pages that aren't categorised.

python -text:"{{bbcvidcover}}" -uncatfiles

Course, this is a slow way to go about things, because you probably won't want to add a single template to all the uncategorised files. If you want to filter things a bit, you can instead try to find patterns in the titles of the uncategorised files.


allows you to make up your own matching rules. But if you can see a quick and dirty pattern at the beginning of a filename, you might try this instead:


This method is perfect for quickly licensing achievements badges, because they ll start with the term "File:badge".

What if you want to replace something about a title that has both exclamation points and single quotes for italics? This is pretty dicey, because the exclamation point has to be escaped, and you've got to figure out a way to get around the single quotes. Here's a useful expression:

python -summary:"see [[forum:Prefix war: Doctor Who Adventures vs. Doctor Who Annuals]]: DWAN --> DWS" -regex "\[\[DWAN\]\]\: ''\[\[Grand Theft Planet(\!\]\]'')" "[[DWS]]: ''[[Grand Theft Planet\1" -ref:'Grand Theft Planet!' 

Note what's going on here. the -ref line must be in single quotes. The regex for the original term must have parentheses around the part of the page name that's causing the most difficulty, so that it can just be dumped into the replacement term as a \1. After trying for a bit, I couldn't find anything else that worked in command line operation of the bot. Of course, my guess it that you might well need something like this, even if using a user-fix.

Cleanup after

python "\<(.*)(\[\[.*\]\])(.*)\>" "{{hidecat}}\2" -regex -subcatsr:"Articles containing potentially dated statements from 2015"

This would kep any categories you have in the sea of code that's unfortunately generated by

Update: Actually, it turns out that the code is only generated when using an .xml file. If you instead just use an .rtf or, better, a regular .txt file (with Unicode 8), things work out nicely.
Update again: .txt files with Unicode 8 are the very best option. If your text includes symbols like curly braces (as with template calls), you absolutely need a plain .txt file.

Regex snippets

  • This gets rid of empty sections, in this case External link/links:
-regex "\=\= External .*\n''.*''"
  • This replaces a multi-line series variable with a singular one, while at the same time preserving spaces between "series" and the = sign:
    python -regex 'series( *)=.*' "series\1=[[DWM comic stories|''DWM'' comic stories]]|" -summary:"Only one linked item per series variable.  Otherwise, it's VERY unclear what the previous/next line refers to" -cat:'Fourth Doctor DWM comic stories'
  • To automatically add brackets around things, generally use bracket. However, for dab pages, where you have a list of things all starting with the same letters, use this:
    python -regex -page:"Pagename" "StartingString(.*)\r" "* [[StartingString\1]]"

Pagesfromfile: creating pages based on dabbed titles

  1. Strip the dab with
    python -regex "(.*)\((.*)\)\r" "\1" -page:User:CzechOut/Sandbox10
  2. Add on the "right side" of coding necessary to use
    python -regex "\r" " comic story images'''\n[[Category:TVC comic story images]]\nyyyy\nxxxx" -page:"User:CzechOut/Sandbox10"
  3. Add to the "left side":
    python -regex "\n(.*)'''" "'''Category:\1'''" -page:"User:CzechOut/Sandbox10"
If starting with a list of names of stories, the results will go from:
  • Story name
  • '''Category:Story name comic story images'''

Simple duplication of an entry on a list

  • To create a duplicate on a list
     python -regex "(.*)\n" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10

Log file --> something usable by

  1. Paste logfile onto a page, like user:CzechOut/Sandbox10
  2. Get rid of the "Getting" statements with
    python -regex "Getting.*\n" "" -page:User:CzechOut/Sandbox10
  3. Get rid of everything that's already disambigged with
    python -regex "\n(.*)\)\r" "" -page:User:CzechOut/Sandbox10
  4. Create duplicates of each name, then add (disambiguation term) with
     python -regex "(.*)\r" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10
  5. Put brackets on the right side with
    python -regex "(.*)\r" "\1]]" -page:User:CzechOut/Sandbox10
  6. Put brackets on the left side with
    python -regex "(.*)\]\]" "[[\1]]" -page:User:CzechOut/Sandbox10
Depending on the number of items on your list, the last two steps can take a long time. It'll look like the bot is frozen, but it's not.

HTML bullet stripper

To strip HTML tags do this:

  1. python -regex "<ul>|<\/ul>|<li>|<\/li>" "" -cat:"Doctor Who (2005) television stories" -summary:"getting rid of bulleting in infobox"
    This will then leave you with a series of links directly abutting each other.
  2. python -regex -summary:"putting commas between links" "\]\[" "], [" -subcat:"Doctor Who (2005) television stories"
    This will put a comma and a space between two abutting links.
  3. python -regex -summary:"putting commas between links" "\)\[" "), [" -subcat:"Doctor Who (2005) television stories"
    This will take care of those few instances of a parentheses abutting a link

Stripping a variable of its link

Many times it's better to have an unlinked variable than a linked one. To strip an existing variable of its linkage, do the following:

python -regex -summary:"stripping prev/next story, adding dab for better link" 'previous story( *)=(.*)\[\[(.*)\]\]' "previous story\1=\2\3 (TV story)" -cat:"Doctor Who (1963) television stories"

That works fine, as long as people have actually built the infobox in the "correct" way, i.e. one variable per line. But if they squash it all down so that the infobox and entire text of the article is on one line, the regex is far too greedy and will create unexpected replacements. The following is much better:

python -regex -summary:"stripping prev/next story, adding dab for better link" 'next story( *?)=(.*?)\[\[(.*?)\]\]' "next story\1=\2\3 (TV story)" -subcat:"television stories"

The quick and nasty way to build huge lists of stories

Let's say you have a list of stories with improper disambiguation terms. Or maybe a list without disambiguation terms at all. Instead of typing everything out by hand, like ya did with the British spell checker, use regex to instantly deliver a list that you can immediately plug into a user-fix.

python -page:user:CzechOut/Sandbox13 -regex "(.*?)\(comic story\)" "u'\1(short story)', u'\1(comic story)',\n"

What this does is take raw dump of un-linked text — in this case, things ending in (comic story). It then strips (comic story), and adds the basic structure for replacements. This will then correct every instance where a story has been misidentified as a (short story) and convert it to a proper (comic story). Obviously here, we're using u instead of r cause there's no regex to this replacement. It's totally literal, allowing us to use u.

Creating mass categories

python -regex -page:User:CzechOut/Sandbox14 "\n(.*?)\r" "'''Category:\1'''\n{{ImageLink}}\n{{TitleSort}}\n[[Category:CON images]]\nyyyy\nxxxx\n"

Coverting {{{appearances}}} to {{{only}}}

You'll have to go on semi-automatic, but this'll do the job:

python -regex -summary:"converting {{{appearances}}} to {{{only}}}" "appearances( *?)=( *?)\[\[(.*)\]\]\:( *?)''\[\[(.*)\]\]''\r" "only\1=\2\5"  -page:"Dave Finn" 

Getting rid of whole sections

Here's an example of how to get rid of a whole section. It depends on knowing the format the section is in, however. Any sections called "Timeline" that deviate from this pattern won't be affected.

python -regex "\=\= Timeline \=\=\r.*\n\*.*\r.*\n\*.*\n" "" -summary:"Getting rid of timeline sections per [[Forum:Timeline sections on pages]]" -catr:"stories"

Stripping double vertical spaces

python "\s\s\s" "" -page:"The End (audio story)" -regex