User:CzechOut/Bot tricks

From Tardis Wiki, the free Doctor Who reference

The following are a list of tricks I've learned while using pywikipedia.

add_text.py

One of the harder things to do with bots is to work on pages that have no categories. This is because bots depend upon categories for many of their functions. However, bots can be used on pages without categories, as long as you go about things creatively.

If you have a user who is constantly uploading pictures without licenses, it may be easiest just to look for their work, to the exclusion of other people. Here's a run that'll look for only their additions to the file namespace:

python add_text.py -text:"{{bbcvidcover}}" -namespace:6 -usercontribs:"Doctor Who 63"  -except:"\{\{[Bb]bcvidcover" 

Note that this goes through all their work in namespace 6. So it doesn't look at only their unlicensed work in that namespace. Note that the parameter -uncatfiles doesn't actually help, here. It doesn't hurt, but it doesn't actually confine the search to just those things in namespace 6 modified by Doctor Who 63 which are also uncategorised.

However, -uncatfiles is helpful if you don't have that many files to look after. This is what you use if you just want to add {{bbcvidcover}} to pages that aren't categorised.

python add_text.py -text:"{{bbcvidcover}}" -uncatfiles

Course, this is a slow way to go about things, because you probably won't want to add a single template to all the uncategorised files. If you want to filter things a bit, you can instead try to find patterns in the titles of the uncategorised files.

-titleregex:

allows you to make up your own matching rules. But if you can see a quick and dirty pattern at the beginning of a filename, you might try this instead:

-prefixindex:"File:<whatever>"

This method is perfect for quickly licensing achievements badges, because they ll start with the term "File:badge".


What if you want to replace something about a title that has both exclamation points and single quotes for italics? This is pretty dicey, because the exclamation point has to be escaped, and you've got to figure out a way to get around the single quotes. Here's a useful expression:

python replace.py -summary:"see [[forum:Prefix war: Doctor Who Adventures vs. Doctor Who Annuals]]: DWAN --> DWS" -regex "\[\[DWAN\]\]\: ''\[\[Grand Theft Planet(\!\]\]'')" "[[DWS]]: ''[[Grand Theft Planet\1" -ref:'Grand Theft Planet!' 

Note what's going on here. the -ref line must be in single quotes. The regex for the original term must have parentheses around the part of the page name that's causing the most difficulty, so that it can just be dumped into the replacement term as a \1. After trying for a bit, I couldn't find anything else that worked in command line operation of the bot. Of course, my guess it that you might well need something like this, even if using a user-fix.

Cleanup after pagefromfile.py

python replace.py "\<(.*)(\[\[.*\]\])(.*)\>" "{{hidecat}}\2" -regex -subcatsr:"Articles containing potentially dated statements from 2015"

This would kep any categories you have in the sea of code that's unfortunately generated by pagefromfile.py.

Update: Actually, it turns out that the code is only generated when using an .xml file. If you instead just use an .rtf or, better, a regular .txt file (with Unicode 8), things work out nicely.
Update again: .txt files with Unicode 8 are the very best option. If your text includes symbols like curly braces (as with template calls), you absolutely need a plain .txt file.

Regex snippets

  • This gets rid of empty sections, in this case External link/links:
-regex "\=\= External .*\n''.*''"
  • This replaces a multi-line series variable with a singular one, while at the same time preserving spaces between "series" and the = sign:
    python replace.py -regex 'series( *)=.*' "series\1=[[DWM comic stories|''DWM'' comic stories]]|" -summary:"Only one linked item per series variable.  Otherwise, it's VERY unclear what the previous/next line refers to" -cat:'Fourth Doctor DWM comic stories'
  • To automatically add brackets around things, generally use bracket. However, for dab pages, where you have a list of things all starting with the same letters, use this:
    python replace.py -regex -page:"Pagename" "StartingString(.*)\r" "* [[StartingString\1]]"

Pagesfromfile: creating pages based on dabbed titles

  1. Strip the dab with
    python replace.py -regex "(.*)\((.*)\)\r" "\1" -page:User:CzechOut/Sandbox10
  2. Add on the "right side" of coding necessary to use pagefromfile.py:
    python replace.py -regex "\r" " comic story images'''\n[[Category:TVC comic story images]]\nyyyy\nxxxx" -page:"User:CzechOut/Sandbox10"
  3. Add to the "left side":
    python replace.py -regex "\n(.*)'''" "'''Category:\1'''" -page:"User:CzechOut/Sandbox10"
If starting with a list of names of stories, the results will go from:
  • Story name
to
  • '''Category:Story name comic story images'''
yyyy
xxxx

Simple duplication of an entry on a list

  • To create a duplicate on a list
     python replace.py -regex "(.*)\n" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10

Log file --> something usable by movepages.py

  1. Paste logfile onto a page, like user:CzechOut/Sandbox10
  2. Get rid of the "Getting" statements with
    python replace.py -regex "Getting.*\n" "" -page:User:CzechOut/Sandbox10
  3. Get rid of everything that's already disambigged with
    python replace.py -regex "\n(.*)\)\r" "" -page:User:CzechOut/Sandbox10
  4. Create duplicates of each name, then add (disambiguation term) with
     python replace.py -regex "(.*)\r" "\1\n\1 (comic story)" -page:User:CzechOut/Sandbox10
  5. Put brackets on the right side with
    python replace.py -regex "(.*)\r" "\1]]" -page:User:CzechOut/Sandbox10
  6. Put brackets on the left side with
    python replace.py -regex "(.*)\]\]" "[[\1]]" -page:User:CzechOut/Sandbox10
Depending on the number of items on your list, the last two steps can take a long time. It'll look like the bot is frozen, but it's not.

HTML bullet stripper

To strip HTML tags do this:

  1. python replace.py -regex "<ul>|<\/ul>|<li>|<\/li>" "" -cat:"Doctor Who (2005) television stories" -summary:"getting rid of bulleting in infobox"
    This will then leave you with a series of links directly abutting each other.
  2. python replace.py -regex -summary:"putting commas between links" "\]\[" "], [" -subcat:"Doctor Who (2005) television stories"
    This will put a comma and a space between two abutting links.
  3. python replace.py -regex -summary:"putting commas between links" "\)\[" "), [" -subcat:"Doctor Who (2005) television stories"
    This will take care of those few instances of a parentheses abutting a link

Stripping a variable of its link

Many times it's better to have an unlinked variable than a linked one. To strip an existing variable of its linkage, do the following:

python replace.py -regex -summary:"stripping prev/next story, adding dab for better link" 'previous story( *)=(.*)\[\[(.*)\]\]' "previous story\1=\2\3 (TV story)" -cat:"Doctor Who (1963) television stories"

That works fine, as long as people have actually built the infobox in the "correct" way, i.e. one variable per line. But if they squash it all down so that the infobox and entire text of the article is on one line, the regex is far too greedy and will create unexpected replacements. The following is much better:

python replace.py -regex -summary:"stripping prev/next story, adding dab for better link" 'next story( *?)=(.*?)\[\[(.*?)\]\]' "next story\1=\2\3 (TV story)" -subcat:"television stories"

The quick and nasty way to build huge lists of stories

Let's say you have a list of stories with improper disambiguation terms. Or maybe a list without disambiguation terms at all. Instead of typing everything out by hand, like ya did with the British spell checker, use regex to instantly deliver a list that you can immediately plug into a user-fix.

python replace.py -page:user:CzechOut/Sandbox13 -regex "(.*?)\(comic story\)" "u'\1(short story)', u'\1(comic story)',\n"

What this does is take raw dump of un-linked text — in this case, things ending in (comic story). It then strips (comic story), and adds the basic structure for user-fix.py replacements. This will then correct every instance where a story has been misidentified as a (short story) and convert it to a proper (comic story). Obviously here, we're using u instead of r cause there's no regex to this replacement. It's totally literal, allowing us to use u.

Creating mass categories

python replace.py -regex -page:User:CzechOut/Sandbox14 "\n(.*?)\r" "'''Category:\1'''\n{{ImageLink}}\n{{TitleSort}}\n[[Category:CON images]]\nyyyy\nxxxx\n"

Coverting {{{appearances}}} to {{{only}}}

You'll have to go on semi-automatic, but this'll do the job:

python replace.py -regex -summary:"converting {{{appearances}}} to {{{only}}}" "appearances( *?)=( *?)\[\[(.*)\]\]\:( *?)''\[\[(.*)\]\]''\r" "only\1=\2\5"  -page:"Dave Finn" 

Getting rid of whole sections

Here's an example of how to get rid of a whole section. It depends on knowing the format the section is in, however. Any sections called "Timeline" that deviate from this pattern won't be affected.

python replace.py -regex "\=\= Timeline \=\=\r.*\n\*.*\r.*\n\*.*\n" "" -summary:"Getting rid of timeline sections per [[Forum:Timeline sections on pages]]" -catr:"stories"

Using API to generate quick lists

Derive a list of things. This example will give you a list of all user blog comments:

http://tardis.wikia.com/api.php?action=query&list=allpages&apnamespace=501&default=500&aplimit=1000

Then, cut and paste results over at User:CzechOut/API. Then, run the following two strippers:

LEFT SIDE STRIP 

python replace.py -regex '( +?)\<p pageid\=\"(.*?)\" ns\=\"501\" title\=' '' -page:User:CzechOut/API

RIGHT SIDE STRIP

python replace.py -regex '\" \/\>' '' -page:User:CzechOut/API

You'll end up with a list to put into TextEdit. Convert to UTF-8 and save as a .txt file. That then lets you do the following final step:

python replace.py delete.py -file:Filename.txt

Stripping double vertical spaces

python replace.py "(\n\r)(\n\r)" "" -page:"The End (audio story)" -regex

Fixing specific stories' prefixes

Begin by getting rid of the junk that pagegenerators.py creates:

python replace.py -regex "Getting.*\n" "" -page:User:CzechOut/Sandbox10

Then move on to create your user-fix. This takes into account citations that have the dab term, and those that don't (but it leaves behind a dab termed reference)

python replace.py -page:user:CzechOut/Sandbox10 -regex "(.*?)( +?)\((comic story)\)" "(r'\[\[DWM\]\]\: \'\'\[\[\1\2\\(\3\)\|\1\]\]\'\'', r'[[COMIC]]: ''[[\1\2(\3)|\1]]'''),\n(r'\[\[DWM\]\]\: \'\'\[\[\1\]\]\'\'', r'[[COMIC]]: ''[[\1\2(\3)|\1]]'''),\n"

Then you need to make the single quotes on the replacement expression turn into double quotes, or the replacement won't be able to replace the single quotes used to denote italics.

LEFT SIDE

python replace.py -page:user:CzechOut/Sandbox10  "r'[[" 'r"[['
RIGHT SIDE

python replace.py -page:user:CzechOut/Sandbox10  "')," '"),'

Cut and paste the results of user:CzechOut/Sandbox10 into user-fixes.py, and you're off to the races.

Switching wikilinks for templates

python replace.py -regex "\[\[[Tt]he Master \(UNIT years\)(.*?)\]\]" "{{Delgado}}" -ref:"The Master (UNIT years)"

Making sure that stubs have tags

Some people like to put stubs directly into a category rather than using a proper stub template. To fix this problem, first add the stub to everything in the right category.

python add_text.py -regex -text:"{{TV cast stub}}" -except:"\{\{[Tt]V cast stub\}\}" -category:"TV cast stubs" -summary:"adding stub tag"

Then, you need to go back and strip the category that was mistakenly put on the page:

python replace.py -regex "\[\[[Cc]ategory:[Tt]V cast stubs\]\]\r\n" "" -category:"TV cast stubs"

Creating date pages the right way

In addition to other things that need to be dumped on date pages, don't forget to make sure that {{DayNav}} only appears on the day page in question. You don't want it transcluding over on to Transmat pages.

{{#ifeq:{{PAGENAME}}|{{subst:PAGENAME}}|{{DayNav}}|}}

{{DayNav}} will probably also need a little rejiggering to ignore dab terms. It might have to process page names through {{StoryTitle}}, or something very like {{StoryTitle}}.

Refreshing pages

Not strictly a bot trick, but it is a Terminal thing: to refresh a page that's just not serving properly (like a CSS) file, go into Terminal and perform the following curl

curl -X purge "http://url.url.com"

That's a capital X. If yout want to get headers, use -I instead

Fix to js

This is what I'm using:

$('#WikiaRail').bind('DOMNodeInserted', function(event) { //fires after lazy-loading takes place.  if ($('#WikiaRecentActivity').size()) { //check that #WikiaRecentActivity has been loaded if (!$('#mosbox').size()) { //check to make sure it hasn't already been added. $('#WikiaRecentActivity').before( **add your stuff here** ); } }  }); //end of DOMNodeInserted block 

So, where i said "**add your stuff here**", this will work:

$('#WikiaRecentActivity').before(comboString2);

Obviously, you can just stick a second block in there for your twitter feed too