User:CzechOut/Bot tricks: Difference between revisions

Revision as of 19:01, 26 January 2012

The following are a list of tricks I've learned while using pywikipedia.

add_text.py

One of the harder things to do with bots is to work on pages that have no categories. This is because bots depend upon categories for many of their functions. However, bots can be used on pages without categories, as long as you go about things creatively.

If you have a user who is constantly uploading pictures without licenses, it may be easiest just to look for their work, to the exclusion of other people. Here's a run that'll look for only their additions to the file namespace:

python add_text.py -text:"{{bbcvidcover}}" -namespace:6 -usercontribs:"Doctor Who 63"  -except:"\{\{[Bb]bcvidcover"

Note that this goes through all their work in namespace 6. So it doesn't look at only their unlicensed work in that namespace. Note that the parameter -uncatfiles doesn't actually help, here. It doesn't hurt, but it doesn't actually confine the search to just those things in namespace 6 modified by Doctor Who 63 which are also uncategorised.

However, -uncatfiles is helpful if you don't have that many files to look after. This is what you use if you just want to add {{bbcvidcover}} to pages that aren't categorised.

python add_text.py -text:"{{bbcvidcover}}" -uncatfiles

Course, this is a slow way to go about things, because you probably won't want to add a single template to all the uncategorised files. If you want to filter things a bit, you can instead try to find patterns in the titles of the uncategorised files.

-titleregex:

allows you to make up your own matching rules. But if you can see a quick and dirty pattern at the beginning of a filename, you might try this instead:

-prefixindex:"File:<whatever>"

This method is perfect for quickly licensing achievements badges, because they ll start with the term "File:badge".

What if you want to replace something about a title that has both exclamation points and single quotes for italics? This is pretty dicey, because the exclamation point has to be escaped, and you've got to figure out a way to get around the single quotes. Here's a useful expression:

python replace.py -summary:"see [[forum:Prefix war: Doctor Who Adventures vs. Doctor Who Annuals]]: DWAN --> DWS" -regex "\[\[DWAN\]\]\: ''\[\[Grand Theft Planet(\!\]\]'')" "[[DWS]]: ''[[Grand Theft Planet\1" -ref:'Grand Theft Planet!'

Note what's going on here. the -ref line must be in single quotes. The regex for the original term must have parentheses around the part of the page name that's causing the most difficulty, so that it can just be dumped into the replacement term as a \1. After trying for a bit, I couldn't find anything else that worked in command line operation of the bot. Of course, my guess it that you might well need something like this, even if using a user-fix.

Cleanup after pagefromfile.py

python replace.py "\<(.*)(\[\[.*\]\])(.*)\>" "{{hidecat}}\2" -regex -subcatsr:"Articles containing potentially dated statements from 2015"

This would kep any categories you have in the sea of code that's unfortunately generateg by pagefromfile.py.

Regex snippets

This gets rid of empty sections, in this case External link/links:

-regex "\=\= External .*\n''.*''"

This replaces a multi-line series variable with a singular one, while at the same time preserving spaces between "series" and the = sign:

python replace.py -regex 'series( *)=.*' "series\1=[[DWM comic stories|''DWM'' comic stories]]|" -summary:"Only one linked item per series variable.  Otherwise, it's VERY unclear what the previous/next line refers to" -cat:'Fourth Doctor DWM comic stories'

To automatically add brackets around things, generally use bracket. However, for dab pages, where you have a list of things all starting with the same letters, use this:
```
python replace.py -regex -page:"Pagename" "StartingString(.*)\r" "* [[StartingString\1]]"
```

@@ Line 28: / Line 28: @@
 Note what's going on here.  the -ref line must be in single quotes.  The regex for the original term must have parentheses around the part of the page name that's causing the most difficulty, so that it can just be dumped into the replacement term as a \1.  After trying for a bit, I couldn't find anything else that worked in command line operation of the bot.  Of course, my guess it that you might well need something like this, even if using a user-fix.
+== Cleanup after pagefromfile.py ==
+<pre>python replace.py "\<(.*)(\[\[.*\]\])(.*)\>" "{{hidecat}}\2" -regex -subcatsr:"Articles containing potentially dated statements from 2015"</pre>
+This would kep any categories you have in the sea of code that's unfortunately generateg by pagefromfile.py.
 == Regex snippets ==
 * This gets rid of empty sections, in this case External link/links:

Anonymous

Search

User:CzechOut/Bot tricks: Difference between revisions

Namespaces

More

Page actions

Revision as of 19:01, 26 January 2012

add_text.py

Cleanup after pagefromfile.py

Regex snippets

Navigation

Navigation

Topical pages

Other useful pages

Community

Wiki tools

Wiki tools

Anonymous

Search

User:CzechOut/Bot tricks: Difference between revisions

Revision as of 19:01, 26 January 2012

add_text.py

Cleanup after pagefromfile.py

Regex snippets

Navigation

Wiki tools

Page tools