regex – Nick Falkner, Personal Projects

I’m trying to come up with a set of easy mistake catchers that will do a lot of the more mindless editorial work for me. Having stared at the same 100,000 words for about six months, I’m losing the ability to see the errors. Here is another short set of Scrivener Regular Expression (regex) patterns that will find easy-to-fix problems. (Look at yesterday’s post on #regex to get more information on the definitions I’m using here. As a note, \w means word characters, ^ means ‘start of the line’ and [ ] is used to allow a range of characters to match. I’ll add some more as I go.)

\w”
- This is looking for a single “word” character, followed by closing quotation marks. Scrivener usually replaces straight quotes with opening and closing quotation marks. It’s important to use the right one but you can look for both by modifying the pattern to \w[“”].
- This should catch every time you’ve written speech and missed the punctuation at the end.
^[““][a-z]
- You have to switch on case-sensitivity in the search to get this one to work but it will start at the beginning of lines (^) and look for opening or straight quotes that are followed by lower-case letters.
- This will catch sentences such as “the fish sang.” and “dang it all to heck, man!” but it will also pick up any deliberate use of the lower case, such as “iBooks is a distribution platform.”
\w,[””]$
- Now we have a word character (A-Z, a-z, 0-9, and _) followed by a comma. After this, we’re looking for closing quotes. But what about the $? That means “this has to be at the end of a line”. It’s the companion to ^.
- This will show us any time that we used a comma as the final punctuation inside quotation marks and then started a new line.
- This will find
  “This is a terrible thing,”
  
  but it won’t find
  
  “This is a terrible thing,” said Charlie, “What are we to do?”
  
  because the ,” isn’t at the end of a line.
\s\d\s
- This will find any digits that are sitting around by themselves. If you, like me, would prefer to write most of your numbers as words but keep forgetting to do it, this pattern is for you!
- \d means any digit from 0-9. The spaces are around it to stop the pattern matching digits that are part of bigger numbers. (\d by itself would match the 1, the 9, the 3 and the 2 in 1932 but as individual matches.)
- This will find patterns such “He said that there were 3 beasts”.
- Want to find longer numbers? \s\d{2}\s will find all numbers that are two digits long and are surrounded by spaces.
- Want to find all numbers between 1 and 4 digits long? \s\d{1,4}\s is the pattern and will find numbers that are 1, 2, 3, or 4 digits long.
(\w)(\w)(\w)\3\2\1
- This one is just for fun. It finds palindrome patterns that are six characters long. In my text, “sniffing” and “suffused” are matches.
- Adding a space character in the middle give us (\w)(\w)(\w) \3\2\1 and this will match “sword dropped” and “now wonder”.
- Really rather useless unless you have a habit of writing palindromic text and wish to stop.

I ran several other Scrivener checks today, once again using the amazingly handy Regular Expression (RegEx) facility to find patterns on things.

Punctuation is easy to mess up, especially where spaces are involved and I find some of the following patterns very handy. Remember that to do this in Scrivener, you will need to go to the global search the project field, set the Operator to RegEx and then you’ll be in the right mode. Here are some handy patterns, with explanation.

^\s+\w
- This pattern will find any sentence that has spaces before you start using words. The ^ means start at the beginning of the line. The \s+ means you are looking for 1 or more spaces (\s means space, + means ‘at least one or more’). \w represents a word character, from a-z, A-Z, 0-9, including the _ (underscore) character. This will catch lines such as (ignoring the quotation marks, here used for clarity) ” Bosco said…” and ” Bosco said…”, which can be very hard to pick up from visual inspection.
\w,\w
- This will pick up word characters that are tightly packed around a comma. If you are prone to writing “very,very high” then this will find that for you. (Sometimes this is a legitimate pattern! 100,000 is valid but will be detected by this.)
\w\s+,\w
- This is very similar to the above but now it picks up when you have the space in the wrong place. This will find “very ,very high”.
\s+(\w+)\s+(\1)\s
- This one is more complicated! I’ve mentioned it before but this is an optimised version. Remember that \s means “a space character” although formally it means any whitespace character including tab, space, carriage return or new line. Adding + means that \s+ will match ” ” and ” ” and ” newline tabcharacter “. \w+ means that we are now looking at words of at least one character in length.
- But what does ( and ) mean? This is a grouping operator and, because we’ve used it, anything that matches this now has a numerical reference. We call this a capturing group because we’ve ‘captured’ that pattern and numbered it! We can refer to this capturing group (our first) with the shorthand \1.
- Now we can explain the whole pattern. Starting with any number of spaces (but at least one), we look for characters that make up words, stopping when we find a space. Remember this captured group of characters is labelled as \1. By using \1 again, we are saying that we want to find all the situations where we have the same pattern of characters twice in a row, separated by spaces. Once we match that first group of characters, the regex system can then build a pattern where that group is repeated. The magic of regex!
- This pattern will find ” of of “, ” it IT”, ” and and ” and also things like ” R R ” if you’re spelling something out.
- If you simplify it to (\w+)\s+(\1), you’ll find all of those patterns plus things like “he heard” (where the he he is actually located), “pithy thyme” (thy thy) or “puny NYC” (ny ny pattern will be found here unless you use the search option ‘case sensitive’).
[ ]{2}
- There’s a single space between [ and ]. The {2} means that you are looking for exactly two of these in a row. This will find every time that you typed ” ” instead of ” “. (Some of you like to double space. Many of us do not.)

That’s it for today. Back to editing for me!

Nick Falkner, Personal Projects

Thinking, dreaming, writing and printing

Tag: regex

More handy #Scrivener #indiepub #regex

More automated checks. Fun with #scrivener and regex