linux regular expressions

Linux Pattern matching in Commands:

There are many linux commands available, such as ls, rm, etc. We use file names as options with many of these unix cmds, but sometimes we also use wild card patterns with them to match more than one file. Before we talk about cmds, let's talk about pattern matching, as it forms the basis of cmds.

Pattern matching:

 


 

glob:  This expansion of wild card characters in simple unix cmds is done by a separate program  called glob present in /etc/glob, and then output of this is passed as arg to unix cmd. In later versions of linux, glob() was provided as a library function, which could be used by any program (including the shell). The most common wildcards in glob are *, ?, [ ] and !. These are called metacharacters, as we are not using them as characters to match. They have special meaning, as described below. Everything else is treated as a literal character.

  • * => matches 0 or more characters. ex: Law* matches Law, Lawyer, but not ByLaw. *Law* will match ByLaw. This happens because glob attempts to match entire and not substring (different than RE). So, Law* would match a string starting with letter Law.
  • ? => matches exactly 1 character, ex: ?at matches cat, but not at
  • [abc] => matches one char in bracket. char can be anything including *, ?, etc with exception of - and ]. explained below. ex: [CB]at matches Cat but not cat. [aT[]r matches ar, [r.
  • [a-z] => matches one char from range in bracket. range is a-z, A-Z, 0-9. Note - is not treated as literal character, but as special range char. To match "-" as a literal, it's supposed to be first char in the list (i.e [-a-c] will match -, a, b, c). Similarly matching opening bracket [ is fine, but closing bracket is matched only when it's first char (i.e[]a-c] will match ], a, b, c). ex: num[ab-g0-7XY] matches num0, numb, numX, but not num00 or numx
  • [!abc] => matches one char that is not in bracket. ex: [!bc]at matches rat, Bat, but not cat or bat
  • [!0-7] => matches one char that is not from range in bracket. ex: num[!a-f] matches numx, but not numa or numxx
  • \ => backslash is used to escape the special meaning of metacharacters above. For ex, if we want ? to be treated as a literal, instead of having the special meaning, we need to precede the metacharacter with \ (i.e \? will treat ? as a literal). In that sense \ is also a metacharacter for escping other metacharacters. One thing to note is that *,?,[ ], ! are the only special characters in glob that will need to be escaped using "\" if we want them to be treated as literal, everything else is treated as literal. 

 globbing on filenames is supported by all unix shells as bash, csh, etc (both on cmd line and in scripts). PHP, Perl, Python all have glob() function in them. Also, wildcards here are used only for file name matching (not text matching as in RE, explained later), and meaning of *,?,[] is different than those in RE.

There are many variations of glob cmd. glob cmd used in tcl has multiple switches starting with -. -- indicates end of options. glob cmd in csh is slightly different than one in csh. In linux, it's simple glob with no options. There are symbolic constants (as GLOB_ONLYDIR, etc), which modify the behaviour of glob (similar to options in tcl glob cmd). One of the most common options of glob (GLOB_BRACE) is to include curly braces {} (similar to csh style), to match complete strings. Which of thes options are enabled depends on your particulat linux distro.

  • {string1,string2,...} => matches strings mentioned inside curly braces. {} can be nested too. strings themselves can be patterns as {*abc*,myname*,cd}*.c

ex: Linux: glob [a-c]*.so => finds all files starting with a,b,c and ending with .so

ex: Linux: glob {bti,chip)* => finds all files starting with bti or chip in their name. This is supported by default on CentOS.

ex: Tcl: glob -types {d f r w} * => find all types of file/dir which match types list. d=dir, f=plain file, r=rd permission, w=wrt permission.

 


 

Regular Expression: One problem with glob is that it matches simple patterns. They do not allow match for multiple repetition of preceding string. This worked fine for early unix machines. But later on in 1980's people started using complex pattern matching,  which was called as "Regular Expression" or RE or regex.  RE can describe full set of regular language over any given finite alphabet. This is a concept from compilers, where programs need to be parsed. RE are used to parse these programs and get tokens out. Any pattern can be matched using RE. We support some more wildcards in RE, and then it's able to match any kind of complex pattern. Tcl supports both globbing and RE.

A very good link on RE is: http://www.grymoire.com/Unix/Regular.html.

Another good link to play with any regex and see how it behaves is this link: https://regex101.com/

NOTE:

1. even though RE share many same wildcards as glob, RE are very different than glob. Shell scripts as bash, csh use glob, and NOT RE. Similarly unix cmds as find, ls, etc use simple file pattern matching as glob. glob cmd is used internally to expand the file name pattern, and then that is returned to the cmd for processing.

2. The extent of pattern matching in RE is to match the longest (greedy) or smallest (eager)  possible pattern. However, POSIX standards mandate that longest pattern be matched. So, A.*B matches AAB as well as AABCAB (even thogh AAB has already been matched in 1st 3 letters of this word, match will return the whole 6 letter word).

3. Forward slash, /, which is used extensively in linux as dir path, is not used in any glob or RE. This makes it very convenient as a lot of searches are for paths, and luckily we don't need to escape these /.

In 1980's (before the advent of Linux), there was no standard for RE. People started writng complex pattern matching in their programs, which were all different for different utilities as vi, sed, etc. So, a company named "Sun Microsystems" went through every utility and forced each one to use one of two distinct regular expression libraries - regular or extended. So, we have "regular regular expression" and "extended regular expression". they are also known as regular/basic RE and advanced/extended RE. These are as per IEEE POSIX standard. Both RE serve as an standard, which has been adopted for many tools. Perl have there own RE which have no basic or advanced RE. These perl RE have become a de-facto standard since they have a rich and powerful set of atomic expressions.

There are 3 parts to RE:

  1.  Anchors are used to specify the position of the pattern in relation to a line of text.
  2. Character Sets match one or more characters in a single position.
  3. Modifiers specify how many times the previous character set is repeated.

ex: ^ab.* => Here ^ is an anchor, "ab" are character sets and .* is a modifier.

These are the 2 types of RE:

1. Basic RE (BRE): smaller set. It added . ^ and $ as metacharacters (on top of *, [], ! \) , but didn't add ? as in glob. () { } <> were regarded as meta characters only when preceeded by \.

  • . => matches any single char except newline (exactly which character is considered newline is encoding and platform specific, but LineFeed/Return (LF) char is always considered newline). . inside square bracket is treated as lieral. ex: [a.c] matches any of a or . or c, but a.c matches abc, adc, etc. To match newline in linux, just use \n in the pattern, i.e .*\n.* will match 2 consecutive lines (see deails on * in next bullet)
  • * => matches 0 or more of preceding char. Thus it's different than glob, where 0 or more char are matched. ex: Law* will match Law, La (0 or more of w, note w is not to be matched as it's used as a quantifier for *), Laww, Lawww but not Liw. It will also match Layer (Layer matches as anything after La can match), Lawyer, ByLaw as RE match substring too. We very commonly use .* to match anything (. says match any char except newline, and * following it says match 0 or more of this, basically implying match 0 or more of any char). i.e a.*b will match ab, artsb, acb, but not "a" at end of line (a followed by newline).
  • [abc0-2z6-8] => same as glob.
  • Anchors: ^ and $ are used as beginning or end anchors. The use of "^" and "$" as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses "!^" to specify the first argument of the previous line, and "!$" is the last argument on the previous line. 
    • ^ => beginning of line anchor. Matches starting position of any line. ex: ^Love will match any line starting with letter Love. ^ is an anchor only if it's the 1st char in a RE, otherwise it behaves as a literal.
    • $ => end of line anchor. Matches ending position of any line. ex: Love$ will match any line ending with letter Love. $ is an anchor only if it's the last char in a RE, otherwise it behaves as a literal.
  • [^abc] or [^0-5] => here caret is used as negation metacharacter when used inside square bracket (instead of ! in glob).Thus ^ has 2 meanings. Functionality is same as glob. ex: [^ } => matches anything that's not a space (there's a space after caret in this example). If "-" is 1st or last char in [ ] then, underscore is treated as literal for matching purpose. ex: [^-0-9] will match anything except underscore and digit. Similarly, if ] is 1st char after opening bracket, then ] is treated as literal. ex: []0-9] will match ] or digit.
  • \ => backslash is special metacharacter that turns any metacharacter above into a literal for matching purpose. This is called "escaping metacharacter". For ex, if we try to match "done[" (done followed by a square bracket), RE will see [ as metacaharcter and complain of invalid RE if it doesn't find a closing ]. In order to signal that [ is to be used as a literal, we put backslash. ex: done\[ will now match done[. If we want to match done\, then will need to escape \ by doing done\\
  • () => defines marked subexpression. Meaning any string that matches with pattern in this bracket can be recalled later using \1, \2, ..., \9 (where \1 means 1st matched subexpression and so on). BRE mode requires () be escaped using \( \), or else () will be treated as literals. ex: to match 5 letter plaindromes (that read same from front or back, eg: radar, do: \([a-z]\)\([a-z]\)[a-z]\2\1
  • {m,n} => matches preceding char atleast m times, but not more than n times. ex: a{3,5} matches aaa, aaaa, aaaaa, but not anything else. a[1,} matches 1 or more of "a". BRE mode requires {} be escaped using \{ \}, or else {} will be treated as literals.
  • <the> => matches words only if they are on a word boundary (ideally word boundary means word having spaces on both beginning and end of word. However, here we have some exceptions as explained further) The character before the "t" must be either a new line character, or any character other than a number, letter, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character. This makes it easy to match words without worrying about spaces, punctuation marks, etc. Ex: <[tT]he> will match The, .the, "is the way", but not "they". BRE mode requires < > be escaped using \< \>, or else <> will be treated as literals.

NOTE: the reason that () { } <> were treated as literals, is because they weren't assigned special meaning in early days. They were added later as metacharacters in RE. So, to not break existing programs, only way was to use \ with ( ) { } < > when used as metacharacter.

2. Extended RE (ERE): It added ?, + and | metacharacter, and removed need for escaping () {}  (i.e it started treating () {} as genuine metacharacter. Now you have to escape them to use as literals. So, totally opposite of how it was in BRE, confusing). But this was done to fix the mistake in RE (where backward compatibility was important). ERE was newly defined RE, and so no backward compatibility issue was present here. <> was removed from ERE. ERE wasn't really needed as whatever could be matched by using ERE could be done by using BRE, except for one exception (the "|" operator in ERE has no equivalent matching operator in BRE)

  • ? => matches 0 or 1 of preceding char. Thus it's different than glob. However, it's same as \{0,1\} of RE. ex: a.?b will match ab, acb, but not adcb (as .? will match 0 or 1 of any char except newline)
  • + => matches 1 or more of preceding char. It's same as \{1,\} of RE. ex: a.+b will match acb, acdb but not ab (as .+ will match 1 or more of any char except newline)
  • | => choice operator matches expression before or after the operator. ex: (cat|dog) will match cat or dog. This choice or alternation operator is most useful addition to to ERE, as without this, it's difficult to match different choice of words. | put in ( ). Now, we can also have *,?,+ etc following () to look for 0 or more repetition of what's matched. Eg: (Tom|Rob)+ will matchTomRob or TomTom. lack of <> matching can be made up by using |. Ex: <the> is equiv to  (^|[^_a-zA-Z0-9])the([^_a-zA-Z0-9]|$) => basically this says "the" should not match any alphanumeric char or underscore at start or end. "the" could be start of line or end of line. So, there was really no need of ERE, as we could have added "|" operator to BRE. To not break backward compatibility, we could escape this using \| in BRE, and then BRE would have worked just the same as ERE. Unfortunately, that's not what happened, though emacs used this technique to get away from ERE all together.

NOTE: use of *, ?, + in RE/ERE changes meaning of char preceeding it, as that char is not used in it's normal form for matching, but instead is used as a qualifier for *,?,+. It behaves as if the previous character is glued to these *,?,+. Ex, a.b would not match ab, as . implies a single char has to be in between a and b, but when we do a.*b, then it matches ab, as . loses it's value of matching a single char. Instead . is glued to *, which combined together as .* means match 0 or more of any char. Similarly .? means match 0 or 1 of any char, and .+ matches 1 or more of any char.

Using *? is tricky => internet indicates that it's a lazy match trying to match as little as possible that satisfies the match criteria (by default, any match tries to be greedy match as per POSIX std), no justification on how it ended up that way. So .*? will do lazy match of .*, i.e least possible match of 0 or more of any char. Ex: a.*b will match complete abdbcb (greedy match), but a.*?b will match first 2 letters (ab) only.

IMPORTANT: Forward slash / is NOT a regex. If you see BRE and ERE meta characters above, none of them have / as a meta char (the only regex related to slash is back slash \). So, when matching patterns having linux path (i.e /home/Joe), you don't have to escape anything. So, match it directly by pasting it. So easy !!

 


 

Other class: There are also character class, which provide shorthand notation for matching digits, letters, spaces, etc. Just as we used \1 to refer to 1st matching substring, we can use \d, \w, \s etc heavily used in Perl. However their definition and usage is not consistent across all tools. POSIX std defines [: ... :] for such char class, but more commonly \d, \s, \w are widely supported across many cmds and tools. These are the differences b/w POSIX [] and Perl \d etc.

  1. POSIX char classes can only be used within [], so we need to use [[:alpha:]0-9] to match alphabetic + numeric char. [:xxxxx:] is a substitute for the character set only, i.e [:digit] is substitute for 0-9, so [[:digit:]] is replacement for [0-9].
  2. Perl style \d, \w does the matching too. i.e \w is equiv to [_a-zA-A0-9]. It's matching for any alphanumeric
  • [:alnum:]  => matches any alphanumeric char. [:alnum:] equiv to a-zA-A0-9. [:alpha:] matches only letters(a-z,A-Z), not digits.
  • [:word:] or \w => alphanumeric + underscore. [:word:] equiv to _a-zA-A0-9. \w is equiv to [_a-zA-A0-9]. \W is negation of \w i.e \W is "not matching \w", equiv to [^_a-zA-A0-9]
  • [:digit:] or \d => digits. [:digit:] equiv to 0-9. \d is equiv to [0-9]. \D is negation of \d
  • [:space:] or \s => whitespace char. equiv to [:space:] is equiv to whitespace,\t\r\n\v\f] while \s is equiv to [ \t\r\n\v\f]. \S is negation of \s
  • [:blank:] or \b => space and tab, mostly known as word boundary. This is very common when searching for separate words. ex: \b[a-zA-Z]\b will match every word containing letters only. \b is equiv to (^\w|\w$|\W\w|\w\W). \B is negation of \b (i.e non word boundary)

Regex website:

Below site allows you to verify your regex. It gives you any error in your regex, and allows you to type pattern to match. It's very helpful to check for the correctness of your regex.

https://regex101.com/

ex: In regex, type me.*\n.*, and in test string, type

1st line: me coming

2nd line I go

3rd he is

Now on right side, it shows any errors in regex, and then shows the matching part. In this ex, 1st 2 lines match completely.

 


 

UNIX cmds:

Different Linux cmds/apps use different pattern match. glob/BRE/ERE/char_class are supported by default or by adding options to cmds. Most Linux utilities use BRE by default.

  • vi, the earliest editor uses BRE as expected. Other common linux utilities also use BRE.
  • grep uses RE by default. egrep (or grep -E)  uses ERE. You can use Re/ERE for patterns, while filenames must still be in glob style.
  • sed uses RE by default. "sed -r" uses ERE
  • awk uses ERE.
  • less supports ERE. However depending on version of less installed (type less --version to check your version), it may support GNU regex or something else. We have to use forward slash "/" once we are in less screen to match anything. Then use backslash "\" as escape char. So, /.* will match every line (since \n is not matched by .), \s, \d+ will match digits, etc. to match "ma bc", we can just type "ma bc" or "ma\sbc". Both match.
  • ls supports glob. ex: ls {mint*a,chip}* => this lists all file names starting with mint and having "a" somewhere after that, and starting with chip. ls doesn't have RE as there's no pattern to be provided in ls cmd.
  • emacs uses it's own version of RE. See in emacs section.
  • find cmd has 2 args. One is the path and the other is the filename. filename is always glob, and path is also glob. See in "Linux cmds" section for more deatils
  • Perl uses it's own version of RE. See in perl section.