Match regex

Purpose: Find, or find and replace patterns in strings using regex (regular expressions).

 match-regex <pattern> in <target> \
     [ \
         ( replace-with <replace pattern> \
             result <result> \
             [ status <status> ] )  \
     | \
         ( status <status> \
             [ case-insensitive [ <case-insensitive> ] ] \
             [ single-match [ <single-match> ] ] \
             [ utf8 [ <utf8> ] ] ) \
     ] \
     [ cache ] \
     [ clear-cache <clear cache> )

match-regex searches <target> string for regex <pattern>. If "replace-with" is specified, then instance(s) of <pattern> in <target> are replaced with <replace pattern> string, and the result is stored in <result> string.

The number of found or found/replaced patterns can be obtained in number <status> variable (in "status" clause).

If "replace-with" is not specified, then the number of matched <pattern>s within <target> is given in <status> number, which in this case must be specified.

If "case-insensitive" is used without boolean expression <case-insensitive>, or if <case-insensitive> evaluates to true, then searching for pattern is case insensitive. By default, it is case sensitive.

If "single-match" is specified without boolean expression <single-match>, or if <single-match> evaluates to true, then only the very first occurrence of <pattern> in <target> is processed. Otherwise, all occurrences are processed.

If "utf8" is used, then the pattern itself and all data strings used for matching are treated as UTF-8 strings.

<result> and <status> variables can be created within the statement.

If the pattern is bad (i.e. <pattern> is not a correct regular expression), Gliimly will error out with a message.

PCRE2 and Posix Regex

By default, PCRE2 regex syntax (Perl-compatible Regular Expressions v2) is used. To use extended regex syntax (Posix ERE), specify "--posix-regex" when building your application with gg. See more below in Limitations about differences.

Caching and performance

If "cache" clause is used, then regex compilation of <pattern> will be done only once and saved for future use. There is a significant performance benefit when match-regex executes repeatedly with "cache" (such as in case of web applications or in any kind of loop). If <pattern> changes and you need to recompile it once in a while, use "clear-cache" clause. <clear cache> is a "bool" variable; the regex cache is cleared if it is true, and stays if it is false. For example:

 set-bool cl_c
 if-true q equal 0
     set-bool cl_c = true
 end-if
 match-regex ps in look_in replace-with "Yes it is \\1!" result res cache clear-cache cl_c

In this case, when "q" is 0, cache will be cleared, and the pattern in variable "ps" will be recompiled. In all other cases, the last computed regex stays the same.

While every pattern is different, when using cache, even a relatively small pattern was seen in tests to speed up the match-regex by about 500%, or 5x faster. Use cache whenever possible as it brings parsing performance close to its theoretical limits.

Subexpressions and back-referencing

Subexpressions are referenced via a backslash followed by a number. Because in strings a backslash followed by a number is an octal number, you must use double backslash (\\). For example:

 match-regex "(good).*(day)" \
     in "Hello, good day!" \
     replace-with "\\2 \\1" \
     result res

will produce string "Hello, day good!" as a result in "res" variable. Each subexpression is within () parenthesis, so for instance "day" in the above pattern is the 2nd subexpression, and is back-referenced as \\2 in replacement.

There can be a maximum of 23 subexpressions.

Note that backreferences to non-existing subexpressions are ignored - for example \\4 when there are only 3 subexpressions. Gliimly is "smart" about using two digits and differentiating between \\1 and \\10 for instance - it takes into account the actual number of subexpressions and their validity, and selects a proper subexpression even when two digits are used in a backreference.

Lookaheads and lookbehinds

match-regex supports syntax for lookaheads (i.e. "(?=...)" and "(?!...)") and lookbehinds (i.e. "(?<=...)" and "(?<!...)"). See PCRE2 pattern matching for more details. For instance, the following matches "bar" only if preceded by "foo":

 match-regex "\\w*(?<=foo)bar" in "foobar" status st single-match

and the following matches "foo" if followed by "bar":

 match-regex "\\w*foo(?=bar)" in "foofoo" status st single-match

Limitations

If you are using older versions of PCRE2 (10.36 or earlier) such as by default in Debian 10 or Ubuntu 18, then instead of PCRE2, an Extended Regex Expressions (ERE) from the built-in Linux regex library is used, due to older PCRE2 versions having name conflicts with other libraries. In this case, "utf8" clause will have no effect, and lookaheads/lookbehinds functionality will not work; possibly a few others as well, however for the most part the two are compatible. If your system uses an older PCRE2 library, you can upgrade to 10.37 or later to use PCRE2.

In any case, if you need to use Posix ERE instead of PCRE2 (for compatibility, to reduce memory footprint or some other reason), you can use "--posix-regex" option of gg; the same limitations as above apply. Note that PCRE2 (the default) is generally faster than ERE.

Examples

Use match-regex statement to find out if a string matches a pattern, for example:

 match-regex "SOME (.*) IS (.*)" in "WOW SOME THING IS GOOD HERE" status st

In this case, the first parameter ("SOME (.*) IS (.*)") is a pattern and you're matching it with string ("WOW SOME THING IS GOOD HERE"). Since there is a match, status variable (defined on the fly as integer) "st" will be "1" (meaning one match was found) - in general it will contain the number of patterns matched.

Search for patterns and replace them by using replace-with clause, for example:

 match-regex "SOME (.*) IS ([^ ]+)" in "WOW SOME THING IS GOOD HERE FOR SURE" replace-with "THINGS ARE \\2 YES!" result res status st

In this case, the result from replacement will be in a new string variable "res" specified with the result clause, and it will be

 WOW THINGS ARE GOOD YES! HERE FOR SURE

The above demonstrates a typical use of subexpressions in regex (meaning "()" statements) and their referencing with "\\1", "\\2" etc. in the order in which they appear. Consult regex documentation for more information. Status variable specified with status clause ("st" in this example) will contain the number of patterns matched and replaced.

Matching is by default case sensitive. Use "case-insensitive" clause to change it to case insensitive, for instance:

 match-regex "SOME (.*) IS (.*)" in "WOW some THING IS GOOD HERE" status st case-insensitive

In the above case, the pattern would not be found without "case-insensitive" clause because "SOME" and "some" would not match. This clause works the same in matching-only as well as replacing strings.

If you want to match only the first occurrence of a pattern, use "single-match" option:

 match-regex "SOME ([^ ]+) IS ([^ ]+)" in "WOW SOME THING IS GOOD HERE AND SOME STUFF IS GOOD TOO" status st single-match

In this case there would be two matches by default ("SOME THING IS GOOD" and "SOME STUFF IS GOOD") but only the first would be found. This clause works the same for replacing as well - only the first occurrence would be replaced.