LanguagesIntroduction to Regular Expression in Go

Introduction to Regular Expression in Go

Go Programming ``

Pattern matching through regular expression is a common feature in popular programming languages like Java, Go, and JavaScript. In spite of its widespread uses, regular expressions – or Regex – is infamously difficult to master. But it is undeniable how beautifully it works with the complex craft that works behind the scene. This Golang programming tutorial explores the use of regular expression and the concepts behind using Go as the implementing language.

Overview of Regular Expressions

In computing, we often need to match a particular pattern of characters or a subset of characters as a string in another string. This technique is used to search a specific set of characters in a given string using the technique called regular expression and grammar. If the searched pattern is matched, or a given subset is found in the target string, the search is called successful – otherwise it is considered unsuccessful. The matched pattern can then be extracted, modified, replaced, or deleted according to the need of the programmer or software application.

What is Regular Expression?

A regular expression, at a glance, seems a very cryptic way to describe a set of characters using different symbols and character patterns. For example, a regular expression written as b[au]mble matches both bamble and bumble. Meanwhile, [^0-9] matches everything except numbers. Consider the follow regular expression example:

[^a-zA-Z0-9]

The above Regex above only matches non-alphanumeric characters because the hat (^) metacharacter negates the range of characters denoted by a-z (lowercase letters), A-Z (uppercase letters), and 0-9 (numerics 0 to 9). Therefore, if these are discarded what remains is the symbols, such as $%&, etc.

Read: How to Handle Errors in Go

How to Process a Regular Expression

At a high level, a regular expression is nothing but a way to describe a string pattern that can be used to match and find a text. But internally, this pattern created through regular expression symbols needs to be processed. The processing is done by a regular expression engine that works behind the scenes. The implementation of these engines varies slightly to significantly with a varying degree of complexity. They typically fall into two classes: one that uses a finite state machine and one that uses backtracking.

In a finite state machine-based regex engine, the characters of the regular expression are fed into the finite state automaton that has states and transitions between the states. It is always situated in one state. But, when the input is read, it changes from one state to another. This finite state machine can be of two types: deterministic finite automaton (DFA) and non-deterministic finite automaton(NFA). The difference is that in NFA, more than one transition of a state is allowed for the same input.

In a backtracking-based regex engine, each token in the regex is matched to the next character in the given string. If a match is found, it is successful, otherwise, the engine backtracks to the previous position and tries a different position through a different path. It is a common way to implement regex due to its ability to backreference and lazy quantifiers.

Behind all regular expressions, there is a set of production rules called grammar which describes how to create string and valid syntaxes. It forms the heart of the regular expression.

All these constitute the engine behind processing a regular expression. Therefore it can easily be understood that processing a regular expression is an overhead. Trying to solve every problem using regular expressions is a bad idea, although it may be capable of. Sometimes, just picking the right tool for the job solves half of the problem.

Read: How to Use Strings in Go and Golang

A Quick Go Regex Cheat Sheet

Below is a list of commonly used regular expressions and Regex and their meaning. The list below is not comprehensive:

  • ab: a followed by b
  • a|b: a or b
  • a*: Zero or more a’s
  • a?: Zero or one a’s
  • a{2}: Two or more a’s
  • [ab], ^[ab]: Either a or b, except a or b (^ symbolises not, ie not a or b)
  • [a-z]: Any character a to z
  • [0-9]: Any number 0 to 9.
  • \d: Any digit. Similarly, a non-digit is \D or [^0-9]
  • \s: A whitespace character or [\t\n\f\r]. Similarly, \S is non-whitespace character or [^\t\n\f\r]
  • \w: A word character: [0-9A-Za-z_]. Similarly, \W means a non-word: [^0-9A-Za-z_]
  • [\t\n\f\r\v]: Means a tab=\011, newline=\012, form feed=\014, carriage return=\015, vertical tab=\013 respectively.
  • \123: Octal character upto exactly three digits
  • \x9E: Exactly two digit hex character
  • \A or ^: Beginning of the text
  • $ or \z: End of the text
  • i: Case insensitive

Note: To match special characters, it must be escaped with a backslash character. For example, to match a $, prefix it with a backslash – \$.

Regular Expression and Regex in Go

The package responsible for implementing regular expression in Go is regex. The syntax of the expression follows mostly the established RE2 syntax used in Perl, Python, and other popular languages. The RE2 syntax is a subset of PCRE with various caveats.

The Go package regex contains several methods that match a regular expression and identify the matched text. For example, to quickly test a regular expression in Go, we can use the following code example:

package main

import (
	"fmt"
	"regexp"
)

func main() {
	match, err := regexp.MatchString(`[^0-9]`, "sdfds")
	if err != nil {
		fmt.Println("Error ", err)
	}
	fmt.Println(match)
}

To extract all non-alphanumeric symbols in a given string in Golang, we can write the code as follows:

package main

import (
	"fmt"
	"regexp"
)

func main() {

	text := "This is *$ // sample ## %% text"

	reg := regexp.MustCompile(`[^a-zA-Z0-9]`)
	fmt.Println(reg.MatchString(text))

	strs := reg.FindAllString(text, -1)

	for _, e := range strs {
		fmt.Println(e)
	}
}

To extract a date in a given format, say mm/dd/yyyy, we can edit the above Go code as follows:

text := "This is 11/02/1999 sample 10/14/2022 text"
reg := regexp.MustCompile(`\d{2}/\d{2}/\d{4}`)

To find the number of vowels in a given text, we may write a Go program as follows:

package main

import (
	"fmt"
	"regexp"
)

func main() {

	text := "This is 11/02/1999 sample 10/14/2022 text"
	reg := regexp.MustCompile(`[aeiou]`)
	fmt.Println(reg.MatchString(text))
	strs := reg.FindAllString(text, -1)
	fmt.Println("Number of vowels: ", len(strs))
}

To replace the text of all that matches the regular expression, we can use Go’s ReplaceAllString methods as follows:

package main

import (
	"fmt"
	"regexp"
)

func main() {

	text := "This [sic] is sample [sic] 10/14/2022 text"
	reg := regexp.MustCompile(`\[sic\]`)
	strs := reg.ReplaceAllString(text, "[ref]")
	fmt.Println(strs)
}

Here, we have replaced `[sic]` with `[ref]`.

Final Thoughts on Go and Golang Regular Expressions

Pattern matching plays an important role in searching a string for some set of characters based on a specific search pattern that is based on regular expressions and grammar. A matched pattern allows us to extract the desired data from the string and manipulate it in the way we like. Understanding and using regular expressions is key to processing text. In practice, programmers keep a set of commonly used regular expressions handy for matching email, phone number, etc, and use and reuse it as and when required.

Read more Go and Golang programming tutorials and software development guides.

Latest Posts

Related Stories