If you manage a Google Analytics account, then understanding regular expressions – and how to set them up – is a key part of your job. This tutorial is intended to jump start novice users into the world of regular expressions – specifically from a Google Analytics point of view.

As you will see from reading my books, most regular expressions I use are pretty straightforward – so you shouldn’t be deterred from delving in and understanding them inside-out. However, I have found over the years that many users are scared off the subject. My reasoning is that the general studying of regular expressions can rapidly become complex and overwhelming. However, the truth is for the vast majority of GA work, you don’t need the full power or complexity that regular expressions (built for the IT industry) can provide. Hence, I created this jump start tutorial for you to focus on….

A Quick Introduction…

Regular expressions, also referred to as regex, are a way for computer languages to match strings of text, such as specific characters, words, or patterns of characters. A simple everyday example of regular expressions is using wildcards for matching filenames on your computer. For example, *.pdf matches all filenames that end in .pdf. However, regex can be much more powerful than this. Within google analytics, regular expressions are primarily used when creating profile filters (see Chapter 8 of the book), advanced segments (also Chapter 8), and table filters (Chapter 4).

Understanding the Fundamentals of Regex

An important point to grasp when using regular expressions is that there are two types of characters: literals and metacharacters. Most characters are treated as literals. That is, if you wanted to match a URI containing advanced, you would type the literal character "a", followed by "d", followed by "v", and so forth (without quotes).

The exceptions to this are metacharacters. These are characters of special meaning to the regex engine and therefore interpreted differently. For example, the PDF example shown in the Introduction contains the metacharacter "*" (without quotes). The most common metacharacters are listed in Table 1. Ensure that you understand these before proceeding.

Table 1 – Common regular expression metacharacters

Metacharacter

Description

.

Matches any single character.

[   ]

Matches a single character that is contained within the square brackets. Referred to as a class.

[^   ]

Matches a single character that is not contained within the square brackets. Referred to as a class.

^

Matches the beginning of the string. This is referred to as an anchor.

$

Matches the end of the string. This is referred to as an anchor.

*

Matches zero or more of the previous item.

?

Matches zero or one of the previous item.

+

Matches one or more of the previous item.

|

The OR operator. Matches either the expression before or the expression after the operator.

\

The escape character. Allows you to use one of the metacharacters for your match.

(   )

Groups characters into substrings.

NOTE: Google Analytics uses a partial implementation of the Perl Compatible Regular expressions (PCRE) library. I use the word partial because a full implementation is more powerful and flexible than a software as a service vendor would want it to be! For example, if its use is unrestricted, it can be used maliciously to hack or break a website. Therefore, not every feature of PCRE is included in Google Analytics…

The best way to learn Regex is by example…

Using only literals, you can construct simple regular expressions. First, partial matches are allowed. For example, say you wanted to view only referrals from the website www.google.com. Using a regular expression, you could use the partial keyword "goog" in the table filter of your Traffic sources > Sources > All Traffic report. This will match all entries that have the letters "goog" in them, as shown in Figure 1.

Figure 1 – Table filter using a partial literal match

Click for full size

NOTE: The break down of geographic google domains shown in Figure 1 is achieved by using the Custom SEO plugin for GA.

Being simple to use, literals can be very powerful—as long as you can identify a unique pattern match that includes the string of interest. Taking the previous example, to be more specific, use the OR metacharacter, as in this example:

google\.(com|co\.uk|ca)

This matches the literal google, followed by a period (this must be escaped because it is also a metacharacter), followed by com OR co.uk (period also escaped) OR ca. The result is shown in Figure 2.

Figure 2 – Table filter using the metacharacter OR

Click for full size

NOTE: Google Analytics automatically escapes periods in the report table filter and advanced segments for you. Therefore, you can omit the escape charter (\) for these. However, when you are learning regex, I advise you to always escape these yourself as best practice. This is because profile filters, as well as goal or funnel configurations, do not have the automatic escape feature.

You will notice from Figure 2 that subdomains of Google are present in the reports. Suppose you wish to remove these from your matches. Modify the regex query as follows:

^google\.(com|co\.uk|ca)

This results in only referrers that start with the pattern google being matched. Another example to practice with includes:

^go.+le\.((com$)|(co\.uk)$|(ca)$)

This extends the previous example to explicitly match only Google domains that end in .com, .co.uk, and .ca. This removes referrers such as google.com.au, google.com.br, and so forth, as shown in Figure 3. Note that I have also been a little lazy and used go.+le to illustrate how to use the + metacharacter. That is, it is used to match one or more of the previous character—in this case, any character.

Figure 3 – Table filter using multiple metacharacters

Click to view full size

The following are examples to consider when matching URLs listed in your Content / Top Content reports:

\?(id|pid)=[^&]*

This matches the filename followed by the first query parameter and its value if its name is equal to id or pid. If you have a report with URIs of the following form, this regex will match the two URIs highlighted:

/blog/post?pid=101

/blog/post?id=101&lang=en&cat=hacks

/blog/post?lang=en&cat=hacks&id=102

/blog/about-this-blog

Typically, this regex format is used when defining a goal or funnel step. Note the use of the negative class to stop the regex match. That is, this regex will match all characters after id= or pid= that do not contain &. An asterisk is used (*) to also match zero occurrences of & so that even if there is no second query parameter present, as per the first URI, the regex will still match.

An example that is useful when filtering within Keyword reports (search engines and internal site search) is to consider misspellings. Perhaps you need to find all matches for “colour” and “color.” The following regex will achieve this:

colo[u]*r

Here are some other misspelling examples:

Voda(ph|f)one

Ste(ph|v)en

Br[ai][ai]n

(My name is sometimes spelled Brain!)

Finally, although not directly relevant to Google Analytics, a common regex used in web development for processing forms is:

^(.+)@([^\(\);:,<>_]+\.[a-zA-Z.]{2,6})

Use this to test your understanding. Broken into its constituent parts, this regex checks an email address to ascertain if it is a valid format—that is, brian@mysite.com and not brian@@my_site:com, for example. From left to right, the English interpretation is as follows:

  • Match one or more of any character before the @
  • Match any character after the @ but do not include any of following characters: ( ) ; ; , < > _
  • Followed by a period
  • Followed by between two and six characters that must include an alphabetic character (A–Z as either upper- or lowercase) or a period

I have highlighted the middle section of this regex to help guide your eye, that is, the part between the @ and first period.

If you have followed these examples, you are well on your way to understanding regular expressions for use with Google Analytics. If not, reread this post and use one of the regex tools listed in Appendix B of the book. Further regex examples are shown throughout the book, though none are more complicated than those shown here.

Tips for Building Regular Expressions

  • Make the regular expression as simple as possible. Complex expressions take longer to process or match than simple expressions.
  • Avoid the use of .* if possible because this expression matches everything zero or more times and may slow processing of the expression. For instance, if you need to match all of the following: index.html, index.htm, index.php, index.aspx, index.py, index.cgi

use

index\.(h|p|a|c)+.+

not

index.*

  • Try to group patterns together when possible. For instance, if you wish to match a file suffix of .pdf, .doc, and .ppt

use

\.(pdf|doc|ppt)

not

\.pdf|\.doc|\.ppt

  • Be sure to escape the regular expression wildcards or metacharacters if you wish to match those literal characters. Common ones are periods in filenames and parentheses in text.
  • Use anchors whenever possible (^ and $, which match either the beginning or end of an expression), because these speed up processing.

Some useful regex tools to help you

I have used all of these though, I would love to hear about others:

Was this post useful…? Please let me know by adding a comment or sharing the ‘social love’ with a tweet, +1, Like etc…