Introduction to regular expressions in Java
Regular expressions are a special "language" for denoting patterns
that we want to match to strings. When used appropriately, they are a very powerful
feature of Java and other programming languages. Using the Java regular expression API,
we can perform tasks such as the following in just one or two lines of code:
- string matching and validation: determining that a given string matches
a particular pattern, e.g. that it is a correctly formatted telephone number
or e-mail address;
- matching with variable criteria such as case-sensitive vs
case-insensitive matching;
- substring matching or searching for instances of
a particular substring within a larger string;
- data extraction: finding an automatically extracting substrings
with a particular format from a given string (e.g. the country code, dialling prefix and
local telephone number from a fully specificed phone number);
- find and replace operations, including
cases where the replacement string varies depending on the string that was found.
As well as avoiding sprawling lines of code, using the regular expression API to perform these tasks can also bring
us efficiency optimisations that we might not have considered
were we to write the equivalent code from scratch. As we shall see, this succinctness does sometimes come
at the price of readability. Many programmers therefore have a love-hate relationship with regular expressions!
To illustrate the advantages and disadvantages of using regular expressions, let's start with a simple example.
Example pattern matching: hand-coded vs regular expressions
Suppose you want to answer the question: does a given string contain a series of 10 digits?
You could hand-code this: cycle through the characters in the string until you hit a digit.
Then when you find a digit, cycle through checking that the next nine characters are digits.
So in Java, the code would look something like this:
public boolean hasTenDigits(String s) {
int noDigitsInARow = 0;
for (int len = s.length(), i = 0; i < len; i++) {
char c = s.charAt(i);
if (Character.isDigit(c)) {
if (++noDigitsInARow == 10) {
return true;
}
} else {
noDigitsInARow = 0;
}
}
return false;
}
The strengths and weaknesses of this code are obvious:
- it's verbose: we use quite a few lines of code for a conceptually simple task;
- on the other hand, it is easily understandable.
We can perform the same task using a regular expression. The result would look something like this:
public boolean hasTenDigits(String s) {
return s.matches(".*[0-9]{10}.*");
}
You'll probably agree that we've essentially reversed the strengths and weaknesses (verbosity vs understandability)
in the above code.
Now, we have a nice succinct piece of code, but it relies on the programmer understanding
what is a slightly obscure piece of syntax to the uninitiated.
But despite the initial obscurity, the advantages of regular expressions include:
- succinctness in programs that use a lot of string matching features: once we understand
the regular expression syntax, regular
expressions allow us to "see the wood for the trees": we no longer have to pick through
long pieces of sprawling code when performing simple pattern matching (and indeed, quite complex pattern matching) on strings;
- regular expressions potentially offer various optimisations that may be time-consuming to implement
from scratch.
On the next page, we'll get going with basic
expressions with String.matches().
In case you already know something about regular expressions and want to
skip ahead, here are some of the later topics currently covered by this tutorial:
Regular expression examples
Finally, we'll look at a couple of examples of using regular expressions:
- guessing the IP's country code from referrer string with regular expressions;
- Scraping HTML: how to pull out data from the HTML or XML data at a particular URL, a task often called "HTML scraping" or "screen scraping".
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.