Tokenising a string with regular expressions
It is possible to use a regular expression to split or
tokenise a string, with similar but more flexible functionality
to that of StringTokenizer. To split a string, we can write
a line such as the following:
String[] words = str.split("\\s+");
The regular expression that we pass to the String's split()
method defines the pattern that we want to appear between tokens.
In case you haven't come across it, the sequence \s denotes any
whitespace character, which can include line breaks and tabs. (It is written with a double
backslash inside a literal string to distinguish it from other escape sequences
that can go inside strings.) For more information, see the page on
named character classes.
Because we can put basically any regular expression to define split points,
this makes the String.split() method more flexible than using a regular
StringTokenizer. For example, the following denotes that tokens must
be separated by between two and four spaces:
String[] words = str.split(" {2,4}");
Having the tokens directly in an array makes the syntax much less fussy
for looping through the tokens. Using the Java 5 foreach syntax,
we can write:
for (String word : str.split("\\s+")) {
...
}
Compare this to the klutsy syntax we'd have to use with a StringTokenizer.
Performance
The String.split() method is more flexible than a StringTokenizer.
But of course, this flexibility comes at a price. Using String.split() is
around twice as slow. The next page discusses the
performance of string splitting
in more detail.
If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants.
Editorial page content written by Neil Coffey. Copyright © Javamex UK 2021. All rights reserved.