Matching against multi-line strings

Up until now, we haven't mentioned the issue of multi-line strings. It is perfectly valid and useful to match against a string containing several lines, separated by one of the usual line separators. However, there are a couple of complications:

The multiple lines may or may not have a logical meaning. For example, in HTML or XML, line breaks in the file are actually uninteresting in most cases. On the other hand, in some other types of file, line breaks are significant, and we may want to perform an operation such as finding 'one number per line' or 'the word at the start of the line';
Line breaks can be signalled by different control characters. On UNIX systems, it is typical to signal a line break with a single ACSII character 10. Windows, on the other hand, still maintains the tradition of combining both carriage return (character 13) and newline character. This is why if you've had the misfortune to use Windows Notepad to open a README file from some UNIX software you may have found everything displayed on one long line. (At least Wordpad manages to figure it out...)

Regular expressions provide some flags and tokens to allow all combinations of the above.

If line breaks are significant...

Set the Pattern.MULTILINE flag when constructing the Pattern. In the expression, you can use ^ and $ to mean the beginning and end othe line respectively. (That includes at the beginning and end of the entire expression.) For example, the following pattern will find numbers at the beginning of lines:

Pattern p = Pattern.compile("^([0-9]+).*", Pattern.MULTILINE);
Matcher m = p.matcher(...);
while (m.find()) {
  String number = m.group(1);
  ...
}

Note that by default, the dot matches all characters except line break characters. So the .* in the above example will only match to the end of a line.

Note that \s matches any whitespace, including line breaks.

If line breaks are not significant...

If line breaks are not significant, then in many types of expression you will need to use the Pattern.DOTALL flag to denote that the dot can match literally any character, including line break characters. As mentioned, \s will match line breaks, so may be more useful in some situations.

Without the MULTILINE flag, the ^ and $ now match against the beginning and end of the expression but nothing else.

If the file has Windows-type line breaks (or if you're not sure...)

If the file or character sequence you're matching against, as generally found in Windows, uses the convention of denoting a line break by a sequence carriage return followed by newline, then there's no special option to set. Provided the MULTILINE flag is enabled, then ^ and $ will by default treat a sequence of carriage return plus newline as a single line break. If you enable DOTALL, the dot will match against both carriage return and newline (and all other characters); otherwise, it will exclude carriage return and newline, plus the other line separators.

Leaving the default setting is actually the most flexible line break mode. The following will actually be recognised as line breaks: carriage return and newline (either individually or as a sequence), character 133, plus Unicode characters U+2028 and U+2029 (which are rarely used in practise).

If the file has UNIX-type line breaks

If you are sure that the input has UNIX-style line breaks– in other words, line breaks are denoted by a single newline (ASCII character 10)– then you should enable the Pattern.UNIX_LINES flag. This flag makes ^ and $ recognise only character 10 as a line break; and by default, the dot will match against all characters except character 10.

Note: if the string data you are dealing with is a normal page of HTML that came directly from a web server, there's a good chance it will use UNIX-style line breaks.

If you enjoy this Java programming article, please share with friends and colleagues. Follow the author on Twitter for the latest news and rants. Follow @BitterCoffey