Up until now, we haven't mentioned the issue of multi-line strings. It is
perfectly valid and useful to match against a string containing several lines,
separated by one of the usual line separators. However, there are a couple of
complications:
Regular expressions provide some flags and tokens to allow all combinations
of the above.
- If line breaks are significant...
Set the Pattern.MULTILINE flag when constructing the Pattern.
In the expression, you can use ^ and $ to mean the
beginning and end othe line respectively. (That includes at the beginning
and end of the entire expression.)
For example, the following pattern will find
numbers at the beginning of lines:
Pattern p = Pattern.compile("^([0-9]+).*", Pattern.MULTILINE);
Matcher m = p.matcher(...);
while (m.find()) {
String number = m.group(1);
...
}
Note that by default, the dot matches all characters except line break characters. So the .* in the above example will only match to the end of a line.
Note that \s matches any whitespace, including line breaks.
- If line breaks are not significant...
If line breaks are not significant, then in many types of expression you will
need to use the Pattern.DOTALL flag to denote that the dot can match literally any character, including line break characters. As mentioned, \s
will match line breaks, so may be more useful in some situations.
Without the MULTILINE flag, the ^ and $ now match
against the beginning and end of the expression but nothing else.
- If the file has Windows-type line breaks (or if you're not sure...)
If the file or character sequence you're matching against, as generally
found in Windows, uses the convention of denoting a line break by a sequence
carriage return followed by newline, then there's no special option
to set. Provided the MULTILINE flag is enabled, then ^ and $
will by default treat a sequence of carriage return plus newline as a single line
break.
If you enable DOTALL, the dot will match against
both carriage return and newline (and all other characters); otherwise, it will
exclude carriage return and newline, plus the other line separators.
Leaving the default setting is actually the most flexible line break mode.
The following will actually be recognised as line breaks: carriage return and
newline (either individually or as a sequence), character 133, plus
Unicode characters U+2028 and U+2029 (which are rarely used in practise).
- If the file has UNIX-type line breaks
If you are sure that the input has UNIX-style line breaks– in other
words, line breaks are denoted by a single newline (ASCII character 10)– then
you should enable the Pattern.UNIX_LINES flag. This flag makes ^
and $ recognise only character 10 as a line break; and by default, the dot
will match against all characters except character 10.
Note: if the string data you are dealing with is a normal page of HTML
that came directly from a web server, there's a good chance it will use UNIX-style line breaks.