Introduction
Regular Expression (regex) is extremely useful in programming, especially in processing text files.
I assume that you are familiar with regex and Java. Otherwise, read up the regex syntax at:
- My article on "Regular Expressions".
- The Online Java Tutorial Trail on "Regular Expressions".
- JavaDoc for
java.util.regex
Package. - JavaDoc for
java.util.regex.Pattern
Class, which summarizes of the regex patterns.
Package java.util.regex (JDK 1.4)
Regular expression was introduced in Java 1.4 in package java.util.regex
. This package contains only two classes:
java.util.regex.Pattern
: represents a compiled regular expression. You can get aPattern
object viastatic
methodPattern.compile(String regexStr)
.java.util.regex.Matcher
: an engine that performs matching operations on an inputCharSequence
(such asString
,StringBuffer
,StringBuilder
,CharBuffer
,Segment
) by interpreting apattern
.
The steps are:
String regexStr = "......"; // Regex String String inputStr = "......"; // Input for matching, any CharSequence such as String, StringBuffer, StringBuilder, CharBuffer // Step 1: Compile a Regex String into a Pattern object Pattern pattern = Pattern.compile(regexStr); // Step 2: Allocate a matching engine for the regex pattern bind with the input string Matcher matcher = pattern.matcher(inputStr); // Step 3: Perform the matching matcher.matches() : attempts to match the ENTIRE input sequence matcher.find() : scans the input sequence looking for the next subsequence that matches the pattern matcher.lookingAt() : attempts to match the input sequence, starting at the beginning, against the pattern. matcher.replaceAll(replacementStr): Find and replace all matches. matcher.replaceFirst(replacementStr): Find and replace the first match. // Step 4: Processing matching result matcher.group() : returns the input subsequence matched by the previous match. matcher.start() : returns the start index of the previous match. matcher.end() : returns the offset after the last character matched.
Java Regex by Examples
Example: Check if the Input string Matches a Regex Pattern via matches()
For example, you want to check if the input is a 5-digit string.
import java.util.regex.Matcher; import java.util.regex.Pattern; /** Check if given input matches the specified regex */ public class RegexMatchTest { public static void main(String[] args) { // Method 1: one-liner matches() boolean isMatched1 = Pattern.matches("\\d{5}", "12345"); // 5-digit string System.out.println(isMatched1); // Method 2: compile(), matcher() and matches() Pattern p = Pattern.compile("\\d{5}"); // can be reused and more efficient Matcher m = p.matcher("1234"); boolean isMatched2 = m.matches(); System.out.println(isMatched2); // or boolean isMatched3 = Pattern.compile("\\d{5}").matcher("99999").matches(); System.out.println(isMatched3); } }
Example: Find Text
For example, given the input "This is an apple. These are 33 (thirty-three) apples."
, you wish to find all occurrences of pattern "Th"
(either case-sensitive or case-insensitive).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
import java.util.regex.Pattern; import java.util.regex.Matcher; public class TestRegexFindText { public static void main(String[] args) { // Input String for matching the regex pattern String inputStr = "This is an apple. These are 33 (thirty-three) apples."; // Regex to be matched String regexStr = "Th"; // Step 1: Compile a regex via static method Pattern.compile(), default is case-sensitive Pattern pattern = Pattern.compile(regexStr); // Pattern.compile(regex, Pattern.CASE_INSENSITIVE); // for case-insensitive matching // Step 2: Allocate a matching engine from the compiled regex pattern, // and bind to the input string Matcher matcher = pattern.matcher(inputStr); // Step 3: Perform matching and process the matching results // Try Matcher.find(), which finds the next match while (matcher.find()) { System.out.println("find() found substring \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end()); } // Try Matcher.matches(), which tries to match the entrie input string if (matcher.matches()) { System.out.println("matches() found substring \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end()); } else { System.out.println("matches() found nothing"); } // Try Matcher.lookingAt(), which tries to match from the beginning of the input string if (matcher.lookingAt()) { System.out.println("lookingAt() found substring \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end()); } else { System.out.println("lookingAt() found nothing"); } } } |
Output
find() found substring "Th" starting at index 0 and ending at index 2 find() found substring "Th" starting at index 18 and ending at index 20 matches() found nothing lookingAt() found substring "Th" starting at index 0 and ending at index 2
How It Works
- Three steps are required to perform regex matching:
- Allocate a
Pattern
object. There is no constructor for thePattern
class. Instead, you invoke thestatic
methodPattern.compile(regexStr)
to compile theregexStr
, which returns aPattern
instance. - Allocate a
Matcher
object (an matching engine). Again, there is no constructor for theMatcher
class. Instead, you invoke thematcher(inputStr)
method from thePattern
instance (created in Step 1), and bind the input string to thisMatcher
. - Use the
Matcher
instance (created in Step 2) to perform the matching and process the matching result. TheMatcher
class provides a fewboolean
methods for performing the matches:boolean find()
: scans the input sequence to look for the next subsequence that matches the pattern. If match is found, you can use thegroup()
,start()
andend()
to retrieve the matched subsequence and its starting and ending indices, as shown in the above example.boolean matches()
: try to match the entire input sequence against the regex pattern. It returnstrue
if the entire input sequence matches the pattern. That is, include regex's begin and end position anchors^
and$
to thepattern
.boolean lookingAt()
: try to match the input sequence, starting from the beginning, against the regex pattern. It returnstrue
if a prefix of the input sequence matches the pattern. That is, include regex's begin position anchors^
to thepattern
.
- Allocate a
- To perform case-insensitive matching, use
Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE)
to create thePattern
instance (as commented out in the above example).
Example: Find Pattern (Expressed in Regular Expression)
The above example to find a particular piece of text from an input sequence is rather trivial. The power of regex is that you can use it to specify a pattern, e.g.,
(\w)+
matches any word (delimited by space), where\w
is a metacharacter matching any word character[a-zA-Z0-9_]
, and+
is an occurrence indicator for one or more occurrences.\b[1-9][0-9]*\b
matches any number with a non-zero leading digit, separated by spaces from other words, where\b
is the position anchor for word boundary,[1-9]
is a character class for any character in the range of1
to9
, and*
is an occurrence indicator for zero or more occurrences.
Try changing the regex pattern of the above example to the followings and observe the outputs. Take not that you need to use escape sequence '\\'
for '\'
inside a Java's string.
String regexStr = "\\w+"; // escape sequence \\ for \
String regexStr = "\\b[1-9][0-9]+\\b";
Output for Regex \w+
find() found substring "This" starting at index 0 and ending at index 4 find() found substring "is" starting at index 5 and ending at index 7 find() found substring "an" starting at index 8 and ending at index 10 find() found substring "apple" starting at index 11 and ending at index 16 find() found substring "These" starting at index 18 and ending at index 23 find() found substring "are" starting at index 24 and ending at index 27 find() found substring "33" starting at index 28 and ending at index 30 find() found substring "thirty" starting at index 32 and ending at index 38 find() found substring "three" starting at index 39 and ending at index 44 find() found substring "apples" starting at index 46 and ending at index 52 matches() found nothing lookingAt() found substring "This" starting at index 0 and ending at index 4
Output for Regex \b[1-9][0-9]*\b
find() found substring "33" starting at index 28 and ending at index 30 matches() found nothing lookingAt() found nothing
Check out the Javadoc for the Class java.util.regex.Pattern
for the list of regular expression constructs supported by Java.
Example: Find and Replace Text
Finding a pattern and replace it with something else is probably one of the most frequent tasks in text processing. Regex allows you to express the pattern liberally, and also the replacement text/pattern. This is extremely useful in batch processing a huge text document or many text files. For example, searching for stock prices from many online HTML files, rename many files in a directory with a certain pattern, etc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import java.util.regex.Pattern; import java.util.regex.Matcher; public class TestRegexFindReplace { public static void main(String[] args) { String inputStr = "This is an apple. These are 33 (Thirty-three) apples"; String regexStr = "apple"; // pattern to be matched String replacementStr = "orange"; // replacement pattern // Step 1: Allocate a Pattern object to compile a regex Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE); // Step 2: Allocate a Matcher object from the pattern, and provide the input Matcher matcher = pattern.matcher(inputStr); // Step 3: Perform the matching and process the matching result //String outputStr = matcher.replaceAll(replacementStr); // all matches String outputStr = matcher.replaceFirst(replacementStr); // first match only System.out.println(outputStr); } } |
Output for replaceAll()
This is an orange. These are 33 (Thirty-three) oranges.
Output for replaceFirst()
This is an orange. These are 33 (Thirty-three) apples.
How It Works
- First, create a
Pattern
object to compile a regex pattern. Next, create aMatcher
object from thePattern
and bind to the input string. - The
Matcher
class provides areplaceAll(replacementStr)
to replace all the matched subsequence with thereplacementStr
; orreplaceFirst(replacementStr)
to replace the first match only.
Example: Find and Replace with Back References
Given the input "One:two:three:four"
, the following program produces "four-three-two-One"
by matching the 4 words separated by colons, and uses the so-called parenthesized back-references $1
, $2
, $3
and $4
in the replacement pattern.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import java.util.regex.Pattern; import java.util.regex.Matcher; public class TestRegexBackReference { public static void main(String[] args) { String inputStr = "One:two:three:four"; String regexStr = "(.+):(.+):(.+):(.+)"; // pattern to be matched String replacementStr = "$4-$3-$2-$1"; // replacement pattern with back references // Step 1: Allocate a Pattern object to compile a regex Pattern pattern = Pattern.compile(regexStr); // Step 2: Allocate a Matcher object from the Pattern, and provide the input Matcher matcher = pattern.matcher(inputStr); // Step 3: Perform the matching and process the matching result String outputStr = matcher.replaceAll(replacementStr); // all matches //String outputStr = matcher.replaceFirst(replacementStr); // first match only System.out.println(outputStr); // Output: four-three-two-One } } |
Parentheses ()
have two meanings in regex:
- Grouping sub-expressions: For example
xyz+
matches one'x'
, one'y'
, followed by one or more'z'
. But(xyz)+
matches one or more groups of'xyz'
, e.g.,'xyzxyzxyz'
. - Parenthesized Back Reference: Provide back references to the matched subsequences. The matched subsequence of the first pair of parentheses can be referred to as
$1
, second pair of patentee as$2
, and so on. In the above example, there are 4 pairs of parentheses, which were referenced in the replacement pattern as$1
,$2
,$3
, and$4
. You can usegroupCount()
(of theMatcher
) to get the number of groups captured, andgroup(groupNumber)
,start(groupNumber)
,end(groupNumber)
to retrieve the matched subsequence and their indices. In Java,$0
denotes the entire regular expression. Try the following codes and check the output:while (matcher.find()) { System.out.println("find() found substring \"" + matcher.group() + "\" starting at index " + matcher.start() + " and ending at index " + matcher.end()); System.out.println("Group count is: " + matcher.groupCount()); for (int i = 0; i < matcher.groupCount(); ++i) { System.out.println("Group " + i + ": substring=" + matcher.group(i) + ", start=" + matcher.start(i) + ", end=" + matcher.end(i)); } }
find() found substring "One:two:three:four" starting at index 0 and ending at index 18 Group count is: 4 Group 0: substring=One:two:three:four, start=0, end=18 Group 1: substring=One, start=0, end=3 Group 2: substring=two, start=4, end=7 Group 3: substring=three, start=8, end=13
Example: Rename Files of a Given Directory
The following program rename all the files ending with ".class
" to ".out
" of the directory specified.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import java.util.regex.Pattern; import java.util.regex.Matcher; import java.io.File; public class RegexRenameFiles { public static void main(String[] args) { String regexStr = ".class$"; // ending with ".class" String replacementStr = ".out"; // replace with ".out" // Allocate a Pattern object to compile a regex Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE); Matcher matcher; File dir = new File("."); // directory to be processed int count = 0; File[] files = dir.listFiles(); // list all files and directories for (File file : files) { if (file.isFile()) { // file only, not directory String inFilename = file.getName(); // get filename, exclude path matcher = pattern.matcher(inFilename); // allocate Matches with input if (matcher.find()) { ++count; String outFilename = matcher.replaceFirst(replacementStr); System.out.print(inFilename + " -> " + outFilename); if (file.renameTo(new File(dir + "\\" + outFilename))) { // execute rename System.out.println(" SUCCESS"); } else { System.out.println(" FAIL"); } } } } System.out.println(count + " files processed"); } } |
You can use regex to specify the pattern, and back references in the replacement, as in the previous example.
Other Usages of Regex in Java
The String.split() Method
The String
class contains a method split()
, which takes a regular expression and splits this String
object into an array of String
s.
// In String class
public String[] split(String regexStr)
For example,
public class StringSplitTest {
public static void main(String[] args) {
String source = "There are thirty-three big-apple";
String[] tokens = source.split("\\s+|-"); // whitespace(s) or -
for (String token : tokens) {
System.out.println(token);
}
}
}
There are thirty three big apple
The Scanner & useDelimiter()
The Scanner
class, by default, uses whitespace as the delimiter in parsing input tokens. You can set the delimiter to a regex via use delimiter()
methods:
public Scanner useDelimiter(Pattern pattern) public Scanner useDelimiter(String pattern)
For example,
import java.util.Scanner;
public class ScannerUseDelimiterTest {
public static void main(String[] args) {
String source = "There are thirty-three big-apple";
Scanner in = new Scanner(source);
in.useDelimiter("\\s+|-"); // whitespace(s) or -
while (in.hasNext()) {
System.out.println(in.next());
}
}
}
REFERENCES & RESOURCES