TABLE OF CONTENTS (HIDE)

Java Programming Tutorial

Regular Expression (Regex) in Java

Introduction

Regular Expression (regex) is extremely useful in programming, especially in processing text files.

I assume that you are familiar with regex and Java. Otherwise, read up the regex syntax at:

  1. My article on "Regular Expressions".
  2. The Online Java Tutorial Trail on "Regular Expressions".
  3. JavaDoc for java.util.regex Package.
  4. JavaDoc for java.util.regex.Pattern Class, which summarizes of the regex patterns.

Package java.util.regex (JDK 1.4)

Regular expression was introduced in Java 1.4 in package java.util.regex. This package contains only two classes:

  1. java.util.regex.Pattern: represents a compiled regular expression. You can get a Pattern object via static method Pattern.compile(String regexStr).
  2. java.util.regex.Matcher: an engine that performs matching operations on an input CharSequence (such as String, StringBuffer, StringBuilder, CharBuffer, Segment) by interpreting a pattern.

The steps are:

String regexStr = "......";   // Regex String
String inputStr = "......";   // Input for matching, any CharSequence such as String, StringBuffer, StringBuilder, CharBuffer

// Step 1: Compile a Regex String into a Pattern object
Pattern pattern = Pattern.compile(regexStr);

// Step 2: Allocate a matching engine for the regex pattern bind with the input string
Matcher matcher = pattern.matcher(inputStr);

// Step 3: Perform the matching 
matcher.matches()   : attempts to match the ENTIRE input sequence
matcher.find()      : scans the input sequence looking for the next subsequence that matches the pattern
matcher.lookingAt() : attempts to match the input sequence, starting at the beginning, against the pattern.
matcher.replaceAll(replacementStr):   Find and replace all matches.
matcher.replaceFirst(replacementStr): Find and replace the first match.

// Step 4: Processing matching result
matcher.group() : returns the input subsequence matched by the previous match.
matcher.start() : returns the start index of the previous match.
matcher.end()   : returns the offset after the last character matched.

Java Regex by Examples

Example: Check if the Input string Matches a Regex Pattern via matches()

For example, you want to check if the input is a 5-digit string.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

/** Check if given input matches the specified regex */
public class RegexMatchTest {
   public static void main(String[] args) {
      // Method 1: one-liner matches()
      boolean isMatched1 = Pattern.matches("\\d{5}", "12345");  // 5-digit string
      System.out.println(isMatched1);

      // Method 2: compile(), matcher() and matches()
      Pattern p = Pattern.compile("\\d{5}");  // can be reused and more efficient
      Matcher m = p.matcher("1234");
      boolean isMatched2 = m.matches();
      System.out.println(isMatched2);
      // or
      boolean isMatched3 = Pattern.compile("\\d{5}").matcher("99999").matches();
      System.out.println(isMatched3);
   }
}

Example: Find Text

For example, given the input "This is an apple. These are 33 (thirty-three) apples.", you wish to find all occurrences of pattern "Th" (either case-sensitive or case-insensitive).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TestRegexFindText {
   public static void main(String[] args) {

      // Input String for matching the regex pattern
      String inputStr = "This is an apple. These are 33 (thirty-three) apples.";
      // Regex to be matched
      String regexStr = "Th";

      // Step 1: Compile a regex via static method Pattern.compile(), default is case-sensitive
      Pattern pattern = Pattern.compile(regexStr);
      // Pattern.compile(regex, Pattern.CASE_INSENSITIVE);  // for case-insensitive matching

      // Step 2: Allocate a matching engine from the compiled regex pattern,
      //         and bind to the input string
      Matcher matcher = pattern.matcher(inputStr);

      // Step 3: Perform matching and process the matching results

      // Try Matcher.find(), which finds the next match
      while (matcher.find()) {
         System.out.println("find() found substring \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
      }

      // Try Matcher.matches(), which tries to match the entrie input string
      if (matcher.matches()) {
         System.out.println("matches() found substring \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
      } else {
         System.out.println("matches() found nothing");
      }

      // Try Matcher.lookingAt(), which tries to match from the beginning of the input string
      if (matcher.lookingAt()) {
         System.out.println("lookingAt() found substring \"" + matcher.group()
               + "\" starting at index " + matcher.start()
               + " and ending at index " + matcher.end());
      } else {
         System.out.println("lookingAt() found nothing");
      }
   }
}
Output
find() found substring "Th" starting at index 0 and ending at index 2
find() found substring "Th" starting at index 18 and ending at index 20
matches() found nothing
lookingAt() found substring "Th" starting at index 0 and ending at index 2
How It Works
  • Three steps are required to perform regex matching:
    1. Allocate a Pattern object. There is no constructor for the Pattern class. Instead, you invoke the static method Pattern.compile(regexStr) to compile the regexStr, which returns a Pattern instance.
    2. Allocate a Matcher object (an matching engine). Again, there is no constructor for the Matcher class. Instead, you invoke the matcher(inputStr) method from the Pattern instance (created in Step 1), and bind the input string to this Matcher.
    3. Use the Matcher instance (created in Step 2) to perform the matching and process the matching result. The Matcher class provides a few boolean methods for performing the matches:
      • boolean find(): scans the input sequence to look for the next subsequence that matches the pattern. If match is found, you can use the group(), start() and end() to retrieve the matched subsequence and its starting and ending indices, as shown in the above example.
      • boolean matches(): try to match the entire input sequence against the regex pattern. It returns true if the entire input sequence matches the pattern. That is, include regex's begin and end position anchors ^ and $ to the pattern.
      • boolean lookingAt(): try to match the input sequence, starting from the beginning, against the regex pattern. It returns true if a prefix of the input sequence matches the pattern. That is, include regex's begin position anchors ^ to the pattern.
  • To perform case-insensitive matching, use Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE) to create the Pattern instance (as commented out in the above example).

Example: Find Pattern (Expressed in Regular Expression)

The above example to find a particular piece of text from an input sequence is rather trivial. The power of regex is that you can use it to specify a pattern, e.g.,

  1. (\w)+ matches any word (delimited by space), where \w is a metacharacter matching any word character [a-zA-Z0-9_], and + is an occurrence indicator for one or more occurrences.
  2. \b[1-9][0-9]*\b matches any number with a non-zero leading digit, separated by spaces from other words, where \b is the position anchor for word boundary, [1-9] is a character class for any character in the range of 1 to 9, and * is an occurrence indicator for zero or more occurrences.

Try changing the regex pattern of the above example to the followings and observe the outputs. Take not that you need to use escape sequence '\\' for '\' inside a Java's string.

String regexStr = "\\w+";               // escape sequence \\ for \
String regexStr = "\\b[1-9][0-9]+\\b";
Output for Regex \w+
find() found substring "This" starting at index 0 and ending at index 4
find() found substring "is" starting at index 5 and ending at index 7
find() found substring "an" starting at index 8 and ending at index 10
find() found substring "apple" starting at index 11 and ending at index 16
find() found substring "These" starting at index 18 and ending at index 23
find() found substring "are" starting at index 24 and ending at index 27
find() found substring "33" starting at index 28 and ending at index 30
find() found substring "thirty" starting at index 32 and ending at index 38
find() found substring "three" starting at index 39 and ending at index 44
find() found substring "apples" starting at index 46 and ending at index 52
matches() found nothing
lookingAt() found substring "This" starting at index 0 and ending at index 4
Output for Regex \b[1-9][0-9]*\b
find() found substring "33" starting at index 28 and ending at index 30
matches() found nothing
lookingAt() found nothing

Check out the Javadoc for the Class java.util.regex.Pattern for the list of regular expression constructs supported by Java.

Example: Find and Replace Text

Finding a pattern and replace it with something else is probably one of the most frequent tasks in text processing. Regex allows you to express the pattern liberally, and also the replacement text/pattern. This is extremely useful in batch processing a huge text document or many text files. For example, searching for stock prices from many online HTML files, rename many files in a directory with a certain pattern, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TestRegexFindReplace {
   public static void main(String[] args) {
      String inputStr = "This is an apple. These are 33 (Thirty-three) apples";
      String regexStr = "apple";         // pattern to be matched
      String replacementStr = "orange";  // replacement pattern

      // Step 1: Allocate a Pattern object to compile a regex
      Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE);

      // Step 2: Allocate a Matcher object from the pattern, and provide the input
      Matcher matcher = pattern.matcher(inputStr);

      // Step 3: Perform the matching and process the matching result
      //String outputStr = matcher.replaceAll(replacementStr);     // all matches
      String outputStr = matcher.replaceFirst(replacementStr); // first match only
      System.out.println(outputStr);
   }
}
Output for replaceAll()
This is an orange. These are 33 (Thirty-three) oranges.
Output for replaceFirst()
This is an orange. These are 33 (Thirty-three) apples.
How It Works
  • First, create a Pattern object to compile a regex pattern. Next, create a Matcher object from the Pattern and bind to the input string.
  • The Matcher class provides a replaceAll(replacementStr) to replace all the matched subsequence with the replacementStr; or replaceFirst(replacementStr) to replace the first match only.

Example: Find and Replace with Back References

Given the input "One:two:three:four", the following program produces "four-three-two-One" by matching the 4 words separated by colons, and uses the so-called parenthesized back-references $1, $2, $3 and $4 in the replacement pattern.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import java.util.regex.Pattern;
import java.util.regex.Matcher;

public class TestRegexBackReference {
   public static void main(String[] args) {
      String inputStr = "One:two:three:four";
      String regexStr = "(.+):(.+):(.+):(.+)";  // pattern to be matched
      String replacementStr = "$4-$3-$2-$1";    // replacement pattern with back references

      // Step 1: Allocate a Pattern object to compile a regex
      Pattern pattern = Pattern.compile(regexStr);

      // Step 2: Allocate a Matcher object from the Pattern, and provide the input
      Matcher matcher = pattern.matcher(inputStr);

      // Step 3: Perform the matching and process the matching result
      String outputStr = matcher.replaceAll(replacementStr);     // all matches
      //String outputStr = matcher.replaceFirst(replacementStr); // first match only
      System.out.println(outputStr);   // Output: four-three-two-One
   }
}

Parentheses () have two meanings in regex:

  1. Grouping sub-expressions: For example xyz+ matches one 'x', one 'y', followed by one or more 'z'. But (xyz)+ matches one or more groups of 'xyz', e.g., 'xyzxyzxyz'.
  2. Parenthesized Back Reference: Provide back references to the matched subsequences. The matched subsequence of the first pair of parentheses can be referred to as $1, second pair of patentee as $2, and so on. In the above example, there are 4 pairs of parentheses, which were referenced in the replacement pattern as $1, $2, $3, and $4. You can use groupCount() (of the Matcher) to get the number of groups captured, and group(groupNumber), start(groupNumber), end(groupNumber) to retrieve the matched subsequence and their indices. In Java, $0 denotes the entire regular expression. Try the following codes and check the output:
          while (matcher.find()) {
             System.out.println("find() found substring \"" + matcher.group()
                   + "\" starting at index " + matcher.start()
                   + " and ending at index " + matcher.end());
             System.out.println("Group count is: " + matcher.groupCount());
             for (int i = 0; i < matcher.groupCount(); ++i) {
                System.out.println("Group " + i + ": substring=" 
                      + matcher.group(i) + ", start=" + matcher.start(i) 
                      + ", end=" + matcher.end(i));
             }
          }
    find() found substring "One:two:three:four" starting at index 0 and ending at index 18
    Group count is: 4
    Group 0: substring=One:two:three:four, start=0, end=18
    Group 1: substring=One, start=0, end=3
    Group 2: substring=two, start=4, end=7
    Group 3: substring=three, start=8, end=13

Example: Rename Files of a Given Directory

The following program rename all the files ending with ".class" to ".out" of the directory specified.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.io.File;

public class RegexRenameFiles {
   public static void main(String[] args) {
      String regexStr = ".class$";    // ending with ".class"
      String replacementStr = ".out"; // replace with ".out"

      // Allocate a Pattern object to compile a regex
      Pattern pattern = Pattern.compile(regexStr, Pattern.CASE_INSENSITIVE);
      Matcher matcher;

      File dir = new File(".");  // directory to be processed
      int count = 0;
      File[] files = dir.listFiles();   // list all files and directories
      for (File file : files) {
         if (file.isFile()) {   // file only, not directory
            String inFilename = file.getName();    // get filename, exclude path
            matcher = pattern.matcher(inFilename); // allocate Matches with input
            if (matcher.find()) {
               ++count;
               String outFilename = matcher.replaceFirst(replacementStr);
               System.out.print(inFilename + " -> " + outFilename);

               if (file.renameTo(new File(dir + "\\" + outFilename))) {  // execute rename
                  System.out.println(" SUCCESS");
               } else {
                  System.out.println(" FAIL");
               }
            }
         }
      }
      System.out.println(count + " files processed");
   }
}

You can use regex to specify the pattern, and back references in the replacement, as in the previous example.

Other Usages of Regex in Java

The String.split() Method

The String class contains a method split(), which takes a regular expression and splits this String object into an array of Strings.

// In String class
public String[] split(String regexStr)

For example,

public class StringSplitTest {
   public static void main(String[] args) {
      String source = "There are thirty-three big-apple";
      String[] tokens = source.split("\\s+|-");  // whitespace(s) or -
      for (String token : tokens) {
         System.out.println(token);
      }
   }
}
There
are
thirty
three
big
apple

The Scanner & useDelimiter()

The Scanner class, by default, uses whitespace as the delimiter in parsing input tokens. You can set the delimiter to a regex via use delimiter() methods:

public Scanner useDelimiter(Pattern pattern)
public Scanner useDelimiter(String pattern)

For example,

import java.util.Scanner;
public class ScannerUseDelimiterTest {
   public static void main(String[] args) {
      String source = "There are thirty-three big-apple";
      Scanner in = new Scanner(source);
      in.useDelimiter("\\s+|-");  // whitespace(s) or -
      while (in.hasNext()) {
         System.out.println(in.next());
      }
   }
}

REFERENCES & RESOURCES