TABLE OF CONTENTS (HIDE)

Perl Tutorial

Regular Expressions, File IO & Text Processing

Perl is famous for processing text files via regular expressions.

Regular Expressions in Perl

A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern.  In other words, a regex accepts a certain set of strings and rejects the rest.

I shall assume that you are familiar with Regex syntax. Otherwise, you could read:

Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/. You can use built-in operators:

  • m/regex/modifier: Match against the regex.
  • s/regex/replacement/modifier: Substitute matched substring(s) by the replacement.

Matching Operator m//

You can use matching operator m// to check if a regex pattern exists in a string. The syntax is:

m/regex/
m/regex/modifiers  # Optional modifiers
/regex/            # Operator m can be omitted if forward-slashes are used as delimiter
/regex/modifiers
Delimiter

Instead of using forward-slashes (/) as delimiter, you could use other non-alphanumeric characters such as !, @ and % in the form of m!regex!modifiers m@regex@modifiers or m%regex%modifiers. However, if forward-slash (/) is used as the delimiter, the operator m can be omitted in the form of /regex/modifiers. Changing the default delimiter is confusing, and not recommended.

m//, by default, operates on the default variable $_. It returns true if $_ matches regex; and false otherwise.

Example 1: Regex [0-9]+
#!/usr/bin/env perl
# try_m_1.pl
use strict;
use warnings;
while (<>) {   # Read input from command-line into default variable $_
   print m/[0-9]+/ ? "Accept\n" : "Reject\n";   # one or more digits?
}
$ ./try_m_1.pl
123
Accept
00000
Accept
abc
Reject
abc123
Accept
Example 2: Extracting the Matched Substrings

The built-in array variables @- and @+ keep the start and end positions of the matched substring, where $-[0] and $+[0] for the full match, and $-[n] and $+[n] for back references $1, $2, ..., $n, ....

#!/usr/bin/env perl
# try_m_2.pl
use strict;
use warnings;
while (<>) {   # Read input from command-line into default variable $_
   if (m/[0-9]+/) {
      print 'Accept substring: ' . substr($_, $-[0], $+[0] - $-[0]) . "\n";
   } else {
      print "Reject\n";
   }
}
$ ./try_m_2.pl
123
Accept substring: 123
00000
Accept substring: 00000
abc
Reject
abc123xyz
Accept substring: 123
abc123xyz456
Accept substring: 123
Example 3: Modifier 'g' (global)

By default, m// finds only the first match. To find all matches, include 'g' (global) modifier.

#!/usr/bin/env perl
# try_m_3.pl
use strict;
use warnings;

my $regex = '[0-9]+';   # Define regex pattern in non-interpolating string

while (<>) {   # Read input from command-line into default variable $_
   # Do m//g and save matched substring into an array
   my @matches = /$regex/g;
   print "Matched substrings (in array): @matches\n";        # print array

   # Do m//g in a loop
   print 'Matched substrings (in loop) : ';
   while (/$regex/g) {
      print substr($_, $-[0], $+[0] - $-[0]), ',';
   }
   print "\n";
}
$ ./try_m_3.pl
abc123xyz456_0_789
Matched substrings (in array): 123 456 0 789
Matched substrings (in loop) : 123,456,0,789,
abc
Matched substrings (in array):
Matched substrings (in loop) :
123
Matched substrings (in array): 123
Matched substrings (in loop) : 123,

Operators =~ and !~

By default, the matching operators operate on the default variable $_. To operate on other variable instead of $_, you could use the =~ and !~ operators as follows:

str =~ m/regex/modifiers    # Return true if str matches regex.
str !~ m/regex/modifiers    # Return true if str does NOT match regex.

When used with m//, =~ behaves like comparison (== or eq).

Example 4: =~ Operator
#!/usr/bin/env perl
# try_m_4.pl
use strict;
use warnings;

print 'yes or no? ';
my $reply;
chomp($reply = <>);     # Remove newline
print $reply =~ /^y/i ? "positive!\n" : "negative!\n";
   # Begins with 'y', case-insensitive

Substitution Operator s///

You can substitute a string (or a portion of a string) with another string using s/// substitution operator. The syntax is:

s/regex/replacement/
s/regex/replacement/modifiers  # Optional modifiers

Similar to m//, s/// operates on the default variable $_ by default. To operate on other variable, you could use the =~ and !~ operators. When used with s///, =~ behaves like assignment (=).

Example 5: s///
#!/usr/bin/env perl
# try_s_1.pl
use strict;
use warnings;

while (<>) {   # Read input from command-line into default variable $_
   s/\w+/\*\*\*/g;   # Match each word
   print "$_";
}
$ ./try_s_1.pl
this is an apple.
*** *** *** ***.

Modifiers

Modifiers (such as /g, /i, /e, /o, /s and /x) can be used to control the behavior of m// and s///.

  • g (global): By default, only the first occurrence of the matching string of each line is processed. You can use modifier /g to specify global operation.
  • i (case-insensitive): By default, matching is case-sensitive. You can use the modifier /i to enable case in-sensitive matching.
  • m (multiline): multiline string, affecting position anchor ^, $, \A, \Z.
  • s: permits metacharacter . (dot) to match the newline.

Parenthesized Back-References & Matched Variables $1, ..., $9

Parentheses ( ) serve two purposes in regex:

  1. Firstly, parentheses ( ) can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example, /(a|e|i|o|u){3,5}/ is the same as /a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/.
  2. Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regex /(\S+)/ creates one back-reference (\S+), which contains the first word (consecutive non-spaces) in the input string; the regex /(\S+)\s+(\S+)/ creates two back-references: (\S+) and another (\S+), containing the first two words, separated by one or more spaces \s+.

The back-references are stored in special variables $1, $2, …, $9, where $1 contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/ creates two back-references which matched with the first two words. The matched words are stored in $1 and $2, respectively.

For example, the following expression swap the first and second words:

s/(\S+) (\S+)/$2 $1/;   # Swap the first and second words separated by a single space

Back-references can also be referenced in your program.

For example,

(my $word) = ($str =~ /(\S+)/);

The parentheses creates one back-reference, which matches the first word of the $str if there is one, and is placed inside the scalar variable $word. If there is no match, $word is UNDEF.

Another example,

(my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/);

The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str into variables $word1 and $word2 if there are more than two words; otherwise, both $word1 and $word2 are UNDEF. Note that regular expression matching must be complete and there is no partial matching.

\1, \2, \3 has the same meaning as $1, $2, $3, but are valid only inside the s/// or m//. For example, /(\S+)\s\1/ matches a pair of repeated words, separated by a white-space.

Character Translation Operator tr///

You can use translator operator to translate a character into another character. The syntax is:

tr/fromchars/tochars/modifiers

replaces or translates fromchars to tochars in $_, and returns the number of characters replaced.

For examples,

tr/a-z/A-Z/         # converts $_ to uppercase.
tr/dog/cat/         # translates d to c, o to a, g to t.
$str =~ tr/0-9/a-j/ # replace 0 by a, etc in $str.
tr/A-CG/KX-Z/       # replace A by K, B by X, C by Y, G by Z.

Instead of forward slash (/), you can use parentheses (), brackets [], curly bracket {} as delimiter, e.g.,

tr[0-9][##########]  # replace numbers by #.
tr{!.}(.!)           # swap ! and ., one pass.

If tochars is shorter than fromchars, the last character of tochars is used repeatedly.

tr/a-z/A-E/       # f to z is replaced by E.

tr/// returns the number of replaced characters. You can use it to count the occurrence of certain characters. For examples,

my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/);
my $numDigits  = ($string =~ tr/0-9/0-9/);
my $numSpaces  = ($string =~ tr/ / /);
Modifiers /c, /d and /s for tr///
  • /c: complements (inverses) fromchars.
  • /d: deletes any matched but un-replaced characters.
  • /s: squashes duplicate characters into just one.

For examples,

tr/A-Za-z/ /c  # replaces all non-alphabets with space
tr/A-Z//d      # deletes all uppercase (matched with no replacement).
tr/A-Za-z//dc  # deletes all non-alphabets
tr/!//s        # squashes duplicate !

String Functions: split and join

split(regex, str, [numItems]): Splits the given str using the regex, and return the items in an array. The optional third parameter specifies the maximum items to be processed.

join(joinStr, strList): Joins the items in strList with the given joinStr (possibly empty).

For examples,

#!/usr/bin/env perl
use strict;
use warnings;
   
my $msg = 'Hello, world again!';
my @words = split(/ /, $msg);  # ('Hello,', 'world', 'again!')
for (@words) { say; }          # Use default scalar variable
   
say join('--', @words);        # 'Hello,--world--again!'
my $newMsg = join '', @words;  # 'Hello,worldagain!'
say $newMsg;

Functions grep, map

  • grep(regex, array): selects those elements of the array, that matches regex.
  • map(regex, array): returns a new array constructed by applying regex to each element of the array.

File Input/Output

Filehandle

Filehandles are data structure which your program can use to manipulate files.  A filehandle acts as a gate between your program and the files, directories, or other programs.  Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.

Naming Convention: use uppercase for the name of the filehandle, e.g., FILE, DIR, FILEIN, FILEOUT, and etc.

Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., <FILEHANDLE>.

Example: Read and print the content of a text file via a filehandle.

#!/usr/bin/env perl
use strict;
use warnings;
   
# FileRead.pl: Read & print the content of a text file.
my $filename = shift;    # Get the filename from command line.
  
# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";

while (<FILE>) {      # Set $_ to each line of the file 
   print;             # Print $_
}

Example: Search and print lines containing a particular search word.

#!/usr/bin/env perl
use strict;
use warnings;
   
# FileSearch.pl: Search for lines containing a search word.
(my $filename, my $word) = @ARGV;   # Get filename & search word.

# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";
  
while (<FILE>) {           # Set $_ to each line of the file
   print if /\b$word\b/i;  # Match $_ with word, case insensitive
}

Example: Print the content of a directory via a directory handle.

#!/usr/bin/env perl
use strict;
use warnings;
   
# DirPrint.pl: Print the content of a directory.
my $dirname = shift;       # Get directory name from command-line
opendir(DIR, $dirname) or die "Can't open directory $dirname: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
  # Display files not beginning with dot.
  print "$file\n" if ($file !~ /^\./);
}

You can use C-style's printf for formatted output to file.

File Handling Functions

Function open: open(filehandle, string) opens the filename given by string and associates it with the filehandle. It returns true if success and UNDEF otherwise.

  • If string begins with < (or nothing), it is opened for reading.
  • If string begins with >, it is opened for writing.
  • If string begins with >>, it is opened for appending.
  • If string begins with +<, +>, +>>, it is opened for both reading and writing.
  • If string is -, STDIN is opened.
  • If string is >-, STDOUT is opened.
  • If string begins with -| or |-, your process will fork() to execute the pipe command.

Function close: close(filehandle) closes the file associated with the filehandle. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file.  You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.

A common procedure for modifying a file is to:

  1. Read in the entire file with open(FILE, $filename) and @lines = <FILE>.
  2. Close the filehandle.
  3. Operate upon @lines (which is in the fast RAM) rather than FILE (which is in the slow disk).
  4. Write the new file contents using open(FILE, “>$filename”) and print FILE @lines.
  5. Close the file handle.

Example: Read the contents of the entire file into memory; modify and write back to disk.

#!/usr/bin/env perl
use strict;
use warnings;
   
# FileChange.pl
my $filename = shift;       # Get the filename from command line.

# Create a filehandle called FILE and connect to the file.
open(FILE, $filename) or die "Can't open $filename: $!";
# Read the entire file into an array in memory.
my @lines = <FILE>;
close(FILE);

open(FILE, ">$filename") or die "Can't write to $filename: $!";
foreach my $line (@lines) {
   print FILE uc($line);   # Change to uppercase
}
close(FILE);

Example: Reading from a file

#!/usr/bin/env perl
use strict;
use warnings;
   
open(FILEIN, "test.txt") or die "Can't open file: $!";
while (<FILEIN>) {     # set $_ to each line of the file.
   print;              # print $_
}

Example: Writing to a file

#!/usr/bin/env perl
use strict;
use warnings;
   
my $filename = shift;         # Get the file from command line.
open(FILE, ">$filename") or die "Can't write to $filename: $!";
print FILE "This is line 1\n";    # no comma after FILE.
print FILE "This is line 2\n";
print FILE "This is line 3\n";

Example: Appending to a file

#!/usr/bin/env perl
use strict;
use warnings;
   
my $filename = shift;             # Get the file from command line.
open(FILE, ">>$filename") or die "Can't append to $filename: $!";
print FILE "This is line 4\n";     # no comma after FILE.
print FILE "This is line 5\n";

In-Place Editing

Instead of reading in one file and write to another file, you could do in-place editing by specifying –i flag or use the special variable $^I.

  • The –ibackupExtension flag tells Perl to edit files in-place.  If a backupExtension is provided, a backup file will be created with the backupExtension.
  • The special variable $^I=backupExtension does the same thing.

Example: In-place editing using –i flag

#!/usr/bin/env perl -i.old     # In-place edit, backup as '.old'
use strict;
use warnings;
   

while (<>) {
  s/line/TEST/g;
  print;           # Print to the file, not STDOUT.
}

Example:  In-place editing using $^I special variable.

#!/usr/bin/env perl
use strict;
use warnings;
   
$^I = '.bak';      # Enable in-place editing, backup in '.bak'.
while (<>) {
  s/TEST/line/g;
  print;           # Print to the file, not STDOUT.
}

Functions seek, tell, truncate

seek(filehandle, position, whence): moves the file pointer of the filehandle to position, as measured from whenceseek() returns 1 upon success and 0 otherwise.  File position is measured in bytes.  whence of 0 measured from the beginning of the file; 1 measured from the current position; and 2 measured from the end.  For example:

seek(FILE, 0, 2);    # 0 byte from end-of-file, give file size.
seek(FILE, -2, 2);   # 2 bytes before end-of-file.
seek(FILE, -10, 1);  # Move file pointer 10 byte backward.
seek(FILE, 20, 0);   # 20 bytes from the begin-of-file.

tell(filehandle): returns the current file position of filehandle.

truncate(FILE, length): truncates FILE to length bytes.  FILE can be either a filehandle or a file name.

To find the length of a file, you could:

seek(FILE, 0, 2);   # Move file point to end of file.
print tell(FILE);   # Print the file size.

Example: Truncate the last 2 bytes if they begin with \x0D,

#!/usr/bin/env perl
use strict;
use warnings;
   
my $filename = shift;            # Get the file from command line.
open(FILE, "+<$filename") or die "Can't open $file: $!";
seek(FILE, -2, 2);        # 2 byte before end-of-file.
my $pos = tell FILE;
my $data = <FILE>;        # read moves the file pointer.
if ($data =~ /^\x0D/) {   # begin with 0D
  truncate FILE, $pos;    # truncate last 2 bytes.
}

Function eof

eof(filehandle) returns 1 if the file pointer is positioned at the end of the file or if the filehandle is not opened.

Reading Bytes Instead of Lines

The function read(filehandle, var, length, offset) reads length bytes from filehandle starting from the current file pointer, and saves into variable var starting from offset (if omitted, default is 0).  The bytes includes \x0A, \x0D etc.

Example
#!/usr/bin/env perl
use strict;
use warnings;
   
(my $numbytes, my $filename) = @ARGV;
open(FILE, $filename) or die "Can't open $filename: $!";
   
my $data;
read(FILE, $data, $numbytes);
print $data, "\n----\n";
   
read(FILE, $data, $numbytes);    # continue from current file ptr
print $data;
print $data, "\n----\n";
   
read(FILE, $data, $numbytes, 2);  # save in $data offset 2
print $data, "\n----\n";

Piping Data To and From a Process

If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.

  • open(handle, "command|") lets you read from the output of command.
  • open(handle, "|command") lets you write to the input of command.

Both of these statements return the Process ID (PID) of the command.

Example: The dir command lists the current directory.  By opening a pipe from dir, you can access its output.

#!/usr/bin/env perl
use strict;
use warnings;
  
open(PIPEFROM, "dir|") or die "Pipe failed: $!";
while (<PIPEFROM>) {
   print;
}
close PIPEFROM;

Example: This example shows how you can pipe input into the sendmail program.

#!/usr/bin/env perl
use strict;
use warnings;
  
my $my_login = test
open(MAIL, "| sendmail –t –f$my_login") or die "Pipe failed: $!";
print MAIL, "From: test101@test.com\n";
print MAIL, "To: test102@test.com\n";
print MAIL, "Subject: test\n";
print MAIL, "\n";
print MAIL, "Testing line 1\n";
print MAIL, "Testing line 2\n";
close MAIL;

You cannot pipe data both to and from a command.  If you want to read the output of a command that you have opened with the |command, send the output to a file.  For example,

open (PIPETO, "|command > /output.txt");

Deleting file: Function unlink

unlink(FILES) deletes the FILES, returning the number of files deleted.  Do not use unlink() to delete a directory, use rmdir() instead. For example,

unlink $filename;
unlink "/var/adm/message";
unlink "message";

Inspecting Files

You can inspect a file using (-test FILE) condition.  The condition returns true if FILE satisfies testFILE can be a filehandle or filename.  The available test are:

  • -e: exists.
  • -f: plain file.
  • -d: directory.
  • -T: seems to be a text file (data from 0 to 127).
  • -B: seems to be a binary file (data from 0 to 255).
  • -r: readable.
  • -w: writable.
  • -x: executable.
  • -s: returns the size of the file in bytes.
  • -z: empty (zero byte).
Example
#!/usr/bin/env perl
use strict;
use warnings;
  
my $dir = shift;
opendir(DIR, $dir) or die "Can't open directory: $!";
my @files = readdir(DIR);
closedir(DIR);
   
foreach my $file (@files) {
   if (-f "$dir/$file") {
      print "$file is a file\n";
      print "$file seems to be a text file\n" if (-T "$dir/$file");
      print "$file seems to be a binary file\n" if (-B "$dir/$file");
      my $size = -s "$dir/$file";
      print "$file size is $size\n";
      print "$file is a empty\n" if (-z "$dir/$file");
   } elsif (-d "$dir/$file") {
      print "$file is a directory\n";
   }
   print "$file is a readable\n" if (-r "$dir/$file");
   print "$file is a writable\n" if (-w "$dir/$file");
   print "$file is a executable\n" if (-x "$dir/$file");
}

Function stat and lsstat

The function stat(FILE) returns a 13-element array giving the vital statistics of FILElsstat(SYMLINK) returns the same thing for the symbolic link SYMLINK.

The elements are:

Index Value
0 The device
1 The file's inode
2 The file's mode
3 The number of hard links to the file
4 The user ID of the file's owner
5 The group ID of the file
6 The raw device
7 The size of the file
8 The last accessed time
9 The last modified time
10 The last time the file's status changed
11 The block size of the system
12 The number of blocks used by the file

For example: The command

perl -e "$size= (stat('test.txt'))[7]; print $size"

prints the file size of "test.txt".

Accessing the Directories

  • opendir(DIRHANDLE, dirname) opens the directory dirname.
  • closedir(DIRHANDLE) closes the directory handle.
  • readdir(DIRHANDLE) returns the next file from DIRHANDLE in a scalar context, or the rest of the files in the array context.
  • glob(string) returns an array of filenames matching the wildcard in string,  e.g., glob('*.dat') and glob('test??.txt').
  • mkdir(dirname, mode) creates the directory dirname with the protection specified by mode.
  • rmdir(dirname) deletes the directory dirname, only if it is empty.
  • chdir(dirname) changes the working directory to dirname.
  • chroot(dirname) makes dirname the root directory "/" for the current process, used by superuser only.

Example: Print the contents of a given directory.

#!/usr/bin/env perl
use strict;
use warnings;
  
my $dirname = shift;      # first command-line argument.
opendir(DIR, $dirname) or die "can't open $dirname: $!\n";
@files = readdir(DIR);
closedir(DIR);
foreach my $file (@files) {
   print "$file\n";
}

Example:  Removing empty files in a given directory

#!/usr/bin/env perl
use strict;
use warnings;
  

my $dirname = shift;
opendir(DIR, $dirname) or die "Can't open directory: $!";
my @files = readdir(DIR);
foreach my $file (@files) {
   if ((-f "$dir/$file") && (-z "$dir/$file")) {
      print "deleting $dir/$file\n";
      unlink "$dir/$file";
   }
}
closedir(DIR);

Example: Display files matches "*.txt"

my @files = glob('*.txt');
foreach (@files) { print; print "\n" }

Example: Display files matches the command-line pattern.

$file = shift;
@files = glob($file);
foreach (@files) {
   print;
   print "\n" 
}

Standard Filehandles

Perl defines the following standard filehandles:

  • STDIN – Standard Input, usually refers to the keyboard.
  • STDOUT – Standard Output, usually refers to the console.
  • STDERR – Standard Error, usually refers to the console.
  • ARGV – Command-line arguments.

For example:

my $line = <STDIN>    # Set $line to the next line of user input
my $item = <ARGV>     # Set $item to the next command-line argument
my @items = <ARGV>    # Put all command-line arguments into the array

When you use an empty angle brackets <> to get inputs from user, it uses the STDIN filehandle; when you get the inputs from the command-line, it uses ARGV filehandle.  Perl fills in STDIN or ARGV for you automatically.  Whenever you use print() function, it uses the STDOUT filehandler.

<> behaves like <ARGV> when there is still data to be read from the command-line files, and behave like <STDIN> otherwise.

Text Formatting

Function write

write(filehandle): printed formatted text to filehandle, using the format associated with filehandle. If filehandle is omitted, STDOUT would be used.

Declaring format

format name =
text1
text2
.

Picture Field @<, @|, @>

  • @<: left-flushes the string on the next line of formatting texts.
  • @>: right-flushes the string on the next line of formatting texts.
  • @|: centers the string on the next line of the formatting texts.

@<, @>, @| can be repeated to control the number of characters to be formatted. The number of characters to be formatted is same as the length of the picture field. @###.## formats numbers by lining up the decimal points under ".".

For examples,

[TODO]

Printing Formatting String printf

printf(filehandle, template, array): prints a formatted string to filehandle (similar to C's fprintf()). For example,

printf(FILE "The number is %d", 15);

The available formatting fields are:

Field Expected Value
%s String
%c Character
%d Decimal number
%ld Long decimal Number
%u Unsigned decimal number
%x Hexadecimal number
%lx Long hexadecimal number
%o Octal number
%lo Long octal number
%f Fixed-point floating-point number
%e Exponential floating-point number
%g Compact floating-point number

REFERENCES & RESOURCES

[TODO]