Perl is famous for processing text files via regular expressions.
Regular Expressions in Perl
A Regular Expression (or Regex) is a pattern (or filter) that describes a set of strings that matches the pattern. In other words, a regex accepts a certain set of strings and rejects the rest.
I shall assume that you are familiar with Regex syntax. Otherwise, you could read:
- "Regex Syntax Summary" for a summary of regex syntax and examples.
- "Regular Expressions" for full coverage.
Perl makes extensive use of regular expressions with many built-in syntaxes and operators. In Perl (and JavaScript), a regex is delimited by a pair of forward slashes (default), in the form of /regex/
. You can use built-in operators:
- m/regex/modifier: Match against the
regex
. - s/regex/replacement/modifier: Substitute matched substring(s) by the replacement.
Matching Operator m//
You can use matching operator m//
to check if a regex pattern exists in a string. The syntax is:
m/regex/ m/regex/modifiers # Optional modifiers /regex/ # Operator m can be omitted if forward-slashes are used as delimiter /regex/modifiers
Delimiter
Instead of using forward-slashes (/
) as delimiter, you could use other non-alphanumeric characters such as !
, @
and %
in the form of m!regex!modifiers
m@regex@modifiers
or m%regex%modifiers
. However, if forward-slash (/
) is used as the delimiter, the operator m
can be omitted in the form of /regex/modifiers
. Changing the default delimiter is confusing, and not recommended.
m//
, by default, operates on the default variable $_
. It returns true if $_
matches regex; and false otherwise.
Example 1: Regex [0-9]+
#!/usr/bin/env perl # try_m_1.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ print m/[0-9]+/ ? "Accept\n" : "Reject\n"; # one or more digits? }
$ ./try_m_1.pl 123 Accept 00000 Accept abc Reject abc123 Accept
Example 2: Extracting the Matched Substrings
The built-in array variables @-
and @+
keep the start and end positions
of the matched substring, where $-[0]
and $+[0]
for the full match, and
$-[n]
and $+[n]
for back references $1
, $2
, ..., $n
, ....
#!/usr/bin/env perl # try_m_2.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ if (m/[0-9]+/) { print 'Accept substring: ' . substr($_, $-[0], $+[0] - $-[0]) . "\n"; } else { print "Reject\n"; } }
$ ./try_m_2.pl 123 Accept substring: 123 00000 Accept substring: 00000 abc Reject abc123xyz Accept substring: 123 abc123xyz456 Accept substring: 123
Example 3: Modifier 'g' (global)
By default, m//
finds only the first match. To find all matches, include 'g' (global) modifier.
#!/usr/bin/env perl # try_m_3.pl use strict; use warnings; my $regex = '[0-9]+'; # Define regex pattern in non-interpolating string while (<>) { # Read input from command-line into default variable $_ # Do m//g and save matched substring into an array my @matches = /$regex/g; print "Matched substrings (in array): @matches\n"; # print array # Do m//g in a loop print 'Matched substrings (in loop) : '; while (/$regex/g) { print substr($_, $-[0], $+[0] - $-[0]), ','; } print "\n"; }
$ ./try_m_3.pl abc123xyz456_0_789 Matched substrings (in array): 123 456 0 789 Matched substrings (in loop) : 123,456,0,789, abc Matched substrings (in array): Matched substrings (in loop) : 123 Matched substrings (in array): 123 Matched substrings (in loop) : 123,
Operators =~ and !~
By default, the matching operators operate on the default variable $_
. To operate on other variable instead of $_
, you could use the =~
and !~
operators as follows:
str =~ m/regex/modifiers # Return true if str matches regex. str !~ m/regex/modifiers # Return true if str does NOT match regex.
When used with m//
, =~
behaves like comparison (==
or eq
).
Example 4: =~ Operator
#!/usr/bin/env perl # try_m_4.pl use strict; use warnings; print 'yes or no? '; my $reply; chomp($reply = <>); # Remove newline print $reply =~ /^y/i ? "positive!\n" : "negative!\n"; # Begins with 'y', case-insensitive
Substitution Operator s///
You can substitute a string (or a portion of a string) with another string using s///
substitution operator. The syntax is:
s/regex/replacement/
s/regex/replacement/modifiers # Optional modifiers
Similar to m//
, s///
operates on the default variable $_
by default. To operate on other variable, you could use the =~
and !~
operators. When used with s///
, =~
behaves like assignment (=
).
Example 5: s///
#!/usr/bin/env perl # try_s_1.pl use strict; use warnings; while (<>) { # Read input from command-line into default variable $_ s/\w+/\*\*\*/g; # Match each word print "$_"; }
$ ./try_s_1.pl this is an apple. *** *** *** ***.
Modifiers
Modifiers (such as /g
, /i
, /e
, /o
, /s
and /x
) can be used to control the behavior of m//
and s///
.
- g (global): By default, only the first occurrence of the matching string of each line is processed. You can use modifier
/g
to specify global operation. - i (case-insensitive): By default, matching is case-sensitive. You can use the modifier
/i
to enable case in-sensitive matching. - m (multiline): multiline string, affecting position anchor
^
,$
,\A
,\Z
. - s: permits metacharacter
.
(dot) to match the newline.
Parenthesized Back-References & Matched Variables $1, ..., $9
Parentheses ( )
serve two purposes in regex:
- Firstly, parentheses
( )
can be used to group sub-expressions for overriding the precedence or applying a repetition operator. For example,/(a|e|i|o|u){3,5}/
is the same as/a{3,5}|e{3,5}|i{3,5}|o{3,5}|u{3,5}/
. - Secondly, parentheses are used to provide the so called back-references. A back-reference contains the matched sub-string. For examples, the regex
/(\S+)/
creates one back-reference(\S+)
, which contains the first word (consecutive non-spaces) in the input string; the regex/(\S+)\s+(\S+)/
creates two back-references:(\S+)
and another(\S+)
, containing the first two words, separated by one or more spaces\s+
.
The back-references are stored in special variables $1
, $2
, …, $9
, where $1
contains the substring matched the first pair of parentheses, and so on. For example, /(\S+)\s+(\S+)/
creates two back-references which matched with the first two words. The matched words are stored in $1
and $2
, respectively.
For example, the following expression swap the first and second words:
s/(\S+) (\S+)/$2 $1/; # Swap the first and second words separated by a single space
Back-references can also be referenced in your program.
For example,
(my $word) = ($str =~ /(\S+)/);
The parentheses creates one back-reference, which matches the first word of the $str
if there is one, and is placed inside the scalar variable $word
. If there is no match, $word
is UNDEF
.
Another example,
(my $word1, my $word2) = ($str =~ /(\S+)\s+(\S+)/);
The 2 pairs of parentheses place the first two words (separated by one or more white-spaces) of the $str
into variables $word1
and $word2
if there are more than two words; otherwise, both $word1
and $word2
are UNDEF
. Note that regular expression matching must be complete and there is no partial matching.
\1
, \2
, \3
has the same meaning as $1
, $2
, $3
, but are valid only inside the s///
or m//
. For example, /(\S+)\s\1/
matches a pair of repeated words, separated by a white-space.
Character Translation Operator tr///
You can use translator operator to translate a character into another character. The syntax is:
tr/fromchars/tochars/modifiers
replaces or translates fromchars
to tochars
in $_
, and returns the number of characters replaced.
For examples,
tr/a-z/A-Z/ # converts $_ to uppercase. tr/dog/cat/ # translates d to c, o to a, g to t. $str =~ tr/0-9/a-j/ # replace 0 by a, etc in $str. tr/A-CG/KX-Z/ # replace A by K, B by X, C by Y, G by Z.
Instead of forward slash (/
), you can use parentheses ()
, brackets []
, curly bracket {}
as delimiter, e.g.,
tr[0-9][##########] # replace numbers by #. tr{!.}(.!) # swap ! and ., one pass.
If tochars
is shorter than fromchars
, the last character of tochars
is used repeatedly.
tr/a-z/A-E/ # f to z is replaced by E.
tr///
returns the number of replaced characters. You can use it to count the occurrence of certain characters. For examples,
my $numLetters = ($string =~ tr/a-zA-Z/a-zA-Z/); my $numDigits = ($string =~ tr/0-9/0-9/); my $numSpaces = ($string =~ tr/ / /);
Modifiers /c, /d and /s for tr///
/c
: complements (inverses)fromchars
./d
: deletes any matched but un-replaced characters./s
: squashes duplicate characters into just one.
For examples,
tr/A-Za-z/ /c # replaces all non-alphabets with space tr/A-Z//d # deletes all uppercase (matched with no replacement). tr/A-Za-z//dc # deletes all non-alphabets tr/!//s # squashes duplicate !
String Functions: split and join
split(regex, str, [numItems]): Splits the given str
using the regex
, and return the items in an array. The optional third parameter specifies the maximum items to be processed.
join(joinStr, strList): Joins the items in strList
with the given joinStr
(possibly empty).
For examples,
#!/usr/bin/env perl use strict; use warnings; my $msg = 'Hello, world again!'; my @words = split(/ /, $msg); # ('Hello,', 'world', 'again!') for (@words) { say; } # Use default scalar variable say join('--', @words); # 'Hello,--world--again!' my $newMsg = join '', @words; # 'Hello,worldagain!' say $newMsg;
Functions grep, map
- grep(regex, array): selects those elements of the
array
, that matchesregex
. - map(regex, array): returns a new array constructed by applying
regex
to each element of thearray
.
File Input/Output
Filehandle
Filehandles are data structure which your program can use to manipulate files. A filehandle acts as a gate between your program and the files, directories, or other programs. Your program first opens a gate, then sends or receives data through the gate, and finally closes the gate. There are many types of gates: one-way vs. two-way, slow vs. fast, wide vs. narrow.
Naming Convention: use uppercase for the name of the filehandle, e.g., FILE
, DIR
, FILEIN
, FILEOUT
, and etc.
Once a filehandle is created and connected to a file (or a directory, or a program), you can read or write to the underlying file through the filehandle using angle brackets, e.g., <FILEHANDLE>
.
Example: Read and print the content of a text file via a filehandle.
#!/usr/bin/env perl use strict; use warnings; # FileRead.pl: Read & print the content of a text file. my $filename = shift; # Get the filename from command line. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; while (<FILE>) { # Set $_ to each line of the file print; # Print $_ }
Example: Search and print lines containing a particular search word.
#!/usr/bin/env perl use strict; use warnings; # FileSearch.pl: Search for lines containing a search word. (my $filename, my $word) = @ARGV; # Get filename & search word. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; while (<FILE>) { # Set $_ to each line of the file print if /\b$word\b/i; # Match $_ with word, case insensitive }
Example: Print the content of a directory via a directory handle.
#!/usr/bin/env perl use strict; use warnings; # DirPrint.pl: Print the content of a directory. my $dirname = shift; # Get directory name from command-line opendir(DIR, $dirname) or die "Can't open directory $dirname: $!"; my @files = readdir(DIR); foreach my $file (@files) { # Display files not beginning with dot. print "$file\n" if ($file !~ /^\./); }
You can use C-style's printf
for formatted output to file.
File Handling Functions
Function open: open(filehandle, string)
opens the filename given by string
and associates it with the filehandle
. It returns true if success and UNDEF
otherwise.
- If string begins with
<
(or nothing), it is opened for reading. - If string begins with
>
, it is opened for writing. - If string begins with
>>
, it is opened for appending. - If string begins with
+<
,+>
,+>>
, it is opened for both reading and writing. - If string is
-
,STDIN
is opened. - If string is
>-
,STDOUT
is opened. - If string begins with
-|
or|-
, your process willfork()
to execute the pipe command.
Function close: close(filehandle)
closes the file associated with the filehandle
. When the program exits, Perl closes all opened filehandles. Closing of file flushes the output buffer to the file. You only have to explicitly close the file in case the user aborts the program, to ensure data integrity.
A common procedure for modifying a file is to:
- Read in the entire file with
open(FILE, $filename)
and@lines = <FILE>
. - Close the filehandle.
- Operate upon
@lines
(which is in the fast RAM) rather thanFILE
(which is in the slow disk). - Write the new file contents using
open(FILE, “>$filename”)
andprint FILE @lines
. - Close the file handle.
Example: Read the contents of the entire file into memory; modify and write back to disk.
#!/usr/bin/env perl use strict; use warnings; # FileChange.pl my $filename = shift; # Get the filename from command line. # Create a filehandle called FILE and connect to the file. open(FILE, $filename) or die "Can't open $filename: $!"; # Read the entire file into an array in memory. my @lines = <FILE>; close(FILE); open(FILE, ">$filename") or die "Can't write to $filename: $!"; foreach my $line (@lines) { print FILE uc($line); # Change to uppercase } close(FILE);
Example: Reading from a file
#!/usr/bin/env perl use strict; use warnings; open(FILEIN, "test.txt") or die "Can't open file: $!"; while (<FILEIN>) { # set $_ to each line of the file. print; # print $_ }
Example: Writing to a file
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, ">$filename") or die "Can't write to $filename: $!"; print FILE "This is line 1\n"; # no comma after FILE. print FILE "This is line 2\n"; print FILE "This is line 3\n";
Example: Appending to a file
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, ">>$filename") or die "Can't append to $filename: $!"; print FILE "This is line 4\n"; # no comma after FILE. print FILE "This is line 5\n";
In-Place Editing
Instead of reading in one file and write to another file, you could do in-place editing by specifying –i
flag or use the special variable $^I
.
- The
–ibackupExtension
flag tells Perl to edit files in-place. If abackupExtension
is provided, a backup file will be created with thebackupExtension
. - The special variable
$^I=backupExtension
does the same thing.
Example: In-place editing using –i
flag
#!/usr/bin/env perl -i.old # In-place edit, backup as '.old' use strict; use warnings; while (<>) { s/line/TEST/g; print; # Print to the file, not STDOUT. }
Example: In-place editing using $^I
special variable.
#!/usr/bin/env perl use strict; use warnings; $^I = '.bak'; # Enable in-place editing, backup in '.bak'. while (<>) { s/TEST/line/g; print; # Print to the file, not STDOUT. }
Functions seek, tell, truncate
seek(filehandle, position, whence)
: moves the file pointer of the filehandle
to position
, as measured from whence
. seek()
returns 1 upon success and 0 otherwise. File position is measured in bytes. whence
of 0 measured from the beginning of the file; 1 measured from the current position; and 2 measured from the end. For example:
seek(FILE, 0, 2); # 0 byte from end-of-file, give file size. seek(FILE, -2, 2); # 2 bytes before end-of-file. seek(FILE, -10, 1); # Move file pointer 10 byte backward. seek(FILE, 20, 0); # 20 bytes from the begin-of-file.
tell(filehandle)
: returns the current file position of filehandle
.
truncate(FILE, length)
: truncates FILE
to length
bytes. FILE
can be either a filehandle or a file name.
To find the length of a file, you could:
seek(FILE, 0, 2); # Move file point to end of file. print tell(FILE); # Print the file size.
Example: Truncate the last 2 bytes if they begin with \x0D
,
#!/usr/bin/env perl use strict; use warnings; my $filename = shift; # Get the file from command line. open(FILE, "+<$filename") or die "Can't open $file: $!"; seek(FILE, -2, 2); # 2 byte before end-of-file. my $pos = tell FILE; my $data = <FILE>; # read moves the file pointer. if ($data =~ /^\x0D/) { # begin with 0D truncate FILE, $pos; # truncate last 2 bytes. }
Function eof
eof(filehandle)
returns 1 if the file pointer is positioned at the end of the file or if the filehandle
is not opened.
Reading Bytes Instead of Lines
The function read(filehandle, var, length, offset)
reads length
bytes from filehandle
starting from the current file pointer, and saves into variable var
starting from offset
(if omitted, default is 0). The bytes includes \x0A
, \x0D
etc.
Example
#!/usr/bin/env perl use strict; use warnings; (my $numbytes, my $filename) = @ARGV; open(FILE, $filename) or die "Can't open $filename: $!"; my $data; read(FILE, $data, $numbytes); print $data, "\n----\n"; read(FILE, $data, $numbytes); # continue from current file ptr print $data; print $data, "\n----\n"; read(FILE, $data, $numbytes, 2); # save in $data offset 2 print $data, "\n----\n";
Piping Data To and From a Process
If you wish your program to receive data from a process or want your program to send data to a process, you could open a pipe to an external program.
open(handle, "command|")
lets you read from the output ofcommand
.open(handle, "|command")
lets you write to the input ofcommand
.
Both of these statements return the Process ID (PID) of the command
.
Example: The dir
command lists the current directory. By opening a pipe from dir
, you can access its output.
#!/usr/bin/env perl use strict; use warnings; open(PIPEFROM, "dir|") or die "Pipe failed: $!"; while (<PIPEFROM>) { print; } close PIPEFROM;
Example: This example shows how you can pipe input into the sendmail program.
#!/usr/bin/env perl use strict; use warnings; my $my_login = test open(MAIL, "| sendmail –t –f$my_login") or die "Pipe failed: $!"; print MAIL, "From: test101@test.com\n"; print MAIL, "To: test102@test.com\n"; print MAIL, "Subject: test\n"; print MAIL, "\n"; print MAIL, "Testing line 1\n"; print MAIL, "Testing line 2\n"; close MAIL;
You cannot pipe data both to and from a command
. If you want to read the output of a command
that you have opened with the |command
, send the output to a file. For example,
open (PIPETO, "|command > /output.txt");
Deleting file: Function unlink
unlink(FILES)
deletes the FILES, returning the number of files deleted. Do not use unlink()
to delete a directory, use rmdir()
instead. For example,
unlink $filename; unlink "/var/adm/message"; unlink "message";
Inspecting Files
You can inspect a file using (-test FILE)
condition. The condition returns true if FILE
satisfies test
. FILE
can be a filehandle or filename. The available test
are:
-e
: exists.-f
: plain file.-d
: directory.-T
: seems to be a text file (data from 0 to 127).-B
: seems to be a binary file (data from 0 to 255).-r
: readable.-w
: writable.-x
: executable.-s
: returns the size of the file in bytes.-z
: empty (zero byte).
Example
#!/usr/bin/env perl use strict; use warnings; my $dir = shift; opendir(DIR, $dir) or die "Can't open directory: $!"; my @files = readdir(DIR); closedir(DIR); foreach my $file (@files) { if (-f "$dir/$file") { print "$file is a file\n"; print "$file seems to be a text file\n" if (-T "$dir/$file"); print "$file seems to be a binary file\n" if (-B "$dir/$file"); my $size = -s "$dir/$file"; print "$file size is $size\n"; print "$file is a empty\n" if (-z "$dir/$file"); } elsif (-d "$dir/$file") { print "$file is a directory\n"; } print "$file is a readable\n" if (-r "$dir/$file"); print "$file is a writable\n" if (-w "$dir/$file"); print "$file is a executable\n" if (-x "$dir/$file"); }
Function stat and lsstat
The function stat(FILE)
returns a 13-element array giving the vital statistics of FILE
. lsstat(SYMLINK)
returns the same thing for the symbolic link SYMLINK
.
The elements are:
Index | Value |
---|---|
0 | The device |
1 | The file's inode |
2 | The file's mode |
3 | The number of hard links to the file |
4 | The user ID of the file's owner |
5 | The group ID of the file |
6 | The raw device |
7 | The size of the file |
8 | The last accessed time |
9 | The last modified time |
10 | The last time the file's status changed |
11 | The block size of the system |
12 | The number of blocks used by the file |
For example: The command
perl -e "$size= (stat('test.txt'))[7]; print $size"
prints the file size of "test.txt
".
Accessing the Directories
opendir(DIRHANDLE, dirname)
opens the directorydirname
.closedir(DIRHANDLE)
closes the directory handle.readdir(DIRHANDLE)
returns the next file fromDIRHANDLE
in a scalar context, or the rest of the files in the array context.glob(string)
returns an array of filenames matching the wildcard instring
, e.g.,glob('*.dat')
andglob('test??.txt')
.mkdir(dirname, mode)
creates the directorydirname
with the protection specified bymode
.rmdir(dirname)
deletes the directorydirname
, only if it is empty.chdir(dirname)
changes the working directory todirname
.chroot(dirname)
makesdirname
the root directory "/" for the current process, used by superuser only.
Example: Print the contents of a given directory.
#!/usr/bin/env perl
use strict;
use warnings;
my $dirname = shift; # first command-line argument.
opendir(DIR, $dirname) or die "can't open $dirname: $!\n";
@files = readdir(DIR);
closedir(DIR);
foreach my $file (@files) {
print "$file\n";
}
Example: Removing empty files in a given directory
#!/usr/bin/env perl use strict; use warnings; my $dirname = shift; opendir(DIR, $dirname) or die "Can't open directory: $!"; my @files = readdir(DIR); foreach my $file (@files) { if ((-f "$dir/$file") && (-z "$dir/$file")) { print "deleting $dir/$file\n"; unlink "$dir/$file"; } } closedir(DIR);
Example: Display files matches "*.txt
"
my @files = glob('*.txt'); foreach (@files) { print; print "\n" }
Example: Display files matches the command-line pattern.
$file = shift; @files = glob($file); foreach (@files) { print; print "\n" }
Standard Filehandles
Perl defines the following standard filehandles:
STDIN
– Standard Input, usually refers to the keyboard.STDOUT
– Standard Output, usually refers to the console.STDERR
– Standard Error, usually refers to the console.ARGV
– Command-line arguments.
For example:
my $line = <STDIN> # Set $line to the next line of user input my $item = <ARGV> # Set $item to the next command-line argument my @items = <ARGV> # Put all command-line arguments into the array
When you use an empty angle brackets <>
to get inputs from user, it uses the STDIN
filehandle; when you get the inputs from the command-line, it uses ARGV
filehandle. Perl fills in STDIN
or ARGV
for you automatically. Whenever you use print()
function, it uses the STDOUT
filehandler.
<>
behaves like <ARGV>
when there is still data to be read from the command-line files, and behave like <STDIN>
otherwise.
Text Formatting
Function write
write(filehandle)
: printed formatted text to filehandle
, using the format associated with filehandle
. If filehandle
is omitted, STDOUT
would be used.
Declaring format
format name = text1 text2 .
Picture Field @<, @|, @>
@<
: left-flushes the string on the next line of formatting texts.@>
: right-flushes the string on the next line of formatting texts.@|
: centers the string on the next line of the formatting texts.
@<
, @>
, @|
can be repeated to control the number of characters to be formatted.
The number of characters to be formatted is same as the length of the picture field.
@###.##
formats numbers by lining up the decimal points under ".
".
For examples,
[TODO]
Printing Formatting String printf
printf(filehandle, template, array)
: prints a formatted string to filehandle
(similar to C's fprintf()
). For example,
printf(FILE "The number is %d", 15);
The available formatting fields are:
Field | Expected Value |
---|---|
%s |
String |
%c |
Character |
%d |
Decimal number |
%ld |
Long decimal Number |
%u |
Unsigned decimal number |
%x |
Hexadecimal number |
%lx |
Long hexadecimal number |
%o |
Octal number |
%lo |
Long octal number |
%f |
Fixed-point floating-point number |
%e |
Exponential floating-point number |
%g |
Compact floating-point number |
REFERENCES & RESOURCES
[TODO]