File Input/Output
File Input/Output (IO) requires 3 steps:
- Open the file for read or write or both.
- Read/Write data.
- Close the file to free the resources.
Python provides built-in functions and external modules to support these operations.
Opening/Closing a File
- open(file, [mode='r']) -> fileObj: Open the
file
and return a file object. The available modes are:'r'
(read-only) (default),'w'
(write - erase all contents for existing file),'a'
(append),'r+'
(read and write). You can also use'rb'
,'wb'
,'ab'
,'rb+'
for binary mode (raw-byte) operations. You can optionally specify the text encoding via keyword parameterencoding
, e.g.,encoding="utf-8"
. - fileObj.close(): Flush and close the file stream.
Reading/Writing Text Files
The fileObj
returned after the file is opened maintains a file pointer. It initially positions at the beginning of the file and advances whenever read/write operations are performed.
Reading Line/Lines from a Text File
- fileObj.readline() -> str: (most commonly-used) Read next line (up to and including newline) and return a string (including newline). It returns an empty string after the end-of-file (EOF).
- fileObj.readlines() -> [str]: Read all lines into a list of strings.
- fileObj.read() -> str: Read the entire file into a string.
Writing Line to a Text File
- fileObj.write(str) -> int: Write the given string to the file and return the number of characters written. You need to explicitly terminate the
str
with a'\n'
, if needed. The'\n'
will be translated to the platform-dependent newline ('\r\n'
for Windows or'\n'
for Unixes/macOS).
Examples
# Open a file for writing and insert some records >>> f = open('test.txt', 'w') >>> f.write('apple\n') >>> f.write('orange\n') >>> f.write('pear\n') >>> f.close() # Always close the file # Check the contents of the file created # Open the file created for reading and read line(s) using readline() and readlines() >>> f = open('test.txt', 'r') >>> f.readline() # Read next line into a string 'apple\n' >>> f.readlines() # Read all (next) lines into a list of strings ['orange\n', 'pear\n'] >>> f.readline() # Return an empty string after EOF '' >>> f.close() # Open the file for reading and read the entire file via read() >>> f = open('test.txt', 'r') >>> f.read() # Read entire file into a string 'apple\norange\npear\n' >>> f.close() # Read line-by-line using readline() in a while-loop >>> f = open('test.txt') >>> line = f.readline() # include newline >>> while line: line = line.rstrip() # strip trailing spaces and newline # process the line print(line) line = f.readline() apple orange pear >>> f.close()
Processing Text File Line-by-Line
We can use a with
-statement to open a file, which will be closed automatically upon exit, and a for
-loop to read line-by-line as follows:
with open('path/to/file.txt', 'r') as f: # Open file for read for line in f: # Read line-by-line line = line.strip() # Strip the leading/trailing whitespaces and newline # Process the line # File closed automatically upon exit of with-statement
The with
-statement is equivalent to the try-finally
statement as follows:
try:
f = open('path/to/file.txt')
for line in f:
line = line.strip()
# Process the line
finally:
f.close()
Example: Line-by-line File Copy
The following script copies a file into another line-by-line, prepending each line with the line number.
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
"""
file_copy: Copy file line-by-line from source to destination
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage: file_copy <src> <dest>
"""
import sys
import os
def main():
# Check and retrieve command-line arguments
if len(sys.argv) != 3:
print(__doc__)
sys.exit(1) # Return a non-zero value to indicate abnormal termination
fileIn = sys.argv[1]
fileOut = sys.argv[2]
# Verify source file
if not os.path.isfile(fileIn):
print("error: {} does not exist".format(fileIn))
sys.exit(1)
# Verify destination file
if os.path.isfile(fileOut):
print("{} exists. Override (y/n)?".format(fileOut))
reply = input().strip().lower()
if reply[0] != 'y':
sys.exit(1)
# Process the file line-by-line
with open(fileIn, 'r') as fpIn, open(fileOut, 'w') as fpOut:
lineNumber = 0
for line in fpIn:
lineNumber += 1
line = line.rstrip() # Strip trailing spaces and newline
fpOut.write("{}: {}\n".format(lineNumber, line))
# Need \n, which will be translated to platform-dependent newline
print("Number of lines: {}\n".format(lineNumber))
if __name__ == '__main__':
main()
Binary File Operations
[TODO] Intro
- fileObj.tell() -> int: returns the current stream position. The current stream position is the number of bytes from the beginning of the file in binary mode, and an opaque number in text mode.
- fileObj.seek(offset): sets the current stream position to
offset
bytes from the beginning of the file.
For example [TODO]
Directory and File Management
In Python, directory and file management are supported by modules os
, os.path
, shutil
, ...
Path Operations Using Module os.path
In Python, a path could refer to:
- a file,
- a directory, or
- a symlink (symbolic link).
A path could be absolute (beginning with root) or relative to the current working directory (CWD).
The path separator is platform-dependent (Windows use '\'
, while Unixes/macOS use '/'
). The os.path
module supports platform-independent operations on paths, by handling the path separator intelligently.
Checking Path Existence and Type
- os.path.exists(path) -> bool: Check if the given path exists.
- os.path.isfile(file_path), os.path.isdir(dir_path), os.path.islink(link_path) -> bool: Check if the given path is a file, a directory, or a symlink.
For examples,
>>> import os >>> os.path.exists('/usr/bin') # Check if the path exists (as a file/directory/Symlink) True >>> os.path.isfile('/usr/bin') # Check if path is a file False >>> os.path.isdir('/usr/bin') # Check if path is a directory True
Forming a New Path
The path separator is platform-dependent (Windows use '\'
, while Unixes/macOS use '/'
). For portability, It is important NOT to hardcode the path separator. The os.path
module supports platform-independent operations on paths, by handling the path separator intelligently.
- os.path.sep: the path separator of the current system.
- os.path.join(path, *paths): Form and return a path by joining one or more path components by inserting the platform-dependent path separator (
'/'
or'\'
). To form an absolute path, you need to begin with aos.path.sep
, as root.
For examples,
>>> import os >>> print(os.path.sep) # Path Separator / # Form an absolute path beginning with root >>> print(os.path.join(os.path.sep, 'etc', 'apache2', 'httpd.conf')) /etc/apache2/httpd.conf # Form a relative path >>> print(os.path.join('..', 'apache2', 'httpd.conf')) ../apache2/httpd.conf
Manipulating Directory-name and Filename
- os.path.dirname(path): Return the directory name of the given
path
(file, directory or symlink). The returned directory name could be absolute or relative, depending on thepath
given. - os.path.abspath(path): Return the absolute path name (starting from the root) of the given
path
. This could be an absolute filename, an absolute directory-name, or an absolute symlink.
For example, to form an absolute path of a file called out.txt
in the same directory as in.txt
, you may extract the absolute directory name of the in.txt
, then join with out.txt
, as follows:
# Absolute filename os.path.join(os.path.dirname(os.path.abspath('in.txt')), 'out.txt') # Relative filename os.path.join(os.path.dirname('in.txt'), 'out.txt')
For example,
#!/usr/bin/env python3 # -*- coding: UTF-8 -*- """ test_ospath.py """ import os print('__file__:', __file__) # This filename print('dirname():', os.path.dirname(__file__)) # directory component (relative or absolute) print('abspath():', os.path.abspath(__file__)) # Absolute filename print('dirname(abspath()):', os.path.dirname(os.path.abspath(__file__))) # Absolute directory name
When a module is loaded in Python, __file__
is set to the module name. Try running this script with various __file__
references and study their output:
# cd to the directory containing "test_ospath.py" $ python3 ./test_ospath.py $ python3 test_ospath.py $ python3 ../parent_dir/test_ospath.py # Relative filename $ python3 /path/to/test_ospath.py # Absolute filename
Handling Symlink (Unixes/macOS)
- os.path.realpath(path): (for symlinks) Similar to
abspath()
, but return the canonical path, eliminating any symlinks encountered.
For example,
#!/usr/bin/env python3 # -*- coding: UTF-8 -*- """test_realpath.py""" import os print('__file__:', __file__) print('abspath():', os.path.abspath(__file__)) # Absolute filename print('realpath():', os.path.realpath(__file__)) # Filename with symlink resolved, if any
$ python3 test_realpath.py # Same output for abspath() and realpath() because there is no symlink # Make a symlink to the Python script $ ln -s test_realpath.py test_realpath_link.py # Run via symlink $ python3 test_realpath_link.py #abspath(): /path/to/test_realpath_link.py #realpath(): /path/to/test_realpath.py (symlink resolved)
Directory & File Management Using Modules os
and shutil
The modules os
and shutil
provide interface to the Operating System and System Shell.
However,
- If you just want to read or write a file, use built-in function
open()
. - If you just want to manipulate paths (files, directories and symlinks), use
os.path
module. - If you want to read all the lines in all the files on the command-line, use
fileinput
module. - To create temporary files/directories, use
tempfile
module.
Directory Management
- os.getcwd(): Return the current working directory (CWD).
- os.chdir(dir_path): Change the CWD.
- os.mkdir(dir_path, mode=0777): Create a directory with the given
mode
expressed in octal (which will be further masked by environment variableumask
).mode
is ignored in Windows. - os.mkdirs(dir_path, mode=0777): Similar to
mkdir
, but create the intermediate sub-directories, if needed. - os.rmdir(dir_path): Remove an empty directory. You could use
os.path.isdir(path)
to check if thepath
exists. - shutil.rmtree(dir_path): Remove a directory and all its contents.
File Management
- os.rename(src_file, dest_file): Rename a file.
- os.remove(file) or os.unlink(file): Remove the file. You could use
os.path.isfile(file)
to check if thefile
exists.
For examples [TODO],
>>> import os >>> dir(os) # List all attributes ...... >>> help(os) # Show man page ...... >>> help(os.getcwd) # Show man page for specific function ...... >>> os.getcwd() # Get current working directory ... current working directory ... >>> os.listdir() # List the contents of the current directory ... contents of current directory ... >>> os.chdir('test-python') # Change directory >>> exec(open('hello.py').read()) # Run a Python script >>> os.system('ls -l') # Run shell command >>> os.name # Name of OS 'posix' >>> os.makedir('sub_dir') # Create sub-directory >>> os.makedirs('/path/to/sub_dir') # Create sub-directory and the intermediate directories >>> os.remove('filename') # Remove file >>> os.rename('oldFile', 'newFile') # Rename file
List a Directory
- os.listdir(path='.') -> [path]: list all the entries in a given directory (exclude
'.'
and'..'
), default to the current directory.
>>> import os >>> help(os.listdir) ...... >>> os.listdir() # Return a list of entries in the current directory [..., ..., ...] # You can use a for-loop to iterate thru the list >>> for f in sorted(os.listdir('/usr')): print(f) ...... >>> for f in sorted(os.listdir('/usr')): print(os.path.abspath(f)) ......
List a Directory Recursively via os.walk()
- os.walk(top, topdown=True, onerror=None, followlinks=False): recursively list all the entries starting from
top
.
For example,
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
"""
file_list_oswalk.py - List files recursively from a given directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage: files_list_oswalk.py [<dir>|.]
"""
import sys
import os
def main():
# Process command-line arguments
if len(sys.argv) > 2: # Command-line arguments are kept in a list 'sys.argv'
print(__doc__)
sys.exit(1) # Return a non-zero value to indicate abnormal termination
elif len(sys.argv) == 2:
dir = sys.argv[1] # directory given in command-line argument
else:
dir = '.' # default current directory
# Verify dir
if not os.path.isdir(dir):
print('error: {} does not exists'.format(dir))
sys.exit(1)
# Recursively walk thru from dir using os.walk()
for curr_dir, subdirs, files in os.walk(dir):
# os.walk() recursively walk thru the given "dir" and its sub-directories
# For each iteration:
# - curr_dir: current directory being walk thru, recursively from "dir"
# - subdirs: list of sub-directories in "curr_dir"
# - files: list of files/symlinks in "curr_dir"
print('D:', os.path.abspath(curr_dir)) # print currently walk dir
for subdir in sorted(subdirs): # print all subdirs under "curr_dir"
print('SD:', os.path.abspath(subdir))
for file in sorted(files): # print all files under "curr_dir"
print(os.path.join(os.path.abspath(curr_dir), file)) # full filename
if __name__ == '__main__':
main()
List a Directory Recursively via Module glob
(Python 3.5)
[TODO] Intro
#!/usr/bin/env python3
# -*- coding: UTF-8 -*-
"""
file_list_glob.py - List files recursively from a given directory
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Usage: files_list [<dir>|.]
"""
import sys
import os
import glob # Python 3.5
def main():
# Process command-line arguments
if len(sys.argv) > 2: # Command-line arguments are kept in a list 'sys.argv'
print(__doc__)
sys.exit(1) # Return a non-zero value to indicate abnormal termination
elif len(sys.argv) == 2:
dir = sys.argv[1] # directory given in command-line argument
else:
dir = '.' # default current directory
# Check dir
if not os.path.isdir(dir):
print('error: {} does not exists'.format(dir))
sys.exit(1)
# List *.txt only
for file in glob.glob(dir + '/**/*.txt', recursive=True):
# ** match any files and zero or more directories and subdirectories
print(file)
print('----------------------------')
# List all files and subdirs
for file in glob.glob(dir + '/**', recursive=True):
# ** match any files and zero or more directories and subdirectories
if os.path.isdir(file):
print('D:', file)
else:
print(file)
if __name__ == '__main__':
main()
Copying File
- shutil.copyfile(src, dest): Copy from
src
todest
.
Shell Command [TODO]
- os.system(command_str): Run a shell command. (In Python 3, use
subprocess.call()
instead.)
Environment Variables [TODO]
- os.getenv(varname, value=None): Returns the environment variable if it exists, or
value
if it doesn't, with default ofNone
. - os.putenv(varname, value): Set environment variable to value.
- os.unsetenv(varname): Delete (Unset) the environment variable.
fileinput
Module
The fileinput
module provides support for processing lines of input from one or more files given in the command-line arguments (sys.argv
). For example, create the following script called "test_fileinput.py
":
#!/usr/bin/env python3 # -*- coding: UTF-8 -*- """ test_fileinput.py: Process all the files given in the command-line arguments Usage: test_fileinput.py file1, file2, file3, ... """ import fileinput def main(): """Get lines from all the file given in the command-line arguments (sys.argv)""" lineNumber = 0 for line in fileinput.input(): line = line.rstrip() # strip the trailing spaces/newline lineNumber += 1 print('{}: {}'.format(lineNumber, line)) if __name__ == '__main__': main()
Text Processing
For simple text string operations such as string search and replacement, you can use the built-in string functions (e.g., str.replace(old, new)
). For complex pattern search and replacement, you need to master regular expression (regex).
String Operations
The built-in class str
provides many member functions for text string manipulation. Suppose that s
is a str
object.
Strip whitespaces (blank, tab and newline)
- s.strip()-> str: Return a copy of the string
s
with leading and trailing whitespaces removed. Whitespaces includes blank, tab and newline. - s.strip([chars]) -> str: Strip the leading/trailing characters given, instead of whitespaces.
- s.rstrip(), s.lstrip() -> str: Strip the right (trailing) whitespaces and the left (leading) whitespaces, respectively.
s.rstrip()
is the most commonly-used to strip the trailing spaces/newline. The leading whitespaces are usually significant.
Uppercase/Lowercase
- s.upper(), s.lower() -> str: Return a copy of string
s
converted to uppercase and lowercase, respectively. - s.isupper(), s.islower() -> bool: Check if the string is uppercase/lowercase, respectively.
Find
- s.find(key_str, [start], [end]) -> int|-1: Return the lowest index in slice
s
[start:end]
(default to entire string); or -1 if not found. - s.index(key_str, [start], [end]) -> int|ValueError: Similar to
find()
, but raisesValueError
if not found. - s.startswith(key_str, [start], [end]), s.endsswith(key_str, [start], [end]) -> bool: Check is the string begins or ends with
key_str
.
For examples,
>>> s = '/test/in.txt'
>>> s.find('in')
6
>>> s[0 : s.find('in')] + 'out.txt'
'/test/out.txt'
# You could use str.replace() described below
Find and Replace
- s.replace(old, new, [count]) -> str: Return a copy with all occurrences of
old
replaced bynew
. The optional parametercount
limits the number of occurrences to replace, default to all occurrences.
str.replace()
is ideal for simple text string replacement, without the need for pattern matching.
For examples,
>>> s = 'hello hello hello, world' >>> help(s.replace) >>> s.replace('ll', '**') 'he**o he**o he**o, world' >>> s.replace('ll', '**', 2) 'he**o he**o hello, world'
Split into Tokens and Join
- s.split([sep], [maxsplit=-1]) -> [str]: Return a list of words using
sep
as delimiter string. The default delimiter is whitespaces (blank, tab and newline). ThemaxSplit
limits the maximum number of split operations, with default -1 means no limit. - sep.join([str]) -> str: Reverse of
split()
. Join the list of string withsep
as separator.
For examples,
>>> 'apple, orange, pear'.split() # default delimiter is whitespaces ['apple,', 'orange,', 'pear'] >>> 'apple, orange, pear'.split(', ') # Set the delimiter ['apple', 'orange', 'pear'] >>> 'apple, orange, pear'.split(', ', maxsplit=1) # Set the split operation ['apple', 'orange, pear'] >>> ', '.join(['apple', 'orange, pear']) 'apple, orange, pear'
Regular Expression in Module re
References:
- Python's Regular Expression HOWTO @ https://docs.python.org/3/howto/regex.html.
- Python's re - Regular expression operations @ https://docs.python.org/3/library/re.html.
I assume that you are familiar with regex, otherwise, you could read:
- "Regex By Examples" for a summary of regex syntax and examples.
- "Regular Expressions" for full coverage.
The re
module provides support for regular expressions (regex).
>>> import re >>> dir(re) # List all attributes ...... >>> help(re) # Show man page ...... # The man page lists all the special characters and metacharacters used by Python's regex.
Backslash (\
), Python Raw String r'...'
vs Regular String
Regex's syntax uses backslash (\
):
- for metacharacters such as
\d
(digit),\D
(non-digit),\s
(space),\S
(non-space),\w
(word),\W
(non-word) - to escape special regex characters, e.g.,
\.
for.
,\+
for+
,\*
for*
,\?
for?
. You also need to write\\
to match\
.
On the other hand, Python' regular strings also use backslash for escape sequences, e.g., \n
for newline, \t
for tab. Again, you need to write \\
for \
.
To write the regex pattern \d+
(one or more digits) in a Python regular string, you need to write '\\d+'
. This is cumbersome and error-prone.
Python's solution is using raw string with a prefix r
in the form of r'...'
. It ignores interpretation of the Python's string escape sequence. For example, r'\n'
is '\'+'n'
(two characters) instead of newline (one character). Using raw string, you can write r'\d+'
for regex pattern \d+
(instead of regular string '\\d+'
).
Furthermore, Python denotes parenthesized back references (or capturing groups) as \1
, \2
, \3
, ..., which can be written as raw strings r'\1'
, r'\2'
instead of regular string '\\1'
and '\\2'
. Take note that some languages use $1
, $2
, ... for the back references.
I suggest that you use raw strings for regex pattern strings and replacement strings.
Compiling (Creating) a Regex Pattern Object
- re.compile(regexStr, [modifiers]) -> regexObj: Compile a regex pattern into a regex object, which can then be used for matching operations.
For examples,
>>> import re >>> p1 = re.compile(r'[1-9][0-9]*|0') # zero or positive integers (begins with 1-9, followed by zero or more 0-9; or 0) >>> type(p1) <class '_sre.SRE_Pattern'> >>> p2 = re.compile(r'^\w{6,10}$') # 6-10 word-character line # ^ matches the start-of-line, $ matches end-of-line, \w matches word character >>> p3 = re.compile(r'xy*', re.IGNORECASE) # with an optional modifier # x followed by zero or more y, case-insensitive
Invoking Regex Operations
You can invoke most of the regex functions in two ways:
- regexObj.func(str): Apply compiled regex
object
tostr
, viaSRE_Pattern
's member functionfunc()
. - re.func(regexObj|regexStr, str): Apply
regexStr
(uncompiled) tostr
, viare
's module-level functionfunc()
. These module-level functions are shortcuts to the above that do not require you to compile a regex object first, but miss the modifiers ifregexStr
is used.
Find using finaAll()
- regexObj.findall(str) -> [str]: Return a list of all the matching substrings.
- re.findall(regexObj|regexStr, str) -> [str]: same as above.
For examples,
# (1) Using compile regex object >>> p1 = re.compile(r'[1-9][0-9]*|0') # match integer >>> p1.findall('123 456') ['123', '456'] >>> p1.findall('abc') [] >>> p1.findall('abc123xyz456_7_00') ['123', '456', '7', '0', '0'] # (2) Using re.findall() with uncompiled regex string >>> re.findall(r'[1-9][0-9]*|0', '123 456') # Provide the regex pattern string ['123', '456'] >>> re.findall(r'[1-9][0-9]*|0', 'abc') [] >>> re.findall(r'[1-9][0-9]*|0', 'abc123xyz456_7_00') ['123', '456', '7', '0', '0']
Replace using sub()
and subn()
- regexObj.sub(replaceStr, inStr, [count=0]) -> outStr: Substitute (Replace) the matched substrings in the given
inStr
with thereplaceStr
, up tocount
occurrences, with default of all. - regexObj.subn(replaceStr, inStr, [count=0]) -> (outStr, count): Similar to
sub()
, but return a new string together with the number of replacements in a 2-tuple. - re.sub(regexObj|regexStr, replaceStr, inStr, [count=0]) -> outStr: same as above.
- re.subn(regexObj|regexStr, replaceStr, inStr, [count=0]) -> (outStr, count): same as above.
For examples,
# (1) Using compiled regex object >>> p1 = re.compile(r'[1-9][0-9]*|0') # match integer >>> p1.sub(r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> p1.subn(r'**', 'abc123xyz456_7_00') ('abc**xyz**_**_****', 5) # (outStr, count) >>> p1.sub(r'**', 'abc123xyz456_7_00', count=3) 'abc**xyz**_**_00' # (2) Using re module-level function >>> re.sub(r'[1-9][0-9]*|0', r'**', 'abc123xyz456_7_00') # Using regexStr 'abc**xyz**_**_****' >>> re.sub(p1, r'**', 'abc123xyz456_7_00') # Using pattern object 'abc**xyz**_**_****' >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=3) ('abc**xyz**_**_00', 3) >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=10) # count exceeds matches ('abc**xyz**_**_****', 5)
Notes: For simple string replacement, use str.replace(old, new, [max=-1]) -> str
which is more efficient. See above section.
Using Parenthesized Back-References \1
, \2
, ... in Substitution and Pattern
In Python, regex parenthesized back-references (capturing groups) are denoted as \1
, \2
, .... You could use raw string (e.g., r'\1'
) to avoid escaping backslash in regular string (e.g., '\\1'
).
For examples,
# To swap the two words by using back-references in the replacement string >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc') 'bbb aaa ccc' >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd') 'bbb aaa ddd ccc' >>> re.subn(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd eee') ('bbb aaa ddd ccc eee', 2) # To remove duplicate words using back-reference >>> re.subn(r'(\w+) \1', r'\1', 'hello hello world again again') ('hello world again', 2)
Find using search()
and Match Object
- regexObj.search(inStr, [begin], [end]) -> matchObj:
- re.search(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:
The search()
returns a special Match
object encapsulating the first match (or None
if there is no matches). You can then use the following methods to process the resultant Match
object:
- matchObj.group(): Return the matched substring.
- matchObj.start(): Return the starting matched position (inclusive).
- matchObj.end(): Return the ending matched position (exclusive).
- matchObj.span(): Return a tuple of (start, end) matched position.
For example,
>>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> m = p1.search(inStr) >>> m <_sre.SRE_Match object; span=(3, 6), match='123'> >>> m.group() '123' >>> m.span() (3, 6) >>> m.start() 3 >>> m.end() 6 # You can search further by providing the begin search positions # in the form of search(str, [beginIdx]), e.g., >>> m = p1.search(inStr, m.end()) >>> m <_sre.SRE_Match object; span=(9, 12), match='456'> # Using a while loop >>> m = p1.search(inStr) >>> while m: print(m, m.group()) m = p1.search(inStr, m.end()) <_sre.SRE_Match object; span=(3, 6), match='123'> 123 <_sre.SRE_Match object; span=(9, 12), match='456'> 456 <_sre.SRE_Match object; span=(13, 14), match='7'> 7 <_sre.SRE_Match object; span=(15, 16), match='0'> 0 <_sre.SRE_Match object; span=(16, 17), match='0'> 0
To retrieve the back-references (or capturing groups) inside the Match
object:
- matchObj.groups(): return a tuple of captured groups (or back-references)
- matchObj.group(n): return the capturing group
n
, wheren
starts at1
. - matchObj.lastindex: last index of the capturing group
>>> p2 = re.compile('(A)(\w+)', re.IGNORECASE) # Two parenthesized back-references (capturing groups) >>> inStr = 'This is an apple.' >>> m = p2.search(inStr) >>> while m: print(m) print(m.group()) # show full match print(m.groups()) # show capturing groups in tuple for idx in range(1, m.lastindex + 1): # index starts at 1 print(m.group(idx), end=',') # show capturing group idx print() m = p2.search(inStr, m.end()) <_sre.SRE_Match object; span=(8, 10), match='an'> an ('a', 'n') a,n, <_sre.SRE_Match object; span=(11, 16), match='apple'> apple ('a', 'pple') a,pple,
Find using match()
and fullmatch()
- regexObj.match(inStr, [begin], [end]) -> matchObj:
- regexObj.fullmatch(inStr, [begin], [end]) -> matchObj:
- re.match(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:
- re.fullmatch(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:
The search()
matches anywhere in the given inStr[begin:end]
. On the other hand, the match()
matches from the start of inStr[begin:end]
(similar to regex pattern ^...
); while the fullmatch()
matches the entire inStr[begin:end]
(similar to regex pattern ^...$
).
For example,
# match() >>> p1 = re.compile(r'[1-9][0-9]*|0') # match integers >>> m = p1.match('aaa123zzz456') # NOT match the beginning 'a...' >>> m # None >>> m = p1.match('123zzz456') # match the beginning (index 0) >>> m <_sre.SRE_Match object; span=(0, 3), match='123'> # fullmatch() >>> m = p1.fullmatch('123456') # Match entire string >>> m <_sre.SRE_Match object; span=(0, 6), match='123456'> >>> m = p1.fullmatch('123456abc') >>> m # None
Find using finditer()
- regexObj.finditer(inStr) -> matchIterator
- re.finditer(regexObj|regexStr, inStr) -> matchIterator
The finditer()
is similar to findall()
. The findall()
returns a list of matched substrings. The finditer()
returns an iterator to Match
objects. For examples,
# Using findall() >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> p1.findall(inStr) # return a list of matched substrings ['123', '456', '7', '0', '0'] >>> for s in p1.findall(inStr): # using for-in loop to process the list print(s, end=' ') 123 456 7 0 0 # Using finditer() >>> for m in p1.finditer(inStr): # using loop on iterator print(m) <_sre.SRE_Match object; span=(3, 6), match='123'> <_sre.SRE_Match object; span=(9, 12), match='456'> <_sre.SRE_Match object; span=(13, 14), match='7'> <_sre.SRE_Match object; span=(15, 16), match='0'> <_sre.SRE_Match object; span=(16, 17), match='0'> >>> for m in p1.finditer(inStr): print(m.group(), end=' ') 123 456 7 0 0
Splitting String into Tokens
- regexObj.split(inStr) -> [str]:
- re.split(regexObj|regexStr, inStr) -> [str]:
The split()
splits the given inStr
into a list
, using the regex's Pattern
as delimiter (separator). For example,
>>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.split('aaa123bbb456ccc') ['aaa', 'bbb', 'ccc'] >>> re.split(r'[1-9][0-9]*|0', 'aaa123bbb456ccc') ['aaa', 'bbb', 'ccc']
Notes: For simple delimiter, use str.split([sep])
, which is more efficient. See above section.
Web Scraping
References:
- Beautiful Soup Documentation @ https://www.crummy.com/software/BeautifulSoup/bs4/doc/.
Web Scraping (or web harvesting or web data extraction) refers to reading the raw HTML page to retrieve desired data. Needless to say, you need to master HTML, CSS and JavaScript.
Python supports web scraping via packages requests and BeautifulSoup (bs4).
Install Packages
You could install the relevant packages using pip
as follows:
$ pip install requests $ pip install bs4
Step 0: Inspect the Target Webpage
- Press F12 on the target webpage to turn on the "F12 debugger".
- Choose "Inspector".
- Click the "Select" (the left-most icon with a arrow) and point your mouse at the desired part of the HTML page. Study the codes.
Step 1: Send a HTTP GET request to the target URL to retrieve the raw HTML page using module requests
>>> import requests # Set the target URL >>> url = "http://your_target_webpage" # Send a HTTP GET request to the target URL to retrieve the HTML page, # which returns a "Response" object. >>> response = requests.get(url) # Inspect the "Response' object >>> type(response) <class 'requests.models.Response'> >>> response <Response [200]> >>> help(response) ...... >>> print(response.text) # content of the response, in unicode text ...... >>> print(response.content) # content of the response, in raw bytes ......
Step 2: Parse the HTML Text into a Tree-Structure using BeautifulSoup
and Search the Desired Data
# Continue from Step 1 >>> from bs4 import BeautifulSoup # Parse the HTML text into a tree-structure "BeautifulSoup" object >>> soup = BeautifulSoup(response.text, "html.parser") # Inspect the "BeautifulSoup" object >>> type(soup) <class 'bs4.BeautifulSoup'> >>> help(soup) ...... # Find the first appearance a particular HTML tag (e.g, 'img') via find(tag) -> str >>> img_tag = soup.find('img') >>> img_tag <img ...... /> # Find all the appearances of a particular tag, via findAll(tag) -> [str] >>> img_tags = soup.findAll('img') >>> img_tags [<img ... />, <img ... />, <img ... />, ...] # Find the first appearance of a tag with certain attributes >>> soup.find('div', attrs = {'id':'test'}) # Find all appearances of a tag with certain attributes >>> soup.findAll('div', attrs = {'class':'error'})
You could write out the selected data to a file:
with open(filename, 'w') as fp: for row in rows: fp.write(row + '\n')
You could also use csv
module to write out rows of data with a header:
>>> import csv >>> with open(filename, 'w') as fp: writer = csv.DictWriter(fp, ['colHeader1', 'colHeader2', 'colHeader3']) writer.writeheader() for row in rows: writer.writerow(row)
Step 3: Download Selected Document Using urllib.request
You may want to download documents such as text files or images.
# Continue from Step 2 >>> import urllib.request # Download a select URL and save the response in a local file >>> downloadUrl = '.....' >>> file = '......' >>> urllib.request.urlretrieve(download_url, file)
Step 4: Delay
To avoid spamming a website with download requests (and flagged as a spammer), you need to pause your code for a while.
>>> import time
# Pause (in seconds)
>>> time.sleep(1)
REFERENCES & RESOURCES