Python is a convenient language that’s often used for scripting, data science, and web development.
In this article, we’ll look at newline matches, case insensitive matching, and the sub
method.
Matching Newlines with the Dot Character
We can use the re.DOTALL
constant to match newlines.
For instance, we can use it as in the following code:
import re
regex = re.compile(r'.\*', re.DOTALL)
matches = regex.search('Jane\\nJoe')
Then we get ‘Jane\nJoe’
as the value returned bymatches.group()
.
Without re.DOTALL
, as in the following example:
import re
regex = re.compile(r'.\*')
matches = regex.search('Jane\\nJoe')
we get ‘Jane’
as the value returned bymatches.group()
.
Summary of Regex Symbols
The following is a summary of regex symbols:
?
— matches 0 or 1 of the preceding group*
— matches 0 or more of the preceding group+
— matches one or more of the preceding group{n}
— matches exactlyn
of the preceding group{n,}
— matchesn
or more of the preceding group{,n}
— matches 0 ton
of the preceding group{n,m}
— matchesn
tom
of the preceding group{n,m}?
or*?
or+?
performs a non-greedy match of the preceding group^foo
— matches a string beginning withfoo
foo$
— matches a string that ends withfoo
.
matches any character except for new kine\d
,\w
, and\s
matches a digit, word, or space character respectively\D
,\W
, and\S
match anything except a digit, word, or space character respectively[abc]
— matches any character between the brackets likea,
,b
, orc
[^abc]
— matches any character buta
,b
orc
Case-Insensitive Matching
We can pass in re.I
to do case insensitive matching.
For instance, we can write:
import re
regex = re.compile(r'foo', re.I)
matches = regex.findall('FOO foo fOo fOO Foo')
Then matches
has the value [‘FOO’, ‘foo’, ‘fOo’, ‘fOO’, ‘Foo’]
.
Substituting Strings with the sub() Method
We can use the sub
method to replace all substring matches with the given string.
For instance, we can write:
import re
regex = re.compile(r'\\d{3}-\\d{3}-\\d{4}')
new\_string = regex.sub('SECRET', 'Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')
Since sub
replaces the substring matches passed in as the 2nd argument and a new string is returned, new_string
has the value of:
"Jane's number is SECRET. Joe's number is SECRET"
Verbose Mode
We can use re.VERBOSE
to ignore whitespaces and comments in a regex.
For instance, we can write:
import re
regex = re.compile(r'\\d{3}-\\d{3}-\\d{4} # phone regex', re.VERBOSE)
matches = regex.findall('Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')
Then matches
has the value [‘123–456–7890’, ‘555–555–1212’]
since the whitespace and comment in our regex is ignored by passing in the re.VERBOSE
option.
Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE
We can combine re.IGNORECASE
, re.DOTALL
, and re.VERBOSE
with a pipe (|) operator.
For instance, we can do a case-insensitive and ignore whitespace and comments by writing:
import re
regex = re.compile(r'jane # jane', re.IGNORECASE | re.VERBOSE)
matches = regex.findall('Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')
The matches
has the values ['Jane']
since we passed in re.IGNORECASE
and combined it with re.VERBOSE
with the |
symbol to do a case-insensitive search.
Conclusion
We can pass in different arguments to the re.compile
method to adjust how regex searches are done.
re.IGNORECASE
lets us do a case-insensitive search.
re.VERBOSE
makes the Python interpreter ignore whitespace and comments in our regex.
re.DOTALL
let us search for matches with newline characters.
The 3 constants above can be combined with the |
operator.
The sub
method makes a copy of the string, then replace all the matches with what we passed in, then returns the string with the replacements.