Categories
Python

More Things We Can Do With Regexes and Python

Spread the love

Python is a convenient language that’s often used for scripting, data science, and web development.

In this article, we’ll look at newline matches, case insensitive matching, and the sub method.

Matching Newlines with the Dot Character

We can use the re.DOTALL constant to match newlines.

For instance, we can use it as in the following code:

import re  
regex = re.compile(r'.\*', re.DOTALL)  
matches = regex.search('Jane\\nJoe')

Then we get ‘Jane\nJoe’ as the value returned bymatches.group() .

Without re.DOTALL , as in the following example:

import re  
regex = re.compile(r'.\*')  
matches = regex.search('Jane\\nJoe')

we get ‘Jane’ as the value returned bymatches.group() .

Summary of Regex Symbols

The following is a summary of regex symbols:

  • ? — matches 0 or 1 of the preceding group
  • * — matches 0 or more of the preceding group
  • + — matches one or more of the preceding group
  • {n} — matches exactly n of the preceding group
  • {n,} — matches n or more of the preceding group
  • {,n} — matches 0 to n of the preceding group
  • {n,m} — matches n to m of the preceding group
  • {n,m}? or *? or +? performs a non-greedy match of the preceding group
  • ^foo — matches a string beginning with foo
  • foo$ — matches a string that ends with foo
  • . matches any character except for new kine
  • \d , \w , and \s matches a digit, word, or space character respectively
  • \D , \W , and \S match anything except a digit, word, or space character respectively
  • [abc] — matches any character between the brackets like a, , b , or c
  • [^abc] — matches any character but a , b or c

Case-Insensitive Matching

We can pass in re.I to do case insensitive matching.

For instance, we can write:

import re  
regex = re.compile(r'foo', re.I)  
matches = regex.findall('FOO foo fOo fOO Foo')

Then matches has the value [‘FOO’, ‘foo’, ‘fOo’, ‘fOO’, ‘Foo’] .

Substituting Strings with the sub() Method

We can use the sub method to replace all substring matches with the given string.

For instance, we can write:

import re  
regex = re.compile(r'\\d{3}-\\d{3}-\\d{4}')  
new\_string = regex.sub('SECRET', 'Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')

Since sub replaces the substring matches passed in as the 2nd argument and a new string is returned, new_string has the value of:

"Jane's number is SECRET. Joe's number is SECRET"

Verbose Mode

We can use re.VERBOSE to ignore whitespaces and comments in a regex.

For instance, we can write:

import re  
regex = re.compile(r'\\d{3}-\\d{3}-\\d{4} # phone regex', re.VERBOSE)  
matches = regex.findall('Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')

Then matches has the value [‘123–456–7890’, ‘555–555–1212’] since the whitespace and comment in our regex is ignored by passing in the re.VERBOSE option.

Combining re.IGNORECASE, re.DOTALL, and re.VERBOSE

We can combine re.IGNORECASE , re.DOTALL , and re.VERBOSE with a pipe (|) operator.

For instance, we can do a case-insensitive and ignore whitespace and comments by writing:

import re  
regex = re.compile(r'jane # jane',  re.IGNORECASE | re.VERBOSE)  
matches = regex.findall('Jane\\'s number is 123-456-7890. Joe\\'s number is 555-555-1212')

The matches has the values ['Jane'] since we passed in re.IGNORECASE and combined it with re.VERBOSE with the | symbol to do a case-insensitive search.

Conclusion

We can pass in different arguments to the re.compile method to adjust how regex searches are done.

re.IGNORECASE lets us do a case-insensitive search.

re.VERBOSE makes the Python interpreter ignore whitespace and comments in our regex.

re.DOTALL let us search for matches with newline characters.

The 3 constants above can be combined with the | operator.

The sub method makes a copy of the string, then replace all the matches with what we passed in, then returns the string with the replacements.

Leave a Reply

Your email address will not be published. Required fields are marked *