Regular Expression with Python

Regular Expression with Python

What is a regular expression?

A regular expression (regex) is a sequence of characters that define a search pattern. This definition is taken from Geeks for Geeks. Let's try another definition.

A regular expression is a pattern of characters. The pattern is used for searching and replacing characters in strings. The RegExp Object is a regular expression with added Properties and Methods.

Still couldn't get the grasp of it? Ok, Let me explain.

When we speak, read, or write a language, not computer programming language exactly, but the language we follow a set of instructions which is called grammar. Grammar tells us what is in the language, what is not in the language, etc. A regular expression is a form of a language, meaning it's a way to say that a set of strings match or don't match a regular expression. Below let's take a look at some regular expressions:

^ => Matches the beginning of the line

$ => Matches the end of the line

. => Matches any character

\s => Matches whitespace

\S => Matches any non-whitespace character

* => Repeats a character zero or more times

*? => Repeats a character zero or more times (non-greedy)

+ => Repeats a character one or more times

+? => Repeats a character one or more times (non-greedy)

[aeiou] => Matches a single character in the listed set

[^XYZ] => Matches a single character not in the listed set

[a-z] [0-9] => The set of characters can include a range

( => Indicates where string extraction is to start

) => Indicates where string extraction is to end

With these regular expressions, we set some rules to match strings from a file or a paragraph, or a line. It is like searching a string but it is a very intelligent form of search. Let's say we want to find "Hello" from the string "Hello World". In normal Python coding, it will look like this.

text = "Hello World"
splittedText = x.split(' ')
for i in splittedText:
    if i == "Hello":
        print(True)

So, How is it gonna look with a regular expression? Let's see...

import re
text = "Hello World"
y = re.search("^H\S+", text)
print(y)

Here you can see we are using re module which we have to use while we are using regular expression. In the third line, I have written a string inside the search function which is saying to search a string from the text which starts with (^) H followed by any non-whitespace character (\S) and it continues for one or more characters (+). It's like coding but within one character. It reduces a for loop, one if statement to find the Hello.

Regular expressions are

  • very powerful and quite cryptic

  • Fun once you understand them

  • Regular expressions are a language unto themselves

  • A language of ‘marker characters’ – programming with characters

  • It is kind of an old-school language - compact

Let's understand it more example:

Using Regular Expressions like Find()

We already have used re.search() function. But how well it works compared to the regular function of Python?

We are going to use this text for the rest of the coding. Save this text in a text file on your computer at the same location where you're going to practice the code. I have saved it as abcd.txt.

text = "From: To: As we continue to work on the upcoming project, I wanted to remind everyone to please use the following email addresses for all communication: - for general project-related inquiries - for any marketing-related questions - for technical assistance Also, please keep in mind that all updates and important information will be sent out through these email addresses, so be sure to check them regularly."

text = open('abcd.txt')
for line in text:
    line = line.rstrip()
    if line.find("From:") >= 0:
        print(line)

This is how it's done with the find method. Here, find method searches for the 'From:' string in every line of the abcd.txt file's text. If it finds the 'From:' string it returns 0. When this line.find("From:") returns a value that is greater than or equal to 0, we're going to print the line. So we're searching through the text file, most of the lines the code is going to skip because we only have one line where it has a 'From:' string. The same kind of thing happens while we are using regular expressions. Let's see how it's done with regular expression.

import re
text = open('abcd.txt')
for line in text:
    line = line.rstripe()
    if re.search('From:', line):
        print(line)

It is as simple as the find method except for regular expression library re. Because we are using regular expressions here. Here re.search searching for the 'From:' string through the line. When it finds one it passes the if condition and prints the line. This is so simple. Maybe we are not going to use regular expressions for this simple code.

Using Regular Expressions like startswith()

The function startswith() checks whether a string starts with a particular substring. If the string starts with a specified substring, the startswith() method returns True; otherwise, the function returns False.

text = open('abcd.txt')
for line in text:
    line = line.rstrip()
    if line.startswith("From:"):
        print(line)

The code search through the text file's lines and check if any line starts with the 'From:' string. If any line starts with the 'From:' string it prints the line. Now with Regex:

import re
text = open('abcd.txt')
for line in text:
    line = line.rstripe()
    if re.search('^From:', line):
        print(line)

In regular coding, we used a separate for separate work. But here in regular expression, we used the same function but did a slight change in the string. We tweak the matching string. Which is an expression to tell the program that we want a line from this text file that starts with the 'From:' string. That's how we do it with regular expression.

I hope you all understand the regular expression by now. Let's try to do more with regular expression.

Extracting Data from The Text

Regular Expression's one application is to extract data. For ex. extracting all hashtags from a tweet, getting email id or phone numbers, etc. from large unstructured text content. Now we are going to extract all the emails from our abcd.txt file.

import re
text = open('abcd.txt')
emails = []
for line in text:
    line = line.rstrip()
    email = re.findall('\S+@\S+', line)
    if email:
        emails.append(email)
print(emails)

In this code, we used findall() function of re method. Here, we are matching a string that says, "Hey, I want non-blank (represented in code by \S) one or more characters (represented in code by +) followed by @ followed by non-blank one or more characters". This is the grammar for finding emails from this text file. When it finds any string like this it should be an email and it appends the email to the emails list. As a result, we have extracted the emails from the text file. This is a simple fine-tuning email extraction. It can be more complex if you have more noisy text. This means there is some other word that contains @ other than emails. Then you have to set the regular expression grammar with more thinking.

Greedy Matching

When we try to match strings using regular expressions sometimes we fall into greedy matching. Like this problem:

import re
text = 'From: using the data: character'
x = re.findall('^F.+:', text)
print(x)

Here, we have set the grammar to extract the 'From:' string from the text but when we run this code we will get 'From: using the data:' this string as output. But we only wanted to extract the 'From:' as output. Ok, Let's think about what the regular expression means. It means we want a string that starts with F and then matches any character one or more times and ends with a colon. See this grammar is true for the 'From:' string but it is also true for the 'From: using the data:' string. That's why it gives back 'From: using the data:' string as output. This is called greedy matching. Regular expressions do the greedy matching by default. But we can prevent that using '?' (non-greedy) sign. We just have to put this sign in the string. Let's try it.

import re
text = 'From: using the data: character'
x = re.findall('^F.+?:', text)
print(x)

Now we will just get the 'From:' string as output.

Like search and findall function of re method, there are more. You can visit https://docs.python.org/3/library/re.html#functions this link to find out about those and play with them.

Hope you will benefit from this article. Like the article for more like this. If you want a video explanation of this article please let me know.