Image by Author | Ideogram
Â
As a data scientist, you’ll frequently encounter messy, unstructured text data. Before you can analyze this data, you need to clean it, extract relevant information, and transform it into a structured format. This is where regular expressions come in useful.
Think of regex as a specialized mini-language for describing patterns in text. Once you understand the core concepts, you’ll be able to perform complex text operations with just a few lines of code that would otherwise require dozens of lines using standard string methods.
🔗 Link to the code on GitHub. You can also check out this quick reference regex table.
Â
How to Think About Regex
Â
The key to mastering regex is developing the right mental model. At its core, a regular expression is simply a pattern that moves through text from left to right, trying to find matches.
Imagine you’re looking for a specific pattern in a book. You scan each page, looking for that pattern. That’s essentially what regex does—it scans through your text character by character, checking if the current position matches your pattern.
Let’s start by importing Python’s built-in re module:
Â
1. Literal Characters: Building Your First Regex Pattern
Â
The simplest regex patterns match exact text. If you want to find the word “data” in a text, you can use:
text = "Data science is cool as you get to work with real-world data"
matches = re.findall(r"data", text)
print(matches)
Â
Notice that this only found the lowercase “data” and missed “Data” at the beginning.
Â
Regular expressions are case-sensitive by default. This brings us to our first important lesson: be specific about what you want to match.
matches = re.findall(r"data", text, re.IGNORECASE)
print(matches)
Â
Output >>> ['Data', 'data']
Â
The r before the string creates a “raw string.” This is important in regex because backslashes are used for special sequences, and raw strings prevent Python from interpreting these backslashes.
Â
2. Metacharacters: Beyond Literal Matching
Â
What makes regex useful is its ability to define patterns using metacharacters. These are special characters that have meaning beyond their literal representation.
Â
The Wildcard: The Dot (.)
The dot matches any character except a newline. This is particularly useful when you know part of a pattern but not everything:
text = "The cat sat on the mat. The bat flew over the rat."
pattern = r"The ... "
matches = re.findall(pattern, text)
print(matches)
Â
Here, we’re finding “The” followed by any three characters and a space.
Output >>> ['The cat ', 'The bat ']
Â
The dot is powerful, but sometimes too powerful—it matches anything! This is where character classes come in.
Â
Character Classes: Getting Specific with []
Character classes let you define a set of characters to match:
text = "The cat sat on the mat. The bat flew over the rat."
pattern = r"[cb]at"
matches = re.findall(pattern, text)
print(matches)
Â
This pattern finds “cat” or “bat”—any character in the set [cb] followed by “at”.
Output >>> ['cat', 'bat']
Â
Character classes are perfect when you have a limited set of characters that could appear in a certain position.
You can also use ranges in character classes:
# Find all lowercase words that start with a-d
pattern = r"\b[a-d][a-z]*\b"
text = "apple banana cherry date elephant fig grape kiwi lemon mango orange"
matches = re.findall(pattern, text)
print(matches)
Â
Here, \b represents a word boundary (more on this later), [a-d] matches any lowercase letter from a to d, and [a-z]* matches zero or more lowercase letters.
Output >>> ['apple', 'banana', 'cherry', 'date']
Â
Quantifiers: Specifying Repetition
Often, you’ll want to match a pattern that repeats. Quantifiers let you specify how many times a character or group should appear. Let’s find all phone numbers, whether they use hyphens or not:
text = "Phone numbers: 555-1234, 555-5678, 5551234"
pattern = r"\b\d3-?\d4\b"
matches = re.findall(pattern, text)
print(matches)
Â
This gives the following:
Output >>> ['555-1234', '555-5678', '5551234']
Â
Breaking down this pattern:
- \b ensures we’re at a word boundary
- \d3 matches exactly 3 digits
- -? matches zero or one hyphen (the ? makes the hyphen optional)
- \d4 matches exactly 4 digits
- \b ensures we’re at another word boundary
This is much more elegant than writing multiple patterns or complex string operations to handle different formats.
Â
3. Anchors: Finding Patterns at Specific Positions
Â
Sometimes you only want to find patterns at specific positions in the text. Anchors help with this:
text = "Python is popular in data science."
# ^ anchors to the start of the string
start_matches = re.findall(r"^Python", text)
print(start_matches)
# $ anchors to the end of the string
end_matches = re.findall(r"science\.$", text)
print(end_matches)
Â
This outputs:
Â
Anchors don’t match characters; they match positions. This is useful for validating formats like email addresses, where specific elements must appear at the beginning or end.
Â
4. Capturing Groups: Extracting Specific Parts
Â
Often in data science, you don’t just want to find patterns—you want to extract specific parts of those patterns. Capturing groups, created with parentheses, let you do this:
text = "Dates: 2023-10-15, 2022-05-22"
pattern = r"(\d4)-(\d2)-(\d2)"
# findall returns tuples of the captured groups
matches = re.findall(pattern, text)
print(matches)
# You can use these to create structured data
for year, month, day in matches:
print(f"Year: year, Month: month, Day: day")
Â
Here’s the output:
[('2023', '10', '15'), ('2022', '05', '22')]
Year: 2023, Month: 10, Day: 15
Year: 2022, Month: 05, Day: 22
Â
This is especially helpful in extracting structured information from unstructured text, a common task in data science.
Â
5. Named Groups: Making Your Regex More Readable
Â
For complex patterns, remembering what each group captures can be challenging. Named groups solve this:
text = "Contact: john.doe@example.com"
pattern = r"(?P[\w.]+)@(?P[\w.]+)"
match_ = re.search(pattern, text)
if match_:
print(f"Username: match.group('username')")
print(f"Domain: match.group('domain')")
Â
This gives:
Username: john.doe
Domain: example.com
Â
Named groups make your regex more self-documenting and easier to maintain.
Â
Working with Real Data: Practical Examples
Â
Let’s see how regex applies to common data science tasks.
Â
Example 1: Cleaning Messy Data
Suppose you have a dataset with inconsistent product codes:
product_codes = [
"PROD-123",
"Product 456",
"prod_789",
"PR-101",
"p-202"
]
Â
You want to standardize these to extract just the numeric part:
cleaned_codes = []
for code in product_codes:
# Extract just the numeric portion
match = re.search(r"\d+", code)
if match:
cleaned_codes.append(match.group())
print(cleaned_codes)
Â
Output:
['123', '456', '789', '101', '202']
Â
This is much cleaner than writing multiple string operations to handle different formats.
Â
Example 2: Extracting Information from Text
Imagine you have customer service logs and need to extract information:
log = "ISSUE #1234 [2023-10-15] Customer reported app crash on iPhone 12, iOS 15.2"
Â
You can extract structured data with regex:
# Extract issue number, date, device, and OS version
pattern = r"ISSUE #(\d+) \[(\d4-\d2-\d2)\].*?(iPhone \d+).*?(iOS \d+\.\d+)"
match = re.search(pattern, log)
if match:
issue_num, date, device, ios_version = match.groups()
print(f"Issue: issue_num")
print(f"Date: date")
print(f"Device: device")
print(f"iOS Version: ios_version")
Â
Output:
Issue: 1234
Date: 2023-10-15
Device: iPhone 12
iOS Version: iOS 15.2
Â
Example 3: Data Validation
Regular expressions are useful for validating data formats:
def validate_email(email):
"""Validate email format with explanation of what makes it valid or invalid."""
pattern = r"^[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]2,$"
if not re.match(pattern, email):
# Let's check specific issues
if '@' not in email:
return False, "Missing @ symbol"
username, domain = email.split('@', 1)
if not username:
return False, "Username is empty"
if '.' not in domain:
return False, "Invalid domain (missing top-level domain)"
return False, "Invalid email format"
return True, "Valid email"
Â
Now test with different emails:
# Test with different emails
emails = ["user@example.com", "invalid@.com", "no_at_sign.com", "user@example.co.uk"]
for email in emails:
valid, reason = validate_email(email)
print(f"email: reason")
Â
Output:
user@example.com: Valid email
invalid@.com: Invalid email format
no_at_sign.com: Missing @ symbol
user@example.co.uk: Valid email
Â
This function not only validates emails but explains what makes them valid or invalid, which is more useful than a simple true/false result.
Rather than just listing patterns, let’s understand the components that make them work:
Â
Email Validation
pattern = r"^[\w.%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]2,$"
Â
Breaking this down:
- ^ ensures we start at the beginning of the string
- [\w.%+-]+ matches one or more word characters, dots, percent signs, plus signs, or hyphens (common username characters)
- @ matches the literal @ symbol
- [A-Za-z0-9.-]+ matches one or more letters, numbers, dots, or hyphens (domain name)
- \. matches a literal dot
- [A-Za-z]2, matches two or more letters (top-level domain)
- $ ensures we end at the end of the string
This pattern allows for valid emails while rejecting invalid formats.
Â
Date Extraction
pattern = r"\b(\d4)-(\d2)-(\d2)\b"
Â
This pattern matches ISO dates (YYYY-MM-DD):
- \b ensures we’re at a word boundary
- (\d4) captures exactly 4 digits for the year
- – matches the literal hyphen
- (\d2) captures exactly 2 digits for the month
- – matches the literal hyphen
- (\d2) captures exactly 2 digits for the day
- \b ensures we’re at a word boundary
Understanding this structure lets you adapt it for other date formats like MM/DD/YYYY.
Â
Advanced Techniques: Beyond Basic Regex
Â
As you become more comfortable with regex, you’ll encounter situations where basic patterns fall short. Here are some advanced techniques:
Â
Lookaheads and Lookbehinds
These are “zero-width assertions” that check if a pattern exists without including it in the match:
# Password validation
password = "Password123"
has_uppercase = bool(re.search(r"(?=.*[A-Z])", password))
has_lowercase = bool(re.search(r"(?=.*[a-z])", password))
has_digit = bool(re.search(r"(?=.*\d)", password))
is_long_enough = len(password) >= 8
if all([has_uppercase, has_lowercase, has_digit, is_long_enough]):
print("Password meets requirements")
else:
print("Password does not meet all requirements")
Â
Output:
Password meets requirements
Â
The lookahead (?=.*[A-Z]) checks if there’s an uppercase letter anywhere in the string without actually capturing it.
Â
Non-Greedy Matching
Quantifiers are “greedy” by default. Meaning they match as much as possible. Adding a ? after a quantifier makes it “non-greedy”:
text = "First content
Second content
"
# Greedy matching (default)
greedy = re.findall(r"(.*)
", text)
print(f"Greedy: greedy")
# Non-greedy matching
non_greedy = re.findall(r"(.*?)
", text)
print(f"Non-greedy: non_greedy")
Â
Output:
Second content’]
Non-greedy: [‘First content’, ‘Second content’]
Â
Understanding the difference between greedy and non-greedy matching is necessary for parsing nested structures like HTML or JSON.
Â
Learning and Debugging Regex
Â
When you’re learning regular expressions:
- Start with literal matching: Match exact strings before adding complexity
- Add character classes: Learn to match categories of characters
- Master quantifiers: Understand repetition patterns
- Use capturing groups: Extract structured data
- Learn anchors and boundaries: Control where patterns match
- Explore advanced techniques: Lookaheads, non-greedy matching, etc.
The key is to keep learning—start simple and gradually move to others as needed.
When your regex isn’t working as expected:
- Break it down: Test simpler versions of your pattern to isolate the issue
- Visualize it: Use tools like regex101.com to see how your pattern matches step by step
- Test with sample data: Create small test cases that cover different scenarios
For example, if you’re trying to match phone numbers but your pattern isn’t working, try matching just the area code first, then add more components gradually.
Â
Wrapping Up
Â
Regular expressions are a powerful tool for text processing in data science. They allow you to:
- Extract structured information from unstructured text
- Clean and standardize inconsistent data formats
- Validate data against specific patterns
- Transform text through sophisticated search and replace operations
Remember that regex is a skill that develops over time. Don’t try to memorize every metacharacter and technique—instead, focus on understanding the underlying principles and practice regularly with real-world data problems.
Â
Â
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.