The Python regular expression module
re
supports the
spilt()
method that can split a string into a list of sub-strings based on the regular expression pattern. The
re.split()
method is similar to the Python string’s
spilt()
method, but it is more flexible and powerful. This method uses the regular expression pattern to
split the string based on the regex pattern occurrence
.
This tutorial discusses the
re.split()
method in detail with the help of some examples. And by the end of this article, you will build a solid understanding of how to use the Python
re.split()
method in Python to split a string.
Here is a short overview of the method that we are going to tackle.
Operation | Description |
re.split(pattern, string) | It will split the string into a list of substrings by each occurrence of the pattern. |
re.split(pattern, string, maxsplit=3) | It limits the split occurrence of the string by 3. |
re.split(a|b, string) | Split the string by either a or b. |
re.split((a|b), string) | Split the string either by a or b and also include the separator. |
How to use the Regex re.split() function in Python?
Similar to the string’s
split()
method we use the
re.split()
method to split a string into a list based on a separator. In
re.split()
, we define a regular expression pattern as a separator, the
re.split()
function split the targeted string by each occurrence of the separator.
re.split() Function Syntax
import re
re.split(pattern, string, maxsplit=0, flags=0)
There are only two mandatory arguments for split method
pattern
and
string.
The other two
maxsplit
and
flags
are optional.
Arguments
- pattern: (regular expression): It is a regular expression that will be used as a separator for the string split.
- string: (str): It represents the targeted string value that we want to split.
- maxsplit: (int): It is an optional argument value that defines the number of split occurrences. By default its value is 0 means splitting all the occurrences by the pattern, we can also specify it as a positive integer value to limit the number of split occurrences.
- flags: (regex flags): It is also an optional argument value that defines the raised flag on the split() method. By default, its value is 0 means no flags are raised. We can also set it to something like re.I when we want to perform the ignore-case searching, or re.A for ASCII-only matching.
Return Value of re.split() method
The
re.split()
method returns a list of split substrings using the occurrence of the pattern as a separator. If the specified regular expression is not found in the targeted string, the split() method returns a list containing the target string as a single element.
Note: If we use the capturing group (parentheses) in the separator or pattern, then the separator group will also be included in the returned list. This simply means, for capturing parenthesis’s regular expression pattern, the separators are also included in the list.
Examples
Use the re.split() method to split a string into words.
To split a string into a list of words, we have to separate the individual words by white space. This means we need to use the white space pattern as a separator for the re.split() method. The \s is a regular expression that matches the whitespaces in the string. And the \s+ matches the multiple white spaces in the string, and we will use it as a pattern to split our string into a list of words.
import re
#pattern sequence for multiple white space
pattern = r'\s+'
#targeted string
string = "Hello Welcome to TechGeekBuzz Python RegEX Tutorial"
#split the string by white spaces
word_list = re.split(pattern, string)
print(word_list)
Output
['Hello', 'Welcome', 'to', 'TechGeekBuzz', 'Python', 'RegEX', 'Tutorial']
How to limit the number of split?
The
re.split()
method accepts a
maxsplit
argument, that can set a limit to the number of splits. By default, the value of
maxsplit
is 0, which means splitting the string by all the possible occurrence patterns. But we can also set it to a positive integer value and limit the split number.
For instance, if we set the
maxsplit
argument to 3, then only three splits will be performed on the string. Let’s see the
maxsplit
argument in action with an example. Let’s say we have a string that contains details in id-name--DD-MM-YY format, and we need to split it into a list as id, name, and DD-MM-YY.
Here we only want to split the string by the first two Hyphens.
import re
#pattern sequence for hyphen or non alphanumeric chracter
pattern = r'\W+'
#targeted string
string = "10-Rahul-23-09-1999"
#split the string by first 2 hyphens
detail = re.split(pattern, string, maxsplit=2)
print(detail)
Output
['10', 'Rahul', '23-09-1999']
In the above example, the \W+ represents the regular expression pattern for the non-Alphanumeric character. As - is a non-alphanumeric character, \W+ matches it.
How to split a string that has multiple delimiters characters?
With Python string’s split() method we can split a string by a fixed delimiter, but with the help of
re.split()
method we can split a string that has multiple separators or delimiters characters.
For instance, we have a string in this format ‘Name,Department,Salary,DD-MM-YY’ . And we want to extract the information from this string in a list as [Name, Department, Salary, DD, MM, YY] so how would we do that?
The answer is simple we will use the
re.split()
method, and it will return us a list with the required items. In the given string, there are two delimiters, comma(,) and hyphen(-).
So in the pattern, we have to write such a regular expression that can match either comma or a hyphen.
Example
import re
#pattern sequence for hyphen or comma matching
pattern = r',|-'
#targeted string
string = "Rahul,Sales,20000,23-09-1999"
#split the string by commas or hyphen
detail = re.split(pattern, string)
print(detail)
Output
['Rahul', 'Sales', '20000', '23', '09', '1999']
In the above example, the pattern r',|-' represents raw string expression for, or -. We could also perform the same tasks using r'\W+' pattern.
import re
#pattern sequence for non-alphanumeric characters
pattern = r'\W+'
#targeted string
string = "Rahul,Sales,20000,23-09-1999"
#split the string by non alphanumeric characters
detail = re.split(pattern, string)
print(detail)
Output
['Rahul', 'Sales', '20000', '23', '09', '1999']
How to split the string with the separator and include the separator as well?
In the re.split() method, if the regular expression pattern is defined in the capturing group or parentheses () , the split() method will also include the separator in the returning list.
Example
Let’s repeat the above example where we are splitting a string that is present in the Name,Department,Salary,DD-MM-YY format. But here, we will enclose the \W+ in the capturing group such as (\W+).
import re
#pattern sequence for non-alphanumeric characters as capturing group
pattern = r'(\W+)'
#targeted string
string = "Rahul,Sales,20000,23-09-1999"
#split the string by non alphanumeric characters and include them in the list
detail = re.split(pattern, string)
print(detail)
Output
['Rahul', ',', 'Sales', ',', '20000', ',', '23', '-', '09', '-', '1999']
In this output, you can see that specifying the pattern as capturing group (\W+) not only splits the string by matched pattern but also include the pattern. This type of splitting the string comes in very handy when we want both the split string as well as the separator.
Flags argument in the re.split() method
There is the last argument in the split() method called flag. It is an optional argument value whose default value is 0 means no flags are raised. The Python re module provides some flags, which are more like conditions that need to be satisfied when a certain operation is performed. Let’s say we have a string that only contains numbers and alphabets, and we wish to spilt the string by alphabets.
import re
#pattern sequence lowercase alphabets
pattern = r'[a-z]+'
#targeted string
string = "2a2bdh3HjdhHH8jd9pD3"
#split the string by given pattern
result = re.split(pattern, string)
print(result)
Output
['2', '2', '3H', 'HH8', '9', 'D3']
In this example, the pattern [a-z]+ is only matching the lowercase characters and leaving the uppercase. But we want to split the string into numbers by ignoring the case of the letter.
In such cases, instead of defining a new pattern such as [a-z]|[A-Z] (which will also do the trick), we can raise the flags argument to re.I, which will ignore the case while matching the pattern for splitting.
import re
#pattern sequence lowercase alphabets
pattern = r'[a-z]+'
#targeted string
string = "2a2bdh3HjdhHH8jd9pD3"
#split the string by given pattern and ignore case
result = re.split(pattern, string, flags=re.I)
print(result)
Output
['2', '2', '3', '8', '9', '3']
Difference between String’s split() and regex split() methods
String split() | RegEx split() |
The string split() method splits the string into a list of substrings by a single fixed delimiter or separator. | The regex split() method can split a string into a list of substrings by multiple delimiters or separators. |
In the split() method, we can not include the separator in the resulting list. | In the regex split() method, we can include the separator in the resulting string by using capturing groups. |
Example | |
Output
|
Output
|
Some common examples of re.split() function
1: Split the string by five delimiters
To spilt a string by five delimiters, we can group all the five delimiters in an open group.
import re
string = "Hello-World;Welcome,to TechgeekBuzz.com"
#split by hyphen comma space dot semicolon
result = re.split("[-,\s.;]+", string)
print(result)
Output
['Hello', 'World', 'Welcome', 'to', 'TechgeekBuzz', 'com']
2: Split the string by specific words
Let’s split a string into a list by the “ and” or “ or” words.
import re
string = "12or1315or16and17and18"
#split by and or
result = re.split("and|or", string)
print(result)
Output
['12', '1315', '16', '17', '18']
Conclusion
This Python tutorial discussed the Python re.split() method or function. The re.split(pattern, string, maxsplit=0, flags=0) split the given string into a list by separating the string using the given pattern.
In the method, pattern and string are the mandatory argument values, whereas maxsplit and flags are optional. The re.split() method is more powerful and flexible than the normal string split() method, which can split a string by a fixed separator.
People are also reading:
- Python Arrays
- Python Numpy Array Tutorial
- Slice Lists/Arrays and Tuples in Python
- Python Compile Regex Pattern using re.compile()
- Python PostgreSQL Tutorial Using Psycopg2
- Remove an Empty String From a List of Strings in Python
- Python List Methods
- Python Class Variables
- Tuple vs List in Python
- Python Uppercase
Leave a Comment on this Post