Python RegEx

Posted in

Python RegEx
vinaykhatri

Vinay Khatri
Last updated on November 24, 2024

    Regular Expression (RegEx) is a set of string or pattern which define a search pattern, for example: ^c…r$ this is a Regular Expression and this pattern will match all the five-letter string which start with c and end with r.

    Match instances for the above example:

    Regular Expression String Matches
    ^c…r$ “caaar” Match Found
    “caar” Not Matched
    “Caaar” Not Matched (Case Sensitive)

    In python to use the regular expression, we use a python built-in module re.

    Example :

    import re # importing regular expression module
    
    str_pattern = r'^c...r$'   # ^c...r$ is a regular expression because it define a pattern, and ^ and $ are meta characters
    
    string_1 = 'caaar'
    string_2 = 'Caaar'
    
    result_1= re.match(str_pattern, string_1)
    result_2 = re.match(str_pattern,string_2)
    
    if result_1:
        print("string_1 match the str_pattern")
    else:
        print("string_1 does not match the str_pattern")
    
    if result_2:
        print("string_2 match the str_pattern")
    else:
        print("string_2 does not match the str_pattern")

    Output:

    string_1 match the str_pattern
    string_2 does not match the str_pattern

    Behind the code: In this example, we have used the re.match() function to search for the str_pattern in the string_1 and string_2. Here the str_pattern pattern was found in string_1 but not in string_2 .

    Metacharacters

    Metacharacters are the special characters which are used to create a regular expression, the Regular Expression engine read these special characters and form the search pattern, some of the metacharacters are [], ., ^, $, *, +, ?, {}, (), \, |

    Meta character Example:

    1. [] – Square Brackets:

    The square bracket is used to contain a list of elements which represent a set of characters that can match the string pattern.

    Example:

    Regular Expression String Matches
    "[abcd]"   Or   "[a-d]" “a" Match Found
    “b” Match Found
    “c” Match Found
    “a man” Match found
    “get out” Match Not Found
    “teCh” Match Not Found

    Here the regular expression [abcd] or [a-d] search for a string which contains any of a, b, c or d. Using the ^ symbol we can complement the square brackets elements for example:

    • [^abcd] means match any character except a, b, c or d.
    • [^0-4] or [^01234] any digit except 0, 1,2 3, or 4

    2. . – Period:

    It searches for any single character except the new line ‘\n’

    Regular Expression String Matches
    ".." "a" Match Not Found
    "a " 1 Match Found
    "bc" 1 Match Found
    "aabb" 2 Match found (aa) and (bb)

    3. ^ - Caret:

    This metacharacter is used to check the starting character of any string.

    Regular Expression String Matches
    "^c" "c" Match Found
    "car " Match Found
    "Cat" Match Not Found
    "in car" Match Mot Found

    4. $ - Dollar:

    This meta Character is used to check the end character of the string:

    Regular Expression String Matches
    "7$" "cr7" Match Found
    "CR7 " Match Found
    "OO7" Match Found
    "CR7 OO" Match Not Found

    5. * -Star:

    Star metacharacter is used to match the zero or more occurrences of a character in the string.

    Regular Expression String Matches
    "techge*k"   It will match those stings where 0 or more than 1 e is followed by k "techgk" Match Found
    "techgek" Match Found
    "techgeek" Match Found
    "techgeak" Match not Found (because e is not followed by k)

    6. + - Plus:

    Plus, metacharacter is used to match the one or more occurrences of a character in the sting

    Regular Expression String Matches
    "techge+k"   It will match those stings where at least e is followed by k "techgk" Match Not Found
    "techgek" Match Found
    "techgeeeeeeek" Match Found
    "techgeak" Match not Found (because e is not followed by k)

    7. ? – Question Mark:

    Question Mark, metacharacter is used to match the zero or one occurrence of a character in the sting

    Regular Expression String Matches
    "techge+k"   It will match those stings where at most one e is followed by k "techgk" Match Found
    "techgek" Match Found
    "techgeeeeeeek" Match Not Found
    "techgeak" Match not Found (because e is not followed by k)

    8. {} – Braces:

    The braces consist the digit which signifies the exact number of occurrences of a character in a string.

    Regular Expression String Matches
    "techge{2}k"   It will match those stings where we have exact 2 e. "techgeek" Match Found
    "techgek" Match Not Found
    "techgeeeeeeek" Match Not Found
    "techgeak" Match not Found

    9. | - Alternation:

    Alternation meta character act as a or statement.

    Regular Expression String Matches
    "^c|k"   It will match those stings where the starting character of the string is either c or k "cat" Match Found
    "kat" Match Found
    "eat" Match Not Found

    10. () – Group:

    It is used to capture the metacharacters in the group:

    Regular Expression String Matches
    "(c|k)at"   It will match those stings which contain cat or kat "kitkat" Match Found
    " cat food" Match Found
    "fish cat " Match Found
    "Eat food" Match not Found

    11. \ -Backslash

    The backslash is commonly used to escape special character and even in regular expression we use the backslash to escape meta characters. If we use the backslash before a meta character the Regular Expression would treat the sign as a normal character instead of a metacharacter. \$ = treat as a normal character by the Regex $ = treat as a Dollar symbol by the RegEx.

    12. \ - Special Sequence:

    In a regular expression, if we use the backslash symbol with a normal character then it acts as a Special sequence. There is some special sequence in regular Expression which act as a shorthand to some of the Regular expression.

    Special Sequence Description Expression Matches
    \A Check for the start character of the string "\Ago" go for it
    \b Checks if the specified characters are at the beginning or end of a word. r"\btech" r"buzz\b" techgeekbuzz
    \B It is the opposite expression of \b sequence r"\tech" r"buzz\B" TechgeekBuzz
    \d Check for digits "\d" 123ad23
    \D Opposite to the \d expression "\D" Techgeekbuzz
    \s Check for whitespace in the string "\s" Tech Geek Buzz
    \S Check for non-white space "\S" TechGeekBuzz
    \w Check for any alphanumeric character "\w" Techgeekbuzz100
    \W Check for a non-alphanumeric character such as special character "\W" $%&
    \Z Check whether the specified character is at the end of the string "buzz\Z" Techgeekbuzz

    13. Python RegEx

    To use all the above Regular Expression Sequence and Metacharacters we need to import the regular expression module in our program. Python has a built-in module for Regular Expression and to use it you need to write import re to import the module. import re

    r prefix on Regular Expression:

    In python, we use the r prefix before a regular expression and it represents a raw string. For instance '\n' means new line where as r'\n' means two characters \ and n.

    Python Regular Expression Methods:

    The python re module has many methods which are used to work with the regular expressions.

    1. findall()

    The findall() method return a list of string which matches the regular expression.

    Example:

    import re
    
    pattern = r'\d+'
    string = "Hello user 1234 this is tech geek buzz 101"
    res = re.findall(pattern,string)
    print(res)

    Output

    ['1234', '101']

    Behind the code r'\d+' This Regular Expression means match all the digits occurring one or more than once in the string. that why the find all matches all the digits from the string.

    2. split()

    The split method is just the opposite of findall() method, this method split the string into a list from those points where it matches the regular expression pattern.

    Example

    import re
    pattern = r'\d+'
    string = "Hello user 1234 this is tech geek buzz 101"
    res = re.split(pattern,string)
    print(res)

    Output

    ['Hello user ', ' this is tech geek buzz ', '']
    3. sub()

    The sub() method is used to replace the match with the text of our choice.

    Example:
    import re
    pattern = r'\s'    # match white space
    string = "Tech Geek Buzz"
    replace_string = "-"
    
    res = re.sub(pattern,replace_string,string)
    print(res)

    Output:

    Tech-Geek-Buzz

    Behind the Code: In this example, we use the re . sub() method to replace the white space with the – symbol.

    4. search()

    The search method is used to search the string for a match and if there is a match the method returns a Match object else it returns a None value.

    Example:

    import re
    pattern = r'\s'  # match white space
    string = "Tech Geek"
    
    if re.search(pattern,string):
        print("yes there is space in the String")
    else:
        print("there is no space")

    Output:

    yes there is space in the String

    Match Object:

    When we perform the search() method on a string and regular expression and if there is a match the search() method return a Match object. That match object has some properties and methods which we can use to retrieve more information from that match object.

    Match object attributes and methods:
    • span(): it returns the tuple which comprises of the stat and end position of the match
    • string: it gives the string which we are using to match the regular expression.
    • group(): returns the part of the string where there was a match
    • start(): return the index value where it first matches the substring.
    • end(): return the end index where it matched the string.

    Example:

    import re
    
    pattern = r'\s'  # match white space 
    string = "Tech Geek Buzz Wellcome you all!  "
    
    match_object =re.search(r"\bT\w+", string)  #search string start wiht T and followed by alphanumeric charecter
    
    print("Using span() on match match_object ")
    print(match_object.span())
    
    print("Using string on match match_object ")
    print(match_object.string)
    
    print("Using group() on match match_object ")
    print(match_object.group())
    
    print("Using start() match match_object ")
    print(match_object.start())
    
    print("Using end() on match match_object ")
    
    print(match_object.end())

    Output:

    Using span() on match match_object
    (0, 4)
    
    Using string on match match_object
    Tech Geek Buzz Wellcome you all!
    
    Using group() on match match_object
    Tech
    
    Using start() match match_object
    0
    
    Using end() on match match_object
    4

    People are also reading:

    Leave a Comment on this Post

    0 Comments