1

I am trying to write a regex in Python to extract some information from a string.

Given:

"Only in Api_git/Api/folder A: new.txt"

I would like to print:

Folder Path: Api_git/Api/folder A
Filename: new.txt

After having a look at some examples on the re manual page, I'm still a bit stuck.

This is what I've tried so far

m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")

print m.group('folder_path')
print m.group('filename')

Can anybody point me in the right direction??

1
  • Is splitting by : possible? Commented Jul 17, 2014 at 14:37

3 Answers 3

4

Get the matched group from index 1 and 2 using capturing groups.

^Only in ([^:]*): (.*)$

Here is demo

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"

re.findall(p, test_str)

If you want to print in the below format then try with substitution.

Folder Path: Api_git/Api/folder A 
Filename: new.txt

DEMO

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"

result = re.sub(p, subst, test_str)
Sign up to request clarification or add additional context in comments.

4 Comments

Hey this looks good! Just a couple of questions, I think there's a typo because it only prints out "Folder Path: $1" at the moment. Also, is there a quick way to save each value into its own variable eg. folder_path = test_str('folder_path') filename = test_str('filename')
get the matched groups from first example. I mentioned second example to print the desired output.
Nice detailed answer, +1 :)
1

Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.

The ?P construct is only valid as the first bit inside a parenthesized expression, so we need this.

(Only in (?P<folder_path>\w+):(?P<filename>\w+))

The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:

(Only in (?P<folder_path>.+):(?P<filename>.+))

The colon has a space after it in your example text. We need to match it:

(Only in (?P<folder_path>.+): (?P<filename>.+))

The outermost parentheses are not needed. They aren't wrong, just not needed:

Only in (?P<folder_path>.+): (?P<filename>.+)

It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt") 

The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.

Consider this code segment:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
    m = re.match(regex, line)
    ...

For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
    m = re.match(regex, line)
    ...

Now, your original program should look like this:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')

However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:

import re
regex = re.compile(r'''(?x)                # Verbose
            Only\ in\             # Literal match
            (?P<folder_path>.+)   # match longest sequence of anything, and put in 'folder_path'
            :\                    # Literal match
            (?P<filename>.+)      # match longest sequence of anything and put in 'filename'
            ''')

with open('diff.out') as input_file:
    for line in input_file:
        m = re.match(regex, line)
        if m:
            print m.group('folder_path')
            print m.group('filename')

4 Comments

Is there a way to extract the regex from re.match()?
What do you mean by "extract"?
Sorry - I meant to define the regex outside of re.match("regex")
@Tomcelic - See my edit. In short, use regex = re.compile(r'...'); re.match(regex, 'some text').
0

It really depends on the limitation of the input, if this is the only input this will do the trick.

^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.