Using regex to extract information from string

Question

I am trying to write a regex in Python to extract some information from a string.

Given:

"Only in Api_git/Api/folder A: new.txt"

I would like to print:

Folder Path: Api_git/Api/folder A
Filename: new.txt

After having a look at some examples on the re manual page, I'm still a bit stuck.

This is what I've tried so far

m = re.match(r"(Only in ?P<folder_path>\w+:?P<filename>\w+)","Only in Api_git/Api/folder A: new.txt")

print m.group('folder_path')
print m.group('filename')

Can anybody point me in the right direction??

Is splitting by : possible?

Dalorzo
– Dalorzo

2014-07-17 14:37:33 +00:00
Commented Jul 17, 2014 at 14:37 — Dalorzo
– Dalorzo, Commented Jul 17, 2014 at 14:37

Braj · Accepted Answer · 2014-07-17 14:50:19Z

4

Get the matched group from index 1 and 2 using capturing groups.

^Only in ([^:]*): (.*)$

Here is demo

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"

re.findall(p, test_str)

If you want to print in the below format then try with substitution.

Folder Path: Api_git/Api/folder A 
Filename: new.txt

DEMO

sample code:

import re
p = re.compile(ur'^Only in ([^:]*): (.*)$')
test_str = u"Only in Api_git/Api/folder A: new.txt"
subst = u"Folder Path: $1\nFilename: $2"

result = re.sub(p, subst, test_str)

edited Jul 17, 2014 at 14:50

answered Jul 17, 2014 at 14:37

Braj

46.9k5 gold badges63 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TomSelleck Over a year ago

Hey this looks good! Just a couple of questions, I think there's a typo because it only prints out "Folder Path: $1" at the moment. Also, is there a quick way to save each value into its own variable eg. folder_path = test_str('folder_path') filename = test_str('filename')

Braj Over a year ago

get the matched groups from first example. I mentioned second example to print the desired output.

Braj Over a year ago

Read more Capturing group with findall?

zx81 Over a year ago

Nice detailed answer, +1 :)

Robᵩ · Accepted Answer · 2014-07-17 16:30:29Z

1

Your pattern: (Only in ?P<folder_path>\w+:?P<filename>\w+) has a few flaws in it.

The ?P construct is only valid as the first bit inside a parenthesized expression, so we need this.

(Only in (?P<folder_path>\w+):(?P<filename>\w+))

The \w character class is only for letters and underscores. It won't match / or ., for example. We need to use a different character class that more closely aligns with requirements. In fact, we can just use ., the class of nearly all characters:

(Only in (?P<folder_path>.+):(?P<filename>.+))

The colon has a space after it in your example text. We need to match it:

(Only in (?P<folder_path>.+): (?P<filename>.+))

The outermost parentheses are not needed. They aren't wrong, just not needed:

Only in (?P<folder_path>.+): (?P<filename>.+)

It is often convenient to provide the regular expression separate from the call to the regular expression engine. This is easily accomplished by creating a new variable, for example:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
... # several lines later
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")

The above is purely for the convenience of the programmer: it neither saves nor squanders time or memory space. There is, however, a technique that can save some of the time involved in regular expressions: compiling.

Consider this code segment:

regex = r'Only in (?P<folder_path>.+): (?P<filename>.+)'
for line in input_file:
    m = re.match(regex, line)
    ...

For each iteration of the loop, the regular expression engine must interpret the regular expression and apply it to the line variable. The re module allows us to separate the interpretation from the application; we can interpret once but apply several times:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
for line in input_file:
    m = re.match(regex, line)
    ...

Now, your original program should look like this:

regex = re.compile(r'Only in (?P<folder_path>.+): (?P<filename>.+)')
m = re.match(regex, "Only in Api_git/Api/folder A: new.txt")
print m.group('folder_path')
print m.group('filename')

However, I'm a fan of using comments to explain regular expressions. My version, including some general cleanup, looks like this:

import re
regex = re.compile(r'''(?x)                # Verbose
            Only\ in\             # Literal match
            (?P<folder_path>.+)   # match longest sequence of anything, and put in 'folder_path'
            :\                    # Literal match
            (?P<filename>.+)      # match longest sequence of anything and put in 'filename'
            ''')

with open('diff.out') as input_file:
    for line in input_file:
        m = re.match(regex, line)
        if m:
            print m.group('folder_path')
            print m.group('filename')

edited Jul 17, 2014 at 16:30

answered Jul 17, 2014 at 14:53

Robᵩ

170k20 gold badges251 silver badges323 bronze badges

4 Comments

TomSelleck Over a year ago

Is there a way to extract the regex from re.match()?

Robᵩ Over a year ago

What do you mean by "extract"?

TomSelleck Over a year ago

Sorry - I meant to define the regex outside of re.match("regex")

Robᵩ Over a year ago

@Tomcelic - See my edit. In short, use regex = re.compile(r'...'); re.match(regex, 'some text').

f.rodrigues · Accepted Answer · 2014-07-17 15:40:32Z

0

It really depends on the limitation of the input, if this is the only input this will do the trick.

^Only in (?P<folder_path>[a-zA-Z_/ ]*): (?P<filename>[a-z]*.txt)$

answered Jul 17, 2014 at 15:40

f.rodrigues

3,5876 gold badges32 silver badges64 bronze badges

Collectives™ on Stack Overflow

Using regex to extract information from string

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related