Python. Regular expression not returning output

Question

I am trying to findall instances of the string "PB" and the digits that follow it, but when I call.

number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))

the ([0-9])\d+ doesn't return an output. I check my output file, sequence.txt but there is nothing inside it. If i just do \bPB\b it outputs "PB" but no numbers.

My input file, raw-sequence.txt looks like this:

WB (19, 21, 24, 46, 60)
WB (12, 11, 9, 23, 49)
PB (18, 21, 10, 5, 5)
WB (2, 14, 2, 29, 67)
WB (1, 8, 1, 16, 52)
PB (2, 11, 8, 3, 4)

How can I output the following lines to sequence.txt?

PB (18, 21, 10, 5, 5)
PB (2, 11, 8, 3, 4)

Here is my current code:

sequence_raw_buffer = open('c:\\sequence.txt', 'a')
with open('c:\\raw-sequence.txt') as f:
  number_list = f.read().splitlines()
  number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
  unique = list(set(number_all))
  for i in unique:
    sequence_raw_buffer.write(i + '\n')
  print "done"
  f.close()
  sequence_raw_buffer.close()

yes Mad pretty at the moment all i want is PB lines called and output in sequence.txt — Keo Rithy
– Keo Rithy, Commented May 22, 2017 at 17:03

Mad Physicist · Accepted Answer · 2017-05-22 17:43:50Z

2

Given the code you show, regex are an unnecessary over-complication to your problem. You can just iterate over the lines from the input file and dump the ones for which line.startswith("PB") returns True.

with open(r'c:\raw-sequence.txt', 'r') as f, open(r'c:\sequence.txt', 'a') as sequence_raw_buffer:
    for line in f:
        if line.startswith("PB"):
            print(line, file=sequence_raw_buffer)

This illustrates the fact that files can be iterated over line-by-line. I use print to dump the line because it will append the correct line terminator that the for loop strips off.

This example also shows you how to put multiple context managers into a single with block. You should have all your file in a with block, whether input or output, because I/O errors are a possibility in both directions.

Now, if you are trying to use regex for practice or because the match is really more complicated than what you present here, you can try

PB\s*\((?:\d+,\s*)*\d+\)

This matches as follows:

Literal PB
Optional unlimited number of spaces \s*
Literal open parens \(
Optional non-capturing group (?:)*, repeated as many times as necessary, containing
- At least one digit \d+
- Literal comma ,
- Any number of spaces \s*
At least one actual number \d
Literal close parens \)

I would not bother concatenating the whole file together and using findall on that though, unless your expression can span multiple lines. I would prefer to still use the approach shown above, because in all but a few cases that I can think of, textual data will generally be delimited by newlines:

pattern = re.compile('PB\s*\((?:\d+,\s*)*\d+\)')
...
            if pattern.match(line):
...

Pre-compiling the pattern once makes the program run faster, but you could call re.match(..., line) every time as well.

edited May 22, 2017 at 17:43

answered May 22, 2017 at 17:11

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Sam Pearman Over a year ago

Ah, I coded this up myself almost exactly the same in the time you wrote this. Beaten to the punch :) I just want to note that the output is dumb about newlines if you're running the it multiple times (eg. with different files) regardless of if you're using print (too many newlines) or write (not enough). You can get around it by using print(line, file=outputFile, end='') inside the for loop and print('', file=outputFile) outside, or outputFile.write(line) inside and outputFile.write("\n") outside.

Alessandro Martini · Accepted Answer · 2017-05-22 17:12:41Z

0

You can try this regex: PB\s?\(([0-9]*,?\s?)*\)

answered May 22, 2017 at 17:12

Alessandro Martini

711 silver badge6 bronze badges

4 Comments

Kevin Over a year ago

This appears to return ['', ''] when run on the OP's input file.

Alessandro Martini Over a year ago

I tried it here, it seems to work, regex101.com/r/GtqFJg/1 . Can you send me the text you are using?

Kevin Over a year ago

Sure. Even better, here is the string in an interactive environment that shows the output too: repl.it/IOFL

Alessandro Martini Over a year ago

Ok, the problem was with the groups i was creating, this should work PB \(.*?\)

rock321987 · Accepted Answer · 2017-05-22 17:30:40Z

0

There are few things that you are missing

You are missing a space between word boundary(\b) and bracket (
Parenthesis () have different meanings in regex context. Parenthesis denotes capturing group. To match parenthesis literally you need to escape it.

Now to match the exact pattern you intend, you can use this

\bPB\s+\((?:\s*\d+\s*,\s*)*\d+\)

If you want to only match lines with PB you can directly search for PB

answered May 22, 2017 at 17:30

rock321987

11.1k1 gold badge34 silver badges44 bronze badges

Collectives™ on Stack Overflow

Python. Regular expression not returning output

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related