0

I am trying to findall instances of the string "PB" and the digits that follow it, but when I call.

number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))

the ([0-9])\d+ doesn't return an output. I check my output file, sequence.txt but there is nothing inside it. If i just do \bPB\b it outputs "PB" but no numbers.

My input file, raw-sequence.txt looks like this:

WB (19, 21, 24, 46, 60)
WB (12, 11, 9, 23, 49)
PB (18, 21, 10, 5, 5)
WB (2, 14, 2, 29, 67)
WB (1, 8, 1, 16, 52)
PB (2, 11, 8, 3, 4)

How can I output the following lines to sequence.txt?

PB (18, 21, 10, 5, 5)
PB (2, 11, 8, 3, 4)

Here is my current code:

sequence_raw_buffer = open('c:\\sequence.txt', 'a')
with open('c:\\raw-sequence.txt') as f:
  number_list = f.read().splitlines()
  number_all = re.findall(r'\bPB\b([0-9])\d+', ' '.join(number_list))
  unique = list(set(number_all))
  for i in unique:
    sequence_raw_buffer.write(i + '\n')
  print "done"
  f.close()
  sequence_raw_buffer.close()
7
  • regex101.com is great for testing regular expressions Commented May 22, 2017 at 16:57
  • You should really read the re module documentation. Commented May 22, 2017 at 16:57
  • You just want lines starting with "PB"? Commented May 22, 2017 at 17:02
  • Because you really don't need regex for that. Commented May 22, 2017 at 17:02
  • yes Mad pretty at the moment all i want is PB lines called and output in sequence.txt Commented May 22, 2017 at 17:03

3 Answers 3

2

Given the code you show, regex are an unnecessary over-complication to your problem. You can just iterate over the lines from the input file and dump the ones for which line.startswith("PB") returns True.

with open(r'c:\raw-sequence.txt', 'r') as f, open(r'c:\sequence.txt', 'a') as sequence_raw_buffer:
    for line in f:
        if line.startswith("PB"):
            print(line, file=sequence_raw_buffer)

This illustrates the fact that files can be iterated over line-by-line. I use print to dump the line because it will append the correct line terminator that the for loop strips off.

This example also shows you how to put multiple context managers into a single with block. You should have all your file in a with block, whether input or output, because I/O errors are a possibility in both directions.

Now, if you are trying to use regex for practice or because the match is really more complicated than what you present here, you can try

PB\s*\((?:\d+,\s*)*\d+\)

This matches as follows:

  • Literal PB
  • Optional unlimited number of spaces \s*
  • Literal open parens \(
  • Optional non-capturing group (?:)*, repeated as many times as necessary, containing
    • At least one digit \d+
    • Literal comma ,
    • Any number of spaces \s*
  • At least one actual number \d
  • Literal close parens \)

I would not bother concatenating the whole file together and using findall on that though, unless your expression can span multiple lines. I would prefer to still use the approach shown above, because in all but a few cases that I can think of, textual data will generally be delimited by newlines:

pattern = re.compile('PB\s*\((?:\d+,\s*)*\d+\)')
...
            if pattern.match(line):
...

Pre-compiling the pattern once makes the program run faster, but you could call re.match(..., line) every time as well.

Sign up to request clarification or add additional context in comments.

1 Comment

Ah, I coded this up myself almost exactly the same in the time you wrote this. Beaten to the punch :) I just want to note that the output is dumb about newlines if you're running the it multiple times (eg. with different files) regardless of if you're using print (too many newlines) or write (not enough). You can get around it by using print(line, file=outputFile, end='') inside the for loop and print('', file=outputFile) outside, or outputFile.write(line) inside and outputFile.write("\n") outside.
0

You can try this regex: PB\s?\(([0-9]*,?\s?)*\)

4 Comments

This appears to return ['', ''] when run on the OP's input file.
I tried it here, it seems to work, regex101.com/r/GtqFJg/1 . Can you send me the text you are using?
Sure. Even better, here is the string in an interactive environment that shows the output too: repl.it/IOFL
Ok, the problem was with the groups i was creating, this should work PB \(.*?\)
0

There are few things that you are missing

  1. You are missing a space between word boundary(\b) and bracket (
  2. Parenthesis () have different meanings in regex context. Parenthesis denotes capturing group. To match parenthesis literally you need to escape it.

Now to match the exact pattern you intend, you can use this

\bPB\s+\((?:\s*\d+\s*,\s*)*\d+\)

If you want to only match lines with PB you can directly search for PB

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.