0

I am trying to sort out specific paragraph by using regular expression in python.

here is an input.txt file.

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    
paragraph_a A_story(

...
some random texts adfsasdsd

...
)

paragraph_b different_story(
...
some random texts
...
)

expected output is here:

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    

paragraph_b different_story(
...
some random texts
...
)

What I want to do is to delete all the paragraph_a contents (including parenthesis) but It should be deleted by the name of the below-line paragraph(in this case, paragraph_b) because the contents of the to-be-deleted paragraph(in this case, paragraph_a) is random.

I've managed to make regular expression to select Only the paragraph that is located right above paragraph_b

https://regex101.com/r/pwGVbe/1 <- you can refer to it in here.

However, By using this regular expression I couldn't delete the thing I want.

here is what I've done so far:

import re

output = open ('output.txt', 'w')
input = open('input.txt', 'r')

for line in input:
#    print(line)
    t = re.sub('^(\w+ \w+\((?:(.|\n)*)\))\s*^paragraph_b','', line)
    output.write(t)

Is there anything I can get some solution or clue? Any answer or advice would be appreciated.

Thanks.

5
  • 1
    If your regex successfully matches paragraph_a content, then what's missing? You're not being very clear about your goal and what's lacking in your current solution. Commented Aug 21, 2022 at 13:56
  • please add expected output and actual output to the question Commented Aug 21, 2022 at 13:59
  • @PookyFan As I mentioned in the question, even though the regex itself matched, the code didn't work.. Commented Aug 21, 2022 at 13:59
  • @rok I added the desired output and the current output from the code is blank even though the regular expression seems to be matched... so That's why I question about the code.. Commented Aug 21, 2022 at 14:04
  • @Parine I understand now, see my answer. Does it help? Commented Aug 21, 2022 at 14:16

2 Answers 2

1

You can match the paragraph before by asserting paragraph_b and not cross more paragraphs.

Note that input is a reserved keyword, so instead of writing input = open('input.txt', 'r') you might write it like this input_file = open('file', 'r')

 ^\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

If the match also should not start with paragraph_b itself:

^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)

Regex demo

Example, using input_file.read() to read the whole file:

import re

output_file = open('file_out', 'w')
input_file = open('file', 'r')

t = re.sub(
    '^(?!paragraph_b)\w+ \w+\((?:\n(?!^\w+ \w+\().*)*\)(?=\s*^paragraph_b)',
    '',
    input_file.read(),
    0,
    re.M
)
output_file.write(t)

Contents of output.txt

some random texts (100+ lines)
bbb
...
ttt
some random texts
ccc
...
fff    


paragraph_b different_story(
...
some random texts
...
)
Sign up to request clarification or add additional context in comments.

5 Comments

Thanks for the answer but even though I substitute re.sub method to your suggestions, the code didn't work..
@Parine I have added example code to the answer.
thanks, I've tried on yours but the output remains same as input file..
@Parine Did you try testing this with the data that you shared in the question? Can you share a part of the real file?
Oh.. there is 1-letter space on your first answer.( ^\w+ \w+((?:\n(?!^\w+ \w+().*)*)(?=\s*^paragraph_b)) when I deleted that space, It works!! thank you very much. It helped me a lot.
0

Your code doesn't work because you're parsing text line by line:

for line in input:

That way your regex has no chance to match entire file content. You're better off reading it all at once and store it in single string variable, then apply your modifications with regex using that string variable.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.