How to extract text between two substrings from a Python file

Question

I want to read the text between two characters (“#*” and “#@”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

My Input:

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

My Output:

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

Expected output:

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems

Could you perhaps highlight your input/output/expected output? — norok2
– norok2, Commented Jul 22, 2019 at 10:19
Remove your for cycle and add contents = myfile.read() and then print(re.findall(r'#\*(.*?)#@', contents, re.S)) — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 22, 2019 at 10:24
Do you mean like this? ^#\*(.*)(?:\r?\n){2}#@ regex101.com/r/5ouxbw/1 — The fourth bird
– The fourth bird, Commented Jul 22, 2019 at 10:24
@Wiktor Stribiżew thank you for help. This code is returning all data between these two strings from a file. I basically want all this one by one. For example, I want to read the first title between "#*" and "@". Then next title between "#*" and "@" and so on. — BiSarfraz
– BiSarfraz, Commented Jul 22, 2019 at 10:36
So, no problem: for match in re.findall(r'#\*(.*?)#@', contents, re.S): // do something with the match — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Jul 22, 2019 at 11:17

Wiktor Stribiżew · Accepted Answer · 2019-07-22 11:34:03Z

2

You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:

import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
    contents = myfile.read()                     # Read file into a variable
    for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
        # Process each match individually

See the regex demo.

answered Jul 22, 2019 at 11:34

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

CinCout · Accepted Answer · 2019-07-22 10:24:56Z

2

Use the following regex:

#\*([\s\S]*?)#@ /g

This regex captures all whitespace and non-whitespace characters between #* and #@.

Demo

answered Jul 22, 2019 at 10:24

CinCout

9,62916 gold badges55 silver badges74 bronze badges

Collectives™ on Stack Overflow

How to extract text between two substrings from a Python file

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related