3

I want to read the text between two characters (“#*” and “#@”) from a file. My file contains thousands of records in the above-mentioned format. I have tried using the code below, but it is not returning the required output. My data contains thousands of records in the given format.

import re
start = '#*'
end = '#@'
myfile = open('lorem.txt')
for line in fhand:
    text = text.rstrip()
    print (line[line.find(start)+len(start):line.rfind(end)])
myfile.close()

My Input:

\#*OQL[C++]: Extending C++ with an Object Query Capability

\#@José A. Blakeley

\#t1995

\#cModern Database Systems

\#index0

\#*Transaction Management in Multidatabase Systems

\#@Yuri Breitbart,Hector Garcia-Molina,Abraham Silberschatz

\#t1995

\#cModern Database Systems

\#index1

My Output:

51103
OQL[C++]: Extending C++ with an Object Query Capability

t199
cModern Database System
index
...

Expected output:

OQL[C++]: Extending C++ with an Object Query Capability
Transaction Management in Multidatabase Systems
6
  • Could you perhaps highlight your input/output/expected output? Commented Jul 22, 2019 at 10:19
  • 1
    Remove your for cycle and add contents = myfile.read() and then print(re.findall(r'#\*(.*?)#@', contents, re.S)) Commented Jul 22, 2019 at 10:24
  • 1
    Do you mean like this? ^#\*(.*)(?:\r?\n){2}#@ regex101.com/r/5ouxbw/1 Commented Jul 22, 2019 at 10:24
  • @Wiktor Stribiżew thank you for help. This code is returning all data between these two strings from a file. I basically want all this one by one. For example, I want to read the first title between "#*" and "@". Then next title between "#*" and "@" and so on. Commented Jul 22, 2019 at 10:36
  • 1
    So, no problem: for match in re.findall(r'#\*(.*?)#@', contents, re.S): // do something with the match Commented Jul 22, 2019 at 11:17

2 Answers 2

2

You are reading the file line by line, but your matches span across lines. You need to read the file in and process it with a regex that can match any chars across lines:

import re
start = '#*'
end = '#@'
rx = r'{}.*?{}'.format(re.escape(start), re.escape(end)) # Escape special chars, build pattern dynamically
with open('lorem.txt') as myfile:
    contents = myfile.read()                     # Read file into a variable
    for match in re.findall(rx, contents, re.S): # Note re.S will make . match line breaks, too
        # Process each match individually

See the regex demo.

Sign up to request clarification or add additional context in comments.

Comments

2

Use the following regex:

#\*([\s\S]*?)#@ /g

This regex captures all whitespace and non-whitespace characters between #* and #@.

Demo

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.