Python Dataframe - Extract multiple lines between regex match

Question

I'm working on a python 3.x project that needs to read a large TXT file that need to be filtered (for example, remove multiple spaces, blank lines, lines that start with certain strings, etc) and finally split by REGEX matching.

What I am doing right now is using pandas dataframe to store each line (which make it easy to delete lines using pandas startswith() or endswith()). On the other hand by having each line of the text file corresponding to a row in DataFrame I can't figure out on how to extract data between REGEX matches. Here is an example:

| 0 | REGEX MATCH   |
| 1 | data          |
| 2 | data          |
| 3 | REGEX MATCH   |
| 4 | data          |
| 5 | REGEX MATCH   |

So the question is how can I extract data between matches (in this example, rows 0 to 2; 3 to 4 and 5). Is this even possible in pandas?

Another option is to use read() from text file and going for regular string manipulation instad of DataFrame, filtering, spliting, etc, which I'm not sure if it is apropriate for big text files. In that case I have unwanted data between REGEX matches. Example:

str = "This is REGEX_MATCH    while between another \n \n\ REGEX_MATCH there is some    unwanted data"

In the above, I would need to remove extra blank space, \n and finally using REGEX to split matches. The only issue is that my source text file is really large.

Pandas is fast on deletion/filtering while regular string is easier on spliting.

Any ideas?

Thanks!

EDIT. Here is how my source text looks like. Its a mess as you can see (extracted from PDF). Each line is a row in pandas dataframe. Question is if it is possible to extract all the data between those lines containing a series of numbers (including those lines).

13 - 0005761-52.2014.4.02.5101                 Lorem ipsum dolor sit amet.
Quisque eget velit a orci consectetur pharetra. Aliquam.
\n
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
a
Lorem ipsum dolor sit amet.
        Lorem ipsum dolor sit amet - Sed ut tempus neque.
Sed ut tempus neque.
2 - 0117333-76.2015.4.02.5101 Lorem ipsum dolor sit amet

I don't know how big your data is, but I tried string operations many times with more than 30K subtitle files. One of those files is 10 KB on avarage. So, there were 30MB data in total. Every time I tried in python, it took e few seconds. — Alperen
– Alperen, Commented Sep 20, 2017 at 6:04
@Vaishali Just edited my post to include a snippet from my text. — Leandro
– Leandro, Commented Sep 22, 2017 at 1:03
And you want everything between 13 - 0005761-52.2014.4.02.5101 and 2 - 0117333-76.2015.4.02.5101? — Vaishali
– Vaishali, Commented Sep 22, 2017 at 1:06
Exactly @Vaishali. Although I will still need to do some filtering to drop some extra spaces and blank lines. But essentially, all text data between those parameters (parameter included) is what I need to be put on a row in pandas dataframe. The filtering can be done after that. — Leandro
– Leandro, Commented Sep 25, 2017 at 17:39

brennan · Accepted Answer · 2017-09-20 07:29:03Z

1

You could read it all into a DataFrame using and select rows that don't contain the match:

import pandas as pd

df = pd.read_csv('test.txt', header=None, delimiter='|') 
df = df[df[2].str.contains('MATCH') == False]  # check column 2 from the example

Alternatively, you could find the lines you wanted to ignore then use the skiprows argument for pandas.read_csv:

with open('test.txt') as f:
    lines = f.readlines()

skiprows = [i for i, line in enumerate(lines) if 'MATCH' in line]
df = pd.read_csv('test.txt', skiprows=skiprows, header=None, delimiter='|')

To drop columns by column number if they are unwanted or empty:

df = df.drop(df.columns[[0, 1, 3]], axis=1)

To clean extra whitespace from all the values in column 2:

df[2] = [' '.join(x.split()) for x in df[2]]

Or to clean the whitespace across the entire DataFrame:

cleaner = lambda x: ' '.join(x.split()) if isinstance(x, str) else x
df = df.applymap(cleaner)

edited Sep 20, 2017 at 7:29

answered Sep 20, 2017 at 6:39

brennan

3,50328 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Leandro Over a year ago

I will definetly try this! I will need to make some changes though. I just can't drop the entire line that contains de REGEX match because sometimes there's data after that match (for example: | 0 | REGEX MATCH data data |

brennan Over a year ago

I see, in that case it might make more sense to read in each line as one big column and apply operations from there.

Leandro Over a year ago

Columms or rows? What I am cuirrently doing is reading each line in a row, not columm. Also, If I read each line as one big columm/row and apply all the filtering I need (extra spaces, blank lines, etc) will I still be able to gather all data betwenn REGEX matches even if it is spread across many columms/rows? That's where I'm stuck. The REGEX filtering only apply to a specific row so I'm not able to get all data between multiple rows using this method.. But yeah, probably I'm doing something wrong.

brennan Over a year ago

Hmmmm, I guess it might help to see more example data. If you want we can chat here chat.stackoverflow.com/rooms/155276/…

Syamanthaka Over a year ago

@Leandro I'm curious to know if you solved this, and if so how. I'm trying to find a solution to a very similar situation and any inspiration will be useful

Collectives™ on Stack Overflow

Python Dataframe - Extract multiple lines between regex match

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related