0

I'm working on a python 3.x project that needs to read a large TXT file that need to be filtered (for example, remove multiple spaces, blank lines, lines that start with certain strings, etc) and finally split by REGEX matching.

What I am doing right now is using pandas dataframe to store each line (which make it easy to delete lines using pandas startswith() or endswith()). On the other hand by having each line of the text file corresponding to a row in DataFrame I can't figure out on how to extract data between REGEX matches. Here is an example:

| 0 | REGEX MATCH   |
| 1 | data          |
| 2 | data          |
| 3 | REGEX MATCH   |
| 4 | data          |
| 5 | REGEX MATCH   |

So the question is how can I extract data between matches (in this example, rows 0 to 2; 3 to 4 and 5). Is this even possible in pandas?

Another option is to use read() from text file and going for regular string manipulation instad of DataFrame, filtering, spliting, etc, which I'm not sure if it is apropriate for big text files. In that case I have unwanted data between REGEX matches. Example:

str = "This is REGEX_MATCH    while between another \n \n\ REGEX_MATCH there is some    unwanted data"

In the above, I would need to remove extra blank space, \n and finally using REGEX to split matches. The only issue is that my source text file is really large.

Pandas is fast on deletion/filtering while regular string is easier on spliting.

Any ideas?

Thanks!

EDIT. Here is how my source text looks like. Its a mess as you can see (extracted from PDF). Each line is a row in pandas dataframe. Question is if it is possible to extract all the data between those lines containing a series of numbers (including those lines).

13 - 0005761-52.2014.4.02.5101                 Lorem ipsum dolor sit amet.
Quisque eget velit a orci consectetur pharetra. Aliquam.
\n
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
a
Lorem ipsum dolor sit amet.
        Lorem ipsum dolor sit amet - Sed ut tempus neque.
Sed ut tempus neque.
2 - 0117333-76.2015.4.02.5101 Lorem ipsum dolor sit amet
5
  • How does the txt look like? Can you post 4-5 lines? Commented Sep 20, 2017 at 5:32
  • 1
    I don't know how big your data is, but I tried string operations many times with more than 30K subtitle files. One of those files is 10 KB on avarage. So, there were 30MB data in total. Every time I tried in python, it took e few seconds. Commented Sep 20, 2017 at 6:04
  • @Vaishali Just edited my post to include a snippet from my text. Commented Sep 22, 2017 at 1:03
  • And you want everything between 13 - 0005761-52.2014.4.02.5101 and 2 - 0117333-76.2015.4.02.5101? Commented Sep 22, 2017 at 1:06
  • Exactly @Vaishali. Although I will still need to do some filtering to drop some extra spaces and blank lines. But essentially, all text data between those parameters (parameter included) is what I need to be put on a row in pandas dataframe. The filtering can be done after that. Commented Sep 25, 2017 at 17:39

1 Answer 1

1

You could read it all into a DataFrame using and select rows that don't contain the match:

import pandas as pd

df = pd.read_csv('test.txt', header=None, delimiter='|') 
df = df[df[2].str.contains('MATCH') == False]  # check column 2 from the example

Alternatively, you could find the lines you wanted to ignore then use the skiprows argument for pandas.read_csv:

with open('test.txt') as f:
    lines = f.readlines()

skiprows = [i for i, line in enumerate(lines) if 'MATCH' in line]
df = pd.read_csv('test.txt', skiprows=skiprows, header=None, delimiter='|')

To drop columns by column number if they are unwanted or empty:

df = df.drop(df.columns[[0, 1, 3]], axis=1)

To clean extra whitespace from all the values in column 2:

df[2] = [' '.join(x.split()) for x in df[2]]  

Or to clean the whitespace across the entire DataFrame:

cleaner = lambda x: ' '.join(x.split()) if isinstance(x, str) else x
df = df.applymap(cleaner)
Sign up to request clarification or add additional context in comments.

5 Comments

I will definetly try this! I will need to make some changes though. I just can't drop the entire line that contains de REGEX match because sometimes there's data after that match (for example: | 0 | REGEX MATCH data data |
I see, in that case it might make more sense to read in each line as one big column and apply operations from there.
Columms or rows? What I am cuirrently doing is reading each line in a row, not columm. Also, If I read each line as one big columm/row and apply all the filtering I need (extra spaces, blank lines, etc) will I still be able to gather all data betwenn REGEX matches even if it is spread across many columms/rows? That's where I'm stuck. The REGEX filtering only apply to a specific row so I'm not able to get all data between multiple rows using this method.. But yeah, probably I'm doing something wrong.
Hmmmm, I guess it might help to see more example data. If you want we can chat here chat.stackoverflow.com/rooms/155276/…
@Leandro I'm curious to know if you solved this, and if so how. I'm trying to find a solution to a very similar situation and any inspiration will be useful

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.