How can I parse this text into a table in Python?

Question

I have this data called text.txt. I also have my code below. I want to extract line values and want to make a table out of it. I also wanted to see if there is a better way to do it. Thanks

test.txt

Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
73764
Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
78640
Counting********************File:  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
26267

result I want:

  File Name                                 Seq_132582_1  Seq_483974_49238
0  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001     0      73764
1  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001     0      78640
2  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq   0      26267

code I tried:

import sys

if sys.version_info[0] < 3:
    raise Exception("Python 3 or a more recent version is required.")
import re
import pandas as pd
text = open("text.txt",'r').read()
print(type(text))
results = re.findall(r'(bbduk_trimmed.*.fastq)\nSeq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: \n(\d)\nSeq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: \n(\d*)',text)
df=pd.DataFrame(results)
# df.columns=['FileName','Seq_132582_1','Seq_483974_49238'] #This doesn't work
print(df)

Gahan · Accepted Answer · 2018-10-31 16:40:25Z

1

Just replace your regex with below code line:

re.findall(r'Counting[*]+File:[ ]*([\w.]+)[ \n]*[ :\w]+[\n]*(\w+)[\n]*[ :\w]+[\n]*(\w+)', text)

Explanation:

[*]+ - match one or more * character
[ ]* - match one or more (space) character
([\w.]+) - match filename and compute as first paranthasis
[ \n]* - match zero or more space or newline character
[ :\w]+ - match your whole line which is starting with Seq

The core logic to get sequence in the regex is as below:

([\w.]+)[ \n]*[ \w]+:[ :\w]+[\n]*(\w+)

after matching filename with ([\w.]+) first, we match the space(s) and new lines(s) using [ \n]*,
after that if you want to parse name of sequence you are parsing you might need to keep [ \w]+:[ :\w]+ separately and use it as ([ \w])+:[ :\w]+ where paranthisis can match you can extract sequence which can be Seq_132582_1 or Seq_483974_49238, however if order is not to be considered then you may simply replace it with [ :\w]+[\n]* and match the whole line and match the data you require on next line with (\w+)

Another easier way is to extract data is shown below to prepare result without using re module:

results = []
f = open("content.txt", 'r')

while True:
    line = f.readline()
    if not line:
        break
    file_name = line.split(":")[-1].strip()
    f.readline()  # skip line 
    data_seq1 = f.readline().strip()
    f.readline()  # skip line 
    data_seq2 = f.readline().strip()
    results.append((file_name, data_seq1, data_seq2))

edited Oct 31, 2018 at 16:40

answered Oct 31, 2018 at 16:07

Gahan

4,2234 gold badges26 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Achal Neupane Over a year ago

Thanks! what if I have more Seqs? what regex do I need to add?

Collectives™ on Stack Overflow

How can I parse this text into a table in Python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related