1

I have this data called text.txt. I also have my code below. I want to extract line values and want to make a table out of it. I also wanted to see if there is a better way to do it. Thanks

test.txt

Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
73764
Counting********************File:  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
78640
Counting********************File:  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq
Seq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: 
0
Seq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: 
26267

result I want:

  File Name                                 Seq_132582_1  Seq_483974_49238
0  bbduk_trimmed_Ago2_SsHV2L_1_CATGGC_L003_R1_001     0      73764
1  bbduk_trimmed_Ago2_SsHV2L_2_CATTTT_L003_R1_001     0      78640
2  bbduk_trimmed_Ago2_VF_1_CAACTA_L003_R1_001.fastq   0      26267

code I tried:

import sys

if sys.version_info[0] < 3:
    raise Exception("Python 3 or a more recent version is required.")
import re
import pandas as pd
text = open("text.txt",'r').read()
print(type(text))
results = re.findall(r'(bbduk_trimmed.*.fastq)\nSeq_132582_1: ATCCGAATTAGTGTAGGGGTTAACATAACTCT: \n(\d)\nSeq_483974_49238: TCCGAATTAGTGTAGGGGTTAACATAACTC: \n(\d*)',text)
df=pd.DataFrame(results)
# df.columns=['FileName','Seq_132582_1','Seq_483974_49238'] #This doesn't work
print(df)

1 Answer 1

1

Just replace your regex with below code line:

re.findall(r'Counting[*]+File:[ ]*([\w.]+)[ \n]*[ :\w]+[\n]*(\w+)[\n]*[ :\w]+[\n]*(\w+)', text)

Explanation:

  • [*]+ - match one or more * character
  • [ ]* - match one or more (space) character
  • ([\w.]+) - match filename and compute as first paranthasis
  • [ \n]* - match zero or more space or newline character
  • [ :\w]+ - match your whole line which is starting with Seq

The core logic to get sequence in the regex is as below:

([\w.]+)[ \n]*[ \w]+:[ :\w]+[\n]*(\w+)

  • after matching filename with ([\w.]+) first, we match the space(s) and new lines(s) using [ \n]*,
  • after that if you want to parse name of sequence you are parsing you might need to keep [ \w]+:[ :\w]+ separately and use it as ([ \w])+:[ :\w]+ where paranthisis can match you can extract sequence which can be Seq_132582_1 or Seq_483974_49238, however if order is not to be considered then you may simply replace it with [ :\w]+[\n]* and match the whole line and match the data you require on next line with (\w+)

Another easier way is to extract data is shown below to prepare result without using re module:

results = []
f = open("content.txt", 'r')

while True:
    line = f.readline()
    if not line:
        break
    file_name = line.split(":")[-1].strip()
    f.readline()  # skip line 
    data_seq1 = f.readline().strip()
    f.readline()  # skip line 
    data_seq2 = f.readline().strip()
    results.append((file_name, data_seq1, data_seq2))
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! what if I have more Seqs? what regex do I need to add?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.