0

I have written a piece of code that would read in one .fasta file, analyze a single genetic sequence, make calculations based on said sequence, and then organize the calculation results into a single pandas dataframe, which would subsequently be exported as .csv file.

I have updated the code recently in order for it to parse a .fasta file that contains multiple sequences, and although I figured out how to do it, the code in its current form exports one .csv file per sequence. When the .fasta file contains many sequences (over 100, for example), having to sort through so many .csv files might be somewhat laborious.

So instead I am trying to have each of the pandas dataframes be exported in a single .csv file instead. However, I am not sure how to set up code in order to have this occur. Right now, the code is based around a for loop that iterates over values of a dict (where the sequences from the .fasta file are stored). In each iteration, a function is called that creates a dict full of the the pertinent calculation results, and another function is called that creates pandas dataframe and fills it with the information from the dict, which is then exports as a .csv file.

import pandas as pd
from os import path

for seq in seq_dict.keys():
    result_dict= calculator_func(seq_dict[seq])
    results_df= data_assembler(result_dict)
    results_df.to_csv(path.join(output_dir, "{}_dataframe.csv".format(project_name)

It should also be noted that the indices of the dataframes are all based on the numerical positions within the relevant sequence.

In any case, I am having a hard time trying to figure out exactly how I should conglomerate all the dataframes into one .csv file such that indices make it possible for the user to tell a. from which sequence the row is from and b. at which position within the sequence the row is based on. Can anybody recommend me a some kind of approach?

1 Answer 1

1

You can set your index as whatever you want, including a string. Try this example:

import pandas as pd

test_frame = pd.DataFrame({"Sequence":[1,2],"Position":[3,4]})
test_frame.index = "Sequence:" + test_frame['Sequence'].astype(str) + "_" + "Position:" + test_frame['Position'].astype(str)
test_frame
Sign up to request clarification or add additional context in comments.

4 Comments

I know that you can set the index on a dataframe to being whatever you want, but how can I export all the dataframes produced by the for loop into one single .csv file? Should I create an empty data frame before the loop, and then fill it up with each for loop? How then should I structure the indices? Tell me if you need to tell you what the structure of my functions are.
Do all of the dataframes have the same column names? After you assign the new indices, you can append to a giant master dataframe or concatenate them into a master dataframe and export. For example: master_frame = test_frame1 Followed by: master_frame = master_frame.append(test_frame2) pandas.pydata.org/pandas-docs/stable/merging.html
I ended up merging the dataframes by making an empty list before the for loop, appending each built dataframe to the list, and then using final_dataframe = pd.concat(total_list_of_dfs) to make the final dataframe. Thanks for your help!
Awesome, hope I helped!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.