0

I have a csv file that looks like this:

CCC1=NNC(CC)=C1CC(=O)N[C@H]1C[C@@](C)(O)C1  Z3670357625 M
COC(=O)C1(CNC(=O)NCC2CC2(C)C)CCCC1  Z1366118392 M
CC1(C)CN(CCF)CCN1S(=O)(=O)C1=CNC=N1 Z3847735140 S
CC1=CC(N2CCC(C#N)(C3CCC3)CC2)=NC=C1C#N  Z4174841496 M
CN(CC1=NN=C2C=CC=CN12)C(=O)C1=CC=CC=C1Br    Z979185880  S
CC1=CC(C(NCC(C)N(C)C2CC2)C2CC2)=CC=C1F  Z1586839055 S
CN1N=NC2=CC(CSCC3=CC=C(Br)C=C3)=CC=C21  Z1449657345 S
CCC(C)(C)C(=O)N(C)C1=CC=CC(F)=C1OC  Z4888674994 S
CCC(NC1CC2CCCCN2C1=O)C1=NOC(C)=C1   Z4196015222 M
COCC1(CO)CCN(CC2=CC=CC(C(=O)O)=C2Cl)C1  Z4251400790 M
O=C(C[C@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1  Z4062618130 S
O=C(C[C@@H]1CCNC1)N1CCC(OC2=CC=C(Cl)C=C2)C1 Z4062618465 S
CC1(C)CN(C2CCN(C(=O)OC3COC3)C2)C1   Z4000118072 M
CCC(CC)C(C)C(=O)OCC1=CC(=O)NC=C1    Z2861190319 S
COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC |&1:9,13,&2:11,15|  Z3808914614 S

The last line contains characters within ||, which leaves it with four columns instead of the three that the other rows have. How do I delete just |&1:9,13,&2:11,15|? I saw this post: Deleting part of a string pandas DataFrame but I need to delete just this selection, not the entire row starting from this point. I feel this would be easier than trying to just retain the three columns I need throughout the csv, but am open to other ideas. I will need to do this for a number of lines in a very large file. Thanks!

0

2 Answers 2

1

Assuming that your content is in a csv file called csvfile.csv try the following:

  1. Code to replace the offending string
#read input file
fin = open("csvfile.csv", "rt")
#read file contents to string
data = fin.read()
#replace all occurrences of the required string
data = data.replace('|&1:9,13,&2:11,15|', '')
#close the input file
fin.close()
#open the input file in write mode
fin = open("csvfile.csv", "wt")
#overrite the input file with the resulting data
fin.write(data)
#close the file
fin.close()

after this you should be able to do:

  1. Code to simply read the csv
import pandas as pd
df = pd.read_csv("csvfile.csv")
Sign up to request clarification or add additional context in comments.

Comments

0

A much faster way is to skip python altogether. From your terminal:

sed 's/|&1:9,13,&2:11,15|//g' path_to_csv > path_to_modified_csv

This will remove exclusively the undesired part, e.g. (last line):

COC1=CC(Br)=C(CN[C@@H]2C[C@@H]3C[C@H](O)[C@H]2C3)C=C1OC   Z3808914614 S

1 Comment

A more general way to write this would be sed 's/|.*.|//g' path_to_csv > path_to_modified_csv

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.