1

I have a long text where I have inserted a delimiter ";" exactly where I would like to split the text into different columns. So far, whenever I try to split the text into 'ID' and 'ADText' I only get the first line. However there should be 1439 lines/rows in two columns.

My text looks like this: 1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on

I want to use the ; to split my text into two Columns, one with ID and one with the AD Text.

#read the text file into python: 
jobads= pd.read_csv("jobads.txt", header=None)
print(jobadsads)

#create dataframe 
df=pd.DataFrame(jobads, index=None, columns=None)
type(df)
print(df)
#name column to target it for split 
df = df.rename(columns={0:"Job"})
print(df)

#split it into two columns. Problem: I only get the first row.
print(pd.DataFrame(dr.Job.str.split(';',1).tolist(),
                   columns=['ID','AD']))

Unfortunately that only works for the first entry and then it stops. The output looks like this:

               ID                                                 AD
0            1234                                   text in written from with ...

Where am I going wrong? I would appreciate any advise =) Thank you!

1
  • Why don't you use the "sep" attribute of "pd.read_csv" ? Commented Sep 4, 2020 at 14:22

1 Answer 1

2

sample text:

FullName;ISO3;ISO1;molecular_weight
Alanine;Ala;A;89.09
Arginine;Arg;R;174.20
Asparagine;Asn;N;132.12
Aspartic_Acid;Asp;D;133.10
Cysteine;Cys;C;121.16

Create columns based on ";" separator:

import pandas as pd
f = "aminoacids"
df = pd.read_csv(f,sep=";")

enter image description here

EDIT: Considering the comment I assume the text looks more something like this:

t = """1234; text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon 2345; then the new Ad-Text begins until the next ID 3456; and so on1234; text in written from with multiple """

In this case regex like this will split your string into ids and text which you can then use to generate a pandas dataframe.

import re
r = re.compile("([0-9]+);")
re.split(r,t)

Output:

['',
 '1234',
 ' text in written from with multiple sentences going over multiple lines until at some point the next ID is written dwon ',
 '2345',
 ' then the new Ad-Text begins until the next ID ',
 '3456',
 ' and so on',
 '1234',
 ' text in written from with multiple ']

EDIT 2: This is a response to questioners additional question in the comments: How to convert this string to a pandas dataframe with 2 columns: IDs and Texts

import pandas as pd
# a is the output list from the previous part of this answer
# Create list of texts. ::2 takes every other item from a list, starting with the FIRST one.
texts = a[::2][1:] 
print(texts)
# Create list of ID's. ::1 takes every other item from a list, starting with the SECOND one
ids = a[1::2]
print(ids)
df = pd.DataFrame({"IDs":ids,"Texts":texts})
Sign up to request clarification or add additional context in comments.

9 Comments

Thank you so much for your answer. I have tried this and it gives me 0 rows and 19090 columns since my text is not sorted like your example. I have the ID not written nicely in front of each line but free-flowing in the text.
Ah I see so you don't even have new lines? It is just a long single line string?
Yes it is one single line string sadly
@Nina Have you checked the edit of my answer? Would it help or are there any other scenarios that it doesen‘t capture?
Thank you so much for your time and answer! It helps a lot and it worked! Awesome really, thanks!
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.