Separating data that overlaps between rows in a csv file using Pandas library

Question

So, I downloaded this Ecommerce dataset from kaggle here:
https://www.kaggle.com/datasets/kolawale/focusing-on-mobile-app-or-website

After converting it to a csv file, there seems to be an issue. The data starting from 2nd row (1st row contains the column names like Email, Address, Avatar, Avg. Session Length, Time on App, Yearly Amount Spent etc) onwards seems to be spilling over between adjacent rows.

For example, data(separated by commas) corresponding to one customer is not contained in one row. Some of it is contained in one row and the rest is in the row below it. When I do 'Text to Column' in excel, the data gets populated in the columns such that it makes no sense i.e Address data of a customer will be in the 'email' column, 'Avg Session Length' data will be in the 'Time on App' column and so on. Moreover, most of the cells become empty and lose data so that these cells get converted to NaN when read in by pandas' read_csv() function. This is shown below:

Doing 'Text to Column' on this data will cause the data to be split between columns in such a way that important data is lost. Reading this csv file using pandas' read_csv() function will almost certainly convert these empty cells to 'NaN'. This is shown below:

What is the work around for this? How do you combine the data in this csv file such that data for each customer is contained in one row only? Any help or advice is much appreciated. Thanks.

rehaqds · Accepted Answer · 2025-01-29 23:24:35Z

2

The issue is that the address field is a string with a new line character in the middle. Excel doesn't like it but the Pandas csv reader can deal with it:

df = pd.read_csv("Ecommerce Customers.csv")

If you want to replace the special character by a more usual space character:

df["Address"] = df["Address"].str.replace("\n", " ")

Then you could save the dataframe as a csv file with df.to_csv("Ecommerce Customers2.csv") and open it in Excel if you wish but I guess that you will keep using Pandas to analyze the data.

answered Jan 29 at 23:24

rehaqds

2,2564 silver badges13 bronze badges

$\begingroup$ Is there anything you did before you used the pd.read_csv() function to read in the csv file? Cause it seems like the data is getting converted to dataframe perfectly the way its supposed to but when I do the same thing, only one column appears at the top which contains names of all the labels separated by commas like this : 'Email, Address, Avg Session Length, Time on App, Time on Website' etc. instead of separate columns with separate names and data in them. $\endgroup$

Majoka
– Majoka

2025-01-30 09:07:23 +00:00
Commented Jan 30 at 9:07
$\begingroup$ No, I just added ".csv" at the file name but I've just checked that it works with the original file name. Did you do something on the file ? Maybe delete the one you have and download it again to start fresh. What you get is as if it didn't use the comma separator or the header line equals to 0 but both are default parameters, that's strange. What is your Pandas version ? $\endgroup$

rehaqds
– rehaqds

2025-01-30 09:46:03 +00:00
Commented Jan 30 at 9:46

Add a comment |

Stack Exchange Network

Separating data that overlaps between rows in a csv file using Pandas library

1 Answer 1

Your Answer

Hot Network Questions

Separating data that overlaps between rows in a csv file using Pandas library

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions