0

I'm cleaning a csv file with pandas, mainly removing special characters such as ('/', '#', etc). The file has 7 columns (none of which are dates).

In some of the columns, there's just numerical data such as '11/6/1980'.

I've noticed that directly after reading the csv file,

df = pd.read_csv ('report16.csv', encoding ='ANSI')

this data becomes '11/6/80', after cleaning it becomes '11 6 80' (it's the same result in the output file). So wherever the data has ' / ', it's being interpreted as a date and python is eliminating the first 2 digits from the data.

Data Expected result Actual Result
11/6/1980 11 6 1980 11 6 80
12/8/1983 12 8 1983 12 8 83

Both of the above results are wrong because in the Actual Result column, I'm losing 2 digits towards the end.

The data looks like this

Org Name Code Code copy
ABC 11/6/1980 11/6/1980
DEF 12/8/1983 12/8/1983
GH 11/5/1987 11/5/1987
OrgName,    Code,   Code copy
ABC,    11/6/1980,  11/6/1980
DEF,    12/8/1983,  12/8/1983
GH, 11/5/1987,  11/5/1987
KC,      9000494,          9000494

It's worth mentioning that the column contains other data such as '900490', strings, etc but in these instances, there aren't any problems.

What could be done to not allow this conversion?

3
  • Welcome to SO! You will find help here, provided you ask questions in the way we are used to. Here, if you show us the code you use, with data exhibiting the problem - in fact if you provide a minimal reproducible example, you could get far more relevant and detailed answers. If you do not really understand what a minimal reproducible example is, please read How to Ask... Commented Mar 5, 2021 at 12:54
  • I cannot reproduce. Please see my (non) answer below for more details. Commented Mar 5, 2021 at 13:30
  • 1
    Still not an useable format here. As the problem is at the time of reading the csv file, you should show the file not in a spreadsheet format but in raw text format like when you use type file.csv in a windows console CMD window, or cat file.csv on a Unix-like. Or when you open it in a simple text editor like Windows notepad, or vi on Unix-like, or notepad++ Commented Mar 5, 2021 at 13:47

3 Answers 3

1

Not an answer, but comments do not allow to include well presented code and data.

Here is what I call a minimal reproducible example:

Content of the sample.csv file:

Data,Expected result,Actual Result
11/6/1980,11 6 1980,11 6 80
12/8/1983,12 8 1983,12 8 83

Code:

df = pd.read_csv('sample.csv')
print(df)
s = df['Data'].str.replace('/', ' ')
print((df['Expected result'] == s).all())

It gives :

        Data Expected result Actual Result
0  11/6/1980       11 6 1980       11 6 80
1  12/8/1983       12 8 1983       12 8 83
True

This proves that read_csv has correctly read the file and has not changed anything.

PLEASE SHOW THE CONTENT OF YOUR CSV FILE AS TEXT, along with enough code to reproduce your problem.

Sign up to request clarification or add additional context in comments.

3 Comments

This is not an answer and will probably be deleted in a while... It should be anyway ;-)
This looks like an answer to me. A good answer. Shows OP a very good first step in problem solving: break the problem into pieces that can be proved/disproved ("proves that read_csv has correctly read the file"). Also shows OP how to get better support by providing the core of the problem: the text file. Well done.
@jim Thank you for supporting :-) . I said that it is not an answer precisely for those reasons: it explains OP how to present a question, it explains them what they should control on their own system before asking here or at least what they should show if they need explainations. All things that should normally go into comments :-(
0

How about trying string operation?! First select the column that you would like to modify and replace "/" or "#" with whitespace : column.str.replace("/", " "). I hope this is gonna work !

1 Comment

I tried this however the problem becomes apparent upon reading the csv file. So immediatedly after the file is read, they become dates and the 2 digits of the 'year' column are lost. What I've done until now is replaced the '/' with ' * ' in excel and re-read the file and that worked. However I'd like to know if there's a workaround to this in Python.
0

The behavior of converting dates is not strictly a python issue. You are using pandas read_csv.

Try to explicitly declare a separator. If sep not declared, it makes guesses.

df = pd.read_csv ('report16.csv', encoding ='ANSI', sep =',') 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.