1

I'm using numpy and Python 3.4 to read data from a .csv file.

Here is a sample of the CSV file:

"05/27/2016 09:45:37.816","187666432","7921470.8554087048","0","95.202655176457412","82.717061054954783","1.4626657999999999","158","5"
"05/27/2016 09:45:38.819","206884864","10692185.668858336","0","101.33018029563618","93.535551042125718","2.4649584999999998","158","5"

And here is my code sample used to extract data from the CSV above:

import os
import numpy as np

path = os.path.abspath('sample.csv')
csv_contents = np.genfromtxt(path, dtype=None, delimiter=',', autostrip=True, skip_header=0,
                             usecols=(1, 2, 3, 4, 5, 6, 7, 8))

num_cols = csv_contents.shape[1]

for x in np.nditer(csv_contents):
    print('Original value: {0}'.format(x))
    print('Decoded value: {0}'.format(x.tostring().decode('utf-8')))
    val = x.tostring().decode('utf-8').replace('\x00', '').replace('"', '')
    print('Without hex and ": {0}'.format(val))

    try:
        print('Float value:\t{0}\n'.format(float(val)))
    except ValueError as e:
        raise e

Sample output:

Original value: b'"187666432"'
Decoded value: "187666432"���������
Without hex and ": 187666432
Float value:    187666432.0

Original value: b'"7921470.8554087048"'
Decoded value: "7921470.8554087048"
Without hex and ": 7921470.8554087048
Float value:    7921470.855408705

Original value: b'"0"'
Decoded value: "0"�����������������
Without hex and ": 0
Float value:    0.0

In my for loop, to convert the x value to a float, I've had to do this:

val = x.tostring().decode('utf-8').replace('\x00', '').replace('"', '')

Which is not particularly elegant and prone to be faulty.

Question 1: Is there a better way to do this?

Question 2: Why does x.tostring().decode('utf-8') evaluate to something like "158"��������������� when dealing with integers? Where are the hexadecimal coming from in x.tostring()?

9
  • which version of numpy are you using? Can you print the output of list(b'"187666432"') etc. for these values (perhaps that will explain the �s). Commented May 27, 2016 at 20:42
  • I'm on numpy 1.11.0. For your other request, I'll check once I'm back on my laptop! :) Commented May 27, 2016 at 20:47
  • 2
    Perhaps it's a fixed length value, filled with some \0 or something like that? All three decoded values have the same length: "187666432"��������� "0"����������������� "7921470.8554087048" Commented May 27, 2016 at 20:47
  • 2
    @luis Ah yes, the dtype is s20. Commented May 27, 2016 at 20:54
  • 1
    @HEADLESS_0NE Running on Python 3.4.3, Ubuntu, numpy 1.11.0. I ran it on IPython, but just checked that it runs in python directly as well. Maybe some OSX-EndOfLine-Stuff? (ain't got no idea about OSX :P) Commented May 28, 2016 at 12:42

1 Answer 1

2

To answer the first question:

I strongly recommend using pandas to read in csv files:

In [11]: pd.read_csv(path, header=None)
Out[11]:
                         0          1             2  3           4          5         6    7  8
0  05/27/2016 09:45:37.816  187666432  7.921471e+06  0   95.202655  82.717061  1.462666  158  5
1  05/27/2016 09:45:38.819  206884864  1.069219e+07  0  101.330180  93.535551  2.464958  158  5

It "sniffs out" whether you have quoted strings, an unquoted, though this can be made explicit.


To answer the second question:

If you use flatten rather than nditer it doesn't add the \x00s (which make the length of each string to length 20; the s20 dtype):

In [21]: a
Out[21]:
array([[b'"187666432"', b'"7921470.8554087048"', b'"0"',
        b'"95.202655176457412"', b'"82.717061054954783"',
        b'"1.4626657999999999"', b'"158"', b'"5"'],
       [b'"206884864"', b'"10692185.668858336"', b'"0"',
        b'"101.33018029563618"', b'"93.535551042125718"',
        b'"2.4649584999999998"', b'"158"', b'"5"']],
      dtype='|S20')

In [22]: [i.tostring() for i in np.nditer(a)]
Out[22]:
[b'"187666432"\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"7921470.8554087048"',
 b'"0"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"95.202655176457412"',
 b'"82.717061054954783"',
 b'"1.4626657999999999"',
 b'"158"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"5"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"206884864"\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"10692185.668858336"',
 b'"0"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"101.33018029563618"',
 b'"93.535551042125718"',
 b'"2.4649584999999998"',
 b'"158"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00',
 b'"5"\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00']

In [23]: [i.tostring() for i in a.flatten()]
Out[23]:
[b'"187666432"',
 b'"7921470.8554087048"',
 b'"0"',
 b'"95.202655176457412"',
 b'"82.717061054954783"',
 b'"1.4626657999999999"',
 b'"158"',
 b'"5"',
 b'"206884864"',
 b'"10692185.668858336"',
 b'"0"',
 b'"101.33018029563618"',
 b'"93.535551042125718"',
 b'"2.4649584999999998"',
 b'"158"',
 b'"5"']
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for flatten! I hadn't thought about changing the way I was iterating over my array. I'll look into pandas; seems pretty handy.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.