1

I am working with Python on Spark and reading my dataset from a .csv file whose first a few rows are:

17  0.2  7
17  0.2  7
39  1.3  7
19   1   7
19   0   7

When I read from the file line by line with the code below:

# Load and parse the data
def parsePoint(line):
   values = [float(x) for x in line.replace(',', ' ').split(' ')]
   return LabeledPoint(values[0], values[1:])

I get the this error:

Traceback (most recent call last):
  File "<stdin>", line 3, in parsePoint
ValueError: could not convert string to float: "17"

Any help is greatly appreciated.

7
  • 2
    You should be using .split('|'), not .split(' ') Commented Mar 20, 2016 at 11:58
  • I put those '|' characters to clarify the borders of the cells while posting my question. They do not exist in the actual file. Commented Mar 20, 2016 at 12:46
  • 1
    Ah, well the whitespace should do that. Just put the text into your question exactly how it is in the file. Commented Mar 20, 2016 at 12:49
  • I will keep that in mind next time. Thanks. Commented Mar 20, 2016 at 12:57
  • Could you edit the question then? Commented Mar 20, 2016 at 13:01

1 Answer 1

1

Following the comments below this answer, you should use:

[float(x.strip(' "')) for x in line.split(',')]

You do not need to replace ',' with ' ', you should simply split on , and then remove leading and trailing whitespaces and quotes (x.strip(' "')) before converting to float.

Also, have a look at the csv packages which may simplify your work.


Below is the answer to the original question (before comments).

You need to use .split() instead of .split(' '). You have multiple consecutive space characters in your line, so splitting on ' ' results in empty strings, e.g. your first line is split into:

['17', '', '0.2', '', '7']

The problem are those empty strings that you (obviously) cannot convert to float.

Using split() will solve the problem thanks to the behaviour of split when its sep argument is None (or not present):

If the optional second argument sep is absent or None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed).

See the doc of split, and a small example to understand the difference:

>>> sp5 = ' ' * 5
>>> sp5.split()
[]
>>> sp5.split(' ')
['', '', '', '', '', '']
Sign up to request clarification or add additional context in comments.

11 Comments

I tried exactly as you said. Now, the same error except that it complains about another value which is 35. Traceback (most recent call last): File "<stdin>", line 3, in parsePoint ValueError: could not convert string to float: "35"
@EmreBulut Could you show the line where this 35 is?
The file has more than 1 million rows. I do not know which 35 it is complaining about.. I opened the file with a text editor this time: "35","1.3","7" "29","1","7" "24","1.2","7" "24","1.1","7" "19","0","7" "36","0","7" "19","1.2","7" "24","1.3","7"
@EmreBulut One good idea to find the line could be to wrap your code into a try such as: try: [float(x) for x in line.replace(',', ' ').split(' ')]; except ValueError: print(line); exit(1).
@EmreBulut But it looks like your values are wrapped between " characters?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.