Reading Data from CSV and fill Empty Values Python

Question

I am reading in a CSV file with the general schema of

  ,abv,ibu,id,name,style,brewery_id,ounces
14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0

I am running into problems where fields are not existing such as in object 0 where it is lacking an IBU. I would like to be able to insert a value such as 0.0 that would work as a float for values that require floats and an empty string for ones that require strings.

My code is along the lines of

import csv
import numpy as np

def dataset(path, filter_field, filter_value):
  with open(path, 'r') as csvfile:
    reader = csv.DictReader(csvfile)
      if filter_field:
        for row in filter(lambda row: row[filter_field]==filter_value, reader):
          yield row

def main(path):
      data = [(row["ibu"], float(row["ibu"])) for row in dataset(path, "style", "American Pale Lager")]

As of right now my code would throw an error sine there are empty values in the "ibu" column for object 0.

How should one go about solving this problem?

Cobry · Accepted Answer · 2017-01-23 22:14:00Z

You can do the following: add a default dictionary input that you can use for missing values and also to update upon certain conditions such as when ibu is empty

this is your implementation changed to correct for what you need. If I were you I would use pandas ...

import csv, copy

def dataset(path, filter_field, filter_value, default={'brewery_id':-1, 'style': 'unkown style', '   ': -1, 'name': 'unkown name', 'abi':0.0, 'id': -1, 'ounces':-1, 'ibu':0.0}):
with open(path, 'r') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        if row is None:
            break
        if row[filter_field].strip() != filter_value:
            continue
        default_row = copy.copy(default)
        default_row.update(row)
        # you might want to add conditions
        if default_row["ibu"] == "":
            default_row["ibu"] = default["ibu"]
        yield default_row

data = [(row["ibu"], float(row["ibu"])) for row in dataset('test.csv', "style", "American Pale Lager")]

print data

>> [(0.0, 0.0)]

Gene Burinsky · Accepted Answer · 2017-01-23 22:22:25Z

1

Why don't you use

import pandas as pd

df = pd.read_csv(data_file)

The following is the result:

In [13]: df
Out[13]:
   Unnamed: 0    abv   ibu    id          name                    style  \
0          14  0.061  60.0  1979  Bitter Bitch  American Pale Ale (APA)
1           0  0.050   NaN  1436      Pub Beer      American Pale Lager

   brewery_id  ounces
0         177    12.0
1         408    12.0

answered Jan 23, 2017 at 22:22

Gene Burinsky

10.3k2 gold badges24 silver badges31 bronze badges

Comments

hpaulj · Accepted Answer · 2017-01-23 23:07:18Z

Simulating your file with a text string:

In [48]: txt=b"""  ,abv,ibu,id,name,style,brewery_id,ounces
    ...: 14,0.061,60.0,1979,Bitter Bitch,American Pale Ale (APA),177,12.0
    ...: 0 , 0.05,, 1436, Pub Beer, American Pale Lager, 408, 12.0
    ...: """

I can load it with numpy genfromtxt.

In [49]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None,skip_heade
    ...: r=1,filling_values=0)

In [50]: data
Out[50]: 
array([ (14,  0.061,  60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177,  12.),
       ( 0,  0.05 ,   0., 1436, b' Pub Beer', b' American Pale Lager', 408,  12.)], 
      dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8'), ('f3', '<i4'), ('f4', 'S12'), ('f5', 'S23'), ('f6', '<i4'), ('f7', '<f8')])
In [51]:

I had to skip the header line because it is incomplete (a blank for the 1st field). The result is a structured array - a mix of ints, floats and strings (bytestrings in Py3).

After correcting the header line, and using names=True, I get

array([ (14,  0.061,  60., 1979, b'Bitter Bitch', b'American Pale Ale (APA)', 177,  12.),
       ( 0,  0.05 ,   0., 1436, b' Pub Beer', b' American Pale Lager', 408,  12.)], 
      dtype=[('f0', '<i4'), ('abv', '<f8'), ('ibu', '<f8'), ('id', '<i4'), ('name', 'S12'), ('style', 'S23'), ('brewery_id', '<i4'), ('ounces', '<f8')])

genfromtxt is the most powerful csv reader in numpy. See it's docs for more parameters. The pandas reader is faster and more flexible - but of course produces a data frame, not array.

Collectives™ on Stack Overflow

Reading Data from CSV and fill Empty Values Python

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related