1

I have a dataset like that (0:5 row) ,every row have different number of elements, like 2,3...

whole milk,margarine
yogurt,brown bread,coffee
pork,yogurt,coffee
bottled water,bottled beer
whole milk
salty snack

I have tried

gro_arr = np.genfromtxt("groceries.csv", dtype=str, delimiter=",")

but it shows the

Some errors were detected !
    Line #2 (got 3 columns instead of 4)
    Line #3 (got 1 columns instead of 4)
    Line #6 (got 5 columns instead of 4)
    Line #7 (got 1 columns instead of 4)

how to solve it?

4
  • 1
    You can't do this using .genfromtxt and numpy isn't really meant for non-numeric types so it's got no real use here even if you tweaked your data so it was a proper 2D structure... What are you trying to achieve here? Commented Mar 27, 2020 at 2:06
  • Is there other way to save this dataset, because I want to calculate the number of element in every row and find the max, min,and avg, for example, the first row is 2, 2nd row is 3, 5th row is 1 Commented Mar 27, 2020 at 5:33
  • Do you mean you're trying to work with the frequency of word occurrences? Commented Mar 27, 2020 at 5:45
  • @JonClements yes Commented Mar 27, 2020 at 6:10

2 Answers 2

1

Solution

You can read in the lines in two different ways:

  1. Split each line into the number of contents in that line.
  2. Store the contents of the lines post-splitting into a padded-numpy-array.

I have made a class with necessary methods to do this in a few lines. So, you could apply it as-is, without making hardly any changes to it.

Example

s = """whole milk,margarine
yogurt,brown bread,coffee
pork,yogurt,coffee
bottled water,bottled beer
whole milk
salty snack
"""

tc = TextToColumns(file = 'source.txt', 
                   sep = ',', 
                   text2columns = True, 
                   savedata = False, 
                   usedummydata = True, 
                   dummydata = s)
tc.make_dummydata()
tc.toarray()

Output:

array([['whole milk', 'margarine', ''],
       ['yogurt', 'brown bread', 'coffee'],
       ['pork', 'yogurt', 'coffee'],
       ['bottled water', 'bottled beer', ''],
       ['whole milk', '', ''],
       ['salty snack', '', '']], dtype='<U13')

Code

import numpy as np
import os

class TextToColumns(object):
    """Reads in text data and converts text into 
    columns using user specified separator.

    Parameters
    ----------
    file: path to the file
    sep: (str) separator. Default: ","
    text2columns: (bool) if True, adds empty strings as a padding to create a 
                         2D array. Default: True
    savedata: (bool) if True, saves the data read-in after splitting with the 
                     separator, as a part of the object. Default: False
    usedummydata: (bool) if True, uses dummy data to write to a file.
                         Default: False
    dummydata: (str) the string to use as dummy data.
                         Default: ''

    Example:
    # test-data
    s = '''whole milk,margarine
    yogurt,brown bread,coffee
    pork,yogurt,coffee
    bottled water,bottled beer
    whole milk
    salty snack
    '''
    # Text-to-column transformation 
    tc = TextToColumns(filename = 'source.txt', 
                       sep = ',', 
                       text2columns = True, 
                       savedata = False, 
                       usedummydata = True, 
                       dummydata = s)
    tc.make_dummydata()
    tc.toarray() 
    # Uncomment next line to clear any dummy data created
    # tc.clear_dummydata() 
    """
    def __init__(self, file, 
                 sep: str = ',', 
                 text2columns: bool = True, 
                 savedata: bool = False, 
                 usedummydata: bool = False, 
                 dummydata: str=''):
        self.file = file # 'source.txt'
        self.sep = sep
        self.text2columns = text2columns
        self.savedata = savedata
        self.usedummydata = usedummydata
        self.dummydata = dummydata

    def __repr__(self):
        return "TextToColumns object"

    def make_dummydata(self, dummydata=''):
        """Save a string as a file to use as dummy data.
        """
        s = """whole milk,margarine
            yogurt,brown bread,coffee
            pork,yogurt,coffee
            bottled water,bottled beer
            whole milk
            salty snack
            """ 
        if (self.dummydata == ''):
            self.dummydata = s       
        if (dummydata == ''):
            dummydata = self.dummydata        

            with open(self.file, 'w') as f:
                f.write(dummydata)

    def clear_dummydata(self):
        if os.path.isfile(self.file):
            os.remove(self.file)            

    def readlines(self):
        return self.toarray()        

    def read_file(self):
        if os.path.isfile(self.file):
            with open(self.file, 'r') as f:
                lines = f.readlines()
            return lines
        else:
            raise ValueError('Invalid file path.')

    def split_lines(self, lines=None):
        data = []
        self._max_length = 0
        if lines is None:
            lines = self.read_file()
        for line in lines:
            linedata = [e.strip() for e in line.split(sep)]
            length = len(linedata)
            if (length > self._max_length): 
                self._max_length = length
            #print(linedata)
            if length > 0:
                data.append(linedata)
        if self.savedata:
            self.data = data
        return data

    def toarray(self, data=None):
        if data is None:
            data = self.split_lines()
        padded_data = []
        if self.text2columns:
            for line in data:
                padded_data.append(line + ['']*(max_length - len(line)))
            if self.savedata:
                self.padded_data = padded_data
            return np.array(padded_data)
        else:
            return data
Sign up to request clarification or add additional context in comments.

1 Comment

@4daJKong Please consider voting up.
0

You could read thecsv file using open and then read each line and split it by the comma delimiter.

allLines = []

with open("groceries.csv", 'r') as f:
    while(True):
        line = f.readline()[:-1] # [:-1] to avoid the \n at the end of the line
        if not line:
            break

        line = line.split(",")
        allLines.append(line)

for l in allLines:
    print(l)

Outputs:

['whole milk', 'margarine']
['yogurt', 'brown bread', 'coffee']
['pork', 'yogurt', 'coffee']
['bottled water', 'bottled beer']
['whole milk']
['salty snac']

Hope this helps.

1 Comment

Is there other way to save this dataset, because I want to calculate the number of element in every row and find the max, min,and avg

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.