2
  • I am new to pandas/python. Have used excel and stata pretty extensively.
  • I get a .csv file with multiple tables in it from a supplier that will not change their format.
  • The tables have headers and a blank row in between them.
  • The number of rows in each table can vary
  • The number of tables also seems to vary (i just discovered!)
  • There are 23 possible tables that can come in the file
  • I have managed to create one big data frame from the file
  • I can't seem to group by the index=0

Here is the code i have so far:

%matplotlib inline
import csv
from pandas import Series, DataFrame
import pandas as pd  # if len(row) == 0,new_table_coming_up = 1if len(row) > 0,if new_table_coming_up == 0
import numpy as np
import matplotlib.pyplot as plt
import io
df = pd.read_csv(r'C:\Users\file.csv',names=range(25))
table_names = ["WAREHOUSE","SUPPLIER","PRODUCT","BRAND","INVENTORY","CUSTOMER","CONTACT","CHAIN","ROUTE","INVOICE","INVOICETRANS","SURVEY","FORECAST","PURCHASE","PURCHASETRANS","PRICINGMARKET","PRICINGMARKETCUSTOMER","PRICINGLINE","PRICINGLINEPRODUCT","EMPLOYEE"]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

here is a sample of the .csv file with the first 3 tables:

Record Identifier   Sender ID   Receiver ID Action  Warehouse ID    Warehouse Name  System Close Date   DBA Address Address 2   City    State   Postal Code Phone   Fax Primary Contact Email   FEIN    DUNS    GLN             
WAREHOUSE   COX SUPPLIERX   Change  1   Richmond    20160127    Company 700 Court       Anywhere    CA  99999   5555555555  5555555555  na  na  0   50682020                    

Record Identifier   Sender ID   Receiver ID Sender Supplier ID  Supplier Name   Supplier Family                                                                     
SUPPLIER    COX SUPPLIERX   16  SUPPLIERX   SUPPLIERX                                                                       

Record Identifier   Sender ID   Receiver ID Supplier Product Number Sender Product ID   Product Name    Sender Brand ID Active  Cases Per Pallet    Cases Per Layer Case GTIN   Carrier GTIN    Unit GTIN   Package Name    Case Weight Case Height Case Width  Case Length Case Ounces Case Equivalents    Retail Units Per Case   Consumable Units Per Case   Selling Unit Of Measure Container Material
PRODUCT COX SUPPLIERX       53030   LAG DOGTOWN PALE ALE 4/6/12OZ NR    217 Active  70  10  7.2383E+11  7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53071   LAG DOGTOWN PALE ALE 1/2 KEG    217 Active  8   8       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX   2100008003  53122   LAG CAPPUCCINO STOUT 12/22OZ NR 221 Active  75  15  7.2383E+11  7.2383E+11  7.2383E+11  12/22oz NR  33.6    9.5 10.75   14.2083 264 0.916667    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53130   LAG SUCKS ALE 4/6/12OZ NR   1473    Active  70  10  7.23831E+11 7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53132   LAG SUCKS ALE 12/32oz NR    1473    Active  50  10  7.23831E+11 7.2383E+11  7.2383E+11  12/32oz NR  38.2    9.5 10.75   20.6667 384 1.333333    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53170   LAG SUCKS ALE 1/4 KEG   1473    Inactive    1   1       0   1.11111E+11 KEG-1/4 BBL 87.2    11.75   17  17  992 3.444444    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53171   LAG FARMHOUSE SAISON 1/2 KEG    1478    Inactive    16  1       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53172   LAG SUCKS ALE 1/2 KEG   1473    Active  80  4       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53255   LAG FARMHOUSE HOP STOOPID ALE 12/22 222 Active  75  15  7.23831E+11 7.2383E+11  7.2383E+11  12/22oz NR  33.6    9.5 10.75   14.2083 264 0.916667    12  12  Case    Aluminum
PRODUCT COX SUPPLIERX       53271   LAG FARMHOUSE HOP STOOPID 1/2 KEG   222 Active  8   8       0       KEG-1/2 BBL 160.6   23.5    15.75   15.75   1984    6.888889    1   1   Each    Aluminum
PRODUCT COX SUPPLIERX       53330   LAG CENSORED ALE 4/6/12OZ NR    218 Active  70  10  7.23831E+11 7.2383E+11  7.2383E+11  4/6/12oz NR 31.9    9.5 10.75   15.5    288 1   4   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53331   LAG CENSORED ALE 2/12/12 OZ NR  218 Inactive    60  1   7.2383E+11  7.2383E+11  7.2383E+11  2/12/12oz NR    31.9    9.5 10.75   15.5    288 1   2   24  Case    Aluminum
PRODUCT COX SUPPLIERX       53333   LAG CENSORED ALE 24/12 OZ NR    218 Inactive    70  1           7.2383E+11  24/12oz NR  31.9    9.5 10.75   15.5    288 1   1   24  Case    Aluminum

2 Answers 2

4

The first thing you need is simply to load your data cleanly. I'm going to assume your input file is tab-separated, even though your code doesn't specify that. This code works for me:

from cStringIO import StringIO
import pandas as pd

subfiles = [StringIO()]

with open('t.txt') as bigfile:
    for line in bigfile:
        if line.strip() == "": # blank line, new subfile                                                                                                                                       
            subfiles.append(StringIO())
        else: # continuation of same subfile                                                                                                                                                   
            subfiles[-1].write(line)

for subfile in subfiles:
    subfile.seek(0)
    table = pd.read_csv(subfile, sep='\t')
    print '*****************'
    print table

Basically what I do is to break up the original file into subfiles by looking for blank lines. Once that's done, reading the chunks with Pandas is straightforward, so long as you specify the correct sep character.

Sign up to request clarification or add additional context in comments.

1 Comment

thanks @John Zwinck sep= ',' .Tweaking a little had to use 'import io' and set the index ' index=('Record Identifier') subfile.format ('Record Identifier')' not quite there yet
0

this worked, then i used the slicer to create tables

df = pd.read_csv(fileloaction.csv',delim_whitespace=True,names=range(25)) table_names=["WAREHOUSE","SUPPLIER","PRODUCT"] groups = df[0].isin(table_names).cumsum() tables = {g.iloc[0,1]: g.iloc[0] for k,g in df.groupby(groups)}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.