SQL/Python: Transform data from csv and into table with different schema with condition

Question

So, I have a csv file containing data like this:

id       type      sum_cost         date_time
--------------------------------------------------
a1        pound     500        2019-04-21T10:50:06    
b1        euro      100        2019-04-21T10:40:00    
c1        pound     650        2019-04-21T11:00:00    
d1        usd       410        2019-04-21T00:30:00

What I want to do is to insert these data into a database table where the schema is not the same as the csv such that the column in table have like this:

_id , start_time, end_time, pound_cost, euro_cost, count

where I insert from csv to this table such that, id = id, start_time is date_time - 1 hour, end_time is date_time - 30 minutes. For pound_cost and euro_cost, if type is pound insert the value from its sum_cost into pound_cost and add 0 to euro_cost. The same way to euro. and add 1 to the count column.

So, the result of the table will be like this:

_id   start_time           end_time              pound_cost  euro_cost  count
-----------------------------------------------------------------------------
 a1  2019-04-21T09:50:06  2019-04-21T10:20:06      500           0        1
 b1  2019-04-21T09:40:06  2019-04-21T10:10:00       0           100       1
 c1  2019-04-21T10:00:00  2019-04-21T10:30:00      650           0        1
 d1  2019-04-20T23:30:00  2019-04-21T00:00:00       0           410       1

So, how should I insert data to table respect to how I transform values from csv to the table. This is my first time using postgresql and I did not use sql that much so I wonder if there is a function that can do this. Or if not, how can I use Python to transform data and insert them to the table.

Thank you.

You could try Pandas.DataFrame.to_sql(). There's an append argument that you can pass which might be useful. More documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html — Mark M
– Mark M, Commented Apr 21, 2019 at 11:24
it would be better if you load that csv into a temporary table and then write that select query and take copy of it to the required table; drop the temporary table after use. — Nikhil
– Nikhil, Commented Apr 21, 2019 at 11:30
Jamiewp : After you've copied to temp table, simply do a INSERT into yourmaintable select col1_expression,co12_expression .. from temp_table. Here col1_expression,col2_expression are the transformations from the columns of temp table into the the main table. For eg: date_time - interval '1 hour' as end_time — Kaushik Nayak
– Kaushik Nayak, Commented Apr 21, 2019 at 11:54
@nikhilsugandh Your initial answer is good. I posted some of the processing stuff, but unless there are a lot of files to process, creating a temp table and then appending that to the main table is the best idea. — Mark M
– Mark M, Commented Apr 21, 2019 at 12:15

Kaushik Nayak · Accepted Answer · 2019-04-21 13:53:46Z

2

As discussed over comments, you may easily accomplish this by using COPY command and a temporary table to hold your data from the file.

Create a temporary table with the structure of your CSV,note that all are of text datatypes. This makes the copying faster as the validations are minimised.

CREATE TEMP TABLE  temptable 
      ( id TEXT ,
        TYPE TEXT,
        sum_cost TEXT ,
        date_time TEXT );

Use COPY to load from the file into this table. If you are loading the file from a server, use COPY, If it's in a client machine use psql's \COPY. Change it to a different delimiter appropriately if needed.

\COPY temptable from '/somepath/mydata.csv'  with delimiter ',' CSV HEADER;

Now, simply run an INSERT INTO .. SELECT using expressions for various transformations.

INSERT INTO maintable (
          _id,start_time,end_time,pound_cost,euro_cost,count )
SELECT id,
     date_time::timestamp - INTERVAL '1 HOUR', 
     date_time::timestamp - INTERVAL '30 MINUTES',
  CASE type
      WHEN 'pound' THEN sum_cost::numeric
     ELSE 0 END,
  CASE type when 'euro' THEN sum_cost::numeric --you have not specified what 
                                               --happens to USD,use it as required.
     ELSE 0 END, 
   1 as count       -- I have hardcoded it based on your info, not sure what it 
                    --actually means
from temptable t;

Now, the data is in your main table

select * from maintable;

 _id |     start_time      |      end_time       | pound_cost | euro_cost | count
-----+---------------------+---------------------+------------+-----------+-------
 a1  | 2019-04-21 09:50:06 | 2019-04-21 10:20:06 |        500 |         0 |     1
 b1  | 2019-04-21 09:40:00 | 2019-04-21 10:10:00 |          0 |       100 |     1
 c1  | 2019-04-21 10:00:00 | 2019-04-21 10:30:00 |        650 |         0 |     1
 d1  | 2019-04-20 23:30:00 | 2019-04-21 00:00:00 |          0 |         0 |     1

answered Apr 21, 2019 at 13:53

Kaushik Nayak

32k6 gold badges36 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

emp Over a year ago

This seem to be the solution to my question. But in addition, if I want my date_time to be clean e.g. 10:20:06 to 10:30:00 (clean second to be 0 and change minutes into interval of 15 minute > 00:01-15:00 to minute 15, 15:01-30:00 to minute 30 and so on). Can this be done inside the select statement ?

Kaushik Nayak Over a year ago

@Jamiewp : Sure, you can do it. But, since it's specific problem and not directly related to this question, you should consider asking a new one separately with specific details of the requirement.I hope this answers your original question. If you found the answer useful, you may accept it so that it would help others.

Mark M · Accepted Answer · 2019-04-21 12:16:57Z

Here's how you might be able to reshape data for your specification:

import os
import pandas as pd
import datetime as dt

dir = r'C:\..\..'
csv_name = 'my_raw_data.csv'
full_path = os.path.join(dir, csv_name)
data = pd.read_csv(full_path)

data = pd.read_csv(full_path)

def process_df(dataframe=data):
    df1 = dataframe.copy(deep=True)
    df1['date_time'] = pd.to_datetime(df1['date_time'])
    df1['count'] = 1

    ### Maybe get unique types to list for future needs
    _types = df1['type'].unique().tolist()

    ### Process time-series shifts
    df1['start_time']  = df1['date_time'] - dt.timedelta(hours=1, minutes=0)
    df1['end_time']  = df1['date_time'] - dt.timedelta(hours=0, minutes=50)

    ## Create conditional masks for the dataframe
    pound_type = df1['type'] == 'pound'
    euro_type = df1['type'] == 'euro'

    ### Subsection each dataframe by currency; concatenate results
    df_p = df1[df1['type'] == 'pound']
    df_e = df1[df1['type'] == 'euro']
    df = pd.concat([df_p, df_e]).reset_index(drop=True)

    ### add conditional columns
    df['pound_cost'] = [x if x == 'pound' else 0 for x in df['type']]
    df['euro_cost'] = [x if x == 'euro' else 0 for x in df['type']]

    ### Manually input desired field arrangement
    fin_cols = [
        'id',
        'start_time',
        'end_time',
        'pound_cost',
        'euro_cost',
        'count',
        ]
    ### Return formatted dataframe
    return df.reindex(columns=fin_cols).copy(deep=True)

data1 = process_df()

Output:

   id          start_time            end_time pound_cost euro_cost  count
0  a1 2019-04-21 09:50:06 2019-04-21 10:00:06      pound         0      1
1  c1 2019-04-21 10:00:00 2019-04-21 10:10:00      pound         0      1
2  b1 2019-04-21 09:40:00 2019-04-21 09:50:00          0      euro      1

To load to the main SQL table, you'd have to get a connection with SQLAlchemy or pyodbc. Then, assuming all data types match, you should be able to utilize pandas.DataFrame.append() to add data.

Collectives™ on Stack Overflow

SQL/Python: Transform data from csv and into table with different schema with condition

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related