1

So, I have a csv file containing data like this:

id       type      sum_cost         date_time
--------------------------------------------------
a1        pound     500        2019-04-21T10:50:06    
b1        euro      100        2019-04-21T10:40:00    
c1        pound     650        2019-04-21T11:00:00    
d1        usd       410        2019-04-21T00:30:00     

What I want to do is to insert these data into a database table where the schema is not the same as the csv such that the column in table have like this:

_id , start_time, end_time, pound_cost, euro_cost, count

where I insert from csv to this table such that, id = id, start_time is date_time - 1 hour, end_time is date_time - 30 minutes. For pound_cost and euro_cost, if type is pound insert the value from its sum_cost into pound_cost and add 0 to euro_cost. The same way to euro. and add 1 to the count column.

So, the result of the table will be like this:

_id   start_time           end_time              pound_cost  euro_cost  count
-----------------------------------------------------------------------------
 a1  2019-04-21T09:50:06  2019-04-21T10:20:06      500           0        1
 b1  2019-04-21T09:40:06  2019-04-21T10:10:00       0           100       1
 c1  2019-04-21T10:00:00  2019-04-21T10:30:00      650           0        1
 d1  2019-04-20T23:30:00  2019-04-21T00:00:00       0           410       1

So, how should I insert data to table respect to how I transform values from csv to the table. This is my first time using postgresql and I did not use sql that much so I wonder if there is a function that can do this. Or if not, how can I use Python to transform data and insert them to the table.

Thank you.

10
  • 1
    You could try Pandas.DataFrame.to_sql(). There's an append argument that you can pass which might be useful. More documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html Commented Apr 21, 2019 at 11:24
  • it would be better if you load that csv into a temporary table and then write that select query and take copy of it to the required table; drop the temporary table after use. Commented Apr 21, 2019 at 11:30
  • 1
    Jamiewp : After you've copied to temp table, simply do a INSERT into yourmaintable select col1_expression,co12_expression .. from temp_table. Here col1_expression,col2_expression are the transformations from the columns of temp table into the the main table. For eg: date_time - interval '1 hour' as end_time Commented Apr 21, 2019 at 11:54
  • 1
    @Jamiewp using case when...... Commented Apr 21, 2019 at 11:56
  • 1
    @nikhilsugandh Your initial answer is good. I posted some of the processing stuff, but unless there are a lot of files to process, creating a temp table and then appending that to the main table is the best idea. Commented Apr 21, 2019 at 12:15

2 Answers 2

2

As discussed over comments, you may easily accomplish this by using COPY command and a temporary table to hold your data from the file.

Create a temporary table with the structure of your CSV,note that all are of text datatypes. This makes the copying faster as the validations are minimised.

CREATE TEMP TABLE  temptable 
      ( id TEXT ,
        TYPE TEXT,
        sum_cost TEXT ,
        date_time TEXT );

Use COPY to load from the file into this table. If you are loading the file from a server, use COPY, If it's in a client machine use psql's \COPY. Change it to a different delimiter appropriately if needed.

\COPY temptable from '/somepath/mydata.csv'  with delimiter ',' CSV HEADER;

Now, simply run an INSERT INTO .. SELECT using expressions for various transformations.

INSERT INTO maintable (
          _id,start_time,end_time,pound_cost,euro_cost,count )
SELECT id,
     date_time::timestamp - INTERVAL '1 HOUR', 
     date_time::timestamp - INTERVAL '30 MINUTES',
  CASE type
      WHEN 'pound' THEN sum_cost::numeric
     ELSE 0 END,
  CASE type when 'euro' THEN sum_cost::numeric --you have not specified what 
                                               --happens to USD,use it as required.
     ELSE 0 END, 
   1 as count       -- I have hardcoded it based on your info, not sure what it 
                    --actually means
from temptable t; 

Now, the data is in your main table

select * from maintable;

 _id |     start_time      |      end_time       | pound_cost | euro_cost | count
-----+---------------------+---------------------+------------+-----------+-------
 a1  | 2019-04-21 09:50:06 | 2019-04-21 10:20:06 |        500 |         0 |     1
 b1  | 2019-04-21 09:40:00 | 2019-04-21 10:10:00 |          0 |       100 |     1
 c1  | 2019-04-21 10:00:00 | 2019-04-21 10:30:00 |        650 |         0 |     1
 d1  | 2019-04-20 23:30:00 | 2019-04-21 00:00:00 |          0 |         0 |     1
Sign up to request clarification or add additional context in comments.

2 Comments

This seem to be the solution to my question. But in addition, if I want my date_time to be clean e.g. 10:20:06 to 10:30:00 (clean second to be 0 and change minutes into interval of 15 minute > 00:01-15:00 to minute 15, 15:01-30:00 to minute 30 and so on). Can this be done inside the select statement ?
@Jamiewp : Sure, you can do it. But, since it's specific problem and not directly related to this question, you should consider asking a new one separately with specific details of the requirement.I hope this answers your original question. If you found the answer useful, you may accept it so that it would help others.
0

Here's how you might be able to reshape data for your specification:

import os
import pandas as pd
import datetime as dt

dir = r'C:\..\..'
csv_name = 'my_raw_data.csv'
full_path = os.path.join(dir, csv_name)
data = pd.read_csv(full_path)

data = pd.read_csv(full_path)

def process_df(dataframe=data):
    df1 = dataframe.copy(deep=True)
    df1['date_time'] = pd.to_datetime(df1['date_time'])
    df1['count'] = 1

    ### Maybe get unique types to list for future needs
    _types = df1['type'].unique().tolist()

    ### Process time-series shifts
    df1['start_time']  = df1['date_time'] - dt.timedelta(hours=1, minutes=0)
    df1['end_time']  = df1['date_time'] - dt.timedelta(hours=0, minutes=50)

    ## Create conditional masks for the dataframe
    pound_type = df1['type'] == 'pound'
    euro_type = df1['type'] == 'euro'

    ### Subsection each dataframe by currency; concatenate results
    df_p = df1[df1['type'] == 'pound']
    df_e = df1[df1['type'] == 'euro']
    df = pd.concat([df_p, df_e]).reset_index(drop=True)

    ### add conditional columns
    df['pound_cost'] = [x if x == 'pound' else 0 for x in df['type']]
    df['euro_cost'] = [x if x == 'euro' else 0 for x in df['type']]

    ### Manually input desired field arrangement
    fin_cols = [
        'id',
        'start_time',
        'end_time',
        'pound_cost',
        'euro_cost',
        'count',
        ]
    ### Return formatted dataframe
    return df.reindex(columns=fin_cols).copy(deep=True)

data1 = process_df()

Output:

   id          start_time            end_time pound_cost euro_cost  count
0  a1 2019-04-21 09:50:06 2019-04-21 10:00:06      pound         0      1
1  c1 2019-04-21 10:00:00 2019-04-21 10:10:00      pound         0      1
2  b1 2019-04-21 09:40:00 2019-04-21 09:50:00          0      euro      1

To load to the main SQL table, you'd have to get a connection with SQLAlchemy or pyodbc. Then, assuming all data types match, you should be able to utilize pandas.DataFrame.append() to add data.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.