Python and Pandas to Query API's and update DB

Question

I've been querying a few API's with Python to individually create CSV's for a table.

I would like to try and instead of recreating the table each time, update the existing table with any new API data.

At the moment the way the Query is working, I have a table that looks like this,

From this I am taking the suburbs of each state and copying them into a csv for each different state.

Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),

"%20"

#suburbs = ["want this", "want this (meh)", "this as well (nope)"]

suburb_cleaned = []

#dont_want = frozenset( ["(meh)", "(nope)"] )

for urb in suburbs:
    cleaned_name = []
    name_parts = urb.split()

    for part in name_parts:
        if part in dont_want:
            continue
        cleaned_name.append(part)

    suburb_cleaned.append('%20'.join(cleaned_name))

Then taking the suburbs for each state and putting them into this API to return a csv,

timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"


url_price = "http://mwap.com/api"
string = 'gxg&state='

api_results = {}

n = 0
y = 2
for urbs in suburb_cleaned:
    url = url_price + urbs + string + "NT"
    print(url)
    print(urbs)
    request = requests.get(url)

    api_results[urbs] = pd.DataFrame(request.json())
    n = n+1
    if n == y:

        dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
            'key').reset_index().set_index(['key'])
        dfs.to_csv(Name, sep='\t', encoding='utf-8')
        y = y+2
        continue

    print("made it through"+urbs)
    # print(request.json())
   # print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
    'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')

Then adding the states manually in excel, and combining and cleaning the suburb names.

# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)

and then finally inserting it into a db

engine = create_engine('mysql://username:password@localhost/dbname') 
with engine.connect() as conn, conn.begin():
    df.to_sql('Price_historic', conn, if_exists='replace',index=False)

Leading this this sort of output

Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.

Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.

Thanks!

Troy D · Accepted Answer · 2020-06-03 18:37:12Z

5

+25

I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?

Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:

 ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);

After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:

final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
            "VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
            "ON DUPLICATE KEY UPDATE " \
            "state = VALUES(state), date = VALUES(date)"

con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
    try:
        cur = con.cursor()
        cur.executemany(final_str, insert_list)

answered Jun 3, 2020 at 18:37

Troy D

2,2452 gold badges17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Rick James Over a year ago

I would try to carry it further -- do all the work in SQL, no Python.

LeCoda Over a year ago

Appreciate this. I'll add in the states to a table, and query the database for the states :)

3nadh · Accepted Answer · 2020-06-07 07:17:30Z

If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-

Process 1: Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.

Process 2: Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key

Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you - in case you want to double confirm if specific API call has been made for specific record not. - Expand your project to additional API calls and can still track status of each API call at record level

[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612

Collectives™ on Stack Overflow

Python and Pandas to Query API's and update DB

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related