2

I'm developing a small website in Flask that relies on data from a CSV file to output data to a table on the frontend using JQuery.

The user would select an ID from a drop-down on the front-end, then a function would run on the back-end where the ID would be used as a filter on the table to return data. The data returned would usually just be a single column from the dataframe as well.

The usual approach, from my understanding, would be to load the CSV data into a SQLite DB on startup and query using SQL methods in python at runtime.

However, in my case, the table is 15MB in size (214K rows) and will never grow past that point. All the data will be as is for the duration of the Apps lifecycle.

As such, would it be easier and less hassle to just load the dataframe table into memory and just filter on a copy of it when requests come in? Is that scalable or am I just kicking a can down the road?

Example:

app = Flask(__name__)


dir_path = os.path.abspath(os.path.dirname(__file__))


with app.app_context():
   print("Writing DB on startup")

   query_df = pd.read_csv(dir_path+'/query_file.csv')


@app.route('/getData', methods=["POST"])
def get_data():

   id = request.get_json()

   print("Getting rows....")

   data_list = sorted(set(query_df[query_df['ID'] == id]['Name'].tolist()))

   return jsonify({'items': data_list, 'ID': id})

This may be a tad naive on my end but I could not find a straight answer for my particular use-case.

3
  • Do you mean that the returned data would be a single column or a single row from the dataframe? It feels like you mean row but say column Commented Jul 30, 2024 at 11:29
  • 1
    Have you tried the "usual approach" of using a sqlite3 database with indices defined on the appropriate columns? To share the connection among multiple threads you would need to lock the connection while the query is done. For example: select distinct Name from my_table where ID = 'some_id_value' order by Name. I suspect this would prove to be faster than processing a dataframe. Commented Jul 31, 2024 at 13:24
  • You can create in-memory sqlite database. This will allow to have SQL or ORM like functionality. Later will be easier to switch to larger db if needed. Commented Jul 31, 2024 at 15:25

3 Answers 3

2
+50

This line of code can be made much faster without adding any new dependencies, just by using the tools that Pandas gives you.

data_list = sorted(set(query_df[query_df['ID'] == id]['Name'].tolist()))

The following optimizations can be made:

  • sorted() can be replaced by pre-sorting the dataframe.
  • set() can be replaced by dropping duplicates with the same ID and Name.
  • query_df[query_df['ID'] == id] requires searching the entire dataframe for matching ID values, and can be replaced with an index.

To prepare the dataframe, on the startup of your program, after reading the dataframe with read_csv(), you would do the following:

name_lookup_series = query_df \
    .sort_values(['ID', 'Name']) \
    .drop_duplicates(['ID', 'Name']) \
    .set_index('ID')['Name']

To look up any particular value, you would do the following:

name_lookup_series.loc[[id_to_look_up]].tolist()

Benchmarking this, it is roughly 100x faster, using the following benchmark program:

import pandas as pd
import numpy as np
np.random.seed(92034)

N = 200000

df = pd.DataFrame({
    'ID': np.random.randint(0, N, size=N),
    'Name': np.random.randint(0, N, size=N),
})
df['ID'] = 'ID' + df['ID'].astype('str')
df['Name'] = 'N' + df['Name'].astype('str')

print("Test dataframe")
print(df)
id_to_look_up = np.random.choice(df['ID'])
print("Looking up", id_to_look_up)
print("Result, method 1", sorted(set(df[df['ID'] == id_to_look_up]['Name'].tolist())))

%timeit sorted(set(df[df['ID'] == id_to_look_up]['Name'].tolist()))
name_lookup_series = df.copy() \
    .sort_values(['ID', 'Name']) \
    .drop_duplicates(['ID', 'Name']) \
    .set_index('ID')['Name']
print("Result, method 2", name_lookup_series.loc[[id_to_look_up]].tolist())
%timeit name_lookup_series.loc[[id_to_look_up]].tolist()
Sign up to request clarification or add additional context in comments.

1 Comment

Awarding this as the answer as it speeds up what was already a pretty small memory footprint
0

Your usecase may be specific enough to warrant not using SQLite or Pandas dataframes.

If all you ever need to do is find a sorted set of unique names given a matching ID, it would likely (of course you'd need to measure things) be more optimal to bake the CSV into a machine-generated Python file á la

data = {}
data["John"] = ["Foo", "Bar", "Baz"]
data["Madden"] = ["Moonbase", "Boing", "Blank"]

so your lookup becomes data_list = data[id] and no lookup, unique-ifying or sorting work will be done on each request.

Comments

0

For your use case, where the CSV data is relatively small (15MB, with 214K rows) and won’t grow beyond this size, loading the entire dataset into memory and performing filtering operations directly on the DataFrame is a feasible and efficient approach. Given that the data volume is modest and won't change, this approach simplifies the application architecture and reduces overhead. here's something you can try-

from flask import Flask, request, jsonify
import pandas as pd

 app = Flask(__name__)

 # Load the CSV file into a DataFrame
 df = pd.read_csv('data.csv')

 @app.route('/get-data', methods=['GET'])
 def get_data():
     # Get ID from query parameters
     id = request.args.get('id')

     # Filter DataFrame based on ID
     filtered_data = df[df['ID'] == id]['Value'].tolist()

     return jsonify(filtered_data)

if __name__ == '__main__':
app.run(debug=True)

For your current dataset size, the in-memory DataFrame approach is likely sufficient, might become less efficient as data size grows.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.