Problem with grouping and counting in python

Question

I have been working in a problem in python where I have a matrix of 3 columns and more than a million rows. The first column represents origin country, the second destination country, and the third the date. For example:

US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020
AU AU 02/03/2020
AU CN 03/04/2020
AU MX 03/04/2020
AU US 02/03/2020
US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020

And I want to count all the flights between two countries in a given day. For example, all flights from US to AU on 02/03/2020. I have done it with 3 for's and some if's, but it has been running for more than a week, and it hasn't finished. I wanted to know if anyone has a suggestion on how could I handle this problem in a more efficient way.

Thanks

Welcome! Please read How to Ask. Whenever a question asks about code, the question should include that code. Otherwise, there is no way to know why the code fails. minimal reproducible example is another good article to read. — joseville
– joseville, Commented Nov 11, 2021 at 20:48
Why would you need 3 for's? sum(flight['from'] == 'US' and flight['to'] == 'AU' and flight['date'] == '02/03/2020' for flight in list_of_flights) — Barmar
– Barmar, Commented Nov 11, 2021 at 20:48

mozway · Accepted Answer · 2021-11-11 21:18:23Z

Use pandas, it is built on top of numpy so you will benefit from C-speed.

assuming this file as input:

file.csv

US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020
AU AU 02/03/2020
AU CN 03/04/2020
AU MX 03/04/2020
AU US 02/03/2020
US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020

import pandas as pd
import io

df = pd.read_csv('file.csv', sep='\s', names=['from', 'to', 'date'])
df['date'] = pd.to_datetime(df['date'])
df.groupby(['from', 'to', 'date'], as_index=False).size()

output:

  from  to       date  size
0   AU  AU 2020-02-03     1
1   AU  CN 2020-03-04     1
2   AU  MX 2020-03-04     1
3   AU  US 2020-02-03     3
4   US  AU 2020-02-03     2
5   US  CN 2020-03-04     2
6   US  MX 2020-03-04     2

timing on 3.4 million rows

NB. The test sample was generated by concatenating 200k times the example dataset

320 ms ± 5.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Salindaw · Accepted Answer · 2021-11-11 21:38:44Z

0

Another script using Pandas

!pip install Pandas
import pandas as pd
Fligts_df = pd.read_fwf("flights.txt",names=["FROM", "TO", "DATE"])
Fligts_df.groupby(["FROM", "TO","DATE"])["DATE"].count()

answered Nov 11, 2021 at 21:38

Salindaw

1,6462 gold badges13 silver badges10 bronze badges

Comments

wovano · Accepted Answer · 2021-11-11 22:06:22Z

0

You could do this easily without any external packages (such as pandas), by using a dictionary to store the count for every line (i.e. every unique combination of source, destination, date). The standard defaultdict is very convenient in this case:

import collections

flights = collections.defaultdict(int)
with open('file.csv', 'rt') as file:
    for line in file:
        flights[line.strip()] += 1
print(flights['AU US 02/03/2020'])  # prints 3

This code takes roughly one second to run on a file with 1 million randomly generated "flights".

answered Nov 11, 2021 at 22:06

wovano

5,1915 gold badges33 silver badges58 bronze badges

Collectives™ on Stack Overflow

Problem with grouping and counting in python

3 Answers 3

timing on 3.4 million rows

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

timing on 3.4 million rows

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related