0

I have been working in a problem in python where I have a matrix of 3 columns and more than a million rows. The first column represents origin country, the second destination country, and the third the date. For example:

US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020
AU AU 02/03/2020
AU CN 03/04/2020
AU MX 03/04/2020
AU US 02/03/2020
US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020

And I want to count all the flights between two countries in a given day. For example, all flights from US to AU on 02/03/2020. I have done it with 3 for's and some if's, but it has been running for more than a week, and it hasn't finished. I wanted to know if anyone has a suggestion on how could I handle this problem in a more efficient way.

Thanks

4
  • 4
    Can you show us your code? Commented Nov 11, 2021 at 20:46
  • 1
    Use pandas and its built-in filtering and counting methods. Commented Nov 11, 2021 at 20:46
  • 1
    Welcome! Please read How to Ask. Whenever a question asks about code, the question should include that code. Otherwise, there is no way to know why the code fails. minimal reproducible example is another good article to read. Commented Nov 11, 2021 at 20:48
  • 2
    Why would you need 3 for's? sum(flight['from'] == 'US' and flight['to'] == 'AU' and flight['date'] == '02/03/2020' for flight in list_of_flights) Commented Nov 11, 2021 at 20:48

3 Answers 3

1

Use pandas, it is built on top of numpy so you will benefit from C-speed.

assuming this file as input:

file.csv

US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020
AU AU 02/03/2020
AU CN 03/04/2020
AU MX 03/04/2020
AU US 02/03/2020
US AU 02/03/2020
US CN 03/04/2020
US MX 03/04/2020
AU US 02/03/2020
import pandas as pd
import io

df = pd.read_csv('file.csv', sep='\s', names=['from', 'to', 'date'])
df['date'] = pd.to_datetime(df['date'])
df.groupby(['from', 'to', 'date'], as_index=False).size()

output:

  from  to       date  size
0   AU  AU 2020-02-03     1
1   AU  CN 2020-03-04     1
2   AU  MX 2020-03-04     1
3   AU  US 2020-02-03     3
4   US  AU 2020-02-03     2
5   US  CN 2020-03-04     2
6   US  MX 2020-03-04     2

timing on 3.4 million rows

NB. The test sample was generated by concatenating 200k times the example dataset

320 ms ± 5.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Sign up to request clarification or add additional context in comments.

Comments

0

Another script using Pandas

!pip install Pandas
import pandas as pd
Fligts_df = pd.read_fwf("flights.txt",names=["FROM", "TO", "DATE"])
Fligts_df.groupby(["FROM", "TO","DATE"])["DATE"].count()

Comments

0

You could do this easily without any external packages (such as pandas), by using a dictionary to store the count for every line (i.e. every unique combination of source, destination, date). The standard defaultdict is very convenient in this case:

import collections

flights = collections.defaultdict(int)
with open('file.csv', 'rt') as file:
    for line in file:
        flights[line.strip()] += 1
print(flights['AU US 02/03/2020'])  # prints 3

This code takes roughly one second to run on a file with 1 million randomly generated "flights".

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.