3

I have two data frames : one with all my data (called 'data') and one with latitudes and longitudes of different stations where each observation starts and ends (called 'info'), I am trying to get a data frame where I'll have the latitude and longitude next to each station in each observation, my code in python :

for i in range(0,15557580):
    for j in range(0,542):
         if data.year[i] == '2018' and data.station[i]==info.station[j]:
             data.latitude[i] = info.latitude[j]
             data.longitude[i] = info.longitude[j]
             break

but since I have about 15 million observation , doing it, takes a lot of time, is there a quicker way of doing it ?

Thank you very much (I am still new to this)

edit :

my file info looks like this (about 500 observation, one for each station)

enter image description here

my file data like this (theres other variables not shown here) (about 15 million observations , one for each travel)

enter image description here

and what i am looking to get is that when the stations numbers match that the resulting data would look like this :

enter image description here

3
  • 1
    Could you post a few entries (as they appear in your memory) from the "data" and "info" dataframes, and use those to give us an example of what you want your output to look like? As written, this question is a bit vague. Commented May 21, 2018 at 23:07
  • 1
    Nor do i find myself in many places where i use python or these kind of dataframe structures, so any solution given by me might not be as informative as one would like. But how does the data in this structure actually look? Do you have any example? And any specific reason why you need to go through the entire frame every iteration? Could the frame be sorted by 'year' and searched through by a any O(log n) search rather than O(n)? Some more information about the specific case would help any potential helpers to give you a better answer. Best regards Commented May 21, 2018 at 23:25
  • So I have tried it on a small scale, and my code actually doesn't work, I assumed that it did, basically what i am looking to get is one column with latitude, then one with longitude then one with the station number because i want to map the frequency of the observation on a map, and i just assume that it is the easiest way to do it Commented May 21, 2018 at 23:54

2 Answers 2

2

This is one solution. You can also use pandas.merge to add 2 new columns to data and perform the equivalent mapping.

# create series mappings from info
s_lat = info.set_index('station')['latitude']
s_lon = info.set_index('station')['latitude']

# calculate Boolean mask on year
mask = data['year'] == '2018'

# apply mappings, if no map found use fillna to retrieve original data
data.loc[mask, 'latitude'] = data.loc[mask, 'station'].map(s_lat)\
                                 .fillna(data.loc[mask, 'latitude'])

data.loc[mask, 'longitude'] = data.loc[mask, 'station'].map(s_lon)\
                                  .fillna(data.loc[mask, 'longitude'])
Sign up to request clarification or add additional context in comments.

1 Comment

thank you for your reply, I'll try it out to see if it works
0

This is a very recurrent and important issue when anyone starts to deal with large datasets. Big Data is a whole subject in itself, here is a quick introduction to the main concepts.

1. Prepare your dataset

In big data, 80% to 90% of the time is spent gathering, filtering and preparing your datasets. Create subsets of data, making them optimized for your further processing.

2. Optimize your script

Short code does not always mean optimized code in term of performance. In your case, without knowing about your dataset, it is hard to say exactly how you should process it, you will have to figure out on your own how to avoid the most computation possible while getting the exact same result. Try to avoid any unnecessary computation.

You can also consider splitting the work over multiple threads if appropriate.

As a general rule, you should not use for loops and break them inside. Whenever you don't know precisely how many loops you will have to go through in the first place, you should always use while or do...while loops.

3. Consider using distributed storage and computing

This is a subject in itself that is way too big to be all explained here.

Storing, accessing and processing data in a serialized way is faster of small amount of data but very inappropriate for large datasets. Instead, we use distributed storage and computing frameworks.

It aims at doing everything in parallel. It relies on a concept named MapReduce.

The first distributed data storage framework was Hadoop (eg. Hadoop Distributed File System or HDFS). This framework has its advantages and flaws, depending on your application.

In any case, if you are willing to use this framework, it will probably be more appropriate for you not to use MR directly on top HDFS, but using a upper level one, preferably in-memory, such as Spark or Apache Ignite on top of HDFS. Also, depending on your needs, try to have a look at frameworks such as Hive, Pig or Sqoop for example.

Again this subject is a whole different world but might very well be adapted to your situation. Feel free to document yourself about all these concepts and frameworks, and leave your questions if needed in the comments.

2 Comments

Thank you for your reply, I'll look into Spark and Apache Ignite further
Feel free to ask if I can be of any help. Please up this post and mark as answer, so that this can also be useful to other people. Thanks

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.