0

All,

I am going to compute some feature values using the following python codes. But, because the input sizes are too big, it is very time-consuming. Please help me to optimize the codes.

  leaving_volume=len([x for x in pickup_ids if x not in dropoff_ids])
  arriving_volume=len([x for x in dropoff_ids if x not in pickup_ids])
  transition_volume=len([x for x in dropoff_ids if x in pickup_ids])

  union_ids=list(set(pickup_ids + dropoff_ids))
  busstop_ids=[x for x in union_ids if self.geoitems[x].fare>0]
  busstop_density=np.sum([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0
  busstop_ids=[x for x in union_ids if self.geoitems[x].balance>0]
  smartcard_balance=np.sum([self.geoitems[x].balance for x in busstop_ids])/len(busstop_ids) if len(busstop_ids) > 0 else 0

Hi, All,

Here is my revised version. I run this code on my GPS traces data. It is faster.

intersect_ids=set(pickup_ids).intersection( set(dropoff_ids) )
union_ids=list(set(pickup_ids + dropoff_ids))
leaving_ids=set(pickup_ids)-intersect_ids
leaving_volume=len(leaving_ids)
arriving_ids=set(dropoff_ids)-intersect_ids
arriving_volume=len(arriving_ids)
transition_volume=len(intersect_ids)

busstop_density=np.mean([Util.Geodist(self.geoitems[x].orilat, self.geoitems[x].orilng, self.geoitems[x].destlat, self.geoitems[x].destlng)/(1000*self.geoitems[x].fare) for x in union_ids if self.geoitems[x].fare>0])
if not busstop_density > 0:
    busstop_density = 0
smartcard_balance=np.mean([self.geoitems[x].balance for x in union_ids if self.geoitems[x].balance>0])
if not smartcard_balance > 0:
    smartcard_balance = 0

Many thanks for the help.

1
  • I am not sure if np.sum will work on a list. Moreover, in Python 2.7 and lower, you use integer division (I don't know your actual version of Python). Your first three expressions are two set differences and one set intersection use that. It is not very efficient to do anything in a for-cycle. Instead of a list of objects (geoitems) try to use a dictionary of arrays or record array. Commented Apr 4, 2014 at 7:06

2 Answers 2

3

Just a few things I noticed, as some Python efficiency trivia:

if x not in dropoff_ids

Checking for membership using the in operator is more efficient on a set than a list. But iterating with for through a list is probably more efficient than on a set. So if you want your first two lines to be as efficient as possible you should have both types of data structure around beforehand.

list(set(pickup_ids + dropoff_ids))

It's more efficient to create your sets before you combine data, rather than creating a long list and constructing a set from it. Luckily you probably already have the set versions around now (see the first comment)!

Above all you need to ask yourself the question:

Is the time I save by constructing extra data structures worth the time it takes to construct them?

Next one:

np.sum([...])

I've been trained by Python to think of constructing a list and then applying a function that theoretically only requires a generator as a code smell. I'm not sure if this applies in numpy, since from what I remember it's not completely straightforward to pull data from a generator and put it in a numpy structure.

It looks like this is just a small fragment of your code. If you're really concerned about efficiency I'd recommend making use of numpy arrays rather than lists, and trying to stick within numpy's built-in data structures and function as much as possible. They are likely more highly optimized for raw data crunching in C than the built-in Python functions.

If you're really, really concerned about efficiency then you should probably be doing this data analysis straight-up in C. Especially if you don't have much more code than what you've presented here it might be pretty easy to translate over.

Sign up to request clarification or add additional context in comments.

Comments

0

I can only support what machine yerning wrote in his this post. If you are thinking of switching to numpy so if your variables pickup_ids and dropoff_ids were numpy arrays (which maybe they already are else do:

dropoff_ids = np.array( dropoff_ids, dtype='i' )
pickup_ids = np.array( pickup_ids, dtype='i' )

then you can make use of the functions np.in1d() which will give you a True/False array which you can just sum over to get the total number of True entries.

leaving_volume   = (-np.in1d( pickup_ids, dropoff_ids )).sum()
transition_volume= np.in1d( dropoff_ids, pickup_ids).sum()
arriving_volume  = (-np.in1d( dropoff_ids, pickup_ids)).sum()

somehow I have the feeling that transition_volume = len(pickup_ids) - arriving_volume but I'm not 100% sure right now.

Another function that could be useful to you is np.unique() if you want to get rid of duplicate entries which in a way will turn your array into a set.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.