Are numpy string arrays faster than python strings

Question

I am creating a string that is about 30 million words long. As you can imagine, this takes absolutely forever to create with a for-loop increasing by about 100 words at a time. Is there a way to represent the string in a more memory-friendly way, like a numpy array? I have very little experience with numpy.

bigStr = ''
for tweet in df['text']:
  bigStr = bigStr + ' ' + tweet
len(bigStr)

What are you doing with the string once you've created it? do you need to create the string at all? If all you are doing is getting a length then do that — Sayse
– Sayse, Commented Jul 30, 2021 at 13:47
What exactly is your goal? Loading all words into memory? If that is not the case you want to look into 'generators' — sehan2
– sehan2, Commented Jul 30, 2021 at 13:47
The question is which operation is more expensive? Looping through the data or appending the string? — niaei
– niaei, Commented Jul 30, 2021 at 13:47
bigStr is, and will be, a regular Python str value, no matter what compatible type tweet may have. — chepner
– chepner, Commented Jul 30, 2021 at 13:54

chepner · Accepted Answer · 2021-07-30 13:54:52Z

1

If you want to build a string, use ' '.join, which will create the final string in O(n) time, rather than building it up one piece at a time, which takes O(n^2) time.

bigStr = ' '.join([tweet for tweet in df['text']])

answered Jul 30, 2021 at 13:54

chepner

538k77 gold badges594 silver badges746 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

niaei · Accepted Answer · 2021-07-30 13:51:08Z

0

I can see you're trying to get the length of all data. For that you don't need to append all strings. (And I see you add a white space for each element)

Just get the length of tweet and add it to an integer variable (+1 for each white space):

number_of_texts = 0
for tweet in df['text']:
  number_of_texts += 1 + len(tweet)

print(number_of_texts)

answered Jul 30, 2021 at 13:51

niaei

2,6292 gold badges22 silver badges31 bronze badges

1 Comment

Ethan Powers Over a year ago

Sorry, I shouldn't have included the len() function. That was just for curiosity. I need the array of strings so I can convert each unique word to an int.

Collectives™ on Stack Overflow

Are numpy string arrays faster than python strings

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related