51

I have a DataFrame with multiple rows. Is there any way in which they can be combined to form one string?

For example:

     words
0    I, will, hereby
1    am, gonna
2    going, far
3    to
4    do
5    this

Expected output:

I, will, hereby, am, gonna, going, far, to, do, this
2
  • What is the type of elements? I am guessing 0, 1 , etc is index right? Commented Oct 22, 2015 at 11:30
  • indexes are like 0,1,2,3,4,5,6,7,.... Commented Oct 22, 2015 at 11:33

4 Answers 4

54

You can use str.cat to join the strings in each row. For a Series or column s, write:

>>> s.str.cat(sep=', ')
'I, will, hereby, am, gonna, going, far, to, do, this'
Sign up to request clarification or add additional context in comments.

8 Comments

i tried the above mentioned code. It give me an error: AttributeError: 'DataFrame' object has no attribute 'str'. Is this because there are blank rows in the dataframe as well? If so, how can i rectify it?
The .str accessor only works on a Series or a single column of a DataFrame (not an entire DataFrame). If you want to apply this method to multiple columns of a DataFrame, you'll need to use it on each column individually in turn.
thanks, could you also help me out with the syntax for the above? If i want to concatenate the rows of column 'words' of dataframe df, how should i write it? Thanks for your help!
Sure - to apply the method to the 'words' column, you could write df['words'].str.cat(sep=', ') (where df is the name of your DataFrame).
I'm surprised str.cat is slower to join() method. Do check the solution and timings below.
|
28

How about traditional python's join? And, it's faster.

In [209]: ', '.join(df.words)
Out[209]: 'I, will, hereby, am, gonna, going, far, to, do, this'

Timings in Dec, 2016 on pandas 0.18.1

In [214]: df.shape
Out[214]: (6, 1)

In [215]: %timeit df.words.str.cat(sep=', ')
10000 loops, best of 3: 72.2 µs per loop

In [216]: %timeit ', '.join(df.words)
100000 loops, best of 3: 14 µs per loop

In [217]: df = pd.concat([df]*10000, ignore_index=True)

In [218]: df.shape
Out[218]: (60000, 1)

In [219]: %timeit df.words.str.cat(sep=', ')
100 loops, best of 3: 5.2 ms per loop

In [220]: %timeit ', '.join(df.words)
100 loops, best of 3: 1.91 ms per loop

3 Comments

Interesting timings, I get a similar result on 0.19.2. However, I think the trade-off here is that str.cat will seamlessly handle missing values like NaN and None (you can even supply the na_rep argument to choose how to represent these missing values). Python's join fails here. You can get around this by filtering-out/filling-in missing values and then joining, but this slows it right back down. Filling missing values like this also fails if the column holds categorical values; str.cat just works.
How does this works, If i do not want to have the seperators of a coma? What if my outcome should be: I will hereby am gonna going far to do this
@PV8 you can try " ".join(...) instead of ", ".join(...)
18

If you have a DataFrame rather than a Series and you want to concatenate values (I think text values only) from different rows based on another column as a 'group by' key, then you can use the .agg method from the class DataFrameGroupBy. Here is a link to the API manual.

Sample code tested with Pandas v0.18.1:

import pandas as pd

df = pd.DataFrame({
    'category': ['A'] * 3 + ['B'] * 2,
    'name': ['A1', 'A2', 'A3', 'B1', 'B2'],
    'num': range(1, 6)
})

df.groupby('category').agg({
    'name': lambda x: ', '.join(x),
    'num': lambda x: x.max()
})

3 Comments

minor comment: need to assign to a new dataframe i.e.df2 = df.groupby(...)
groupby with agg and lambda is quite slow on larger dataframes... is there a way to speed this up?
dude thx for this, it rlly helps me to solve another groupby problem
0

For anyone want to know how to combine multiple rows of strings in dataframe,
I provide a method that can concatenate strings within a 'window-like' range of near rows as follows:

# add columns based on 'windows-like' rows
df['windows_key_list'] = pd.Series(df['key'].str.cat([df.groupby(['bycol']).shift(-i)['key'] for i in range(1, windows_size)], sep = ' ')

Note: This can't be reached by groupby, because we don't mean the same id of rows, just near rows.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.