7

I have a specific series of datasets which come in the following general form:

import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})

The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30. My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):

df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3

enter image description here

So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:

Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:

df['result'] = ' '.join(df.ix[:,1:])

Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings: Concatenate row-wise across specific columns of dataframe

3
  • 3
    Try this: stackoverflow.com/a/32529152/5276797 Commented Sep 19, 2016 at 11:19
  • 1
    Perfect! The solution suggested by Russ is flexible and simple. Thanks @IanS for pointing out that answer! Commented Sep 19, 2016 at 11:39
  • 1
    For two columns, this answer [stackoverflow.com/a/36911306/3903767] on the same question linked by @IanS is substantially faster using .str.cat Commented May 16, 2018 at 9:44

3 Answers 3

3

The key to operate in columns (Series) of strings en mass is the Series.str accessor.

I can think of two .str methods to do what you want.

str.cat()

The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:

column_names = df.columns[1:]  # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')

Or, in one line:

df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')

str.join()

The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:

tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')

Or, in one line:

df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')

BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.

Sign up to request clarification or add additional context in comments.

Comments

2

Other than using apply to concatenate the strings, you can also use agg to do so.

df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]: 
0    a d g i
1    b e h j
2    c f i k
dtype: object

Comments

1

Here is a slightly alternative solution:

In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()

In [58]: df
Out[58]:
     n t0 t1 t2 t3   result
0   92  a  d  g  i  a d g i
1  916  b  e  h  j  b e h j
2  363  c  f  i  k  c f i k

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.