1

I have a DataFrame like:

import pandas as pd
df = pd.DataFrame({'author':["Melville","Hemingway","Faulkner"],
                   'title':["Moby Dick","The Sun Also Rises","The Sound and the Fury"],
                   'subject':["whaling","bullfighting","a messed-up family"]
                   })

I know that I can do this:

# produces desired output                   
("Some guy " + df['author'] + " wrote a book called " + 
   df['title'] + " that uses " + df['subject'] + 
   " as a metaphor for the human condition.")

but is it possible to write this more clearly using str.format(), something along the lines of:

# returns KeyError:'author'
["Some guy {author} wrote a book called {title} that uses "
   "{subject} as a metaphor for the human condition.".format(x) 
      for x in df.itertuples(index=False)]

2 Answers 2

3
>>> ["Some guy {author} wrote a book called {title} that uses "
   "{subject} as a metaphor for the human condition.".format(**x._asdict())
      for x in df.itertuples(index=False)]

['Some guy Melville wrote a book called Moby Dick that uses whaling as a metaphor for the human condition.', 'Some guy Hemingway wrote a book called The Sun Also Rises that uses bullfighting as a metaphor for the human condition.', 'Some guy Faulkner wrote a book called The Sound and the Fury that uses a messed-up family as a metaphor for the human condition.']

Note that _asdict() is not meant to be part of the public api, so relying on it may break in future updates to pandas.

You could do this instead:

>>> ["Some guy {} wrote a book called {} that uses "
   "{} as a metaphor for the human condition.".format(*x)
      for x in df.values]
Sign up to request clarification or add additional context in comments.

1 Comment

Got it, so the * does the tuple part for me. Brilliant, thanks -- not sure why someone downvoted us
1

You could also use DataFrame.iterrows() like this:

["The book {title} by {author} uses "
   "{subject} as a metaphor for the human condition.".format(**x) 
     for i, x in df.iterrows()]

Which is nice if you want to:

  • use named arguments, so the order of use didn't have to match the order of columns (like above)
  • not use an internal function like _asdict()

Timing: the fastest appears to be M. Klugerford's second solution, even if we note the warning about caching and take the slowest run.

# example
%%timeit
 ("Some guy " + df['author'] + " wrote a book called " + 
   df['title'] + " that uses " + df['subject'] + 
   " as a metaphor for the human condition.")
# 1000 loops, best of 3: 883 µs per loop

%%timeit
    ["Some guy {author} wrote a book called {title} that uses "
       "{subject} as a metaphor for the human condition.".format(**x._asdict())
          for x in df.itertuples(index=False)]
#1000 loops, best of 3: 962 µs per loop

%%timeit
    ["Some guy {} wrote a book called {} that uses "
     "{} as a metaphor for the human condition.".format(*x)
          for x in df.values]   
#The slowest run took 5.90 times longer than the fastest. This could mean that an intermediate result is being cached.
#10000 loops, best of 3: 18.9 µs per loop

%%timeit
    from collections import OrderedDict
    ["The book {title} by {author} uses "
       "{subject} as a metaphor for the human condition.".format(**x) 
         for x in [OrderedDict(row) for i, row in df.iterrows()]]
#1000 loops, best of 3: 308 µs per loop            

%%timeit 
    ["The book {title} by {author} uses "
       "{subject} as a metaphor for the human condition.".format(**x) 
         for i, x in df.iterrows()]
#1000 loops, best of 3: 413 µs per loop         

Why the next-to-last is faster than the last is beyond me.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.