1

(I suck at titling these questions...)

So I've gotten 90% of the way through a very laborious learning process with pandas, but I have one thing left to figure out. Let me show an example (actual original is a comma-delimited CSV that has many more rows):

 Name    Price    Rating    URL                Notes1       Notes2            Notes3
 Foo     $450     9         a.com/x            NaN          NaN               NaN
 Bar     $99      5         see over           www.b.com    Hilarious         Nifty
 John    $551     2         www.c.com          Pretty       NaN               NaN
 Jane    $999     8         See Over in Notes  Funky        http://www.d.com  Groovy

The URL column can say many different things, but they all include "see over," and do not indicate with consistency which column to the right includes the site.

I would like to do a few things, here: first, move websites from any Notes column to URL; second, collapse all notes columns to one column with a new line between them. So this (NaN's removed because pandas makes me in order to use them in df.loc):

 Name    Price    Rating    URL                Notes1       
 Foo     $450     9         a.com/x            
 Bar     $99      5         www.b.com          Hilarious
                                               Nifty
 John    $551     2         www.c.com          Pretty
 Jane    $999     8         http://www.d.com   Funky
                                               Groovy

I got partway there by doing this:

 df['URL'] = df['URL'].fillna('')
 df['Notes1'] = df['Notes1'].fillna('')
 df['Notes2'] = df['Notes2'].fillna('')
 df['Notes3'] = df['Notes3'].fillna('')
 to_move = df['URL'].str.lower().str.contains('see over')
 df.loc[to_move, 'URL'] = df['Notes1']

What I don't know is how to find the Notes column with either www or .com. If I, for example, try to use my above method as a condition, e.g.:

 if df['Notes1'].str.lower().str.contains('www'):
    df.loc[to_move, 'URL'] = df['Notes1']

I get back ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() But adding .any() or .all() has the obvious flaw that they don't give me what I'm looking for: with any, e.g., every line that meets the to_move requirement in URL will get whatever's in Notes1. I need the check to occur row by row. For similar reasons, I can't even get started collapsing the Notes columns (and I don't know how to check for non-null empty string cells, either, a problem I created at this point).

Where it stands, I know I also have to move in Notes2 to Notes1, Notes3 to Notes2, and '' to Notes3 when the first condition is satisfied, because I don't want the leftover URLs in the Notes columns. I'm sure pandas has easier routes than what I'm doing, because it's pandas, and when I try to do anything with pandas, I find out that it can be done in one line instead of my 20...

(PS, I don't care if the empty columns Notes2 and Notes3 are left over, b/c I'm not using them in my CSV import in the next step, though I can always learn more than I need)

UPDATE: So I figured out a crummy verbose solution using my non-pandas python logic one step at a time. I came up with this (same first five lines above, minus the df.loc line):

url_in1 = df['Notes1'].str.contains('\.com')
url_in2 = df['Notes2'].str.contains('\.com')
to_move = df['URL'].str.lower().str.contains('see-over')
to_move1 = to_move & url_in1 
to_move2 = to_move & url_in2
df.loc[to_move1, 'URL'] = df.loc[url_in1, 'Notes1']
df.loc[url_in1, 'Notes1'] = df['Notes2']
df.loc[url_in1, 'Notes2'] = ''
df.loc[to_move2, 'URL'] = df.loc[url_in2, 'Notes2']
df.loc[url_in2, 'Notes2'] = ''

(Lines moved around and to_move repeated in actual code) I know there has to be a more efficient method... This also doesn't collapse in the Notes columns, but that should be easy using the same method, except that I still don't know a good way to find the empty strings.

1 Answer 1

1

I'm still learning pandas, so some parts of this code may be not so elegant, but general idea is - get all notes columns, find all urls in there, combine it with URL column and then concat remaining notes into Notes1 column:

import pandas as pd
import numpy as np
import pandas.core.strings as strings

# Just to get first notnull occurence
def geturl(s):
    try:
        return next(e for e in s if not pd.isnull(e))
    except:
        return np.NaN

df =  pd.read_csv("d:/temp/data2.txt")

dfnotes = df[[e for e in df.columns if 'Notes' in e]]

#       Notes1            Notes2  Notes3
# 0        NaN               NaN     NaN
# 1  www.b.com         Hilarious   Nifty
# 2     Pretty               NaN     NaN
# 3      Funky  http://www.d.com  Groovy

dfurls = dfnotes.apply(lambda x: x.str.contains('\.com'), axis=1)
dfurls = dfurls.fillna(False).astype(bool)

#   Notes1 Notes2 Notes3
# 0  False  False  False
# 1   True  False  False
# 2  False  False  False
# 3  False   True  False

turl = dfnotes[dfurls].apply(geturl, axis=1)

df['URL'] = np.where(turl.isnull(), df['URL'], turl)
df['Notes1'] = dfnotes[~dfurls].apply(lambda x: strings.str_cat(x[~x.isnull()], sep=' '), axis=1)

del df['Notes2']
del df['Notes3']

df
#    Name Price  Rating               URL           Notes1
# 0   Foo  $450       9           a.com/x                 
# 1   Bar   $99       5         www.b.com  Hilarious Nifty
# 2  John  $551       2         www.c.com           Pretty
# 3  Jane  $999       8  http://www.d.com     Funky Groovy
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.