60

Is there any way to use the mapping function or something better to replace values in an entire dataframe?

I only know how to perform the mapping on series.

I would like to replace the strings in the 'tesst' and 'set' column with a number for example set = 1, test =2

Here is a example of my dataset: (Original dataset is very large)

ds_r
  respondent  brand engine  country  aware  aware_2  aware_3  age tesst   set
0          a  volvo      p      swe      1        0        1   23   set   set
1          b  volvo   None      swe      0        0        1   45   set   set
2          c    bmw      p       us      0        0        1   56  test  test
3          d    bmw      p       us      0        1        1   43  test  test
4          e    bmw      d  germany      1        0        1   34   set   set
5          f   audi      d  germany      1        0        1   59   set   set
6          g  volvo      d      swe      1        0        0   65  test   set
7          h   audi      d      swe      1        0        0   78  test   set
8          i  volvo      d       us      1        1        1   32   set   set

Final result should be

 ds_r
  respondent  brand engine  country  aware  aware_2  aware_3  age  tesst  set
0          a  volvo      p      swe      1        0        1   23      1    1
1          b  volvo   None      swe      0        0        1   45      1    1
2          c    bmw      p       us      0        0        1   56      2    2
3          d    bmw      p       us      0        1        1   43      2    2
4          e    bmw      d  germany      1        0        1   34      1    1
5          f   audi      d  germany      1        0        1   59      1    1
6          g  volvo      d      swe      1        0        0   65      2    1
7          h   audi      d      swe      1        0        0   78      2    1
8          i  volvo      d       us      1        1        1   32      1    1

10 Answers 10

91

What about DataFrame.replace?

In [9]: mapping = {'set': 1, 'test': 2}

In [10]: df.replace({'set': mapping, 'tesst': mapping})
Out[10]: 
   Unnamed: 0 respondent  brand engine  country  aware  aware_2  aware_3  age  \
0           0          a  volvo      p      swe      1        0        1   23   
1           1          b  volvo   None      swe      0        0        1   45   
2           2          c    bmw      p       us      0        0        1   56   
3           3          d    bmw      p       us      0        1        1   43   
4           4          e    bmw      d  germany      1        0        1   34   
5           5          f   audi      d  germany      1        0        1   59   
6           6          g  volvo      d      swe      1        0        0   65   
7           7          h   audi      d      swe      1        0        0   78   
8           8          i  volvo      d       us      1        1        1   32   

  tesst set  
0     2   1  
1     1   2  
2     2   1  
3     1   2  
4     2   1  
5     1   2  
6     2   1  
7     1   2  
8     2   1  

As @Jeff pointed out in the comments, in pandas versions < 0.11.1, manually tack .convert_objects() onto the end to properly convert tesst and set to int64 columns, in case that matters in subsequent operations.

Sign up to request clarification or add additional context in comments.

5 Comments

note that you might want to do a df.convert_objects() after the replacement to coerce to proper dtypes
@Dan Allan this will be default in 0.11.1, FYI (to convert_objects)
This is super old but you can also do this now: df.replace(to_replace=['set', 'test'], value=[1, 2])
I think we shouldn't ask to hardcode name of the values, It should be dynamically picked up at run time and assigned number.
For pandas v2.2.0 this raises FutureWarning: Downcasting behavior in 'replace' is deprecated and will be removed in a future version. To retain.... Suggested infer_objects(copy=False) does not work.
32

I know this is old, but adding for those searching as I was. Create a dataframe in pandas, df in this code

ip_addresses = df.source_ip.unique()
ip_dict = dict(zip(ip_addresses, range(len(ip_addresses))))

That will give you a dictionary map of the ip addresses without having to write it out.

Comments

19

You can use the applymap DataFrame function to do this:

In [26]: df = DataFrame({"A": [1,2,3,4,5], "B": ['a','b','c','d','e'],
                         "C": ['b','a','c','c','d'], "D": ['a','c',7,9,2]})
In [27]: df
Out[27]:
   A  B  C  D
0  1  a  b  a
1  2  b  a  c
2  3  c  c  7
3  4  d  c  9
4  5  e  d  2

In [28]: mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}

In [29]: df.applymap(lambda s: mymap.get(s) if s in mymap else s)
Out[29]:
   A  B  C  D
0  1  1  2  1
1  2  2  1  3
2  3  3  3  7
3  4  4  3  9
4  5  5  4  2

5 Comments

I working on the problem like this and I just followed the exact steps mentioned in your answer. I am not getting the output. Code: wc = pd.read_csv('PATH', usecols = ['Workclass'])
df = pd.DataFrame(wc) end of line wcdict = {"?":0,"Federal-gov":1,"Local-gov":2,"Never-worked":3,"Private":4,"Self-emp-inc":5, "Self-emp-n-inc":6,"State-gov":7,"Without-pay":8} end of line df.applymap(lambda s: wcdict.get(s) if s in wcdict else s) end of line print(df)
df.applymap(lambda s: mymap.get(s) if s in mymap else s) does not make inline changes to df, so your print df statement will not reflect the results of the applymap. You need to do an assigment like df2 = df.applymap(lambda s: mymap.get(s) if s in mymap else s). print df2 will now reflect the changes.
That worked!! Thanks :) I have one more question, I need to work with pyspark rather than working with normal python. Does the implementation of this logic differs in pyspark? When I created a data frame, I gave the file path [as shown in above comments] but, I would like to give an RDD as the input to data frame. I couldn't do that. Do you have any idea about this?
Glad it worked. I'm really not sure... perhaps this might be a start?
12

The simplest way to replace any value in the dataframe:

df=df.replace(to_replace="set",value="1")
df=df.replace(to_replace="test",value="2")

Hope this will help.

Comments

8

To convert Strings like 'volvo','bmw' into integers first convert it to a dataframe then pass it to pandas.get_dummies()

  df  = DataFrame.from_csv("myFile.csv")
  df_transform = pd.get_dummies( df )
  print( df_transform )

Better alternative: passing a dictionary to map() of a pandas series (df.myCol) (by specifying the column brand for example)

df.brand = df.brand.map( {'volvo':0 , 'bmw':1, 'audi':2} )

Comments

4

pandas.factorize() does exactly this.

>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> codes
array([0, 0, 1, 2, 0]...)
>>> uniques
array(['b', 'a', 'c'], dtype=object)

With a DataFrame:

df["tesst"], tesst_key = pandas.factorize(df["tesst"])

Comments

2

You can also do this with pandas rename_categories. You would first need to define the column as dtype="category" e.g.

In [66]: s = pd.Series(["a","b","c","a"], dtype="category")

In [67]: s
Out[67]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

and then rename them:

In [70]: s.cat.rename_categories([1,2,3])
Out[70]: 
0    1
1    2
2    3
3    1
dtype: category
Categories (3, int64): [1, 2, 3]

You can also pass a dict-like object to map the renaming, e.g.:

In [72]: s.cat.rename_categories({1: 'x', 2: 'y', 3: 'z'})

2 Comments

in general, what is this category type for?
@HerrIvan there's plenty of documentation here pandas.pydata.org/pandas-docs/stable/categorical.html
2

When no of features are not much :

mymap = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
df.applymap(lambda s: mymap.get(s) if s in mymap else s)

When it's not possible manually :

temp_df2 = pd.DataFrame({'data': data.data.unique(), 'data_new':range(len(data.data.unique()))})# create a temporary dataframe 
data = data.merge(temp_df2, on='data', how='left')# Now merge it by assigning different values to different strings.

Comments

1

You can build dictionary from column values itself and fill like below

x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
    item_type_mapping[item_list[i]]=i

df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x]) 

Comments

0

df.replace(to_replace=['set', 'test'], value=[1, 2]) from @Ishnark comment on accepted answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.