0

Data:

screenshot


import pandas as pd
dict= {'REF': ['A','B','C','D'],
        'ALT': [['E','F'], ['G'], ['H','I','J'], ['K,L']],
        'sample1': ['0', '0', '1', '2'],
        'sample2': ['1', '0', '3', '0']
        }
df = pd.DataFrame(dict)

Problem: I need to replace the values in columns'Sample1' and 'Sample2'. If there is 0, then 'REF' column value should be placed. If 1, then first element of list in column 'ALT' should be placed, if 2, then second element of 'ALT' column list, and so on.
My Solution:

 sample_list = ['sample1', 'sample2']
    for sample in sample_list:

        #replace 0s 
        df[sample] = df.apply(lambda x: x[sample].replace('0', x['REF']), axis=1)
        #replace other numbers
        for i in range(1,4):
            try:
                df[sample] = df.apply(lambda x: x[sample].replace(f'{i}', x['ALT'][i-1]), axis=1)
            except:
                pass

However, because list length is different in every 'ALT' column row, it seems that there is IndexError, and values are not replaced after 1. You can see it from the output:

screenshot

'{"REF":{"0":"A","1":"B","2":"C","3":"D"},"ALT":{"0":["E","F"],"1":["G"],"2":["H","I","J"],"3":["K"]},"sample1":{"0":"A","1":"B","2":"H","3":"2"},"sample2":{"0":"E","1":"B","2":"3","3":"D"}}'

How can I solve it?

UPDATE: If I have NaN value in sample1 or sample2, I can't convert values to int and don't how to skip these values

enter image description here

So, NaN values should not be converted and stayed NaN

Expected output:

enter image description here

4
  • 1
    In sample 1 you have 2 but only one element in the list Commented Dec 22, 2020 at 9:09
  • Even if it is 2 elements, still doesn't work Commented Dec 22, 2020 at 9:12
  • My question was more, what should be done in those cases? Commented Dec 22, 2020 at 9:12
  • I think you have a typo in your ALT column, K and L should be separated. Commented Dec 22, 2020 at 9:39

3 Answers 3

1

You could do:

df['sample1'] = np.where(df['sample1'].eq(0), df['REF'],
                         [v[max(i - 1, 0)] for v, i in zip(df['ALT'], df['sample1'].astype(int))])

df['sample2'] = np.where(df['sample2'].eq(0), df['REF'],
                         [v[max(i - 1, 0)] for v, i in zip(df['ALT'], df['sample2'].astype(int))])

print(df)

Output

  REF        ALT sample1 sample2
0   A     [E, F]       E       E
1   B        [G]       G       G
2   C  [H, I, J]       H       J
3   D        [K]       K       K

Note that I use a different input given the one in your example is not valid.

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! But what I can do if there is NaN value in some of the rows in sample column? then df['sample2'].astype(int)) will not work. How to skip these rows?
0

Using a simple concatenation of REF and ALT columns and apply :

import pandas as pd
d= {'REF': ['A','B','C','D'],
        'ALT': [['E','F'], ['G'], ['H','I','J'], ['K','L']],
        'sample1': ['0', '0', '1', '2'],
        'sample2': ['1', '0', '3', '0']
        }
df = pd.DataFrame(d)


df["REF_ALT"] = df["REF"].map(list)+df["ALT"]  # concatenate REF and ALT
df["sample1"] = df.apply(lambda row: np.nan if np.isnan(row["sample1"]) else row["REF_ALT"][int(row["sample1"])], axis=1)
df["sample2"] = df.apply(lambda row: np.nan if np.isnan(row["sample2"]) else row["REF_ALT"][int(row["sample2"])], axis=1)
df.pop("REF_ALT")
df

enter image description here

6 Comments

Thanks for simple answer! But what I can do if there is NaN value in some of sample columns? then int(row["sample"]) will not work
In that case you need to replace the NaN values beforehand with .fillna()
But I need to keep these NaN values and don't replace them, so I can't use either .fillna() or convert to integer
Ok so please clarify what is the expected output in case of nan
Expected output is just to keep NaN values (don't replace) in sample column, and replace only numbers
|
0

A simple solution:

df = pd.DataFrame.from_dict({
 'REF': {0: 'A', 1: 'B', 2: 'C', 3: 'D'},
 'ALT': {0: ['E', 'F'], 1: ['G'], 2: ['H', 'I', 'J'], 3: ['K', 'L']},
 'sample1': {0: 0, 1: 0, 2: 1, 3: 2},
 'sample2': {0: 1, 1: 0, 2: 3, 3: 0},
})

# create a temp col s that includes a single string with letters:
df["s"] = df.REF + df.ALT.str.join("")    
df["sample1"] = df.apply(lambda x: x["s"][x.sample1], axis=1)
df["sample2"] = df.apply(lambda x: x["s"][x.sample2], axis=1)
df = df.drop(columns="s")

output:

  REF        ALT sample1 sample2
0   A     [E, F]       A       E
1   B        [G]       B       B
2   C  [H, I, J]       H       J
3   D     [K, L]       L       D

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.