I’m having a problem changing values in a dataframe. I also want to consult regarding a problem I need to solve and the proper way to use pandas to solve it. I'll appreciate help on both. I have a file containing information about matching degree of audio files to speakers. The file looks something like that:
wave_path spk_name spk_example# score mark comments isUsed
190 122_65_02.04.51.800.wav idoD idoD 88 NaN NaN False
191 121_110_20.17.27.400.wav idoD idoD 87 NaN NaN False
192 121_111_00.34.57.300.wav idoD idoD 87 NaN NaN False
193 103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
194 131_101_02.08.06.500.wav idoD idoD_0 96 HIT VP False
What I need to do, is some kind of a sophisticated counting. I need to group the results by speaker, and calculate for each speaker some calculation. I then proceed with the speaker that made the best calculation for me, but before proceeding I need to mark all the files which I used for the calculation as being used, i.e. changing the isUsed value for each row in which they appear (files can appear more than once) to TRUE. Then I make another iteration. Calculate for each speaker, mark the used files and so on until no more speakers left to be calculated.
I thought a lot about how to implement that process using pandas (it is quite easy to implement in regular python but it will take a lot of looping and data structuring that my guess will slow the process down significantly, and also I’m using this process to get to learn pandas abilities more deeply)
I came out with the following solution. As preparation steps, I’ll group by speaker name and set the file name as index by the set_index method. I will then iterate over the groupbyObj and apply the calculation function, which will return the selected speaker and the files to be marked as used.
Then I’ll iterate over the files and mark them as used (this would be fast and simple since I set them as indexes beforehand), and so on until I finish calculating.
First, I’m not sure about this solution, so feel free to tell me your thoughts on it. Now, I’ve tried implementing this, and got into trouble:
First I indexed by file name, no problem here:
In [53]:
marked_results['isUsed'] = False
ind_res = marked_results.set_index('wave_path')
ind_res.head()
Out[53]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
131_101_02.08.06.500.wav idoD idoD 99 HIT VP False
144_35_22.46.38.700.wav idoD idoD 96 HIT VP False
41_09_17.10.11.700.wav idoD idoD 93 HIT TEST False
122_188_03.19.20.400.wav idoD idoD 93 NaN NaN False
Then I choose a file and checked that I get the entries relevant to that file:
In [54]:
example_file = ind_res.index[0];
ind_res.ix[example_file]
Out[54]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
Now problems here too. Then I tried to change the isUsed value for that file to True, and that where I got the problem:
In [56]:
ind_res.ix[example_file]['isUsed'] = True
ind_res.ix[example_file].isUsed = True
ind_res.ix[example_file]
Out[56]:
spk_name spk_example# score mark comments isUsed
wave_path
103_31_18.59.12.800.wav idoD idoD 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_0 99 HIT VP False
103_31_18.59.12.800.wav idoD idoD_1 97 HIT VP False
103_31_18.59.12.800.wav idoD idoD_2 95 HIT VP False
So, you see the problem. Nothing has changed. What am I doing wrong? Is the problem described above should be solved using pandas?
And also: 1. How can I approach a specific group by a groupby object? bcz I thought maybe instead of setting the files as indexed, grouping by a file, and the using that groupby obj to apply a changing function to all of its occurrences. But I didn’t find a way to approach a specific group and passing the group name as parameter and calling apply on all the groups and then acting only on one of them seemed not "right" to me.
I hope it is not to long... :)
.ix[example_file,'isUsed']see here: pandas.pydata.org/pandas-docs/dev/…