I currently have a pandas DataFrame df:
paper reference
2171686 p84 r51
3816503 p41 r95
4994553 p112 r3
2948201 p112 r61
2957375 p32 r41
2938471 p65 r41
...
Here, each row of df shows the relationship of citation between paper and reference (where paper cites reference).
I need the following numbers for my analysis:
Frequency of elements of
paperindfWhen two elements from
paperare randomly selected, the number ofreferencethey cite in common
For number 1, I performed the following:
df_count = df.groupby(['paper'])['paper'].count()
For number 2, I performed the operation that returns pairs of elements in paper that cite the same element in reference:
from collections import defaultdict
pair = []
d = defaultdict(list)
for idx, row in df.iterrows():
d[row['paper']].append(row['paper'])
for ref, lst in d.items():
for i in range(len(lst)):
for j in range(i+1, len(lst)):
pair.append([lst[i], lst[j], ref])
pair is a list that consists of three elements: first two elements are the pair of paper, and the third element is from reference that both paper elements cite. Below is what pair looks like:
[['p88','p7','r11'],
['p94','p33','r11'],
['p75','p33','r43'],
['p5','p12','r79'],
...]
I would like to retrieve a DataFrame in the following format:
paper1 freq1 paper2 freq2 common
p17 4 p45 3 2
p5 2 p8 5 2
...
where paper1 and paper2 represent the first two elements of each list of pair, freq1 and freq2 represent the frequency count of each paper done by df_count, and common is a number of reference both paper1 and paper2 cite in common.
How can I retrieve my desired dataset (in the desired format) from df, df_count, and pair?
crosstabor pivot, i can't see how your expected output comes from your input tho.