1

I made a script that combines data from 2 different csv files and generates a txt file with different lines (prompt). What I want to do is to avoid a repetition of the same "fintag" variable in a way that all the prompts would be different.

This script does exactly what I need, but it obviously repeats some of the values because ran is a random number.

I can't avoid repetitions of the same random number, because the random number is used in multiple column. Creating a different variable for each column would solve it, but the columns number is high, and it might even change overtime.

The alternative is to remove the elements from the "asstag" lists once they've been used, but the list is generated within a for loop and I have no idea how to remove elements from a list while a for loop is iterating on it.

Input:

people = {'Name' : ['mark', 'bill', 'tim', 'frank'],
        'Tag' : [color, animal, clothes, animal]}
dic = {'color' : ['blu', 'green', 'red', 'yellow'],
        'animal' : [dog, cat, horse, shark],
        'clothes' : [gloves, shoes, shirt, socks]}

Expected Output:

mark blu (or green, or red, or yellow)
bill horse (or dog, or cat, or shark)
tim socks (or gloves, or shoes, or shirt)
frank dog (or cat, or shark, but not horse if horse is already assigned to bill)

Code:

people = pd.read_csv("people.csv")
dic = pd.read_csv("dic.csv")

nam = list(people.loc[:,"Name"])    
tag = list(people.loc[:,"Tag"])

with open("test.txt", "w+") as file:  
    for n, t in zip (nam, tag):
        asstag = list(dic.loc[:, t])
        ran = random.randint(0, len(dic.loc[:, tag]) - 1)
        fintag = asstag[ran]
        prompt = (str(nam) + " " + str(fintag))
        print(prompt)
        file.write(prompt)
4
  • Please add input and expected output. What is othervariable? Commented Jul 19, 2022 at 7:34
  • don't worry about it, renamed Commented Jul 19, 2022 at 7:54
  • What happens if there are more names than possible unique tags Commented Jul 19, 2022 at 8:03
  • that is not possible, there is always more tags than names Commented Jul 19, 2022 at 8:04

1 Answer 1

1

One approach to select by tag unique elements, using random.sample:

import pandas as pd
import random
from collections import Counter

random.seed(42)

people = pd.DataFrame({'Name': ['mark', 'bill', 'tim', 'frank'],
                       'Tag': ['color', 'animal', 'clothes', 'animal']})
dic = pd.DataFrame({'color': ['blu', 'green', 'red', 'yellow'],
                    'animal': ['dog', 'cat', 'horse', 'shark'],
                    'clothes': ['gloves', 'shoes', 'shirt', 'socks']})

names = list(people.loc[:, "Name"])
tags = list(people.loc[:, "Tag"])

samples_by_tag = {tag: random.sample(dic.loc[:, tag].unique().tolist(), count) for tag, count in Counter(tags).items()}

for name, tag in zip(names, tags):
    print(name, samples_by_tag[tag].pop())

Output

mark blu
bill horse
tim shirt
frank dog

The idea is to sample n_i unique elements by each tag using random.sample, where n_i is the number each tag appears in tags, this is done in the line:

samples_by_tag = {tag: random.sample(dic.loc[:, tag].unique().tolist(), count) for tag, count in Counter(tags).items()}

for a given run it can take the following value:

{'color': ['blu'], 'animal': ['dog', 'horse'], 'clothes': ['shirt']}
 # samples_by_tag 

Note that you need to remove:

random.seed(42)

to make the script give random results every time. See the documentation on random.seed and the notes on reproducibility.

UPDATE

If one tag has fewer values than need, and you have a list to replace them, do the following:

other_colors = ['black', 'violet', 'green', 'brown']
populations = { tag : dic.loc[:, tag].unique().tolist() for tag in set(tags) }
populations["color"] = list(set(other_colors))

samples_by_tag = {tag: random.sample(populations[tag], count) for tag, count in Counter(tags).items()}

for name, tag in zip(names, tags):
    print(name, samples_by_tag[tag].pop())
Sign up to request clarification or add additional context in comments.

6 Comments

this is not random, it gives the exact same result every time it's ran.
@FrancescoCalderone Because I set the seed for reproducibility, just remove the random.seed(42) line
What if there are actually more names than unique tags as you suggested before? Let's say that a specific tag called "colors" only has 2 entries. What I want to do is to disregard those 2 entries entirely, and get the values for an entirely different list (just for that tag). The list is stored in a variable named colors_list. How would I do that?
Just update the dictionary samples_by_tag replace the values that has the "colors" key with the new ones
Not sure what that means. samples_by_tag dictionary doesn't get generated because "ValueError: Sample larger than population or is negative" so I can't change the values afterwards. And I'm not sure how to make that exception before that samples_by_tag line is ran.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.