python - Yield improperly usage

Question

Im pretty sure im using yield improperly:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
from gensim import corpora, models, similarities
from collections import defaultdict
from pprint import pprint  # pretty-printer
from six import iteritems
import openpyxl
import string
from operator import itemgetter

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

#Creating a stoplist from file
with open('stop-word-list.txt') as f:
    stoplist = [x.strip('\n') for x in f.readlines()]

corpusFileName = 'content_sample_en.xlsx'
corpusSheetName = 'content_sample_en'

class MyCorpus(object):
    def __iter__(self):
        wb = openpyxl.load_workbook(corpusFileName)
        sheet = wb.get_sheet_by_name(corpusSheetName)
        for i in range(1, (sheet.max_row+1)/2):
            title = str(sheet.cell(row = i, column = 4).value.encode('utf-8'))
            summary = str(sheet.cell(row = i, column = 5).value.encode('utf-8'))
            content = str(sheet.cell(row = i, column = 10).value.encode('utf-8'))
            yield reBuildDoc("{} {} {}".format(title, summary, content))


def removeUnwantedPunctuations(doc):
    "change all (/, \, <, >) into ' ' "
    newDoc = ""
    for l in doc:
        if  l == "<" or l == ">" or l == "/" or l == "\\":
            newDoc += " "
        else:
            newDoc += l
    return newDoc

def reBuildDoc(doc):
    """
    :param doc:
    :return: document after being dissected to our needs.
    """
    doc = removeUnwantedPunctuations(doc).lower().translate(None, string.punctuation)
    newDoc = [word for word in doc.split() if word not in stoplist]
    return newDoc

corpus = MyCorpus()

tfidf = models.TfidfModel(corpus, normalize=True)

In the following example you can see me trying to create a corpus from an xlsx file. Im reading from the xlsx file 3 lines which are title summary and content and appending them into a big string. my reBuildDoc() and removeUnwantedPunctuations() functions then adjust the text to my needs and in the end returns a big list of words. (for ex: [hello, piano, computer, etc... ]) in the end I yield the result but I get the following error:

Traceback (most recent call last):
  File "C:/Users/Eran/PycharmProjects/tfidf/docproc.py", line 101, in <module>
    tfidf = models.TfidfModel(corpus, normalize=True)
  File "C:\Anaconda2\lib\site-packages\gensim-0.13.1-py2.7-win-amd64.egg\gensim\models\tfidfmodel.py", line 96, in __init__
    self.initialize(corpus)
  File "C:\Anaconda2\lib\site-packages\gensim-0.13.1-py2.7-win-amd64.egg\gensim\models\tfidfmodel.py", line 119, in initialize
    for termid, _ in bow:
ValueError: too many values to unpack

I know the error is from the yield line because I had a different yield line that worked. It looked like this:

 yield [word for word in dictionary.doc2bow("{} {} {}".format(title, summary, content).lower().translate(None, string.punctuation).split()) if word not in stoplist]

It was abit messy and hard to put functionallity to it so I've changed it as you can see in the first example.

Side-note: removeUnwantedPunctuations is implemented incredibly inefficiently, particularly since you're performing a translate call on the result anyway. Just run the following at the top level of your code unwanted_to_space, deletepunc = string.maketrans(r'\/<>', ' '), string.punctuation.translate(None, r'\/<>'), then change doc = removeUnwantedPunctuations(doc).lower().translate(None, string.punctuation) to doc = doc.translate(unwanted_to_space, deletepunc).lower(). In simple tests, that reduces run time by a factor of ~10-15x (higher end for longer/less punctuated strings). — ShadowRanger
– ShadowRanger, Commented Aug 9, 2016 at 14:09
And you could save even more by making unwanted_to_space with string.maketrans(r'\/<>' + string.ascii_uppercase, ' ' + string.ascii_lowercase) (if it's not visible, there should be four spaces in the string that leads the second argument), which allows you to omit the call to lower (if the input was non-ASCII, you'd want true lower, but for ASCII str, incorporating the work in the translate call is equivalent and free), getting a >20x savings on runtime. Obviously, if your inputs are small, the savings doesn't matter, but for big data, parsing input could be a big cost. — ShadowRanger
– ShadowRanger, Commented Aug 9, 2016 at 14:25

Copperfield · Accepted Answer · 2016-08-09 13:50:25Z

1

the problem is not the yield per se, is what is yielded, the error said is from for termid, _ in bow this line said that you expect that bow contain a list of tuples or any other object containing exactly 2 element like (1,2),[1,2],"12",... but as it end giving to it the result of MyCorpus which is a string with obviously more than 2 element, hence the error, to fix this do either for termid in bow or in MyCorpus do yield reBuildDoc("{} {} {}".format(title, summary, content)), None so you yield a tuple of 2 object

to illustrate this check this example

>>> def fun(obj):
        for _ in range(2):
            yield obj


>>> for a,b in fun("xyz"):
        print(a,b)


Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    for a,b in fun("xyz"):
ValueError: too many values to unpack (expected 2)
>>> for a,b in fun("xy"):
        print(a,b)


x y
x y
>>> for a,b in fun(("xy",None)):
        print(a,b)


xy None
xy None
>>>

edited Aug 9, 2016 at 13:50

answered Aug 9, 2016 at 13:38

Copperfield

8,6353 gold badges26 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Eran Moshe Over a year ago

You are right, The problem was with the other function. I needed to do dictionary.doc2bow on the document before yielding it. And as you've written it does accept tuples. Thanks!

ShadowRanger Over a year ago

This answer is describing immediate cause of the eventual symptom, but the real solution is not just "add a fake second element", and the true cause is much higher level; TfidfModel requires input in a specific form, and adding faked second elements wouldn't make the code work either.

ShadowRanger · Accepted Answer · 2016-08-09 14:16:54Z

1

It looks like your problem is that TfidfModel expects a corpus that is a list of doc2bow outputs (themselves lists of two-tuples). Your original working code used doc2bow correctly to convert from your plain strings to the corpus format, your new code is passing in raw strings, not the "vectors" TfidfModel expects.

Go back to using doc2bow, and read the tutorial on converting string to vectors, which makes it clear that raw strings are nonsensical as input.

answered Aug 9, 2016 at 14:16

ShadowRanger

158k12 gold badges221 silver badges315 bronze badges

Collectives™ on Stack Overflow

python - Yield improperly usage

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related