Problem using a dictionary of numpy array(Indexing it wrong)

Question

I'm trying to code the Gaussian Naive Bayes from scratch using python and numpy but I'm having some troubles to create the word frequency table.

I have a dictionary of N words as keys and each one of these N words has a numpy array associated.

Example:

freq_table['subject'] -> Vector of ocurrences of this word of length nrows where nrows is the size of the dataset.

So for each row in the dataset I'm doing: freq_table[WORD][i] += 1

def train(self, X):
        # Creating the dictionary
        self.dictionary(X.data[:100])

        # Calculating the class prior probabilities
        self.p_class = self.prior_probs(X.target)

        # Calculating the likelihoods
        nrows = len(X.data[:100])
        freq = dict.fromkeys(self._dict, nrows * [0])

        for doc, target, i in zip(X.data[:2], X.target[:2], range(2)):
            print('doc [%d] out of %d' % (i, nrows))

            words = preprocess(doc)

            print(len(words), i)

            for j, w in enumerate(words):
                print(w, j)

                # Getting the vector assigned by the word w
                vec = freq[w]

                # In the ith position (observation id) sum one of ocurrence
                vec[i] += 1

        print(freq['subject'])

The output is

Dictionary length 4606

doc [0] out of 100
43 0
wheres 0
thing 1
subject 2
nntppostinghost 3
racwamumdedu 4
organization 5
university 6
maryland 7
college 8
lines 9
wondering 10
anyone 11
could 12
enlighten 13
sports 14
looked 15
early 16
called 17
bricklin 18
doors 19
really 20
small 21
addition 22
front 23
bumper 24
separate 25
anyone 26
tellme 27
model 28
engine 29
specs 30
years 31
production 32
history 33
whatever 34
funky 35
looking 36
please 37
email 38
thanks 39
brought 40
neighborhood 41
lerxst 42
[43, 53, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

It seems that I'm indexing the dictionary and vector wrong.

It was not supposed to be 43 or 53 occurrences for the word 'subject' because the length of the preprocessed words from the document/row is 43/53.

What is X.data and X.target here?

willeM_ Van Onsem
– willeM_ Van Onsem

2019-08-24 17:32:55 +00:00
Commented Aug 24, 2019 at 17:32 — willeM_ Van Onsem
– willeM_ Van Onsem, Commented Aug 24, 2019 at 17:32

Craig · Accepted Answer · 2019-08-24 18:04:16Z

2

The code has at least two errors:

1) In the line

freq = dict.fromkeys(self._dict, nrows * [0])

You initialize all items in the freq dictionary with the same list. nrows * [0] is evaluated once to create a list, which is then passed to the dict.fromkeys() function. The reference to this one list is assigned to all of the keys in the freq dictionary. No matter which key you select, you get a reference to the same list. This is a common gotcha in Python.

Instead, you can use a dictionary comprehension to create the entries with separate lists:

freq = {key:nrows*[0] for key in self._dict}

2) You use i as your indexing variable for the vec, but you meant to use j:

vec[j] += 1

Using variables with descriptive names would help avoid this type of confusion.

edited Aug 24, 2019 at 18:04

answered Aug 24, 2019 at 17:59

Craig

4,8751 gold badge20 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

murthaA Over a year ago

Thanks, Craig my problem was the initialization, indeed the keys were referencing the same list. And the i indexing is correct it worked out.

Collectives™ on Stack Overflow

Problem using a dictionary of numpy array(Indexing it wrong)

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related