Speeding up Python depth-first search in large graph

Question

I'm building a Markov text poetry generator. The main method, which is attached below, performs a depth-first search through the Markov chain, which is implemented as a NetworkX digraph of Word objects, which have methods for determining whether e.g. a Word rhymes with another Word instance.

At each level of the search, I filter the successor nodes to those that match user-supplied constraints - for example, "first word in sequence must rhyme with last word in sequence", so lines of poetry with different poetic devices can be built from the Markov chain.

The code I have works, but with large amounts of text (27412 unique words, 160892 edges between words) for certain combinations of constraints, especially rhyming ones, it takes a very long time to come up with a sequence that satisfies the constraints, if there is a possible sequence in the chain. Here's an example:

>>> vg = VerseGenerator(some_very_long_text)
>>> cs = [Constraint(Word.rhymeswith, [0, 3])]
>>> vg.build_sequence(4, [], cs)
[Word(among), Word(the), Word(white), Word(young)] - 204ms via IPython %timeit
>>> cs = [Constraint(Word.rhymeswith, [0, 9])]
>>> vg.build_sequence(10, [], cs) - Doesn't terminate for at least 10 minutes

Can anyone see a way I could speed this up, if any? I'm experimenting with Cython, but it will be a while before I get up to speed on it. Perhaps there is some clever graph theory property or algorithm I could use to preprocess it, or something?

The code in its entirety is available at my Github. I'm using Python 3. I hope I have explained it clearly enough, but if not please point out what other information I should include.

def build_sequence(self, length, predicates, constraints, order='roulette'):
    ''' Traverse the graph in depth first order to build a sequence
    of the given length which fits the supplied predicates. If a
    matching sequence is not found, return partially constructed
    sequence. Begins with a randomly selected word from the graph
    '''

    # If there are any multiple-word constraints that begin at
    # this level, we need to create predicates so they can be
    # applied at future search levels
    def apply_constraints(cands, path, level, preds, cons):
        cs = [c for c in cons if c.indices[0] == level]
        search_nodes = []
        for can in cands:
            new_preds = []
            for con in cs:
                curried = partial(con.method, can)
                sub = con.indices[1:]
                new_preds.extend([Predicate(curried, i) for i in sub])
            rec = {'word': can, 'parent': path, 'level': level,
                   'preds': preds + new_preds}
            search_nodes.append(rec)
        return search_nodes

    stack = []
    level = 0

    # Get candidate starting nodes
    apply = [p.partial for p in predicates if p.index == level]
    cands = self.filter_words(self.chain.nodes(), apply)
    random.shuffle(cands)
    branches = apply_constraints(cands, None, level, predicates, constraints)
    stack.extend(branches)

    while stack and level < length:
        path = stack.pop()
        prev = path['word']
        level = path['level'] + 1
        preds = path['preds']

        # Filter nodes following previous word to get candidates for
        # the next word in the sequence.
        apply = [p.partial for p in preds if p.index == level]
        succ = self.shuffled_successors(prev, order=order)
        cands = self.filter_words(succ, apply)

        branches = apply_constraints(cands, path, level, preds, constraints)
        stack.extend(branches)

    # Walk backward through final recursive record to build the
    # resulting sentence
    result = []
    for i in range(level):
        result.append(path['word'])
        path = path['parent']

    return reverse(result)

"for certain combinations of constraints ... it takes a very long time." Can you give a specific example? As always with performance questions, it would be very good to have a runnable example, complete with data, or link to a publicly available data file. — Janne Karila
– Janne Karila, Commented Oct 14, 2015 at 7:25
I've added an example. I'll try get a data file up somewhere — muskrat
– muskrat, Commented Oct 14, 2015 at 8:11
I don’t have time to try myself, but have you tried profiling the code to find where the time is being sunk? — alexwlchan
– alexwlchan, Commented Oct 14, 2015 at 21:39
Seems like you need a way to prune non-productive branches without descending them. Could you organize your tree to minimize the number of possible rhymes on each branch and, at each node, summarize the potential rhymes below that node? That way, at each node, you could look to see whether potential matches for the current word lie below this node in the tree. If not, you could abandon the current branch and move on. — Tom Barron
– Tom Barron, Commented Dec 13, 2015 at 12:01

J_H · Accepted Answer · 2017-08-05 18:02:06Z

1

Can anyone see a way I could speed this up

The predicate filtering approach is correct, but slow. It offers little guidance on which branches are worth exploring deeply.

Switch from hard boolean to a scoring approach, so you can sort candidate scores into rank order, which allows deeper exploration of more promising nodes earlier.

answered Aug 5, 2017 at 18:02

J_H

43.2k3 gold badges38 silver badges158 bronze badges

Add a comment |

aghast · Accepted Answer · 2018-01-03 01:32:16Z

There are some things that seem obvious to me, but I don't know how much benefit you will see from them.

In multiple places, you do this:
- make a list
- randomize the list
- filter the words in the list
It seems to me that filtering the list should be first, since that will shorten anything that comes later.
You do a lot of code that appears to re-compute values that you should already have. For example, the constraints lists, the candidate records (dictionaries), the predicate lists computed into apply. Is there no way to store these into your DiGraph or someplace else cache-like, so that you don't recompute them?

Stack Exchange Network

Speeding up Python depth-first search in large graph

2 Answers 2

You must log in to answer this question.

Hot Network Questions

Speeding up Python depth-first search in large graph

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions