Matching words with Regex (Python 3)

Question

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.

Problem:

Given the following input strings, find all possible output words 5 characters or longer.

qwertyuytresdftyuioknn
gijakjthoijerjidsdfnokg

Your program should find all possible words (5+ characters) that can be derived from the strings supplied. Use http://norvig.com/ngrams/enable1.txt as your search dictionary. The order of the output words doesn't matter.

queen question
gaeing garring gathering gating geeing gieing going goring

Assumptions about the input strings:

QWERTY keyboard
Lowercase a-z only, no whitespace or punctuation
The first and last characters of the input string will always match the first and last characters of the desired output word.
Don't assume users take the most efficient path between letters
Every letter of the output word will appear in the input string

Attempted solution:

First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):

import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
    fp.write(chunk)
fp.close()

I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile() to achieve this.

import re
input = 'qwertyuytresdftyuioknn'         #example
fp= open('words.txt')
string = fp.read()

regex = re.compile(input[0]+'\w{3,}'+input[-1])   #wrong need help here
regex.findall(string)

As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}. Any help into this would be greatly appreciated.

Peter · Accepted Answer · 2016-11-15 05:57:18Z

1

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.

Addition:

If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.

edited Nov 15, 2016 at 5:57

answered Nov 15, 2016 at 5:04

Peter

5531 gold badge5 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

tadm123 Over a year ago

reddit.com/r/dailyprogrammer/comments/53ijnb/… It's actually a challenge from reddit just for practice.

tadm123 Over a year ago

Peter but I need to match letters sweeping the word from left to right... `[adfg]{3,} matches letters in any order.

Peter Over a year ago

From pythons regular expressions documentation: (?<=...) Matches if the current position in the string is preceded by a match for ... that ends at the current position. Thus ^qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*(?<=\w{4})n$ ?

tadm123 Over a year ago

Lastly is there a way to make it so that each character can appear just once?

Peter Over a year ago

? matches 0 or 1 repetition. * matches 0 or more repetitions.

Collectives™ on Stack Overflow

Matching words with Regex (Python 3)

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related