1

I had been staring at this problem for hours, I don't know what regex format to use to solve this problem.

Problem:

Given the following input strings, find all possible output words 5 characters or longer.

  1. qwertyuytresdftyuioknn
  2. gijakjthoijerjidsdfnokg

Your program should find all possible words (5+ characters) that can be derived from the strings supplied. Use http://norvig.com/ngrams/enable1.txt as your search dictionary. The order of the output words doesn't matter.

  1. queen question
  2. gaeing garring gathering gating geeing gieing going goring

Assumptions about the input strings:

  • QWERTY keyboard
  • Lowercase a-z only, no whitespace or punctuation
  • The first and last characters of the input string will always match the first and last characters of the desired output word.
  • Don't assume users take the most efficient path between letters
  • Every letter of the output word will appear in the input string

Attempted solution:

First I downloaded the the words from that webpage and store them in a file in my computer ('words.txt'):

import requests
res = requests.get('http://norvig.com/ngrams/enable1.txt')
res.raise_for_status()
fp = open('words.txt', 'wb')
for chunk in res.iter_content(100000):
    fp.write(chunk)
fp.close()

I'm then trying to match the words I need using regex. The problem is that I don't know how to format my re.compile() to achieve this.

import re
input = 'qwertyuytresdftyuioknn'         #example
fp= open('words.txt')
string = fp.read()

regex = re.compile(input[0]+'\w{3,}'+input[-1])   #wrong need help here
regex.findall(string)

As it's obvious, it's wrong since I need to match letters from my input string going form left to right, not any letters which I'm mistakenly doing with \w{3,}. Any help into this would be greatly appreciated.

1 Answer 1

1

This feels a bit like a homework problem. Thus, I won't post the full answer, but will try to give some hints: Character groups to match are given between square brackets [adfg] will match any of the letters a, d, f or g. [adfg]{3,} will match any part with at least 3 of these letters. Looking at your list of words, you only want to match whole lines. If you pass re.MULTILINE as the second argument to re.compile, ^ will match the beginning and $ the end of a line.

Addition:

If the characters can only appear in the order given and assuming that each character can appear any number of times: 'qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*n'. However, we will also have to have at least 5 characters in total. A positive lookbehind assertion (?<=\w{5}) added to the end will ensure that.

Sign up to request clarification or add additional context in comments.

5 Comments

reddit.com/r/dailyprogrammer/comments/53ijnb/… It's actually a challenge from reddit just for practice.
Peter but I need to match letters sweeping the word from left to right... `[adfg]{3,} matches letters in any order.
From pythons regular expressions documentation: (?<=...) Matches if the current position in the string is preceded by a match for ... that ends at the current position. Thus ^qw*e*r*t*y*u*y*t*r*e*s*d*f*t*y*u*i*o*k*n*(?<=\w{4})n$ ?
Lastly is there a way to make it so that each character can appear just once?
? matches 0 or 1 repetition. * matches 0 or more repetitions.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.