Clarify grep & regex

Question

set of words that are 10 characters long and that contain a substring of three consecutive vowels. So far I tried these command.

grep -E '^.{10}$'| grep 'a*.e*.i*.o*.u*' words2.txt
grep -E '^.{10}$&a*.e*.i*.o*.u*' words2.txt

Input data, extracted via OCR of this screenshot:

unpernicious
unperspicuous
unpervious
unpious
unpiteous
unpiteously
unpiteousness
unplebeian
unplenteous
unportmanteaued
unportuous
unprecarious
unprecious
unprecocious
unpredacious
unpresumptuous
unpresumptuously
unpretentious
unpretentiously
unpretentiousness
unpromiscuous
unpropitious
unpropitiously
unpropitiousness
unpugnacious
unpunctilious
unquailed
unquailing
unquailingly
unqueen
unqueened
unqueening
unqueenlike
unqueenly
unquiescence
unquiescent
unquiescently
unquiet
unquietable
unquieted
unquieting
unquietly
unquietness
unquietude
unrapacious
unrebellious
unreligious
unreligiously
unreligiousness
unrighteous
unrighteously
unrighteousness
unsacrilegious
Unsagacious
unsalubrious
unsanctimonious
unsanctimoniously
unsanctimoniousness
unsanguineous
unsanguineously
unseditious
unseeable
unseeing

Should it report words like plateauing (4 consecutive vowels)? — Stéphane Chazelas
– Stéphane Chazelas, Commented Apr 26, 2017 at 12:01
Please be more specific in your question's title. That new "need a help and clarify with grep & regex" title is not useful and won't help you get answers or help people with a similar need find this Q&A. The original title ("Find the set of words that are exactly 10 characters long and that contain a substring of 3 consecutive vowels") was a lot better. — Stéphane Chazelas
– Stéphane Chazelas, Commented Apr 27, 2017 at 11:19

Stéphane Chazelas · Accepted Answer · 2017-04-26 15:27:11Z

Your problem is (IMHO) better solved with awk, but I'll just point out a problem with your command

grep -E '^.{10}$'| grep 'a*.e*.i*.o*.u*' words2.txt

To filter the contents of the file word2.txt through both grep invocations, this ought to look like

grep -E '^.{10}$' words2.txt | grep 'a*.e*.i*.o*.u*'

The second grep pattern should be [auoie]{3}, which lands us at

grep -E '^.{10}$' words2.txt | grep -E '[aouie]{3}'

The input to the first grep is your file. The input to the second grep is the output of the first grep, not your file.

Using a POSIX awk (like recent versions of GNU awk):

$ awk 'length == 10 && /[aouei]{3}/' words2.txt
unpervious
unplebeian
unportuous
unprecious
unquailing
unqueening
unquieting
unquietude

mawk, BSD awk and historical pre-POSIX implementations of awk don't support {n} in regular expressions as pointed out by Stéphane Chazelas.

Stephen Rauch · Accepted Answer · 2017-04-26 06:00:46Z

2

You had the 10 characters right, but to find 3 vowels in a row, look for a group [AEIOU]:

egrep '^.{10}$' | egrep -i '[AEIOU]{3}'

To reject whitepace use this:

egrep '^[^ \t]{10}$' | egrep -i '[AEIOu]{3}'

edited Apr 26, 2017 at 6:00

answered Apr 26, 2017 at 5:50

Stephen Rauch

4,33915 gold badges24 silver badges33 bronze badges

it worked, but some of them has words less than 10 characters :('

Mariyam Mohammed Jalil
– Mariyam Mohammed Jalil

2017-04-26 05:51:52 +00:00
Commented Apr 26, 2017 at 5:51
Is there extra whitespace?

Stephen Rauch
– Stephen Rauch

2017-04-26 05:53:35 +00:00
Commented Apr 26, 2017 at 5:53
nope, there aren't any whitespaces, but there are characters less than 10, I need characters either 10 or more than 10 that contain a substring of 3 consecutive vowels

Mariyam Mohammed Jalil
– Mariyam Mohammed Jalil

2017-04-26 06:00:34 +00:00
Commented Apr 26, 2017 at 6:00
To help further, you are likely going to need to show some sample data in your post.

Stephen Rauch
– Stephen Rauch

2017-04-26 06:01:42 +00:00
Commented Apr 26, 2017 at 6:01
3

@StephenRauch The OP is putting the input file name at the very end of the command line.

Kusalananda
– Kusalananda ♦

2017-04-26 11:59:48 +00:00
Commented Apr 26, 2017 at 11:59

| Show 11 more comments

score 2 · Accepted Answer · 2017-04-26 22:42:35Z

2

Assuming 1 word/line, you can do this:

sed -nE '/^.{10}$/!d;/[aAeEiIoOuU]{3}/p' words.txt

edited Apr 26, 2017 at 22:42

answered Apr 26, 2017 at 6:44

user218374

Add a comment |

Stéphane Chazelas · Accepted Answer · 2017-04-26 12:17:04Z

1

With grep built with PCRE support:

grep -iPx '(?=.*[aeiou]{3}.*).{10}'

Or:

grep -wiP '(?=\w*[aeiou]{3}\w*)\w{10}'

to search for those words when they're not one per line (add -o if your grep implementation supports it to print the matching words only instead of the whole line they're found in). There word means any sequence of word characters (letters (in the latin script, without diacritics only, add a (*UCP) for letters in any script, though that still won't cover vowels like é or α), digits and underscore).

edited Apr 26, 2017 at 12:17

answered Apr 26, 2017 at 11:55

Stéphane Chazelas

587k96 gold badges1.1k silver badges1.7k bronze badges

Add a comment |

Stack Exchange Network

Clarify grep & regex

4 Answers 4

You must log in to answer this question.

Linked

Hot Network Questions

Clarify grep & regex

4 Answers 4

You must log in to answer this question.

Linked

Related

Hot Network Questions