Python program to get the first word

Question

Given a string the task it to find its first word with some rules:

The string can have points and commas
A word can start with a letter, a point or space
One word can contain one apostrophe and it stills being a valid one

For example:

assert first_word("Hello world") == "Hello"
assert first_word(" a word ") == "a"
assert first_word("don't touch it") == "don't"
assert first_word("greetings, friends") == "greetings"
assert first_word("... and so on ...") == "and"
assert first_word("hi") == "hi"
assert first_word("Hello.world") == "Hello"

The code:

def first_word(text: str) -> str:
    """
        returns the first word in a given text.
    """
    text = re.sub("[^A-Za-z'\s.]",'',text)
    words = text.split()
    for word in words:
        for i in range(len(word)):
            if word[i].isalpha() or word[i] == "'":
                if i == len(word) - 1:
                    if word.find('.') != -1:
                        return word.split('.')[0]
                    else:
                        return word

How could we improve it?

Why is first_word(" a word ") == "a" if a word can start with a space? — Martin R
– Martin R, Commented May 25, 2019 at 15:48

Justin · Accepted Answer · 2019-05-28 10:11:29Z

You could make the code better (and shorter) by using regex to split any delimiters that occur in the string, for example, in Hello.world, the string (list form) would then be like ['', 'Hello', ''] (after splitting the first word from delimiters) and then you can access the first word from index [1] (always). Like this,

import re
def first_word(s):
    return re.split(r"(\b[\w']+\b)(?:.+|$)", s)[1]

Here are some tests:

tests = [
"Hello world",
"a word",
"don't touch it",
"greetings, friends",
"... and so on ...",
"hi",
"Hello.world",
"Hello.world blah"]

for test in tests:
    assert first_word("Hello world") == "Hello"
    assert first_word(" a word ") == "a"
    assert first_word("don't touch it") == "don't"
    assert first_word("greetings, friends") == "greetings"
    assert first_word("... and so on ...") == "and"
    assert first_word("hi") == "hi"
    assert first_word("Hello.world") == "Hello"
    assert first_word("Hello.world blah") == "Hello"
    print('{}'.format(first_word(test)))

(\b[\w']+\b)(?:.+|$) is used above, where (\b[\w']+\b) calls the first word of the string (in list form). \b allows you to perform a "whole words only" search using a regular expression in the form of \b"word"\b. Note that using [\w'] (instead of [\w+]) leaves the apostrophe in don't. For (?:.+|$), you can take a look here.

Here are the expected outputs:

Hello
a
don't
greetings
and
hi
Hello
Hello

After timing it -

%timeit first_word(test)
>>> 1.54 µs ± 17.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

NOTE - A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.

Hope this helps!

Jamal · Accepted Answer · 2019-07-20 02:53:59Z

Your code looks pretty great, much better that mine!

The beauty of regular expressions is that sometimes we can do the entire task, similar to our task here, with it so that to reduce writing additional if and thens. Maybe, here we could find an expression that would do so, something similar to:

(\b[\w']+\b)(?:.+|$)

which wraps our desired first word in a capturing group:

(\b[\w']+\b)

followed by a non-capturing group:

(?:.+|$)

Of course, if we wish to add more boundaries or reduce our boundaries or change our char list [\w'], we can surely do so.

Test

Let's test our expression with re.finditer to see if that would work:

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"(\b[\w']+\b)(?:.+|$)"

test_str = ("Hello world\n"
     " a word \n"
     "don't touch it\n"
     "greetings, friends\n"
     "... and so on ...\n"
     "hi\n"
     "Hello.world")

matches = re.finditer(regex, test_str, re.MULTILINE)

for matchNum, match in enumerate(matches, start=1):

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

Output

Match 1 was found at 0-11: Hello world
Group 1 found at 0-5: Hello
Match 2 was found at 13-20: a word 
Group 1 found at 13-14: a
Match 3 was found at 21-35: don't touch it
Group 1 found at 21-26: don't
Match 4 was found at 36-54: greetings, friends
Group 1 found at 36-45: greetings
Match 5 was found at 59-72: and so on ...
Group 1 found at 59-62: and
Match 6 was found at 73-75: hi
Group 1 found at 73-75: hi
Match 7 was found at 76-87: Hello.world
Group 1 found at 76-81: Hello

RegEx Circuit

jex.im visualizes regular expressions:

Basic Performance Test

const repeat = 1000000;
const start = Date.now();

for (var i = repeat; i >= 0; i--) {
	const regex = /(\b[\w']+\b)(?:.+|$)/gm;
	const str = `Hello.world`;
	const subst = `$1`;

	var match = str.replace(regex, subst);
}

const end = Date.now() - start;
console.log("YAAAY! \"" + match + "\" is a match 💚💚💚 ");
console.log(end / 1000 + " is the runtime of " + repeat + " times benchmark test. 😳 ");

Stack Exchange Network

Python program to get the first word

2 Answers 2

Test

Output

RegEx Circuit

Basic Performance Test

DEMO

You must log in to answer this question.

Hot Network Questions

2 Answers 2

Test

Output

RegEx Circuit

Basic Performance Test

You must log in to answer this question.

Related