Python html extract number after a key word

Question

line1 = " The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over."

line2 = " The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."

expected output:

household median income: $64,411
family median income: $78,940
per capital income: $22,466



[householdIncome, familyIncome, perCapitalIncome] = re.findall("\d+,\d+",line1)

line1 works well. line2:

ValueError: too many values to unpack (expected 3)

The main obj is how to identify the 1st number/value after locate the key word.

some lines they do not have the per capital income, I can accept it as ""

Jan · Accepted Answer · 2017-03-18 21:40:47Z

As pointed out by others, you'll need some additional programming logic. Consider the following example which uses a regular expression to find the values in question and calculates a median if necessary:

import re, locale
from locale import atoi
locale.setlocale( locale.LC_ALL, 'en_US.UTF-8' )

lines = ["The median income for a household in the city was $64,411, and the median income for a family was $78,940. The per capita income for the city was $22,466. About 4.3% of families and 5.9% of the population were below the poverty line, including 7.0% of those under age 18 and 12.3% of those age 65 or over.",
"The median income for a household in the city was $31,893, and the median income for a family was $38,508. Males had a median income of $30,076 versus $20,275 for females. The per capita income for the city was $16,336. About 14.1% of families and 16.7% of the population were below the poverty line, including 21.8% of those under age 18 and 21.0% of those age 65 or over."]

# define the regex
rx = re.compile(r'''
        (?P<type>household|family|per\ capita)
        \D+
        \$(?P<amount>\d[\d,]*\d)
        (?:
            \s+versus\s+
            \$(?P<amount2>\d[\d,]*\d)
        )?''', re.VERBOSE)

def afterwork(match):
    if match.group('amount2'):
        amount = (atoi(match.group('amount')) + atoi(match.group('amount2'))) / 2
    else:
        amount = atoi(match.group('amount'))
    return amount

result = {}
for index, line in enumerate(lines):
    result['line' + str(index)] = [(m.group('type'), afterwork(m)) for m in rx.finditer(line)]

print(result)
# {'line1': [('household', 31893), ('family', 38508), ('per capita', 16336)], 'line0': [('household', 64411), ('family', 78940), ('per capita', 22466)]}

Bill Bell · Accepted Answer · 2017-03-18 21:10:20Z

2

The result of executing re.findall("\d+,\d+",line2) is ['31,893', '38,508', '30,076', '20,275', '16,336']. Thus the immediate problem is that there are five results from the regex and you have allowed for only three. However, there is a slightly deeper problem. When I examined the two sentences I found that they have different structures. In the first, household income, family income and per capita income do indeed seem to appear first but this does not appear to be the case in the second sentence. I would say that you need to provide for some more complicated analysis of the sentence.

answered Mar 18, 2017 at 21:10

Bill Bell

21.7k6 gold badges48 silver badges62 bronze badges

1 Comment

zhan2383 Over a year ago

U are exactly right, I just modified the question, basically, most lines have the same key words :"household", "family","per capital income", some do not have. I hope to be able to identify the key word and related value.

Randolf Johnson · Accepted Answer · 2017-03-18 21:06:09Z

0

In line2 findall finds more than 3 matches and you are trying to unpack them on only 3 variables.

Use something like this:

[householdIncome, familyIncome, perCapitalIncome] = re.findall("\d+,\d+",line1)[:3]

answered Mar 18, 2017 at 21:06

Randolf Johnson

363 bronze badges

Collectives™ on Stack Overflow

Python html extract number after a key word

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related