1

I have a string that contains salary information in the following way:

salaryMixed = "£25,000 - £30,000"

Sometimes it will look like this:

salaryMixed = "EUR25,000 - EUR30,000"

And others times like this:

salaryMixed = "£37.50 - £50.00"

What I want to do is to remove all characters but the numeric values and then split the two values so as to place them into their own respective variables that reflect low banding and high banding. So far I have:

if salaryMixed.find('£')!=-1: # found £ char
    salaryMixed = salaryMixed.replace("£", "")
if salaryMixed.find('-')!=-1: # found hyphen
    salaryMixed = salaryMixed.replace("-", "")
if salaryMixed.find(',')!=-1: # found comma
    salaryMixed = salaryMixed.replace(",", "")
if salaryMixed.find('EUR')!=-1: # found EUR
    salaryMixed = salaryMixed.replace("EUR", "")
salaryMixed = re.sub('\s{2,}', ' ', salaryMixed) # to remove multiple space

if len(salaryList) == 1:
    salaryLow = map(int, 0) in salaryList
    salaryHigh = 00000
else:
    salaryLow = int(salaryList.index(1))
    salaryHigh = int(salaryList.index(2))

But I am stumped with how to split the two values up, and also how to handle the decimal point when salaryMixed isn't an annual salary but rather per hour as in the case of salaryMixed = "£37.50 - £50.00" because isn't that a float?

I am wanting to store this information in a MySQL DB later on in the code but I have described the table as:

CREATE TABLE jobs(
   job_id INT NOT NULL AUTO_INCREMENT,
   job_title VARCHAR(300) NOT NULL,
   job_salary_low INT(25),
   job_salary_high INT(25),
   PRIMARY KEY ( job_id )
);

What is the best approach here? Thanks.

3
  • 1
    Why not search for the regex [\d,.]+? That will tell you where both numbers are in the string. Then you can preprocess (remove commas etc) and transform it into a number. Commented Jun 12, 2013 at 3:58
  • Just a small point, your table schema has INT for the salary columns, but your values are floats. Better switch to DECIMAL at the MySQL end. Commented Jun 12, 2013 at 4:19
  • @Burhan: ALTERED to reflect your suggestion. Thanks. Commented Jun 12, 2013 at 14:10

3 Answers 3

1

This is a good case for a regular expression from the python re module. And you'll probably want to upcast the hourly rates to annual (assuming you have a consistent average hourly

import re

def salary_band(val):
    currency = 'EUR' if 'EUR' in val else 'GBP'
    numbers = re.findall("[0-9.\,]*", val) # this will have a bunch of empty entries and two numbers
    numbers = [i.replace(",","") for i in numbers if i] # filter out empty strings, remove commas
    numbers = map(float, numbers) # convert to floats
    annual = lambda p: int(p) if p > 2000 else int( p * 1800) # your number here...
    return currency, map(annual, numbers)

print salary_band ( "gbp37.50 - gbp50.00")
print salary_band ( "EUR25,000 - EUR30,000")
>> ('GBP', [75000, 100000])
>> ('EUR', [25000, 30000])

Here i'm returning the currency type and the high/low numbers as a tuple - you can unpack it easily into your table

Sign up to request clarification or add additional context in comments.

3 Comments

Can you explain the line annual = lambda p: int(p) if p > 2000 else int( p * 1800) to me?
Also, isn't the 2nd item of the tuple (the actual numbers) a list? So a list within a tuple? How do I "unpack" these numbers into respective variables? So far I have: salaryTuple = salary_band(salaryMixed) salaryLow = salaryTuple[0] salaryHigh = salaryTuple[1] Which isn't correct...
The lambda is an anonymous function. If the number is larger than 2000 just return it, otherwise return it times 1800 (which is a ballpark number for hours-worked-per-year) It really should be the same number in both cases, I changed it in one place and forgot to change it in the other
1

What I want to do is to remove all characters but the numeric values and then split the two values so as to place them into their own respective variables that reflect low banding and high banding. So far I have:

Ok taking this one step at a time. Remove all the characters but the numeric values (Better keep spaces and periods too)

>>> testcases =  ["£25,000 - £30,000", "EUR25,000 - EUR30,000", "£37.50 - £50.00"]
>>> res = [''.join(x for x in tc if x.isdigit() or x.isspace() or x == '.') for tc in testcases]
>>> res
['25000  30000', '25000  30000', '37.50  50.00']

ok, now split them

>>> res = [x.split() for x in res]
>>> res
[['25000', '30000'], ['25000', '30000'], ['37.50', '50.00']]

Convert to floats (Decimal might be better)

>>> res = [[float(j) for j in i] for i in res]>>> res
[[25000.0, 30000.0], [25000.0, 30000.0], [37.5, 50.0]]

Put in separate variables

>>> for low, high in res:
...     print (low, high)
... 
25000.0 30000.0
25000.0 30000.0
37.5 50.0

regex as suggested by @Patashu is the easy/lazy way to do it though

Comments

0

for storing the values in db, you can use MySQLdb library in python.It's easy to use and will store al your data to database. Here check it out.

You can install it by apt-get install python-mysqldb

1 Comment

Thanks, though I forgot to mention but I already have that import.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.