1

I'm trying to build a market analysis tool. The raw data input is formatted like this:

20,000 shares for 550 USD each

meaning "20,000 shares of stock at 550 USD per share".

Normally, I would grab the price with the following bit of code:

value = re.findall(re.compile('20,000 shares for (.*) USD each'), data)

However, this approach fails me as the number of shares (in this case, 20 thousand) changes as well as the price value. Is there a better way to extract this data?

I apologize in advance for the improper description of my problem; I'm a bit of a newbie to Python and I'm not sure about what technical terms to use in this scenario. If there is a better way to word my title, please feel free to edit, and thank you in advance!

1
  • 1
    '([[:digit:],]+) shares for ([[:digit:],.]+) USD each' Commented Apr 13, 2013 at 4:07

2 Answers 2

1

You can use more general patterns such as:

([\d,.]+) shares for ([\d,.]+) USD each

Also if you want to stick to .* for matching values, it's better to make it less greedy by turning it into .*? so that it does not eat the rest of your input.

If input can end in either each or per share use the following instead:

([\d,.]+) shares(?: of stock)? at ([\d,.]+) USD (?:each|per share)

Putting ?: after the opening parenthesis makes it a non-matching group, so it will not be captured along with the numbers which interest you.

Sign up to request clarification or add additional context in comments.

4 Comments

Would I implement this in the same way as my re.findall(re.compile()) format? Or is there another method I would use? Thank you!
@user2276631 I would think so. Also as the python help goes you can access the matched groups (contents within parentheses), so you can extract the numbers as well.
Again, sorry for my lack of ability. I currently have: list = re.findall(re.compile(('[\d,.]+) shares for ([\d,.]+) USD each'), data) The list returns empty. I know there's something obvious I'm doing wrong, but I'm not sure what it is.
@user2276631 should the sentence end in "each" or "per share"?
0

Use a character class to specify the share numbers and the share price in your regular expression.

(\d[\d,.]*) shares for ([\d,.]+) USD each

Depending on what your data looks like, you may not need to be as careful about capturing separators. For example, if only whole shares are traded, you don't need the decimal point in the first digit group.

If you might use the same regex on more than one dataset, it behooves you to compile it separately from using it in the findall.

import re
compiled_regex = re.compile("""(\d[\d,.]*) shares for ([\d,.]+) USD each""")

trades1 = re.findall(compiled_re, data1)
trades2 = re.findall(compiled_re, data2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.