1

I have a string which looks like below

answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""

Now I need to extract just the table from the string such that the end result looks like

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

Now I tried doing something like this

answer.split("***")[0].split("\n")[1]

But doing so, I only get the header against the expected table.

How do I ensure that I can only extract table from the string? Is there any regex that can be applied here?

1
  • 1
    Why regex? You may just find the first index of the +---------------+ string, then get the substring till the last +---------------+ string. See ideone.com/HNYsmN Commented Oct 17, 2019 at 7:17

3 Answers 3

1

I might try:

answer = re.sub(r'^.*?(?=\+-)|\*\*\*.*$', '', answer, flags=re.DOTALL)
print(answer)

This prints:

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

The regex uses an alternation, to handle trimming the answer string at both the beginning and the end. First:

^.*?(?=\+-)

removes all content from the start of the string up to, but not including, the start of the table (+-). The second part:

\*\*\*.*$

removes all content from the start of the footnote (***) until the end of the string.

Sign up to request clarification or add additional context in comments.

2 Comments

I do not want the header. Just the table.
@SouvikRay I had a slight problem in there, and should have been using DOT ALL mode.
1

It looks as though you wanted to match from the first occurrence of a fixed delimiter to the last occurrence of the same delimiter.

In this case, you do not have to use a regex:

sep = '+---------------+'
start = answer.find(sep)
end = answer.rfind(sep)
print(answer[start:end+len(sep)])

See the Python demo yieling

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

With regex, you may directly match from the first till last occurrence of the separator:

import re
answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""
sep = '+---------------+'
m = re.search(r'(?sm)^{0}.*{0}'.format(re.escape(sep)), answer)
if m:
    print(m.group())

See another regex demo

Regex details

  • (?sm) - dot now matches line breaks and ^ matches start of a line
  • ^ - start of a line
  • \+---------------\+ - a separator pattern
  • .* - any 0+ chars as many as possible
  • \+---------------\+ - separator pattern

2 Comments

yes this is another solution without using regex. Thanks for the alternate approach.
@SouvikRay I actually added a direct dynamic regex extraction approach that will work for any kind of separators.
0

I tried this as follows

Step 1: Identify the Index range by running below code

print(answer.index("ks")) 

print(answer.index("***"))

You will find out index range of table i.e [28:226] and comment out this code once you found the range.

Step 2:

print(answer[28:226])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.