How to extract certain substring from a multi line string in Python?

Question

I have a string which looks like below

answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""

Now I need to extract just the table from the string such that the end result looks like

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

Now I tried doing something like this

answer.split("***")[0].split("\n")[1]

But doing so, I only get the header against the expected table.

How do I ensure that I can only extract table from the string? Is there any regex that can be applied here?

Why regex? You may just find the first index of the +---------------+ string, then get the substring till the last +---------------+ string. See ideone.com/HNYsmN — Wiktor Stribiżew
– Wiktor Stribiżew, Commented Oct 17, 2019 at 7:17

Tim Biegeleisen · Accepted Answer · 2019-10-17 07:10:41Z

1

I might try:

answer = re.sub(r'^.*?(?=\+-)|\*\*\*.*$', '', answer, flags=re.DOTALL)
print(answer)

This prints:

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

The regex uses an alternation, to handle trimming the answer string at both the beginning and the end. First:

^.*?(?=\+-)

removes all content from the start of the string up to, but not including, the start of the table (+-). The second part:

\*\*\*.*$

removes all content from the start of the footnote (***) until the end of the string.

edited Oct 17, 2019 at 7:10

answered Oct 17, 2019 at 7:07

Tim Biegeleisen

526k32 gold badges324 silver badges399 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Souvik Ray Over a year ago

I do not want the header. Just the table.

Tim Biegeleisen Over a year ago

@SouvikRay I had a slight problem in there, and should have been using DOT ALL mode.

Wiktor Stribiżew · Accepted Answer · 2019-10-17 07:23:12Z

1

It looks as though you wanted to match from the first occurrence of a fixed delimiter to the last occurrence of the same delimiter.

In this case, you do not have to use a regex:

sep = '+---------------+'
start = answer.find(sep)
end = answer.rfind(sep)
print(answer[start:end+len(sep)])

See the Python demo yieling

+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+

With regex, you may directly match from the first till last occurrence of the separator:

import re
answer = """
models sold in last 4 weeks
+---------------+
|      pcid     |
+---------------+
|     22bv03    |
|     3eer3d    |
|  fes44h2j555j |
| 4mee33ikj5sq1 |
|  99dkk3bvr32a |
| cv44trmq011sa |
|    lo33xc1a   |
+---------------+***For more information, please visit the company page.
"""
sep = '+---------------+'
m = re.search(r'(?sm)^{0}.*{0}'.format(re.escape(sep)), answer)
if m:
    print(m.group())

See another regex demo

Regex details

(?sm) - dot now matches line breaks and ^ matches start of a line
^ - start of a line
\+---------------\+ - a separator pattern
.* - any 0+ chars as many as possible
\+---------------\+ - separator pattern

edited Oct 17, 2019 at 7:23

answered Oct 17, 2019 at 7:20

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

2 Comments

Souvik Ray Over a year ago

yes this is another solution without using regex. Thanks for the alternate approach.

Wiktor Stribiżew Over a year ago

@SouvikRay I actually added a direct dynamic regex extraction approach that will work for any kind of separators.

robsiemb · Accepted Answer · 2019-10-17 17:05:12Z

0

I tried this as follows

Step 1: Identify the Index range by running below code

print(answer.index("ks")) 

print(answer.index("***"))

You will find out index range of table i.e [28:226] and comment out this code once you found the range.

Step 2:

print(answer[28:226])

edited Oct 17, 2019 at 17:05

robsiemb

6,3747 gold badges34 silver badges49 bronze badges

answered Oct 17, 2019 at 16:34

Poison

11 bronze badge

Collectives™ on Stack Overflow

How to extract certain substring from a multi line string in Python?

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related