How to extract incomplete Python objects from string

Question

I am attempting to extract valid Python-parsable objects, such as dictionaries and lists, from strings. For example, from the string "[{'a' : 1, 'b' : 2}]", the script will extract [{'a' : 1, 'b' : 2}] since the {} and [] denote completed Python objects.

However, when the string output is incomplete, such as "[{'a' : 1, 'b' : 2}, {'a' : 1'}]", I only attempt to extract {'a' : 1, 'b' : 2} and place it into a list [{'a' : 1, 'b' : 2}], as the second Python object is not yet complete and therefore must be left out.

I tried to write a regex pattern to match completed {} or [], it works for simple output but failing on nested list or dict.

Code:

import re 

def match_dict_list(string): 
    pattern = r"\[?\{[^\}\]]*\}\]?|\[?\[[^\]\[]*\]\]?"
    matches = re.findall(pattern, string)
    return matches

But it's failing on """[[1, 2, 3], [11, 12, 21]""" because it's matching [[1, 2, 3], [11, 12, 21] while the expected output is only [1, 2, 3], [11, 12, 21] and put it in list [ [[1, 2, 3], [11, 12, 21] ]

Some test cases

Case 1: "[{'a' : 1, 'b' : 2}, {'a' : 1'"

Expected output: [{'a': 1, 'b': 2}]
Case 2: '[[1, 2, 3], [11, 12, 21]'

Expected output: [[1, 2, 3], [11, 12, 21]]
Case 3: """[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"""

Expected output: [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]

I am getting the output from APIs but can't do anything from their side; sometimes, the server output is complete, and sometimes, it's incomplete.

I also tried the updated pattern : \[?\{[^\}\]]*\}\]?|\[[^\]\[]*\]|\[\[[^\]\[]*\]\] but it's failing on third case. what is the best option to solve this kind of issue?

I can't use ast.literal_eval because as I mentioned above the string output is incomplete such as " [ { 'a' : 1 } , {'b' : ".

There is a reason parsers don't try to fix "obvious" syntax errors. Why are you trying to do so? — chepner
– chepner, Commented Feb 19, 2023 at 23:02
Also, it's impossible to do this with regex, due to arbitrarily-deeply-nested expressions being possible. — Mous
– Mous, Commented Feb 19, 2023 at 23:03
Re: I am getting the output from APIs but can't do anything from their side; sometimes, the server output is complete, and sometimes, it's incomplete., APIs should serialize data using a format like JSON, not python reprs (incomplete, to boot). If you are able to provide feedback to the owner of the API, you should make them fix their output. — dskrypa
– dskrypa, Commented Feb 19, 2023 at 23:09
"[{'a' : 1, 'b' : 2}, {'a' : 1'}]", I only attempt to extract {'a' : 1, 'b' : 2} [...], as the second Python object is not yet complete and therefore must be left out. - how that could become complete, ever, with a pair-less quotation mark inside, and the closing curly brace and bracket already in place? It's not incomplete, but broken. — tevemadar
– tevemadar, Commented Feb 19, 2023 at 23:49

Driftr95 · Accepted Answer · 2023-02-20 20:07:11Z

I can't use ast.literal_eval because as I mentioned above the string output is incomplete such as " [ { 'a' : 1 } , {'b' : "

But you can iteratively try to apply ast.literal_eval after slicing off the last character and closing the first bracket _{[similarly to how json.loads was used in this solution suggested in @mous's comment]}.

import ast

def eval_brokenLiteral(litStr:str, defaultVal=None, printError=True):
    try: return ast.literal_eval(litStr)
    except SyntaxError as se: evalError = se 
    # litStr = litStr[:getattr(evalError, 'offset', len(litStr))]

    bracketPairs = {'{': '}', '[': ']'}
    closers, curCloser = [], ''
    for c in ''.join(litStr.split()):
        if c not in bracketPairs: break
        curCloser = bracketPairs[c] + curCloser
        closers.append(curCloser)

    for closer in closers:
        subStr = litStr.strip() 
        while subStr[1:]:
            try: return ast.literal_eval(subStr + closer)
            except SyntaxError: subStr = subStr[:-1].strip()

    if printError: print(repr(evalError))
    return defaultVal

[If it can't find a valid literal, it will return None unless some other defaultVal is specified.]

Tests:

testCases = [
    "[{'a' : 1, 'b' : 2}, {'a' : 1'}]",
    '[[1, 2, 3], [11, 12, 21]',
    """[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"""
]
for ti, tc in enumerate(testCases, 1):
    print(f'Case {ti}: {repr(tc)}\n  -  ----> ', end='')
    op = eval_brokenLiteral(tc)
    print(f'Output ---> {repr(op)}' if op else '', '\n\n---\n')

_{Case 1: "[{'a' : 1, 'b' : 2}, {'a' : 1'}]"}

----> Output ---> [{'a': 1, 'b': 2}]

_{Case 2: '[[1, 2, 3], [11, 12, 21]'}

----> Output ---> [[1, 2, 3], [11, 12, 21]]

_{Case 3: "[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"}

----> Output ---> [{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]}]

Collectives™ on Stack Overflow

How to extract incomplete Python objects from string

1 Answer 1

Tests:

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Tests:

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related