-1

I am attempting to extract valid Python-parsable objects, such as dictionaries and lists, from strings. For example, from the string "[{'a' : 1, 'b' : 2}]", the script will extract [{'a' : 1, 'b' : 2}] since the {} and [] denote completed Python objects.

However, when the string output is incomplete, such as "[{'a' : 1, 'b' : 2}, {'a' : 1'}]", I only attempt to extract {'a' : 1, 'b' : 2} and place it into a list [{'a' : 1, 'b' : 2}], as the second Python object is not yet complete and therefore must be left out.

I tried to write a regex pattern to match completed {} or [], it works for simple output but failing on nested list or dict.

Code:

import re 

def match_dict_list(string): 
    pattern = r"\[?\{[^\}\]]*\}\]?|\[?\[[^\]\[]*\]\]?"
    matches = re.findall(pattern, string)
    return matches

But it's failing on """[[1, 2, 3], [11, 12, 21]""" because it's matching [[1, 2, 3], [11, 12, 21] while the expected output is only [1, 2, 3], [11, 12, 21] and put it in list [ [[1, 2, 3], [11, 12, 21] ]

Some test cases

  • Case 1: "[{'a' : 1, 'b' : 2}, {'a' : 1'"

    Expected output: [{'a': 1, 'b': 2}]

  • Case 2: '[[1, 2, 3], [11, 12, 21]'

    Expected output: [[1, 2, 3], [11, 12, 21]]

  • Case 3: """[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"""

    Expected output: [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]

I am getting the output from APIs but can't do anything from their side; sometimes, the server output is complete, and sometimes, it's incomplete.

I also tried the updated pattern : \[?\{[^\}\]]*\}\]?|\[[^\]\[]*\]|\[\[[^\]\[]*\]\] but it's failing on third case. what is the best option to solve this kind of issue?

I can't use ast.literal_eval because as I mentioned above the string output is incomplete such as " [ { 'a' : 1 } , {'b' : ".

8
  • 1
    Why you got incomplete 'objects' ? Commented Feb 19, 2023 at 22:53
  • 7
    There is a reason parsers don't try to fix "obvious" syntax errors. Why are you trying to do so? Commented Feb 19, 2023 at 23:02
  • 2
    Also, it's impossible to do this with regex, due to arbitrarily-deeply-nested expressions being possible. Commented Feb 19, 2023 at 23:03
  • 1
    Re: I am getting the output from APIs but can't do anything from their side; sometimes, the server output is complete, and sometimes, it's incomplete., APIs should serialize data using a format like JSON, not python reprs (incomplete, to boot). If you are able to provide feedback to the owner of the API, you should make them fix their output. Commented Feb 19, 2023 at 23:09
  • 2
    "[{'a' : 1, 'b' : 2}, {'a' : 1'}]", I only attempt to extract {'a' : 1, 'b' : 2} [...], as the second Python object is not yet complete and therefore must be left out. - how that could become complete, ever, with a pair-less quotation mark inside, and the closing curly brace and bracket already in place? It's not incomplete, but broken. Commented Feb 19, 2023 at 23:49

1 Answer 1

0

I can't use ast.literal_eval because as I mentioned above the string output is incomplete such as " [ { 'a' : 1 } , {'b' : "

But you can iteratively try to apply ast.literal_eval after slicing off the last character and closing the first bracket [similarly to how json.loads was used in this solution suggested in @mous's comment].

import ast

def eval_brokenLiteral(litStr:str, defaultVal=None, printError=True):
    try: return ast.literal_eval(litStr)
    except SyntaxError as se: evalError = se 
    # litStr = litStr[:getattr(evalError, 'offset', len(litStr))]

    bracketPairs = {'{': '}', '[': ']'}
    closers, curCloser = [], ''
    for c in ''.join(litStr.split()):
        if c not in bracketPairs: break
        curCloser = bracketPairs[c] + curCloser
        closers.append(curCloser)

    for closer in closers:
        subStr = litStr.strip() 
        while subStr[1:]:
            try: return ast.literal_eval(subStr + closer)
            except SyntaxError: subStr = subStr[:-1].strip()

    if printError: print(repr(evalError))
    return defaultVal

[If it can't find a valid literal, it will return None unless some other defaultVal is specified.]


Tests:

testCases = [
    "[{'a' : 1, 'b' : 2}, {'a' : 1'}]",
    '[[1, 2, 3], [11, 12, 21]',
    """[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"""
]
for ti, tc in enumerate(testCases, 1):
    print(f'Case {ti}: {repr(tc)}\n  -  ----> ', end='')
    op = eval_brokenLiteral(tc)
    print(f'Output ---> {repr(op)}' if op else '', '\n\n---\n') 

Case 1: "[{'a' : 1, 'b' : 2}, {'a' : 1'}]"

  • ----> Output ---> [{'a': 1, 'b': 2}]

Case 2: '[[1, 2, 3], [11, 12, 21]'

  • ----> Output ---> [[1, 2, 3], [11, 12, 21]]

Case 3: "[{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}], 'b': [{'a':"

  • ----> Output ---> [{'a': [{'a': 1, 'b': 2}, {'a': 1, 'b': 2}]}]

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.