How to better write this Python String manipulation

Question

The following code creates a list of questions and options from a multiline string.

import json
import re
text_string = '''
Question 1
### Consider the following figure:
Select one:

* **a. The optimal solution**
* b. An infeasible solution
* c. An Alternate vertex
* d. None of these answers

***The correct answer is: The optimal solution***


-----

Question 2
### If the characters 'D', 'C', 'B', 'A' are placed in a queue (in that order), and then removed one at a time, in what order will they be removed?

```
 // initially called with low = 0, high = N - 1  
  BinarySearch_Right(A[0..N-1], v alue, low , high) {  
      // in variants: v alue >= A[i] for all i < low  
                     v alue < A[i] for all i > high  
      if (high < low)  
          return low  
      mid = (low + high) / 2  
      if (A[mid] > v alue) 
          return BinarySearch_Right(A, v alue, low , mid-1)  
      else  
          return BinarySearch_Right(A, v alue, mid+1, high)  
  }
```
Select one:
* a. ABCD
* b. ABDC
* c. DCAB
* **d. DCBA  **
* e. ACDB

***The correct answer is: DCBA***

'''
questions = text_string.split('-----')
quizzes = []
for ques in questions:
    # create array to get question text
    # this should remove the question number like (Question 1)
    question_array = ques.strip().split('*')[0].split('\n')
    # Question text
    question = '\n'.join(question_array[1:len(question_array)])
    # Remove ### if starts with ###
    if question.startswith("###"):
        question = question[3:]
    # build a dict item to add to quizzes array
    quiz_item = {
        'question': question.strip(),
        'options': [],
        'answer_string': ''
    }
    # get index of string staring 'select'
    for option in ques.strip().split('\n'):
       
        if option.startswith("*") and not (option.startswith("***") and option.endswith("***")):
            quiz_item['options'].append({
                'option': option.replace('*', '').strip(),
                 'answer': True if option.startswith("* **") and option.endswith("**") else False
            })
        if option.startswith("***") and option.endswith("***"):
            quiz_item['answer_string'] = option.replace('*', '').strip()
    quizzes.append(quiz_item)
print(json.dumps(quizzes, indent=2))

It works as I do get the results I want. However, I feel it is not efficient enough. Is there any better way to write this? Thank you.

FMc · Accepted Answer · 2023-03-19 18:01:05Z

When parsing, attach meaningful labels to important markers, dividers, etc. Your code is littered with magic strings that drive the parsing logic. But those raw strings don't mean anything to a reader of the code. Help your reader (ie, you in the future) by giving those entities meaningful names.

QUESTION_DIVIDER = '-----'
QUESTION_PREFIX = '#'
OPTION_MARK = '*'
ANSWER_MARK = '**'
ANSWER_TEXT_MARK = '***'

When parsing, don't try to do everything at once. Parsing is often tricky and complicated. Focus your energies on trying to reduce that complexity. Your current approach does not do that: all within a single mega-loop you break the full text into chunks of question-text; nested under that, you perform all of the detailed steps to extract each bit of information. That's too much logical complexity to put in one place: the human brain has trouble keeping track of everything in a context like that. A better strategy is to break the problem down into a sequence of very simple operations, each in its own function, and each individually understandable at a glance. In the illustration below, parsing occurs in two phases: the first just breaks the question-text into its major sub-sections (question, options, answer). Then, those sections are handled individually.

Speaking of functions, all of your code should be in them. This discipline has many benefits. One could argue that this practice is the strongest and oldest lesson in the history of computer programming. You would be well advised to embrace this wisdom of the ancients even if you don't fully appreciate all of the benefits yet. At the top level, one typically has a main() function or something similar. Its job it to orchestrate things, not engage in the grubby details of parsing.

import json

# Your example text.
QUIZ_TEXT = """
...
"""

def main():
    questions = list(parse_quiz(QUIZ_TEXT))
    print(json.dumps(questions, indent = 4)) 

...

if __name__ == '__main__':
    main()

Then we have a top-level parsing function. By design, we don't want it doing much detailed work. Its job is to assemble the desired data structure (a dict in your case) by delegating lower-level operations to other functions.

def parse_quiz(text):
    # Takes the full text of a quiz.
    # Yields dicts, each representing a question
    # and its options and answer.
    for question_text in quiz_text_to_question_texts(text):
        qsection, osection, asection = parse_into_sections(question_text)
        yield dict(
            question = parse_question_section(qsection),
            options = parse_options_section(osection),
            answer = asection,
        )

Finally, there are the helper functions that perform the actual parsing. We want these functions to be short and narrowly focused on specific task. Because they are narrowly defined and because they take advantage of named constants, most of their work is fairly easy to understand. Equally important, my prediction is that you will discover that your current parsing logic has some edge-case problems if you run it against a variety of inputs. When you have to modify your program to handle these unanticipated complications, your work will be much easier because the adjustments will be happening in these little utility functions rather than in a sprawling mass of code that tries to parse everything at once.

def quiz_text_to_question_texts(text):
    # Breaks the full text into separate blobs of question text.
    return [
        qt.strip()
        for qt in text.split(QUESTION_DIVIDER)
    ]

def parse_into_sections(text):
    # Breaks a single question into is primary sections:
    # question, options, answer.
    top, asection, rest = text.split(ANSWER_TEXT_MARK, 2)
    qsection, osection = top.split(OPTION_MARK, 1)
    return (qsection.strip(), osection.strip(), asection.strip())

def parse_question_section(qsection):
    # Extracts the question text from the question-section.
    q = '\n'.join(to_lines(qsection)[1:])
    return q.lstrip(QUESTION_PREFIX).strip()

def parse_options_section(osection):
    # Extracts the options from the options-section.
    return [
        dict(
            option = line.strip(OPTION_MARK).strip(),
            is_answer = line.startswith(ANSWER_MARK),
        )
        for line in to_lines(osection)
    ]

def to_lines(text):
    return text.split('\n')

Next steps: model your data with proper data objects. You are currently representing a quiz as a list of question dicts. I don't know how you intend to use those dicts, but I suspect that your larger project would benefit from representing those questions as dataclass Question instances.

Thanks so much! That was elaborate. I was more focused on the code working efficiently rather than the styling. You are right that writing reusable code is the way to go. Thanks so much for the input. I truly appreciate it. — Kjobber
– Kjobber, Commented Mar 20, 2023 at 0:58

Stack Exchange Network

How to better write this Python String manipulation

1 Answer 1

You must log in to answer this question.

Hot Network Questions

How to better write this Python String manipulation

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions