read input from text file into dictionaries with regex in certain cases

Question

so I would like to from a input.txt file, create a dictionary

for example, here is sample of the input.txt file

%. VAR %first=Billy
%. VAR %last=Bob
%. PRINT VARS
%. VAR %petName=Gato
%. VAR %street="1234 Home Street"
%. VAR %city="New York" 
%. VAR %state=NY 
%. VAR %zip=21236 
%. VAR %title=Dr.
%. PRINT VARS
%. FORMAT LM=5  JUST=LEFT
%. PRINT FORMAT

so VAR %varName=value

i.e in the case of %first=Billy you would get something like varDict = {"first": "Billy"} right? Now I wanna know how to do that thru the entire file

There are two dictionaries that I would need to populate, one for the variables, and one for FORMAT, which just holds values, doesn't actually do anything for now.

As far as a desired output, I'm thinking of something of this manner, I would use the pprint function like this pprint.pprint(varDict , width=30) and would output something like this

{'first': 'Billy',
'last': 'Bob'}
{'city': 'New York',
'first': 'Billy',
 'last': 'Bob',
'petName': 'Gato',
'state': 'NY',
'street': '1234 Home Street',
'title': 'Dr.',
'zip': '21236'}
{'BULLET': 'o',
'FLOW': 'YES',
'JUST': 'LEFT',
'LM': '5',
'RM': '80'}

EDIT

I am going to input the code I have now for my setFormatWIP.py

import re
import sys
import pprint

input=(sys.argv[1])

regexFormat = re.compile(r'^%\.\s*?FORMAT\s*?((?:(?:\w+)=(?:\w+)\s*)*)$', re.MULTILINE)
regexPrintFORMAT = re.compile(r'^%\.\s*PRINT\s(FORMAT)',re.MULTILINE)

file = open(input)
line = file.readline()
formatDict = dict()

while line:
    formatList = regexFormat.findall(line)
    printFormatObj = regexPrintFORMAT.search(line)
    if printFormatObj != None:
            pprint.pprint(formatDict, width=30)
    for param in formatList[0].split():
        splitParam = param.split('=')
        formatDict[splitParam[0]] = splitParam[1]

    line = file.readline()
file.close()

running that, i get this error

Traceback (most recent call last):
File "formatTest.py", line 19, in <module>
for param in formatList[0].split():
IndexError: list index out of range

In your desired output you have 'first': 'Billy', 'last': 'Bob' in the first and second dictionary, but from the example file it looks like it should only be in the first one, also in your regex you are looking for @ but in the file each line starts with %. Are these things on purpose? Moreover, some of the values are surrounded by quotes (e.g. "New York") and others aren't (e.g. Gato) — orKa
– orKa, Commented Apr 26, 2020 at 7:24
@orKach yeah, I fixed my regex, sorry about that. as far as the desired output, like 'first': 'billy' etc, that is actually from the varDict ie, from the first dictionary. There would be too, you see after there is bullet, flow, etc? well that would be from the formatDict i would create. And Yes, for those values surrounded by quotes, what should I do to get that info in that way with the space — user33plus1
– user33plus1, Commented Apr 26, 2020 at 7:41

orKa · Accepted Answer · 2020-04-26 13:05:02Z

2

If you can read the entire file into a string, then the following expression should retrieve all of your variables:

import re

var_pat = re.compile(r'^%\.\s*?VAR\s*?%(\w+)=(\w+|".*")\s*$', re.MULTILINE)
with open('input.txt') as f:
    text = f.read()

var_list = var_pat.findall(text)
print(var_list)

[('first', 'Billy'), ('last', 'Bob'), ('petName', 'Gato'), ('street', '"1234 Home Street"'), ('city', '"New York"'), ('state', 'NY'), ('zip', '21236')]

After that you can do something like this to get your dictionary:

var_dict = dict()
for k, v in var_list:
    var_dict[k] = v

For the format pattern, this

format_pat = re.compile(r'^%\.\s*?FORMAT\s*?((?:(?:\w+)=(?:\w+)\s*)*)$', re.MULTILINE)
format = format_pat.findall(text)
print(format)

will yield

['LM=5  JUST=LEFT']

So you can get your dict by doing:

format_dict = dict()
for param in format[0].split():
    split_param = param.split('=')
    format_dict[split_param[0]] = split_param[1]
print(format_dict)

{'LM': '5', 'JUST': 'LEFT'}

You can learn about these regexes on the link Mace posted.

Edit

In order to get the desired output - instead of searching for all VARs at once, just iterate over the lines of the file and try to match each pattern to that line, then handle that line according to its match:

var_dict = {}
with open('input.txt', 'r') as f:
    for line in f:
        m_var = var_pat.match(line)
        if m_var:
            var_dict[m_var.group(1)] = m_var.group(2)
            continue
        m_print = print_pat.match(line)
        if m_print:
            pprint.pprint(var_dict, width=30)
        .
        .
        .

Where print_pat is a regex pattern that matches the line PRINT VARS.
You can read more about python regex functions like re.match() here.

edited Apr 26, 2020 at 13:05

answered Apr 26, 2020 at 8:28

orKa

1211 silver badge7 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user33plus1 Over a year ago

another question, when I encounter the Print lines in the input.txt files, i should have no problem with also creating a regex and when it finds PRINT VARS to print the varDict right?

orKa Over a year ago

You want to iterate over the rows of input.txt, and when you see a PRINT VARS line you want to print the dictionary up until now? Am I understanding correctly? Do you still want me to answer or did Mace answer your question?

user33plus1 Over a year ago

yeah, thats correct, i'm breaking my head trying to figure that out. how could i do that? none of the code i've written has given me the desired output. see, i know that i can read the input line by line, and if it it gets to a line that has PRINT VARS or PRINT FORMAT, i want to print that directory as it is at that point.

user33plus1 Over a year ago

and now i'm having trouble with the format portion of the code. even with what you added up top, it still isn't printing fully.

user33plus1 Over a year ago

going to edit up top what the error i am getting, and add the code i have so far! please give me a few minutes

Mace · Accepted Answer · 2020-04-26 07:50:52Z

1

Your main question seems to be about using the regular expressions. Maybe this will help you to get a start. re.findall is fairly simple. It returns a list with found values for your expression.

import re

lines = [
    "%. VAR     %first=Billy",
    "%. VAR     %last=Bob",
    "%. PRINT VARS",
    "%. VAR     %petName=Gato",
    "%. VAR     %street=\"1234 Home Street\"",
    "%. VAR     %city=\"New York\" ",
    "%. VAR     %state=NY ",
    "%. VAR     %zip=21236 ",
    "%. VAR     %title=Dr.",
    "%. PRINT VARS",
    "%. FORMAT LM=5  JUST=LEFT",
    "%. PRINT FORMAT",
    ]

# find VAR
re_VAR = r'^\%\.\s+VAR\s+%'
VAR_list = []
for line in lines:
    re_result = re.findall(re_VAR, line)
    if re_result:
        text = line.replace(re_result[0], '')
        text_parts = text.split('=')
        VAR_list.append({text_parts[0]: text_parts[1]})

print(VAR_list)

Result

[{'first': 'Billy'}, {'last': 'Bob'}, {'petName': 'Gato'}, {'street': '"1234 Home Street"'}, {'city': '"New York" '}, {'state': 'NY '}, {'zip': '21236 '}, {'title': 'Dr.'}]

You can test your regular expressions here regex101.com

answered Apr 26, 2020 at 7:50

Mace

1,51011 silver badges15 bronze badges

6 Comments

user33plus1 Over a year ago

interesting, i was considering just using a .split method, but didn't consider using a .replace method as well. so with this sort of method, i should be able to also read it in from a file as well right?

user33plus1 Over a year ago

another question, when I encounter the Print lines in the input.txt files, i should have no problem with also creating a regex and when it finds PRINT VARS to print the varDict right?

Mace Over a year ago

Just copy the line(s) you need into regex101 and try to make a regular expression which fits your needs. Copy the created expression into python r'.....'. and use it as in the example. If it can't be done in one step, do it in more. You will get the hang of it.

user33plus1 Over a year ago

thank you man. glad to know my mind is at least on the right route.

Mace Over a year ago

It means that formatList[0] doesn't exists, so your regexFormat didn't find anything. It's good practice to first check 'if formatList:'before indexing. Compare my answer 'if re_result:'.

|

Collectives™ on Stack Overflow

read input from text file into dictionaries with regex in certain cases

2 Answers 2

5 Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related