7

I'm reading a file into python 2.4 that's structured like this:

field1: 7
field2: "Hello, world!"
field3: 6.2

The idea is to parse it into a dictionary that takes fieldfoo as the key and whatever comes after the colon as the value.

I want to convert whatever is after the colon to it's "actual" data type, that is, '7' should be converted to an int, "Hello, world!" to a string, etc. The only data types that need to be parsed are ints, floats and strings. Is there a function in the python standard library that would allow one to make this conversion easily?

The only things this should be used to parse were written by me, so (at least in this case) safety is not an issue.

9 Answers 9

6

First parse your input into a list of pairs like fieldN: some_string. You can do this easily with re module, or probably even simpler with slicing left and right of the index line.strip().find(': '). Then use a literal eval on the value some_string:

>>> import ast
>>> ast.literal_eval('6.2')
6.2
>>> type(_)
<type 'float'>
>>> ast.literal_eval('"Hello, world!"')
'Hello, world!'
>>> type(_)
<type 'str'>
>>> ast.literal_eval('7')
7
>>> type(_)
<type 'int'>
Sign up to request clarification or add additional context in comments.

7 Comments

The version of python I'm using doesn't have the ast module.
@MikeSamuel obviously the input must be preprocessed into fieldn: string pairs first, but that part is trivial. @julio.alegria _ is a handy shortcut for the last returned value in the interactive interpreter. @Dan ..erm.. now you tell me ;) upgrade python? is there a reason why you need to use such an old version?
@Mike Samuel: Safety isn't an issue for me. I don't need to parse anything that I haven't written myself with another program. +1 on your comment for pointing it out, though.
mail.python.org/pipermail/python-list/2009-September/… here someone backported literal_eval to 2.4, but it all sounds a bit hacky to me. i would prefer to upgrade python than use that, personally.
@wim: I figured out I could just use eval(). See answer below, and thanks for pointing me in the right direction.
|
4

You can use yaml to parse the literals which is better than ast in that it does not throw you an error if strings are not wrapped around extra pairs of apostrophes or quotation marks.

>>> import yaml
>>> yaml.safe_load('7')
7
>>> yaml.safe_load('Hello')
'Hello'
>>> yaml.safe_load('7.5')
7.5

Comments

2

You can attempt to convert it to an int first using the built-in function int(). If the string cannot be interpreted as an int a ValueError exception is raised. You can then attempt to convert to a float using float(). If this fails also then just return the initial string

def interpret(val):
    try:
        return int(val)
    except ValueError:
        try:
            return float(val)
        except ValueError:
            return val

Comments

1

For older python versions, like the one being asked, the eval function can be used but, to reduce evilness, a dict to be the global namespace should be used as second argument to avoid function calls.

>>> [eval(i, {"__builtins__":None}) for i in ['6.2', '"Hello, world!"', '7']]
[6.2, 'Hello, world!', 7]

1 Comment

it raise "SyntaxError: unexpected EOF while parsing" when applying "alphanumeric" values instead to interpret a string.
1

Since the "only data types that need to be parsed are int, float and str", maybe somthing like this will work for you:

entries = {'field1': '7', 'field2': "Hello, world!", 'field3': '6.2'}

for k,v in entries.items():
    if v.isdecimal():
        conv = int(v)
    else:
        try:
            conv = float(v)
        except ValueError:
            conv = v
    entries[k] = conv

print(entries)
# {'field2': 'Hello, world!', 'field3': 6.2, 'field1': 7}

Comments

1

There is strconv lib.

In [22]: import strconv
/home/tworec/.local/lib/python2.7/site-packages/strconv.py:200: UserWarning: python-dateutil is not installed. As of version 0.5, this will be a hard dependency of strconv fordatetime parsing. Without it, only a limited set of datetime formats are supported without timezones.
  warnings.warn('python-dateutil is not installed. As of version 0.5, '

In [23]: strconv.convert('1.2')
Out[23]: 1.2

In [24]: type(strconv.convert('1.2'))
Out[24]: float

In [25]: type(strconv.convert('12'))
Out[25]: int

In [26]: type(strconv.convert('true'))
Out[26]: bool

In [27]: type(strconv.convert('tRue'))
Out[27]: bool

In [28]: type(strconv.convert('12 Jan'))
Out[28]: str

In [29]: type(strconv.convert('12 Jan 2018'))
Out[29]: str

In [30]: type(strconv.convert('2018-01-01'))
Out[30]: datetime.date

1 Comment

Actually, it does not handle unicode strings, see github.com/bruth/strconv/issues/2
0

Hope this helps to do what you are trying to do:

#!/usr/bin/python

a = {'field1': 7}
b = {'field2': "Hello, world!"}
c = {'field3': 6.2}

temp1 = type(a['field1'])
temp2 = type(b['field2'])
temp3 = type(c['field3'])

print temp1
print temp2
print temp3

2 Comments

I don't want to get the types of objects in a dictionary, I want to convert strings in a dictionary that are annotated as python types to the types they represent.
Can you post example input and output, that will easier to understand?
0

Thanks to wim for helping me figure out what I needed to search for to figure this out.

One can just use eval():

>>> a=eval("7")
>>> b=eval("3")
>>> a+b
10
>>> b=eval("7.2")
>>> a=eval("3.5")
>>> a+b
10.699999999999999
>>> a=eval('"Hello, "')
>>> b=eval('"world!"')
>>> a+b
'Hello, world!'

3 Comments

Great! Now make sure you don't import os in your source, to avoid evaluating values like os.system("rm *"). And that's not the only way. So this method works, but it's not recommended.
It's evil and insecure, but this entire script is a quick and dirty fix that should (ideally) be thrown away in a few months.
I had a Q&D awk script that I wrote in 1989 implementing a very crude commercial order processor “until the app we wait is ready” that was still being used up to 1996 that I know of, and a Q&D 1995 QBasic army service chores assigner (whatever you might understand of it :) that was still used in 2007 (albeit modified by others to no end, I presume), so I'm certain “quick&dirty” programs are as quick but lots more dirtier than people usually think they are.
0

I put together this function to help with the type inference of lists.

def infer_dtypes(values:List, sample_size:int=300, stop_after:int=300):
    """
    Infers the data type by randomly sampling from a list. Values are explicitly converted to string before checking.

    Args:
        values (list): A list to infer data types from.
        sample_size (int, optional): The number of values to sample from the list. Entire list will be sampled if set to None. Defaults to 300.
        stop_after (int, optional): The maximum number of non-empty values needed for the test. Equal to sample_size if set to None. Defaults to 300.

    Returns:
        str: The inferred data type ('int', 'float', 'bool', 'str', 'mixed', 'empty').
    """
    found = 0
    non_empty_count = 0

    sample_size = sample_size if sample_size is not None else len(values)
    stop_after = stop_after if stop_after is not None else sample_size

    for v in np.random.choice(values, sample_size):
        v = str(v)
        if v != '':
            non_empty_count += 1
            if non_empty_count > stop_after:
                break
            try:
                int(v)
                found |= 1
            except ValueError:
                try:
                    float(v)
                    found |= 2
                except ValueError:
                    if v.lower() in ['true', 'false']:
                        found |= 4
                    else:
                        found |= 8


    # Check if the data is mixed
    if bin(found).count('1') > 1:
        return 'mixed'

    if found & 8:
        return 'str'
    elif found & 4:
        return 'bool'
    elif found & 2:
        return 'float'
    elif found & 1:
        return 'int'
    else:
        return 'empty'

Produces:

infer_dtypes(['', '', '1', '2', '3', '4', '5'])  # int
infer_dtypes(['', '', '1.0', '2.0', '', '3.0', '4.4', '5.0'])  # float
infer_dtypes(['', '', 'True', 'False', '', '', 'False', 'True'])  # bool
infer_dtypes(['', '', 'never', 'gonna', '', '', 'give', ''])  # str
infer_dtypes(['', '', 'never', '', '5', 'True', '5.2', ''])  # mixed
infer_dtypes(['', '', '', '', '', '', '', ''])  # empty

Rationale, feel free to skip this:

I wrote this function as currently Pandas' df.convert_dtypes, df.infer_objects and pd.to_numeric don't work nicely if you have columns with empty strings. This could be solved (source 1, source 2) if a DataFrame has columns of uniform datatypes, for example if we know that it only has floats we could replace '' with np.nan and then infer. However for a DataFrame with mixed column types (strings, floats, ints), replacing '' with np.nan wouldn't work. This function helps solve this issue by running:

values = np.where(pd.isnull(df.T.values), '', df.T.values)
for l in values:
    infer_dtypes(l)

See this GitHub Gist for a full example. Hope it helps!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.