sscanf in Python

Question

I'm looking for an equivalent to sscanf() in Python. I want to parse /proc/net/* files, in C I could do something like this:

int matches = sscanf(
        buffer,
        "%*d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %*X %*X:%*X %*X:%*X %*X %*d %*d %ld %*512s\n",
        local_addr, &local_port, rem_addr, &rem_port, &inode);

I thought at first to use str.split, however it doesn't split on the given characters, but the sep string as a whole:

>>> lines = open("/proc/net/dev").readlines()
>>> for l in lines[2:]:
>>>     cols = l.split(string.whitespace + ":")
>>>     print len(cols)
1

Which should be returning 17, as explained above.

Is there a Python equivalent to sscanf (not RE), or a string splitting function in the standard library that splits on any of a range of characters that I'm not aware of?

Is there any reason you are insisting on "not RE"? Regexes are the perfect tool for this job. — Max Shawabkeh
– Max Shawabkeh, Commented Feb 1, 2010 at 6:46
If you want to program in C, why not program in C? If you want to program in python, use a regular expression. There's even a helpful hint in the documentation for the re module telling you how to convert scanf formats into regular expressions. docs.python.org/library/re.html#simulating-scanf — user97370
– user97370, Commented Feb 1, 2010 at 7:47
@MattJoiner, I think it would be better to request/disallow features than to request/disallow implementations. "I would like to have format strings that specify the type of the output variable, to have the types converted for me, and to assert specific formatting of the input string" rather than "not regex" explains why you have this preference. After all, if someone used regex to build what you wanted, you'd use it, wouldn't you? — interestinglythere
– interestinglythere, Commented Nov 13, 2015 at 14:43

Craig McQueen · Accepted Answer · 2017-07-14 00:00:18Z

103

There is also the parse module.

parse() is designed to be the opposite of format() (the newer string formatting function in Python 2.6 and higher).

>>> from parse import parse
>>> parse('{} fish', '1')
>>> parse('{} fish', '1 fish')
<Result ('1',) {}>
>>> parse('{} fish', '2 fish')
<Result ('2',) {}>
>>> parse('{} fish', 'red fish')
<Result ('red',) {}>
>>> parse('{} fish', 'blue fish')
<Result ('blue',) {}>

edited Jul 14, 2017 at 0:00

answered Oct 12, 2012 at 4:18

Craig McQueen

43.8k32 gold badges138 silver badges188 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

KY Lu Over a year ago

This solution is clean and clear, but you need to install parse package.

daruma May 14 at 12:46

If this functionality is not present out of the box, then that's obviously not the best answer. Python developers did not get the power of sprintf / sscanf ?

Chris Dellin · Accepted Answer · 2017-11-15 16:22:05Z

78

When I'm in a C mood, I usually use zip and list comprehensions for scanf-like behavior. Like this:

input = '1 3.0 false hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),input.split())]
print (a, b, c, d)

Note that for more complex format strings, you do need to use regular expressions:

import re
input = '1:3.0 false,hello'
(a, b, c, d) = [t(s) for t,s in zip((int,float,strtobool,str),re.search('^(\d+):([\d.]+) (\w+),(\w+)$',input).groups())]
print (a, b, c, d)

Note also that you need conversion functions for all types you want to convert. For example, above I used something like:

strtobool = lambda s: {'true': True, 'false': False}[s]

edited Nov 15, 2017 at 16:22

answered Jun 18, 2012 at 14:55

Chris Dellin

8891 gold badge6 silver badges3 bronze badges

5 Comments

JWL Over a year ago

I really like this approach, especially as my problem was not just a need for scanf, but sscanf.

Aky Over a year ago

This appeared to be a good solution; sadly bool("false") returns True, because only empty strings evaluate to False. However, all is not lost, you could replace bool with a custom function which behaves the way you'd like.

Chris Dellin Over a year ago

@Aky Nice catch! I fixed my answer.

Blair Houghton Over a year ago

@rookiepig t(s) gets replaced by int(substring1), float(substring2), strtobool(substring3), and str(substring4), in order.

dsz Over a year ago

Thanks for actually addressing the type conversion and not just the string-splitting

Mike Graham · Accepted Answer · 2010-02-01 06:51:50Z

38

Python doesn't have an sscanf equivalent built-in, and most of the time it actually makes a whole lot more sense to parse the input by working with the string directly, using regexps, or using a parsing tool.

Probably mostly useful for translating C, people have implemented sscanf, such as in this module: http://hkn.eecs.berkeley.edu/~dyoo/python/scanf/

In this particular case if you just want to split the data based on multiple split characters, re.split is really the right tool.

answered Feb 1, 2010 at 6:51

Mike Graham

77.2k16 gold badges105 silver badges131 bronze badges

4 Comments

Matt Joiner Over a year ago

i did say no re, but you justify it nicely

Janus Troelsen Over a year ago

here's a py3k version of the linked implementation: gist.github.com/3875529

nimig18 Over a year ago

Its not built in, but there is a library for it here pypi.org/project/scanf

Student4K Over a year ago

"most of the time it actually makes a whole lot more sense to parse the input by working with the string directly, using regexps, or using a parsing tool." This is a false statement. In most cases it makes sense to use (s-)scanf. I do not know about python2, but in python3 they have realized it already.

Dietrich Epp · Accepted Answer · 2010-02-01 06:41:49Z

25

You can split on a range of characters using the re module.

>>> import re
>>> r = re.compile('[ \t\n\r:]+')
>>> r.split("abc:def  ghi")
['abc', 'def', 'ghi']

answered Feb 1, 2010 at 6:41

Dietrich Epp

216k39 gold badges366 silver badges426 bronze badges

4 Comments

ZAB Over a year ago

it is not a funny to deal with regex on textual float representation

Dietrich Epp Over a year ago

@ZAB: Nothing funny here. You use the regular expression to split fields, and then you use float() to parse it.

ZAB Over a year ago

for this speciefic problem, to parse /proc/net/*, this ugly trick will work though

Beetle Over a year ago

Or, even better, r = re.compile(r'[\s:]+'). (It's a good habit to put regular expressions in raw strings I think, even though it doesn't make any difference in this case.)

orip · Accepted Answer · 2010-02-01 23:02:59Z

You can parse with module re using named groups. It won't parse the substrings to their actual datatypes (e.g. int) but it's very convenient when parsing strings.

Given this sample line from /proc/net/tcp:

line="   0: 00000000:0203 00000000:0000 0A 00000000:00000000 00:00000000 00000000     0        0 335 1 c1674320 300 0 0 0"

An example mimicking your sscanf example with the variable could be:

import re
hex_digit_pattern = r"[\dA-Fa-f]"
pat = r"\d+: " + \
      r"(?P<local_addr>HEX+):(?P<local_port>HEX+) " + \
      r"(?P<rem_addr>HEX+):(?P<rem_port>HEX+) " + \
      r"HEX+ HEX+:HEX+ HEX+:HEX+ HEX+ +\d+ +\d+ " + \
      r"(?P<inode>\d+)"
pat = pat.replace("HEX", hex_digit_pattern)

values = re.search(pat, line).groupdict()

import pprint; pprint values
# prints:
# {'inode': '335',
#  'local_addr': '00000000',
#  'local_port': '0203',
#  'rem_addr': '00000000',
#  'rem_port': '0000'}

Ryan M · Accepted Answer · 2022-05-05 04:45:40Z

2

There is an example in the official python docs about how to use sscanf from libc:

# import libc
from ctypes import CDLL
if(os.name=="nt"):
    libc = cdll.msvcrt 
else:
    # assuming Unix-like environment
    libc = cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")  # alternative

# allocate vars
i = c_int()
f = c_float()
s = create_string_buffer(b'\000' * 32)

# parse with sscanf
libc.sscanf(b"1 3.14 Hello", "%d %f %s", byref(i), byref(f), s)

# read the parsed values
i.value  # 1
f.value  # 3.14
s.value # b'Hello'

edited May 5, 2022 at 4:45

Ryan M♦

20.6k35 gold badges75 silver badges85 bronze badges

answered Feb 25, 2020 at 15:18

eadmaster

1,49715 silver badges23 bronze badges

1 Comment

Dmytro Over a year ago

replace from ctypes import CDLL with from ctypes import cdll, c_int, c_float, create_string_buffer, byref or from ctypes import *

ghostdog74 · Accepted Answer · 2010-02-01 06:50:34Z

1

you can turn the ":" to space, and do the split.eg

>>> f=open("/proc/net/dev")
>>> for line in f:
...     line=line.replace(":"," ").split()
...     print len(line)

no regex needed (for this case)

answered Feb 1, 2010 at 6:50

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

1 Comment

Kevin Over a year ago

You'd still have to verify that the original string was correct - for example, "abc def ghi" would parse the same as "abc:def:ghi". This distinction may matter.

gerrit · Accepted Answer · 2020-01-15 17:53:00Z

You could install pandas and use pandas.read_fwf for fixed width format files. Example using /proc/net/arp:

In [230]: df = pandas.read_fwf("/proc/net/arp")

In [231]: print(df)
       IP address HW type Flags         HW address Mask Device
0   141.38.28.115     0x1   0x2  84:2b:2b:ad:e1:f4    *   eth0
1   141.38.28.203     0x1   0x2  c4:34:6b:5b:e4:7d    *   eth0
2   141.38.28.140     0x1   0x2  00:19:99:ce:00:19    *   eth0
3   141.38.28.202     0x1   0x2  90:1b:0e:14:a1:e3    *   eth0
4    141.38.28.17     0x1   0x2  90:1b:0e:1a:4b:41    *   eth0
5    141.38.28.60     0x1   0x2  00:19:99:cc:aa:58    *   eth0
6   141.38.28.233     0x1   0x2  90:1b:0e:8d:7a:c9    *   eth0
7    141.38.28.55     0x1   0x2  00:19:99:cc:ab:00    *   eth0
8   141.38.28.224     0x1   0x2  90:1b:0e:8d:7a:e2    *   eth0
9   141.38.28.148     0x1   0x0  4c:52:62:a8:08:2c    *   eth0
10  141.38.28.179     0x1   0x2  90:1b:0e:1a:4b:50    *   eth0

In [232]: df["HW address"]
Out[232]:
0     84:2b:2b:ad:e1:f4
1     c4:34:6b:5b:e4:7d
2     00:19:99:ce:00:19
3     90:1b:0e:14:a1:e3
4     90:1b:0e:1a:4b:41
5     00:19:99:cc:aa:58
6     90:1b:0e:8d:7a:c9
7     00:19:99:cc:ab:00
8     90:1b:0e:8d:7a:e2
9     4c:52:62:a8:08:2c
10    90:1b:0e:1a:4b:50

In [233]: df["HW address"][5]
Out[233]: '00:19:99:cc:aa:58'

By default it tries to figure out the format automagically, but there are options you can give for more explicit instructions (see documentation). There are also other IO routines in pandas that are powerful for other file formats.

Lennart Regebro · Accepted Answer · 2010-02-01 11:38:34Z

-2

If the separators are ':', you can split on ':', and then use x.strip() on the strings to get rid of any leading or trailing whitespace. int() will ignore the spaces.

answered Feb 1, 2010 at 11:38

Lennart Regebro

173k45 gold badges230 silver badges254 bronze badges

Collectives™ on Stack Overflow

sscanf in Python

9 Answers 9

2 Comments

5 Comments

4 Comments

4 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

9 Answers 9

2 Comments

5 Comments

4 Comments

4 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related