Regex on array of chars in python?

Question

I have a buffer (an array of chars) that I am using to read data in from a socket, which contains an HTTP request. I have some regular expressions that work nicely for extracting relevant info from strings, and I am looking for a way to use those regular expressions to extract the same info from an array instead, without having to build a string out of the array. Is this possible with ctypes? This is an example of how I am getting the data right now.

import socket, array, ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
buff = array.array('c', '\0'*4096)
a, b = socket.socketpair()
fd = a.fileno()
buff_pointer = buff.buffer_info()[0]
b.send('a'*100)
bytes_read = libc.recv(fd, buff_pointer, len(buff), 0)
print buff #prints a zeroed array of length 4096 with 100 chars of 'a' in front

This is purely for fun/for lulz btw, inb4 it's unpythonic.

Dunno if it's officially supported, but when I try it, re seems to support searching in anything that supports the buffer interface. That includes array.array instances. — user2357112
– user2357112, Commented May 28, 2014 at 4:22
Alternatively, buff = bytearray(4096); bytes_read = a.recv_into(buff). — Eryk Sun
– Eryk Sun, Commented May 28, 2014 at 4:27
@eryksun yeah, I am aware of that method, I am just using ctypes for kicks. — 3uc1id
– 3uc1id, Commented May 28, 2014 at 4:39
OK, then I suggest using a ctypes array such as buff = (ctypes.c_char * 4096)(). Then you don't have to get buff_pointer, unless you're doing that for fun, too. — Eryk Sun
– Eryk Sun, Commented May 28, 2014 at 4:52
The pattern needs to be hashable because re caches them. — Janne Karila
– Janne Karila, Commented May 28, 2014 at 7:02

mhawke · Accepted Answer · 2014-05-28 04:56:34Z

1

Just run your regexs on the array object, e.g.

>>> import re
>>> m = re.match('^aaaaa', buff)
>>> m
<_sre.SRE_Match object at 0x7fd4cd2cd030>
>>> m.group()
array('c', 'aaaaa')
>>> m.string[m.start():m.end()]
array('c', 'aaaaa')

etc...

answered May 28, 2014 at 4:56

mhawke

87.5k10 gold badges122 silver badges142 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Eryk Sun Over a year ago

Yes, the _sre extension module should work on any contiguous char or wchar_t buffer . See getstring in _sre.c.

Eryk Sun Over a year ago

Wide characters are 4 bytes on most POSIX builds, so you can use a buffer with either single-byte elements or four-byte elements. Wide characters are 2 bytes on Windows. re factors the character size into its iteration and pattern matching: re.match(b'ab', (ctypes.c_uint32 * 2)(97, 98)).group() == [97, 98].

Collectives™ on Stack Overflow

Regex on array of chars in python?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related