Python: regex to catch data

Question

I want to ask your help.

I have a large piece of data, which looks like this:

     a
  b : c 901
   d : e sda
 v
     w : x ads
  any
   abc : def 12132
   ghi : jkl dasf
  mno : pqr fas
   stu : vwx utu

Description: file begins with a line containing single word (it can start with whitespace and whitespaces can be also after the word), then follows line of attributes separated by colon (also can have whitespaces), then again line of attributes or line with a single word. I can't create the right regex to catch it in such form:

{
  "a": [["b": "c 901"], ["d", "e sda"]],
  "v": [["w", "x ads"]],
  "any": ["abc", "def 12132"], ["ghi", "jkl dasf"],
  # etc.
}

Here is what I've tried:

regex = str()
regex += "^(?:(?:\\s*)(.*?)(?:\\s*))$",
regex += "(?:(?:^(?:\\s*)(.*?)(?:\\s*):(?:\\s*)(.*?)(?:\\s*))$)*$"
pattern = re.compile(regex, re.S | re.M)

However, it doesn't find what I need. Could you help me? I know I could process file without regex, using line-by-line iterator and checking for ":" symbol, but file is too big to process it this way (if you know how to process it fast without regex, this also will be right answer, but first which comes in mind is too slow).

Thanks in advance!

P.S. In the canonical form of file looks like this:

a
  b : c 901
  d : e sda

Every section begins with a single word, then follow attributes line (after two whitespaces), there attributes are separated with (" : "), then agane attributes line or line with a single word. Other whitespaces are prohibited. Probably it will be easier.

+1 Super Clarity; Neatly framed question.

Yavar
– Yavar

2013-02-14 10:37:16 +00:00
Commented Feb 14, 2013 at 10:37 — Yavar
– Yavar, Commented Feb 14, 2013 at 10:37

freakish · Accepted Answer · 2013-02-14 10:25:55Z

3

Are regular expressions really necessary here? Try this pseudocode:

result = {}

last = None
for _line in data:
    line = _line.strip( ).split( ":" )
    if len( line ) == 1:
        last = line[ 0 ]
        if last not in result:
            result[ last ] = []
    elif len( line ) == 2:
        obj = [ line[ 0 ].strip( ), line[ 1 ].strip( ) ]
        result[ last ].append( obj )

I hope I understand correctly your data structure.

answered Feb 14, 2013 at 10:25

freakish

57k12 gold badges141 silver badges181 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Inbar Rose Over a year ago

This is the correct approach, no regex needed, I had an answer here I deleted because it is not unnecessary, this is the solution you need. (may need a little tweaking - but its what you want) +1

Anirudha · Accepted Answer · 2013-02-14 10:54:29Z

0

You can use this regex..

 (?:[\n\r]+|^)\s*(\w+)\s*[\n\r]+(\s*\w+\s*:\s*.*?)(?=[\n\r]+\s*\w+\s*[\n\r]+|$)

You need to match the above regex with singleline or dotall option

Group1 and Group2 matches what you want each time you match

check out here..use dot all option

edited Feb 14, 2013 at 10:54

answered Feb 14, 2013 at 10:34

Anirudha

32.9k8 gold badges71 silver badges90 bronze badges

Comments

root · Accepted Answer · 2013-02-14 15:26:12Z

0

# a more golf - like solution
from itertools import groupby

groups = groupby(map(lambda s: map(str.strip,s.split(':')), data), len)
dict((next(i[1])[0], list(next(groups)[1])) for i in groups)

out:

{'a': [['b', 'c 901'], ['d', 'e sda']],
 'any': [['abc', 'def 12132'],
  ['ghi', 'jkl dasf'],
  ['mno', 'pqr fas'],
  ['stu', 'vwx utu']],
 'v': [['w', 'x ads']]}

edited Feb 14, 2013 at 15:26

answered Feb 14, 2013 at 10:54

root

81.1k25 gold badges111 silver badges120 bronze badges

Collectives™ on Stack Overflow

Python: regex to catch data

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related