5

I want to ask your help.

I have a large piece of data, which looks like this:

     a
  b : c 901
   d : e sda
 v
     w : x ads
  any
   abc : def 12132
   ghi : jkl dasf
  mno : pqr fas
   stu : vwx utu

Description: file begins with a line containing single word (it can start with whitespace and whitespaces can be also after the word), then follows line of attributes separated by colon (also can have whitespaces), then again line of attributes or line with a single word. I can't create the right regex to catch it in such form:

{
  "a": [["b": "c 901"], ["d", "e sda"]],
  "v": [["w", "x ads"]],
  "any": ["abc", "def 12132"], ["ghi", "jkl dasf"],
  # etc.
}

Here is what I've tried:

regex = str()
regex += "^(?:(?:\\s*)(.*?)(?:\\s*))$",
regex += "(?:(?:^(?:\\s*)(.*?)(?:\\s*):(?:\\s*)(.*?)(?:\\s*))$)*$"
pattern = re.compile(regex, re.S | re.M)

However, it doesn't find what I need. Could you help me? I know I could process file without regex, using line-by-line iterator and checking for ":" symbol, but file is too big to process it this way (if you know how to process it fast without regex, this also will be right answer, but first which comes in mind is too slow).

Thanks in advance!

P.S. In the canonical form of file looks like this:

a
  b : c 901
  d : e sda

Every section begins with a single word, then follow attributes line (after two whitespaces), there attributes are separated with (" : "), then agane attributes line or line with a single word. Other whitespaces are prohibited. Probably it will be easier.

1
  • +1 Super Clarity; Neatly framed question. Commented Feb 14, 2013 at 10:37

3 Answers 3

3

Are regular expressions really necessary here? Try this pseudocode:

result = {}

last = None
for _line in data:
    line = _line.strip( ).split( ":" )
    if len( line ) == 1:
        last = line[ 0 ]
        if last not in result:
            result[ last ] = []
    elif len( line ) == 2:
        obj = [ line[ 0 ].strip( ), line[ 1 ].strip( ) ]
        result[ last ].append( obj )

I hope I understand correctly your data structure.

Sign up to request clarification or add additional context in comments.

1 Comment

This is the correct approach, no regex needed, I had an answer here I deleted because it is not unnecessary, this is the solution you need. (may need a little tweaking - but its what you want) +1
0

You can use this regex..

 (?:[\n\r]+|^)\s*(\w+)\s*[\n\r]+(\s*\w+\s*:\s*.*?)(?=[\n\r]+\s*\w+\s*[\n\r]+|$)

You need to match the above regex with singleline or dotall option

Group1 and Group2 matches what you want each time you match

check out here..use dot all option

Comments

0
# a more golf - like solution
from itertools import groupby

groups = groupby(map(lambda s: map(str.strip,s.split(':')), data), len)
dict((next(i[1])[0], list(next(groups)[1])) for i in groups)

out:

{'a': [['b', 'c 901'], ['d', 'e sda']],
 'any': [['abc', 'def 12132'],
  ['ghi', 'jkl dasf'],
  ['mno', 'pqr fas'],
  ['stu', 'vwx utu']],
 'v': [['w', 'x ads']]}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.