2

I am trying to parse a line in a mmCIF Protein file into separate tokens using Excel 2000/2003. Worst case it COULD look something like this:

token1 token2 "token's 1a',1b'" 'token4"5"' 12 23.2 ? . 'token' tok'en to"ken

Which should become the following tokens:

token1  
token2  
token's 1a',1b' (note: the double quotes have disappeared)  
token4"5" (note: the single quotes have disappeared)  
12  
23.2  
?  
.  
token (note: the single quotes have disappeared)  
to'ken  
to"ken  

I am looking to see if a RegEx is even possible to split this kind of line into tokens?

3
  • 2
    With six answered questions and none accepted answer, it seems that you don't care much about your mates here. Commented Sep 10, 2010 at 3:19
  • Actually, you couldn't be farther from the truth, beslisarius. I posted this question last night, received no notification from StackOverflow to my email that my question had been answered, and I got an smartass response from you. Totally uncalled for and unappreciated. Commented Sep 10, 2010 at 17:06
  • belisarius, I do see what you are talking about. I wasn't aware of that protocol to accept an answer. My apologies to all involved. In my estimation, however, you could have been a bit more diplomatic. I have accepted answers now on all previous questions. Commented Sep 10, 2010 at 17:22

2 Answers 2

1

Nice puzzle. Thanks.

This pattern (aPatt below) gets the tokens separated, but I can't figure how to remove the outer quotes.

tallpaul() produces:

 token1
 token2
 "token's 1a',1b'"
 'token4"5"'
 12
 23.2
 ?
 .
 'token'
 tok'en
 to"ken

If you can figure out how to lose the outer quotes, please let us know. This needs a reference to "Microsoft VBScript Regular Expressions" to work.

Option Explicit
''returns a list of matches
Function RegExpTest(patrn, strng)
   Dim regEx   ' Create variable.
   Set regEx = New RegExp   ' Create a regular expression.
   regEx.Pattern = patrn   ' Set pattern.
   regEx.IgnoreCase = True   ' Set case insensitivity.
   regEx.Global = True   ' Set global applicability.
   Set RegExpTest = regEx.Execute(strng)   ' Execute search.
End Function

Function tallpaul() As Boolean
    Dim aString As String
    Dim aPatt As String
    Dim aMatch, aMatches

    '' need to pad the string with leading and trailing spaces.
    aString = " token1 token2 ""token's 1a',1b'"" 'token4""5""' 12 23.2 ? . 'token' tok'en to""ken "
    aPatt = "(\s'[^']+'(?=\s))|(\s""[^""]+""(?=\s))|(\s[\w\?\.]+(?=\s))|(\s\S+(?=\s))"
    Set aMatches = RegExpTest(aPatt, aString)

    For Each aMatch In aMatches
          Debug.Print aMatch.Value
    Next
    tallpaul = True
End Function
Sign up to request clarification or add additional context in comments.

4 Comments

Brilliant, Marc! A lot cleaner code that what I had written to handle the situation. I'm not a pro on reg exp like you, but if I can figure out how to get it to remove the quotes I'll add a comment. For now, I'll just test both ends of each token to see if it contains the same quote delimiter and remove accordingly.
Thanks Paul. I'm no pro, but work I'm doing on statistical models for planning and estimating (goodplan.ca) has me back working with Excel and VBA after a long absence. What I do have is some good books. If you plan to be doing much of this stuff, I'd recommend "Excel 2007 VBA Programmer's Reference" by John Green et al. and, of course, the MSDN web site.
Haven't tested it, but I think you want to add backslash in front of the parentheses that is grabbing the entire quoted token. This will not "grab" it but simply group it to make sure has lower order of operation than the | (just in case since order of op doesn't require it here). Then add new set of (grabbing) paren just inside the quotes, leaving the quotes outside so they aren't returned in the match. eg, aPatt = "\(\s'([^']+)'(?=\s)\)| ...
I tested a bit and I think the backslash doesn't work, so (forget backslashes and) just add a new set of inner paren to also capture the token only without outer quotes.
1

It is possible to do:

You'll need to reference "Microsoft VBScript Regular Expressions 5.5" in your VBA Project, then...

Private Sub REFinder(PatternString As String, StringToTest As String)
    Set RE = New RegExp

    With RE
        .Global = True
        .MultiLine = False
        .IgnoreCase = False
        .Pattern = PatternString
    End With

    Set Matches = RE.Execute(StringToTest)

    For Each Match In Matches
        Debug.Print Match.Value & " ~~~ " & Match.FirstIndex & " - " & Match.Length & " = " & Mid(StringToTest, Match.FirstIndex + 1, Match.Length)

        ''#You get a submatch for each of the other possible conditions (if using ORs)
        For Each Item In Match.SubMatches
            Debug.Print "Submatch:" & Item
        Next Item
        Debug.Print
    Next Match

    Set RE = Nothing
    Set Matches = Nothing
    Set Match = Nothing
    Set SubMatch = Nothing
End Sub

Sub DoIt()
    ''#This simply splits by space...
    REFinder "([.^\w]+\s)|(.+$)", "Token1 Token2 65.56"
End Sub

This is obviously just a really simple example as I'm not very knowledgable of RegExp, it's more just to show you HOW it can be done in VBA (you'd probably also want to do something more useful than Debug.Print with the resulting tokens!). I'll have to leave writing the RegExp expression to somebody else I'm afraid!

Simon

1 Comment

Thanks, Simon. I was familiar with the RegExp VB Script option. I just can't figure the single reg expression! :-) I am using OPtion Explicit so I had to add Dim statements to the VBA. I'll keep working on the right RegExp... I had forgotten the OR capabilities so you helped me there!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.