0

How can I split input strings below by regex in Go? Examples of strings:

I know how to split by dot, but how can I avoid splitting in quotes?

"a.b.c.d" -> ["a", "b", "c", "d"]
"a."b.c".d" -> ["a", "b.c", "d"]
"a.'b.c'.d" -> ["a", "b.c", "d"]
6
  • 1
    What have you tried? Show you code and what doesn't Work. Then you will get a better answer. Commented Nov 6, 2018 at 17:25
  • golang.org/pkg/regexp/#Regexp.Split Commented Nov 6, 2018 at 17:27
  • Pure regex solution will be tricky, including look forwards and back. The best way to do this is probably splitting by dot first, then combining the ones within quotes and add dot in between Commented Nov 6, 2018 at 17:28
  • Are all of those all possible inputs? Are other forms possible?, such: "'a.b'.c.d" or even more nesting patterns such: "'a."x.y"'.c.d" Commented Nov 6, 2018 at 17:29
  • I don't think the second value would compile in any language - quotes within quotes. Commented Nov 6, 2018 at 17:46

3 Answers 3

1

Here is another option with a somewhat less hacky regex. It uses the trash bin trick. So the real data is on the (first and second) capturing groups.

It works even with nested quotes like this: "a.'b.c'.d.e."f.g.h"" as long as there is not a recursion of 2 or more levels (as in here: "a.'b."c.d"'", quotes inside quotes inside quotes).

The regex is this: ^"|['"](\w+(?:\.\w+)*)['"]|(\w+)

And the code:

package main

import (
    "regexp"
    "fmt"
)

func main() {
    var re = regexp.MustCompile(`^"|['"](\w+(?:\.\w+)*)['"]|(\w+)`)
    var str = `"a.'b.c'.d.e."f.g.h""`

    result := re.FindAllStringSubmatch(str, -1)
    for _, m := range result {
        if (m[1] != "" || m[2] != "") {
            fmt.Print(m[1] + m[2] + "\n")
        }
    }
}

Input:

"a.'b.c'.d.e."f.g.h""

Output:

a
b.c
d
e
f.g.h
Sign up to request clarification or add additional context in comments.

Comments

1

Since go doesn't support negative lookaheads, I don't think finding a regular expression that matches the . you want to split on will be easy/possible. Instead, you can match the sourrounding text and only capture appropriately:

So the regular expression itself is a bit ugly, but here's the breakdown (ignoring escaped characters for go):

(\'[^.'"]+(?:\.[^.'"]+)+\')|(\"[^.'"]+(?:\.[^.'"]+)+\")|(?:([^.'"]+)\.?)|(?:\.([^.'\"]+))

There are four scenarios that this regular expression matches, and captures certain subsets of these matches:

  • (\'[^.'"]+(?:\.[^.'"]+)+\') - Match and capture single-quoted text
    • \' - Match ' literally
    • [^.'"]+ - Match sequence of non-quotes and non-periods
    • (?:\.[^.'"]+)+ - Match a period followed by a sequence of non-quotes and non-periods, repeated as many times as needed. Not captured.
    • \' - Match ' literally
  • (\"[^.'"]+(?:\.[^.'"]+)+\") - Match and capture double-quoted text
    • Same as above but with double quotes
  • (?:([^.'"]+)\.?) - Match text proceeded by an optional ., not capturing the .
    • ([^.'"]+) - Match and capture sequence of non-quotes and non-periods
    • \.? - Optionally match a period (optional to capture the last bit of delimited text)
  • (?:\.([^.'"]+)) - Match text preceded by a ., not capturing the .
    • Same as above but with the period coming before the capture group, and also non-optional

Example code that dumps the captures:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile("('[^.'\"]+(?:\\.[^.'\"]+)+')|(\"[^.'\"]+(?:\\.[^.'\"]+)+\")|(?:([^.'\"]+)\\.?)|(?:\\.([^.'\"]+))")
    txt := "a.b.c.'d.e'"

    result:= re.FindAllStringSubmatch(txt, -1)

    for k, v := range result {
        fmt.Printf("%d. %s\n", k, v)
    }
}

1 Comment

Could you show how you'd do it with pcre?
1

Matching balanced delimiters is a complex problem for regular expressions, as evidenced by John's answer. Unless you're using something like the Go pcre package.

Instead the Go CSV parser can be adapted. Configure it to use . as the separator, lazy quotes (the CSV quote is '), and variable length records.

package main

import (
    "encoding/csv"
    "fmt"
    "io"
    "log"
    "strings"
)

func main() {
    lines := `a.b.c.d
a.\"b.c\".d
a.'b.c'.d
`

    csv := csv.NewReader(strings.NewReader(lines))
    csv.Comma = '.'
    csv.LazyQuotes = true
    csv.FieldsPerRecord = -1
    for {
        record, err := csv.Read()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }

        fmt.Printf("%#v\n", record)
    }
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.