4

I'm working with a rather large set of strings I have to process as quickly as possible.

The format is quite fixed:

[name]/[type]:[list ([key] = [value],)]

or

 [name]/[type]:[key]

I hope my representation is okay. What this means is that I have a word (in my case I call it Name), then a slash, followed by another word (I call it Type), then a colon, and it is either followed by a comma-separated list of key-value pairs (key = value), or a single key.

Name, Type cannot contain any whitespaces, however the key and value fields can.

Currently I'm using Regex to parse this data, and a split:

var regex = @"(\w+)\/(\w+):(.*)";
var r = new Regex(regex, RegexOptions.IgnoreCase | RegexOptions.Singleline);
var m = r.Match(Id);
if (m.Success) {
    Name = m.Groups[1].Value;
    Type= m.Groups[2].Value;

    foreach (var intern in m.Groups[3].Value.Split(','))
    {
        var split = intern.Trim().Split('=');
        if (split.Length == 2)
            Items.Add(split[0], split[1]);
        else if (split.Length == 1)
            Items.Add(split[0], split[0]);
    }
}

Now I know this is not the most optional case, but I'm not sure which would be the fastest:

  • Split the string first by the : then by / for the first element, and , for the second, then process the latter list and split again by =
  • Use the current mixture as it is
  • Use a completely regex-based

Of course I'm open to suggestions, my main goal is to achieve the fastest processing of this single string.

13
  • If the keys or values can have : or /, you should stick to the regex. Commented Dec 16, 2016 at 12:57
  • 3
    "my main goal is to achieve the fastest processing of this single string" - so measure the different approaches. You can use the Stopwatch class for this: best off to do it in Release configuration, and to run the tests multiple times and use the average result. Commented Dec 16, 2016 at 12:57
  • 1
    regex is usually not an optimal solution as it's built for general use cases (generic code is usually not optimal). You may need to consider writing your own code to do this Commented Dec 16, 2016 at 12:57
  • 1
    Possible duplicate of String.Split VS. Regex.Split? Commented Dec 16, 2016 at 12:58
  • 3
    Using a bunch of IndexOf statements might turn out to be even faster, as no arrays are created. The added complexity might not be worth the small performance gain, though. Commented Dec 16, 2016 at 13:01

1 Answer 1

1

Its always fun to implement a custom parser. Obviously concerning code maintenance, Regex is probably the best choice, but if performance is an ultimate concern then you probably need a tailor made parser which even in the simplest syntaxes is quite a lot more of work.

I've whipped one up really quick (it might be a little hackish in some places) to see what it would take to implement one with some basic error recovery and information. This isn't tested in any way but I'd be curious, if its minimally functional, to know how well it stacks up with the Regex solution in terms of performance.

public class ParserOutput
{
    public string Name { get; }
    public string Type { get; }
    public IEnumerable<Tuple<string, string>> KeyValuePairs { get; }
    public bool ContainsKeyValuePairs { get; }
    public bool HasErrors { get; }
    public IEnumerable<string> ErrorDescriptions { get; }

    public ParserOutput(string name, string type, IEnumerable<Tuple<string, string>> keyValuePairs, IEnumerable<string> errorDescriptions)
    {
        Name = name;
        Type = type;
        KeyValuePairs = keyValuePairs;
        ContainsKeyValuePairs = keyValuePairs.FirstOrDefault()?.Item2?.Length > 0;
        ErrorDescriptions = errorDescriptions;
        HasErrors = errorDescriptions.Any();
    }
}

public class CustomParser
{
    private const char forwardSlash = '/';
    private const char colon = ':';
    private const char space = ' ';
    private const char equals = '=';
    private const char comma = ',';

    StringBuilder buffer = new StringBuilder();

    public ParserOutput Parse(string input)
    {
        var diagnosticsBag = new Queue<string>();

        using (var enumerator = input.GetEnumerator())
        {
            var name = ParseToken(enumerator, forwardSlash, diagnosticsBag);
            var type = ParseToken(enumerator, colon, diagnosticsBag);
            var keyValuePairs = ParseListOrKey(enumerator, diagnosticsBag);

            if (name.Length == 0)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Name could not be parsed.");
            }

            if (type.Length == 0)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Type could not be parsed.");
            }

            if (!keyValuePairs.Any() ||
                input.Last() == comma /*trailing comma is error?*/)
            {
                diagnosticsBag.Enqueue("Input has incorrect format. Key / Value pairs could not be parsed.");
            }

            return new ParserOutput(name, type, keyValuePairs, diagnosticsBag);
        }
    }

    private string ParseToken(IEnumerator<char> enumerator, char separator, Queue<string> diagnosticsBag)
    {
        buffer.Clear();
        var allowWhitespaces = separator != forwardSlash && separator != colon;

        while (enumerator.MoveNext())
        {
            if (enumerator.Current == space && !allowWhitespaces)
            {
                diagnosticsBag.Enqueue($"Input has incorrect format. {(separator == forwardSlash ? "Name" : "Type")} cannot contain whitespaces.");
            }
            else if (enumerator.Current != separator)
            {
                buffer.Append(enumerator.Current);
            }
            else
                return buffer.ToString();
        }

        return buffer.ToString();
    }

    private IEnumerable<Tuple<string, string>> ParseListOrKey(IEnumerator<char> enumerator, Queue<string> diagnosticsBag)
    {
        buffer.Clear();
        var isList = false;

        while (true)
        {
            var key = ParseToken(enumerator, equals, diagnosticsBag);
            var value = ParseToken(enumerator, comma, diagnosticsBag);

            if (key.Length == 0)
                break;

            yield return new Tuple<string, string>(key, value);

            if (!isList && value.Length != 0)
            {
                isList = true;
            }
            else if (isList && value.Length == 0)
            {
                diagnosticsBag.Enqueue($"Input has incorrect format: malformed [key / value] list.");
            }
        }
    }
}
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.