how to extract common part from a string list using C#

Question

This is my scenario!

List<String> list = new List<String>();
list.Add("E9215001");
list.Add("E9215045");
list.Add("E1115001");
list.Add("E1115022");
list.Add("E1115003");
list.Add("E2115041");
list.Add("E2115042");
list.Add("E4115021");
list.Add("E5115062");

I need to extract the following common parts from the above list using C# & LINQ

E92150 -> Extracted From {*E92150*01, *E92150*45}

E11150 -> Extracted From {*E11150*01, *E11150*22, *E11150*03}

E21150 -> Extracted From {*E21150*41, *E21150*42}

E41150 -> Extracted From {*E41150*21}

E51150 -> Extracted From {*E51150*62}

UPDATE: Thank you! everyone! with the help of @mlorbetske & @shelleybutterfly I've figured it out!

Solution:

list.Select((item, index) => new {
Index=index, 
Length=Enumerable.Range(1, (item.Length-2)) //I'm ignoring the last 2 characters
                 .Reverse()
                 .First(proposedLength => list.Count(innerItem =>  
                   innerItem.StartsWith(item.Substring(0, proposedLength))) > 
                   1)}).Select(n => list[n.Index].Substring(0, n.Length)).Distinct()

would "E1115001" and "E1115003" be considered common as "E111500" etc, or only if all elements start with a common value ? — sa_ddam213
– sa_ddam213, Commented Dec 22, 2012 at 10:50
No they are not always same. That first 6 chars is not always constant!, they might also vary.. I've edited my Question. Check now! thanks in advance! — Pradeep
– Pradeep, Commented Dec 22, 2012 at 10:59
Okay, so does that mean you need to extract something different RE: sa_ddam213's question above; e.g. so with the list of stuff you have above will you need to extract E11150 and E111500 since both of those are repeated? [oops E11150 isn't repeated as far as I see actually] — shelleybutterfly
– shelleybutterfly, Commented Dec 22, 2012 at 11:05
Okay; the other way I see this possibly working would be to extract the longest common string (harder problem) which gives me the list: "E92150", "E111500", "E211504", and "E". If that's what you need I will take a look. — shelleybutterfly
– shelleybutterfly, Commented Dec 22, 2012 at 11:13

Tim Schmelter · Accepted Answer · 2012-12-22 10:38:43Z

5

I doubt that this is what you're looking for, however

var result = list.Select(s => s.Substring(0, 6))
                 .Distinct();

answered Dec 22, 2012 at 10:38

Tim Schmelter

462k79 gold badges719 silver badges980 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Simon Whitehead Over a year ago

Identical to my answer.. discarded!

Jeppe Stig Nielsen Over a year ago

Also, for strings s of lengths exceeding six, s.Remove(6) is equivalent to s.Substring(0, 6).

Pradeep Over a year ago

Thank you! But That first 6 chars is not always constant!, they might also vary.. I've edited my Question. Check now! thanks in advance!

mlorbetske · Accepted Answer · 2012-12-22 12:47:13Z

1

I'm not sure what the criteria for determining matches is, so I've written this - it's completely novel, it's a 99.9999% certainty that it's not actually what you want.

Essentially, the outer select gets all the substrings of the determined length.

The first inner select determines the maximum length of this string that was found in at least one other string in the list.

The group by (following the first inner select) groups the found lengths by themselves.

This grouping is then converted to a dictionary of the length versus the number of times it was found.

We then order that set of groupings by frequency (Value) that the length was found (ascending).

Next, we take that actual length (the least frequently occurring length - from Key) and spit it back out into the second parameter of Substring so we take the substrings from 0 to that length. Of course, we're back in the outer select now, so we're actually getting values (hooray!).

Now, we take the distinct set of values from that result and voila!

list.Select(
    item => item.Substring(0, 
        list.Select(
            innerItem => Enumerable.Range(1, innerItem.Length)
                           .Reverse()
                           .First(proposedlength => list.Count(innerInnerItem => innerInnerItem.StartsWith(innerItem.Substring(0, proposedlength))) > 1)
                   )
            .GroupBy(length => length)
            .ToDictionary(grouping => grouping.Key, grouping => grouping.Count())
            .OrderBy(pair => pair.Value)
            .Select(pair => pair.Key)
            .First())
        ).Distinct()

After reading the comments above, I see that there's also an interest in finding the distinct longest substrings present in any of the others for each term. Here's more novel code for that:

list.Select((item, index) => new {
    Index=index, 
    Length=Enumerable.Range(1, item.Length)
                     .Reverse()
                     .First(proposedLength => list.Count(innerItem => innerItem.StartsWith(item.Substring(0, proposedLength))) > 1)
}).Select(n => list[n.Index].Substring(0, n.Length))
  .Distinct()

In short, iterate through each item in the list and collect the index of the entry and the longest substring from the beginning of that element that may be found in at least one other entry in the list. Follow that by collecting all the substrings from each Index/Length pair and taking only the distinct set of strings.

edited Dec 22, 2012 at 12:47

answered Dec 22, 2012 at 11:23

mlorbetske

5,6592 gold badges32 silver badges41 bronze badges

6 Comments

Pradeep Over a year ago

The 2nd Solution is Excellent!!!! Thank You!!!!! Thank You!!! I just achieved it by modifying Length = Enumerable.Range(1, (item.Length - 2)) also I thank @shelleybutterfly for your effort..

shelleybutterfly Over a year ago

I agree about the genius part. :) hey @mlorbetske; I tried generating some random strings of numbers to test out my solution (not yet posted) vs. yours, and it seems to choke; does this assume all the strings are the same length?

shelleybutterfly Over a year ago

Hmm, I set all the strings to the same length, and it still is failing, does it assume there will at least be some matches, maybe?

Pradeep Over a year ago

@shelleybutterfly I use dis 2 easeup attendance marking process 4 d faculty. usly in a class! Rollno is of same length. 1ly d last 2 digits difrs 4 evry stud. If a comma is prssd aftr typing a Rollno, I automaticly load d most commonpart next. so staff can type d remaing 2 digits instd of typing d whole rollno! However in some cases, a class may contain studs whom r transferred frm other classes. In that case i need 2 find d multiple ComnParts. Now tanks 2 u guys, im able to load d 1st most CommonPart aftr a comma press, if ALT+Comma is prssd i load the next MostCommonPart & i cycle it.

mlorbetske Over a year ago

@shelleybutterfly there 3 assumptions made 1-None of the strings are null 2-All of the strings have at least one letter in common 3-(Depending which one you used) all the strings are at least of the length of the common substring length

|

shelleybutterfly · Accepted Answer · 2012-12-22 10:45:14Z

1

Does it need to be inline query syntax? If so, how about:

var result =
    from item in list
    select item.Substring(0,6);

or with the Distinct requirement:

var result =
    (
        from item in list
        select item.Substring(0,6);
    )
    .Distinct();

answered Dec 22, 2012 at 10:45

shelleybutterfly

3,23918 silver badges32 bronze badges

1 Comment

Pradeep Over a year ago

Thank you! But that first 6 chars is also not constant!, they might also vary.. I've edited my Question. Check now! thanks in advance!

Pradeep · Accepted Answer · 2012-12-22 14:01:05Z

0

SOLVED! Thanks to @mlorbetske and @shelleybutterfly

list.Select((item, index) => new { Index=index, 
            Length=Enumerable.Range(1, (item.Length-2)) //I don't need the last 2 Char so I'm ignoring it
            .Reverse()
            .First(proposedLength => list.Count(innerItem =>  
             innerItem.StartsWith(item.Substring(0, proposedLength))) > 
             1)}).Select(n => list[n.Index].Substring(0, n.Length)).Distinct()

answered Dec 22, 2012 at 14:01

Pradeep

2711 gold badge4 silver badges10 bronze badges

Collectives™ on Stack Overflow

how to extract common part from a string list using C#

UPDATE: Thank you! everyone! with the help of @mlorbetske & @shelleybutterfly I've figured it out!

Solution:

4 Answers 4

3 Comments

6 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

UPDATE: Thank you! everyone! with the help of @mlorbetske & @shelleybutterfly I've figured it out!

Solution:

4 Answers 4

3 Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related