Remove duplicate array elements based on part of their content?

Question

Edit:

Context: I inherited a process (from a former co-worker) the generates a generic file that, among other things, creates the following list of items. The list will later need to be turned into a series of unordered links with nesting levels preserved.

From the following array, I need to remove duplicates regardless of how many times it shows up based on the href attribute's value.

var array = [
 '<tag href="cheese.html">',
 '<tag href="cheddar.html"></tag>',
 '  <tag href="cheese.html"></tag>',
 '</tag>',
 '<tag href="burger.html">',
 ' <tag href="burger.html">',
 '   <tag href="burger.html"></tag>'
 ' </tag>'
 '</tag>'
 '<tag href="lettuce.html">',
 '  <tag href="lettuce.html">',
 '    <tag href="lettuce.html"></tag>',
 '  </tag>',
 '</tag>',
 '<tag href="tomato.html">',
 '  <tag href="tomato.html"></tag>',
 '  <tag href="tomato.html">',
 '    <tag href="tomato.html"></tag>',
 '    <tag href="tomato.html">',
 '      <tag href="tomato.html"></tag>',
 '      <tag href="tomato.html">',
 '        <tag href="tomato.html"></tag>',
 '      </tag>',
 '    </tag>',
 '  </tag>',
 '</tag>',
];

After the array has all duplicates removed, it should look like this:

'<tag href="cheese.html">',
'<tag href="cheddar.html"></tag>',
'</tag>',
'<tag href="burger.html">',
'</tag>',
'<tag href="lettuce.html">',
'</tag>',

From here, I have no problems extracting the info I need to generate my unordered list of links. I just need help figuring out how to remove the duplicates.

Why do you end up with two </tag> values?

subwaymatch
– subwaymatch

2017-03-09 00:54:19 +00:00
Commented Mar 9, 2017 at 0:54 — subwaymatch
– subwaymatch, Commented Mar 9, 2017 at 0:54
One tag element is nested within another.

Jawa
– Jawa

2017-03-09 15:26:48 +00:00
Commented Mar 9, 2017 at 15:26 — Jawa
– Jawa, Commented Mar 9, 2017 at 15:26

Rustem Kakimov · Accepted Answer · 2017-03-09 01:40:32Z

2

It would be helpful to know the context of your problem.

This function returns all strings with unique href value, but does nothing about managing closing tags. Removing closing tags would be a complex task. Plus I'm pretty sure parsing HTML with regex is not a good idea.

function sortByHref (array) {
  var hrefReg = new RegExp('href="(.*)"');
  var seen = {};
  var match, href;
  return array.filter(function (x) {
    match = hrefReg.exec(x);
    if (match) {
      href = match[1];
      if (seen.hasOwnProperty(href) && seen[href]) return false;
      seen[href] = true;
    }
    return true;
  });
}

There has to be another way to solve your problem, if you have described what exactly are you trying to accomplish.

answered Mar 9, 2017 at 1:40

Rustem Kakimov

2,69923 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

subwaymatch Over a year ago

Very nice and elegant solution.

Jawa Over a year ago

Works well but like you said, it doesn't do anything with the closing tags.

Jawa Over a year ago

I think I found a solution the extends what you did: created a second array, loop through the cleaned array, push any element that doesn't match the output array of your function that matches this: cleanedArray[i].indexOf(' </tag>') > -1. In my tests, this removes any closing tag element that has a space in front of it. I'll run deeper tests and confirm that this works or not. Cheers!

subwaymatch · Accepted Answer · 2017-03-09 01:56:03Z

Here is a purposely verbose solution for an easier understanding. I am assuming that tags without a href value will simply remove duplicates based on whole string.

var arr = [
    '<tag href="cheese.html">',
    '<tag href="cheddar.html"></tag>',
    '  <tag href="cheese.html"></tag>',
    '</tag>',
    '<tag href="burger.html">',
    ' <tag href="burger.html">',
    '   <tag href="burger.html"></tag>',
    ' </tag>',
    '</tag>'
];

// Remove whitespaces on both ends from each string in array
// Not a necessary step, but will just handle leading and trailing whitespaces this way for convenience
arr = arr.map(function(tagString) {
    return tagString.trim(); 
}); 

// Regex to retrieve href value from tags
var hrefRegexp = /(\s+href=\")([^\"]+)(\")/g;

// Create an array with just the href values for easier lookup
hrefArr = arr.map(function(tagString) {
    // Run regex against the tag string
    var href = hrefRegexp.exec(tagString); 

    // Reset `RegExp`'s index
    hrefRegexp.lastIndex = 0; 

    // If no href match is found, return null, 
    if (href === null) return null; 

    // Otherwise, return the href value
    else return href[2]; 
});

// Store array length (this value will be used in the for loop below)
var arrLength = arr.length; 

// Begin from the left and compare values on the right
for (var leftCompareIndex = 0; leftCompareIndex < arrLength; leftCompareIndex++) {
    for (var rightCompareIndex = leftCompareIndex + 1; rightCompareIndex < arrLength; rightCompareIndex++) {

        // A flag variable to indicate whether the value on the right is a duplicate
        var isRightValueDuplicate = false; 

        // If href value doesn't exist, simply compare whole string
        if (hrefArr[leftCompareIndex] === null) {
            if (arr[leftCompareIndex] === arr[rightCompareIndex]) {
                isRightValueDuplicate = true; 
            }
        }

        // If href value does exist, compare the href values
        else {
            if (hrefArr[leftCompareIndex] === hrefArr[rightCompareIndex]) {
                isRightValueDuplicate = true; 
            }
        }

        // Check flag and remove duplicate element from both original array and href values array
        if (isRightValueDuplicate === true) {
            arr.splice(rightCompareIndex, 1); 
            hrefArr.splice(rightCompareIndex, 1); 
            arrLength--; 
            rightCompareIndex--; 
        }
    }
}

console.log(arr); 

/* Should output
[ '<tag href="cheese.html">',
  '<tag href="cheddar.html"></tag>',
  '</tag>',
  '<tag href="burger.html">' ]
  */

I like the solution but it doesn't add in the last closing tag for <tag href="burger.html">.

Collectives™ on Stack Overflow

Remove duplicate array elements based on part of their content?

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related