Sorting an array of strings by arbitrary numbers of substrings

Question

I'm attempting to modify the OrderedImportsFixer class in php-cs-fixer so I can clean up my files the way I want. What I want is to order my imports in a fashion similar to what you'd see in a filesystem listing, with "directories" listed before "files".

So, given this array:

$indexes = [
    26 => ["namespace" => "X\\Y\\Zed"],
    9 =>  ["namespace" => "A\\B\\See"],
    3 =>  ["namespace" => "A\\B\\Bee"],
    38 => ["namespace" => "A\\B\\C\\Dee"],
    51 => ["namespace" => "X\\Wye"],
    16 => ["namespace" => "A\\Sea"],
    12 => ["namespace" => "A\\Bees"],
    31 => ["namespace" => "M"],
];

I'd like this output:

$sorted = [
    38 => ["namespace" => "A\\B\\C\\Dee"],
    3 =>  ["namespace" => "A\\B\\Bee"],
    9 =>  ["namespace" => "A\\B\\See"],
    12 => ["namespace" => "A\\Bees"],
    16 => ["namespace" => "A\\Sea"],
    26 => ["namespace" => "X\\Y\\Zed"],
    51 => ["namespace" => "X\\Wye"],
    31 => ["namespace" => "M"],
];

As in a typical filesystem listing:

I've been going at uasort for a while (key association must be maintained) and have come close. Admittedly, this is due more to desperate flailing than any sort of rigorous methodology. Not really having a sense of how uasort works is kind of limiting me here.

// get the maximum number of namespace components in the list
$ns_counts = array_map(function($val){
    return count(explode("\\", $val["namespace"]));
}, $indexes);
$limit = max($ns_counts);

for ($depth = 0; $depth <= $limit; $depth++) {
    uasort($indexes, function($first, $second) use ($depth, $limit) {
        $fexp = explode("\\", $first["namespace"]);
        $sexp = explode("\\", $second["namespace"]);
        if ($depth === $limit) {
            // why does this help?
            array_pop($fexp);
            array_pop($sexp);
        }
        $fexp = array_slice($fexp, 0, $depth + 1, true);
        $sexp = array_slice($sexp, 0, $depth + 1, true);
        $fimp = implode(" ", $fexp);
        $simp = implode(" ", $sexp);
        //echo "$depth: $fimp <-> $simp\n";
        return strnatcmp($fimp, $simp);
    });
}
echo json_encode($indexes, JSON_PRETTY_PRINT);

This gives me properly sorted output, but with deeper namespaces on the bottom instead of the top:

{
    "31": {
        "namespace": "M"
    },
    "12": {
        "namespace": "A\\Bees"
    },
    "16": {
        "namespace": "A\\Sea"
    },
    "3": {
        "namespace": "A\\B\\Bee"
    },
    "9": {
        "namespace": "A\\B\\See"
    },
    "38": {
        "namespace": "A\\B\\C\\Dee"
    },
    "51": {
        "namespace": "X\\Wye"
    },
    "26": {
        "namespace": "X\\Y\\Zed"
    }
}

I'm thinking I may have to build a separate array for each level of namespace and sort it separately, but have drawn a blank on how I might do that. Any suggestions for getting the last step of this working, or something completely different that doesn't involve so many loops?

mighty_dev · Accepted Answer · 2020-10-23 18:58:13Z

2

We divide this into 4 steps.

Step 1: Create hierarchical structure from the dataset.

function createHierarchicalStructure($indexes){
    $data = [];
    foreach($indexes as $d){
        $temp = &$data;
        foreach(explode("\\",$d['namespace']) as $namespace){
            if(!isset($temp[$namespace])){
                $temp[$namespace] = [];
            }
            $temp = &$temp[$namespace];
        }
    }
    
    return $data;
}

Split the namespaces by \\ and maintain a $data variable. Use & address reference to keep editing the same copy of the array.

Step 2: Sort the hierarchy in first folders then files fashion.

function fileSystemSorting(&$indexes){
    foreach($indexes as $key => &$value){
        fileSystemSorting($value);
    }
    
    uksort($indexes,function($key1,$key2) use ($indexes){
        if(count($indexes[$key1]) == 0 && count($indexes[$key2]) > 0) return 1;
        if(count($indexes[$key2]) == 0 && count($indexes[$key1]) > 0) return -1;
        return strnatcmp($key1,$key2);
    });
}

Sort the subordinate folders and use uksort for the current level of folders. Vice-versa would also work. If both 2 folders in comparison have subfolders, compare them as strings, else if one is a folder and another is a file, make folders come above.

Step 3: Flatten the hierarchical structure now that they are in order.

function flattenFileSystemResults($hierarchical_data){
    $result = [];
    foreach($hierarchical_data as $key => $value){
        if(count($value) > 0){
            $sub_result = flattenFileSystemResults($value);
            foreach($sub_result as $r){
                $result[] = $key . "\\" . $r;
            }   
        }else{
            $result[] = $key;
        }
    }
    
    return $result;
}

Step 4: Restore the initial data keys back and return the result.

function associateKeys($data,$indexes){
    $map = array_combine(array_column($indexes,'namespace'),array_keys($indexes));
    $result = [];
    foreach($data as $val){
        $result[ $map[$val] ] = ['namespace' => $val];
    }
    return $result;
}

Driver code:

function foldersBeforeFiles($indexes){
   $hierarchical_data = createHierarchicalStructure($indexes);
   fileSystemSorting($hierarchical_data);
   return associateKeys(flattenFileSystemResults($hierarchical_data),$indexes);
}

print_r(foldersBeforeFiles($indexes));

Demo: https://3v4l.org/cvoB2

edited Oct 23, 2020 at 18:58

answered Oct 21, 2020 at 21:40

mighty_dev

17.8k3 gold badges24 silver badges39 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

miken32 Over a year ago

Depends on the project. I know some of the core Laravel files have > 20 import statements, with namespaces up to 6 or 7 levels deep.

miken32 Over a year ago

And of course, there are always outliers. ;)

miken32 Over a year ago

Nice trick with the references to build the hierarchy, that's what I was trying to do originally but drew a blank. I appreciate the effort but unfortunately this doesn't preserve the array's keys. php-cs-fixer needs the keys to reconstruct the document after fixing it.

miken32 Over a year ago

It seems to, yes

miken32 Over a year ago

Thanks for the effort, I marked Jeto's answer accepted as it was easier to integrate into the existing code.

|

Jeto · Accepted Answer · 2020-10-21 22:23:23Z

1

I believe the following should work:

uasort($indexes, static function (array $entry1, array $entry2): int {  
    $ns1Parts = explode('\\', $entry1['namespace']);
    $ns2Parts = explode('\\', $entry2['namespace']);

    $ns1Length = count($ns1Parts);
    $ns2Length = count($ns2Parts);

    for ($i = 0; $i < $ns1Length && isset($ns2Parts[$i]); $i++) {
        $isLastPartForNs1 = $i === $ns1Length - 1;
        $isLastPartForNs2 = $i === $ns2Length - 1;

        if ($isLastPartForNs1 !== $isLastPartForNs2) {
            return $isLastPartForNs1 <=> $isLastPartForNs2;
        }

        $nsComparison = $ns1Parts[$i] <=> $ns2Parts[$i];

        if ($nsComparison !== 0) {
            return $nsComparison;
        }
    }

    return 0;
});

What it does is:

split namespaces into parts,
compare each part starting from the first one, and:
- if we're at the last part for one and not the other, prioritize the one with the most parts,
- otherwise, if the respective parts are different, prioritize the one that is before the other one alphabetically.

Demo

edited Oct 21, 2020 at 22:23

answered Oct 21, 2020 at 21:35

Jeto

15k2 gold badges39 silver badges50 bronze badges

3 Comments

miken32 Over a year ago

When I tried this one live, it had trouble with single-element namespaces (e.g. Throwable). Kept putting them at the bottom of the list.

Jeto Over a year ago

Yeah I just realized you probably can't do it with a single/basic sort pass, as two entries might be sorted differently depending on other ones. For instance, if A and X\Y\Zed were the only two entries, X\Y\Zed should come first (folders first), right? Whereas it's the other way around in your example's case.

miken32 Over a year ago

Actually yes, you're right I should expect those entries to be at the bottom of the list. Going to test with some more live data and see how it goes.

Chris Haas · Accepted Answer · 2020-10-21 21:55:09Z

1

Here's another version that breaks the steps down further that, although it might not be the most optimal, definitely helps my brain think about it. See the comments for more details on what is going on:

uasort(
    $indexes,
    static function (array $a, array $b) {

        $aPath = $a['namespace'];
        $bPath = $b['namespace'];

        // Just in case there are duplicates
        if ($aPath === $bPath) {
            return 0;
        }

        // Break into parts
        $aParts = explode('\\', $aPath);
        $bParts = explode('\\', $bPath);

        // If we only have a single thing then it is a root-level, just compare the item
        if (1 === count($aParts) && 1 === count($bParts)) {
            return $aPath <=> $bPath;
        }

        // Get the class and namespace (file and folder) parts
        $aClass = array_pop($aParts);
        $bClass = array_pop($bParts);

        $aNamespace = implode('\\', $aParts);
        $bNamespace = implode('\\', $bParts);

        // If the namespaces are the same, sort by class name
        if ($aNamespace === $bNamespace) {
            return $aClass <=> $bClass;
        }

        // If the first namespace _starts_ with the second namespace, sort it first
        if (0 === mb_strpos($aNamespace, $bNamespace)) {
            return -1;
        }

        // Same as above but the other way
        if (0 === mb_strpos($bNamespace, $aNamespace)) {
            return 1;
        }

        // Just only by namespace
        return $aNamespace <=> $bNamespace;
    }
);

Online demo

edited Oct 21, 2020 at 21:55

answered Oct 21, 2020 at 21:41

Chris Haas

55.7k13 gold badges152 silver badges293 bronze badges

4 Comments

Chris Haas Over a year ago

I should note, this uses the spaceship operator, <=> but your code uses strnatcmp which handles numbers in strings and might be better suited.

miken32 Over a year ago

Numbers are pretty rare in namespaces, though I do have some old code imported from PEAR2\\Net. I will see how it works...

miken32 Over a year ago

I liked this approach best because of no loops, but it didn't work once I fed some real data to it. In my sample data, replacing A\B\Bee and A\B\See with Am\B\Bee and Am\B\See results in those items being ordered before A\Bees.

Chris Haas Over a year ago

@miken32, thanks for more sample data, I'll see if I can improve this!

mickmackusa · Accepted Answer · 2020-12-06 10:02:27Z

I find no fault with Jeto's algorithmic design, but I decided to implement it more concisely. My snippet avoids iterated function calls and arithmetic in the for() loop, uses a single spaceship operator, and a single return. My snippet is greater than 50% shorter and I generally find it easier to read, but then everybody thinks their own baby is cute, right?

Code: (Demo)

uasort($indexes, function($a, $b) {  
    $aParts = explode('\\', $a['namespace']);
    $bParts = explode('\\', $b['namespace']);
    $aLast = count($aParts) - 1;
    $bLast = count($bParts) - 1;

    for ($cmp = 0, $i = 0; $i <= $aLast && !$cmp; ++$i) {
        $cmp = [$i === $aLast, $aParts[$i]] <=> [$i === $bLast, $bParts[$i]];
    }
    return $cmp;
});

Like Jeto's answer, it iterates each array simultaneously and

checks if either element is the last element of the array, if so it is bumped down the list (because we want longer paths to win tiebreakers);
if neither element is the last in its array, then compare the elements' current string alphabetically.
The process repeats until a non-zero evaluation is generated.
Since duplicate entries are not expected to occur, the return value should always be -1 or 1 (never 0)

Note, I am halting the for() loop with two conditions.

If $i is greater than the number of elements in $aParts (only) -- because if $bParts has fewer elements than $aParts, then $cmp will generate a non-zero value before a Notice is triggered.
If $cmp is a non-zero value.

Finally, to explain the array syntax on either side of the spaceship operator...
The spaceship operator will compare the arrays from left to right, so it will behave like:

leftside[0] <=> rightside[0] then leftside[1] <=> rightside[1]

Making multiple comparisons in this way does not impact performance because there are no function calls on either side of the <=>. If there were function calls involved, it would be more performant to make individual comparisons in a fallback manner like:

fun($a1) <=> fun($b1) ?: fun($a2) <=> fun($b2)

This way subsequent function calls are only actually made if a tiebreak is necessary.

Collectives™ on Stack Overflow

Sorting an array of strings by arbitrary numbers of substrings

4 Answers 4

12 Comments

3 Comments

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

12 Comments

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related