0

I have an optimization issue that I'm not sure where to go from here. I have a program that tries to find the best combination of inputs that return the highest predicted r squared value. The problem is that I have 21 total inputs (List) and I need them in a set of 15 inputs. The formula for total combinations is:

n! / r!(n - r)! = 21! / 15!(21 - 15)! = 54,264 possible combinations

So obviously running through each combination and calculating the predicted rsquared is not an ideal solution so is there an better way/algorithm/method I can use to try to skip or narrow down the bad combinations so that I'm only processing the fewest amount of combinations? Here is my current psuedo code for this issue:

public BestCombo GetBestCombo(List<List<MultipleRegressionInfo>> combosList)
{
   BestCombo bestCombo = new BestCombo();

   foreach (var combo in combosList)
   {
      var predRsquared = CalculatePredictedRSquared(combo);

      if (predRsquared > bestCombo.predRSquared)
      {
         bestCombo.predRSquared = predRsquared;
         bestCombo.BestRSquaredCombo = combo;
      }
   }

   return bestCombo;
}

public class BestCombo
    {
        public double predRSquared { get; set; }
        public IEnumerable<MultipleRegressionInfo> BestRSquaredCombo { get; set; }
    }

public class MultipleRegressionInfo
{
    public List<double> input { get; set; }
    public List<double> output { get; set; }
}

public double CalculatePredictedRSquared(List<MultipleRegressionInfo> combo)
{
    Matrix<double> matrix = BuildMatrix(combo.Select(i => i.input).ToArray());
    Vector<double> vector = BuildVector(combo.ElementAt(0).output);
    var coefficients = CalculateWithQR(matrix, vector);
    var y = CalculateYIntercept(coefficients, input, output);
    var estimateList = CalculateEstimates(coefficients, y, input, output);
    return GetPredRsquared(estimateList, output);
}
16
  • If you're trying to maximize r^2 value, one thing you could do is compute the overall r squared first. Next, calculate the difference between predicted and actual values. Now only use the 15 smallest residuals. Recomputing your r^2 one more time should yield you actual r^2. Trying to compute the best r^2 brute force is going to take forever. I believe my theoretical algorithm would have O(n^3) complexity if you use normal arrays. Using trees or other datastructures may reduce the complexity. Commented Nov 17, 2017 at 3:42
  • @BennettYeo I edited my question to show more pseudo code because what I'm currently doing is getting the overall rsquared for all of the inputs in the current combo and at the end, I output the combo that has the best rsquared. Hopefully this helps explains things a bit better unless I'm misunderstanding what you are suggesting. Can you show me pseudo code for what you are suggesting? Commented Nov 17, 2017 at 4:52
  • @Rufus oops I just fixed that Commented Nov 17, 2017 at 5:07
  • Given two combinations is there a way to tell which one is the better? Commented Nov 17, 2017 at 5:21
  • 1
    What I'm getting at is: optimization algorithms almost always involve being able to either "prune" combinations that you know are going to be bad, or "hill climb" going from one good combination to a better combination, until you find the best. Commented Nov 17, 2017 at 5:29

1 Answer 1

1

54,264 is not enormous for a computer - it might be worth timing a few calls to compute R^2 and multiplying up to see just how long this would take.

There is a branch and bound algorithm for this sort of problem, which relies on the fact that R^2(A,B,C) >= R^2(A,B) - that the R^2 can only decrease when you drop a variable. Recursively search the space of all sets of variables of size at least 15. After computing the R^2 for a set of variables, make recursive calls with sets produced by dropping a single variable from the set, where any such drop must be to the right of any existing gap (so A.CDE produces A..DE, A.C.E, and A.CD. but not ..CDE, which will be produced by .BCDE). You can terminate the recursion when you get down to the desired size of set, or when you find an R^2 that is no better than the best answer so far.

If it happens that you often find R^2 values no better than the best answer so far, this will save time - but this is not guaranteed. You can attempt to improve the efficiency by chosing to investigate the sets with highest R^2 first, hoping that you find a new best answer good enough to rule out their siblings by the time you come to them, and by using a procedure to calculate R^2 for A.CDE that makes use of the calculations you have already done for ABCDE.

Sign up to request clarification or add additional context in comments.

4 Comments

I apologize for not specifying better and I amended my question but I'm calculate predicted rsquared because regular rsquared will just keep going up as you add more inputs but predicted rsquared is much better for verifying if your data is good for future predictions.
FWIW because the standard R^2 is always higher than any adjusted R^2, a standard R^2 value for A,B,C is still an upper bound on the adjusted value for A,B: standard A,B,C >= standard A,B >= adjusted A,B. Of course this bound may be too loose to be of any practical use.
So there are some issues with your answer. I'm calculating predicted rsquared and not adjusted rsquared. Predicted rsquared takes out one datapoint at a time and recalculates the remaining datapoints again for the best fit line to get an estimate for the missing datapoint and then adds up all residuals for every datapoint so for example if I have 4000 datapoints for the input then that is 4000 * 54,264 = 217,056,000 so that is why I need to be optimizing this process. I get what you are recommending but it wouldn't work with predicted rsquared
If you are calculating predicted r^2 by recalculating from scratch for every data point you can save some time. See for example pj.freefaculty.org/guides/stat/Regression/RegressionDiagnostics/… equation 30 P 10. If you work through this document from the beginning you will see an explanation of this and a definition of the hat matrix equation 8 P 3. The hat matrix is of size n^2 but looking at that equation I think you can save time calculating it by only calculating the diagonal from X and (X'X)^-1X'. See also otexts.org/1580

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.