I am struggling to get my ML model to accept the input and outputs that I need.

My aim is to have it accept this as the input:

input_x = [ 
    ((4.11, 8.58, -2.2), (-1.27, -8.76, 2.23)),
    ((0.43, -6.53, 1.25), (7.91, -10.76, 0.06)),
    ...
]

and this as the output:

output_y = [ 
    (34,24),
    (13,30),
    ...
]

The (X1,X2,X3) and the (Y1,Y2,Y3) are different measurements, but they are needed to be considered together to produce the (X, Y) output. The issue that I am running into is that LogisticRegression() does not except the multidimensional input/output. Found array with dim 3, while dim <= 2 is required by LogisticRegression

So, just to get my code running, I flattened everything too:

input_x = [
    (4.11, 8.58, -2.2, -1.27, -8.76, 2.23)
    (0.43, -6.53, 1.25, 7.91, -10.76, 0.06)
    ... 
 ]

output_y = [
    7
    -17
    ...
]

X_train, X_test, y_train, y_test = train_test_split(input_x, output_y, test_size=0.4, random_state=20, shuffle=False)


logreg= LogisticRegression()
logreg.fit(X_train, y_train)
y_pred=logreg.predict(X_test)

print(f"Accuracy: {metrics.accuracy_score(y_test, y_pred)}")
print(f"Recall: {metrics.recall_score(y_test, y_pred, average='micro')}")
print(f"Precision: {metrics.precision_score(y_test, y_pred, average='micro')}")

which works, but had an accuracy of 4.5% - so here is the first round of questions:

  1. Is it possible to input data like ((X1, X2, X3), (Y1, Y2, Y3)) and output data like (X, Y)
  2. Is LogisticRegression() the right thing to use for this project, or is there something better that I should be looking at?
  3. Are there any materials/sites/resources that I should checkout?

Ideally this would be my end goal of the project:

input_x = [ 
    [10, ((4.11, 8.58, -2.2), (-1.27, -8.76, 2.23))],
    [15, ((0.43, -6.53, 1.25), (7.91, -10.76, 0.06))],
    ...
]
  • The added 10 & 15 numbers represent the number of items used to create (X1,X2,X3)/(Y1,Y2,Y3), so 15 would be more accurate than 10 (the numbers go from 8-21), so if possible, I would like to weight the 21's more than the 8's.

  • X1/Y1 and X2/Y2 are more important than X3/Y3, if possible i would like to weight the 1's and 2's to 40% and the 3 to 20%

So, given my end goal

  1. Is the end goal even possible?
  2. Does that change any of your above answers of what I should be using?
  3. Do you have any code samples or know of any code sites, that can get me to my end goal?

2 Replies 2

I don't have the final magical solution but here are my 2 cents.

  1. It is possible to map multiple inputs to multiple outputs. Flattening the input seems a good idea to me.

  2. Logistic Regression is for classification (e.g., cat vs dog), not continuous numerical outputs. That's why you get poor accuracy. What you have here is a multivariate regression problem.

  3. You can try to look into Random Forest Regression, Linear Regression, SVR, ...

  4. Sure, many scikit-learn regressors support sample_weight during training:

weights = [10, 15, ...]  # from your data
model.fit(X_train, y_train, sample_weight=weights_train)

Example:

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# Flatten your inputs
input_x = [
    (4.11, 8.58, -2.2, -1.27, -8.76, 2.23),
    (0.43, -6.53, 1.25, 7.91, -10.76, 0.06),
]
output_y = [
    (34, 24),
    (13, 30),
]

weights = [10, 15]

X_train, X_test, y_train, y_test, w_train, w_test = train_test_split(input_x, output_y, weights, test_size=0.4, random_state=20)

model = RandomForestRegressor()
model.fit(X_train, y_train, sample_weight=w_train)

y_pred = model.predict(X_test)

print("R2 Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

From the post it's not very clear what your data and goals are, so here are some questions:

The (X1,X2,X3) and the (Y1,Y2,Y3) are different measurements

If we talk about modelling X's usually represent input features(factors) and y's represent target(dependent) variables. From your quote above - you mean that both Xs and Ys are two sets of features? Do X1 and Y1 represent the same feature, or are they different?

What data is X, what determines its' shape? Considering input_x example:

input_x = [ 
    ((4.11, 8.58, -2.2), (-1.27, -8.76, 2.23)),
    ((0.43, -6.53, 1.25), (7.91, -10.76, 0.06)),
    ...
]

Is X a matrix of 3 features, with first feature being a vector (4.11, -1.27, 0.43, 7.91, ...) , second feature (8.58, -8.76, -6.53, -10.76, ...) and so on, so these are 4 records of 3 features? Or there is some other logic behind this grouping?

What's your goal, how your target looks like? output_y is grouped in batches of 2s - your target is multi-label, meaning for (34, 20) 34 represents a label of first target, 20 represents the second? Does any of these targets logically depend on one another or are they separate? You want to predict class labels or probabilities? Is it even a classification problem? :)
One example of multi-label classification would be if you're trying to predict a type of flower and it's color based on some characteristics, so target looks like ('iris', 'blue') .

Approach to your problem really depends on how you answer the questions above, here are some links to get more in-depth info about multiclass and multi-label classification:
Scikit-Learn's multiclass and multi-output algorithms

Scikit-multilearn package

Your Reply

By clicking “Post Your Reply”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.