How can I call scikit-learn classifiers from Java?

Question

I have a classifier that I trained using Python's scikit-learn. How can I use the classifier from a Java program? Can I use Jython? Is there some way to save the classifier in Python and load it in Java? Is there some other way to use it?

ogrisel · Accepted Answer · 2012-10-05 09:05:29Z

54

You cannot use jython as scikit-learn heavily relies on numpy and scipy that have many compiled C and Fortran extensions hence cannot work in jython.

The easiest ways to use scikit-learn in a java environment would be to:

expose the classifier as a HTTP / Json service, for instance using a microframework such as flask or bottle or cornice and call it from java using an HTTP client library
write a commandline wrapper application in python that reads data on stdin and output predictions on stdout using some format such as CSV or JSON (or some lower level binary representation) and call the python program from java for instance using Apache Commons Exec.
make the python program output the raw numerical parameters learnt at fit time (typically as an array of floating point values) and reimplement the predict function in java (this is typically easy for predictive linear models where the prediction is often just a thresholded dot product).

The last approach will be a lot more work if you need to re-implement feature extraction in Java as well.

Finally you can use a Java library such as Weka or Mahout that implement the algorithms you need instead of trying to use scikit-learn from Java.

answered Oct 5, 2012 at 9:05

ogrisel

40.3k14 gold badges120 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Thomas Johnson Over a year ago

One of my coworkers just suggested Jepp...is that something that would work for this?

ogrisel Over a year ago

Probably, I did not know about jepp. It indeed looks suited for the task.

rbanffy Over a year ago

For a web app, I personally like the http exposure approach better. @user939259 could then use a classifier pool for various apps and scale it more easily (sizing the pool according to demand). I'd only consider Jepp for a desktop app. As much a python lover as I am, unless scikit-lear has significantly better performance than Weka or Mahout, I'd go for a single-language solution. Having more than one language/framework should be considered technical debt.

ogrisel Over a year ago

I agree about the multilanguage technical debt: it's hard to work in a team were all devs know both java and python and having to switch from one technical culture to the other adds useless complexity in the management of the project.

Thomas Johnson Over a year ago

Maybe it is technical debt - but to stretch the metaphor, in machine learning you're constantly declaring bankruptcy anyways because you're trying stuff out, finding it doesn't work, and tweaking it / throwing it away. So maybe the debt isn't as big a deal in a case like that.

Dmitry Spikhalsky · Accepted Answer · 2022-11-16 14:57:20Z

25

There is JPMML project for this purpose.

First, you can serialize scikit-learn model to PMML (which is XML internally) using sklearn2pmml library directly from python or dump it in python first and convert using jpmml-sklearn in java or from a command line provided by this library. Next, you can load pmml file, deserialize and execute the loaded model using jpmml-evaluator in your Java code.

This way works with not all scikit-learn models, but with many of them.

As some commenters correctly pointed out, it's important to note that JPMML project is licensed under GNU AGPL. AGPL is a strong copyleft license, which may limit your ability to use the project. One of the examples may be if you develop a publically accessible service and want to keep the sources closed.

edited Nov 16, 2022 at 14:57

answered Aug 10, 2016 at 16:31

Dmitry Spikhalsky

5,8872 gold badges30 silver badges43 bronze badges

4 Comments

Andrea Bergonzo Over a year ago

How do you ensure that the feature transformation part is consistent between the one done in Python for training and the one done in Java (using pmml) for serving?

leon Over a year ago

I tried this, and it definitely works for converting sklearn transformers and xgboost model to Java. However, we didn't choose this in our production environment because of the AGPL license. (There is also a commercial license, but negotiating a license does not fit our project timeline.)

Indrajit Kanjilal Over a year ago

I tried this, kept all the feature extraction,cleaning,transformation logic through Java program. And it works fine on the Java side (jpmml-evaluator). A good option for containerized Spring boot application, greatly reducing the devops complexity as the frequency and timeline of the python training cannot be synchronized with continuous integration of Java program

Christopher Schultz Over a year ago

@leon's comment is super important, especially for people who copy/paste solutions from SO answers as a significant part of their software development lifecycle. If you use jpmml-evaluator in your product, your users could force you to disclose all the source code to your product. This is the Big Bad Wolf that Microsoft was warning people about when they equated all Open Source Software to libraries licensed under GPL (not LGLP) and similar licenses. Always read your licenses!

gustavoresque · Accepted Answer · 2018-03-04 19:16:07Z

You can either use a porter, I have tested the sklearn-porter (https://github.com/nok/sklearn-porter), and it works well for Java.

My code is the following:

import pandas as pd
from sklearn import tree
from sklearn_porter import Porter

train_dataset = pd.read_csv('./result2.csv').as_matrix()

X_train = train_dataset[:90, :8]
Y_train = train_dataset[:90, 8:]

X_test = train_dataset[90:, :8]
Y_test = train_dataset[90:, 8:]

print X_train.shape
print Y_train.shape


clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, Y_train)

porter = Porter(clf, language='java')
output = porter.export(embed_data=True)
print(output)

In my case, I'm using a DecisionTreeClassifier, and the output of

print(output)

is the following code as text in the console:

class DecisionTreeClassifier {

  private static int findMax(int[] nums) {
    int index = 0;
    for (int i = 0; i < nums.length; i++) {
        index = nums[i] > nums[index] ? i : index;
    }
    return index;
  }


  public static int predict(double[] features) {
    int[] classes = new int[2];

    if (features[5] <= 51.5) {
        if (features[6] <= 21.0) {

            // HUGE amount of ifs..........

        }
    }

    return findMax(classes);
  }

  public static void main(String[] args) {
    if (args.length == 8) {

        // Features:
        double[] features = new double[args.length];
        for (int i = 0, l = args.length; i < l; i++) {
            features[i] = Double.parseDouble(args[i]);
        }

        // Prediction:
        int prediction = DecisionTreeClassifier.predict(features);
        System.out.println(prediction);

    }
  }
}

thanks for the info. Can you share your ideas on how to execute a sklearn model pickled using sklearn porter, and use it for prediction in Java - @gustavoresque

Volokh · Accepted Answer · 2018-02-16 13:05:25Z

Here is some code for the JPMML solution:

--PYTHON PART--

# helper function to determine the string columns which have to be one-hot-encoded in order to apply an estimator.
def determine_categorical_columns(df):
    categorical_columns = []
    x = 0
    for col in df.dtypes:
        if col == 'object':
            val = df[df.columns[x]].iloc[0]
            if not isinstance(val,Decimal):
                categorical_columns.append(df.columns[x])
        x += 1
    return categorical_columns

categorical_columns = determine_categorical_columns(df)
other_columns = list(set(df.columns).difference(categorical_columns))


#construction of transformators for our example
labelBinarizers = [(d, LabelBinarizer()) for d in categorical_columns]
nones = [(d, None) for d in other_columns]
transformators = labelBinarizers+nones

mapper = DataFrameMapper(transformators,df_out=True)
gbc = GradientBoostingClassifier()

#construction of the pipeline
lm = PMMLPipeline([
    ("mapper", mapper),
    ("estimator", gbc)
])

--JAVA PART --

//Initialisation.
String pmmlFile = "ScikitLearnNew.pmml";
PMML pmml = org.jpmml.model.PMMLUtil.unmarshal(new FileInputStream(pmmlFile));
ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
MiningModelEvaluator evaluator = (MiningModelEvaluator) modelEvaluatorFactory.newModelEvaluator(pmml);

//Determine which features are required as input
HashMap<String, Field>() inputFieldMap = new HashMap<String, Field>();
for (int i = 0; i < evaluator.getInputFields().size();i++) {
  InputField curInputField = evaluator.getInputFields().get(i);
  String fieldName = curInputField.getName().getValue();
  inputFieldMap.put(fieldName.toLowerCase(),curInputField.getField());
}


//prediction

HashMap<String,String> argsMap = new HashMap<String,String>();
//... fill argsMap with input

Map<FieldName, ?> res;
// here we keep only features that are required by the model
Map<FieldName,String> args = new HashMap<FieldName, String>();
Iterator<String> iter = argsMap.keySet().iterator();
while (iter.hasNext()) {
  String key = iter.next();
  Field f = inputFieldMap.get(key);
  if (f != null) {
    FieldName name =f.getName();
    String value = argsMap.get(key);
    args.put(name, value);
  }
}
//the model is applied to input, a probability distribution is obtained
res = evaluator.evaluate(args);
SegmentResult segmentResult = (SegmentResult) res;
Object targetValue = segmentResult.getTargetValue();
ProbabilityDistribution probabilityDistribution = (ProbabilityDistribution) targetValue;

theyCallMeJun · Accepted Answer · 2018-05-11 12:50:16Z

1

I found myself in a similar situation. I'll recommend carving out a classifier microservice. You could have a classifier microservice which runs in python and then expose calls to that service over some RESTFul API yielding JSON/XML data-interchange format. I think this is a cleaner approach.

answered May 11, 2018 at 12:50

theyCallMeJun

9411 gold badge11 silver badges22 bronze badges

Comments

Viktor Ershov · Accepted Answer · 2019-02-21 17:57:05Z

1

Alternatively you can just generate a Python code from a trained model. Here is a tool that can help you with that https://github.com/BayesWitnesses/m2cgen

answered Feb 21, 2019 at 17:57

Viktor Ershov

3411 gold badge2 silver badges11 bronze badges

Comments

Waleed · Accepted Answer · 2024-11-21 17:50:27Z

0

You can also use OnnxRuntime, in fact it might be the best option among others. On a high level this is what you can do:

Generate model in scikit-learn.
Convert the model to onnx format
Save to disk
Load the model in Java via Onnx Runtime
Execute inference

Links :

answered Nov 21, 2024 at 17:50

Waleed

5841 gold badge7 silver badges17 bronze badges

Collectives™ on Stack Overflow

How can I call scikit-learn classifiers from Java?

7 Answers 7

5 Comments

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

5 Comments

4 Comments

1 Comment

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related