2

I am trying to create a multi-model endpoint in AWS sagemaker using Scikit-learn and a custom training script. When I attempt to train my model using the following code:

estimator = SKLearn(
    entry_point=TRAINING_FILE, # script to use for training job
    role=role,
    source_dir=SOURCE_DIR, # Location of scripts
    train_instance_count=1,
    train_instance_type=TRAIN_INSTANCE_TYPE,
    framework_version='0.23-1',
    output_path=s3_output_path,# Where to store model artifacts
    base_job_name=_job,
    code_location=code_location,# This is where the .tar.gz of the source_dir will be stored
    hyperparameters = {'max-samples'    : 100,
                       'model_name'     : key})

DISTRIBUTION_MODE = 'FullyReplicated'

train_input = sagemaker.s3_input(s3_data=inputs+'/train', 
                                  distribution=DISTRIBUTION_MODE, content_type='csv')
    
estimator.fit({'train': train_input}, wait=True)

where 'TRAINING_FILE' contains:


import argparse
import os

import numpy as np
import pandas as pd
import joblib
import sys

from sklearn.ensemble import IsolationForest

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--max_samples', type=int, default=100)
    
    parser.add_argument('--model_dir', type=str, default=os.environ.get('SM_MODEL_DIR'))
    parser.add_argument('--train', type=str, default=os.environ.get('SM_CHANNEL_TRAIN'))
    parser.add_argument('--model_name', type=str)

    args, _ = parser.parse_known_args()

    print('reading data. . .')
    print('model_name: '+args.model_name)    
    
    train_file = os.path.join(args.train, args.model_name + '_train.csv')    
    train_df = pd.read_csv(train_file) # read in the training data
    train_tgt = train_df.iloc[:, 1] # target column is the second column
    
    clf = IsolationForest(max_samples = args.max_samples)
    clf = clf.fit([train_tgt])
    
    path = os.path.join(args.model_dir, 'model.joblib')
    joblib.dump(clf, path)
    print('model persisted at ' + path)

The training script succeeds but sagemaker throws an UnexpectedStatusException: enter image description here

Has anybody ever experienced anything like this before? I've checked all the cloudwatch logs and found nothing of use, and I'm completely stumped on what to try next.

1
  • Not a useful error message by AWS! Commented Oct 20, 2020 at 15:52

1 Answer 1

1

To anyone that comes across this issue in future, the problem has been solved.

The issue was nothing to do with the training, but with invalid characters in directory names being sent to S3. So the script would produce the artifacts correctly, but sagemaker would throw an exception when trying to save them to S3

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.