2

Not sure if this is a bug (file report), or I did something wrong.

System information:

Linux 17.04 TensorFlow version: 1.9.0 Python version: 2.7.13

Command I used:

gcloud ml-engine jobs submit training object_detection_$(date +%Y%m%d_%H%M%S)  \
    --job-dir="gs://mybucket/train" \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.train \
    --region us-central1 \
    --config /home/me/Desktop/die_detection/config.yml \
    -- \
    --train_dir="gs://mybucket/train" \
    --pipeline_config_path="gs://mybucket/data/pipeline_cloud.config"

Tried following this sample, but with my own data: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md

Works fine locally. Running on CloudML Engine, I get a non-0 exit status. From the logs it seems object_detection.train cannot be found.

Source code / logs

E  The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=730275006403&resource=ml_job%2Fjob_id%2Fobject_detection_20180725_090524&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22object_detection_20180725_090524%22 
  undefined
E  ps-replica-0 Command '['python', '-m', u'object_detection.train', u'--train_dir=gs://mybucket/train', u'--pipeline_config_path=gs://mybucket/data/pipeline_cloud.config', '--job-dir', u'gs://mybucket/train']' returned non-zero exit status 1 ps-replica-0 
  undefined
E  ps-replica-0 /usr/bin/python: No module named object_detection.train ps-replica-0 
  undefined

My pipeline.config:

# SSD with Mobilenet v1, configured for Oxford-IIIT Pets Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.

model {
  ssd {
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v1'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 0
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 24
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.0004
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }

  num_steps: 20000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options { 
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "gs://mybucket/data/train.record"
  }
  label_map_path: "gs://mybucket/data/object-detection.pbtxt"
}

eval_config: {
  metrics_set: "coco_detection_metrics"
  num_examples: 32
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "gs://mybucket/data/val.record""
  }
  label_map_path: "gs://mybucket/data/object-detection.pbtxt"
  shuffle: false
  num_readers: 1
}

My config.yml

trainingInput:
  runtimeVersion: "1.0"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 1
  workerType: standard_gpu
  parameterServerCount: 1
  parameterServerType: standard

2 Answers 2

5

I assumed you're using the unmodified object detection sample. According to https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md#starting-training-and-evaluation-jobs-on-google-cloud-ml-engine, the --module-name should be object_detection.model_main instead of object_detection.train. Could you please double check in your dist/object_detection-0.1.tar.gz file?

Sign up to request clarification or add additional context in comments.

Comments

1

Copy the train.py from your models\research\object_detection\legacy dir and paste to the models\research\object_detection and cd to models\research and run the following cmd : python setup.py sdist. This will create a new object_detection-0.1.tar.gz in your models-master\research\dist, your can then run your commands again:

gcloud ml-engine jobs submit training object_detection_$(date +%Y%m%d_%H%M%S)  \
--job-dir="gs://mybucket/train" \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config /home/me/Desktop/die_detection/config.yml \
-- \
--train_dir="gs://mybucket/train" \
--pipeline_config_path="gs://mybucket/data/pipeline_cloud.config"

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.