Tensorflow Real Time object detection - Optimization advice needed

Question

I'm working on a software that should do realtime people detection on multiple camera devices for an home surveillance system.

I'm currently running Opencv to grab frames from an IP camera and tensorflow to analyze and find objects on them (the code is very similar to the one that can be found in the Tf object detection API). I've also tried different frozen inference graphs from the tensorflow object detection api at this link:

https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md

I have a Desktop PC with a CPU Intel Core i7-6700 CPU @ 3.40GHz × 8 and my GPU is NVidia Geforce gtx960ti.

The software is working as intended but is slower than expected (3-5 FPS) and the usage of the CPU is quite high(80-90%) for a single python script that works on only 1 camera device.

Am i doing something wrong? What are the best ways to optimize performances and achieve better FPS and a lower CPU usage to analyze more video feeds at once? So far i've looked into multithreading but i've no idea on how to implement it on my code.

Code snippet:

with detection_graph.as_default(): with tf.Session(graph=detection_graph) as sess: while True: frame = cap.read() frame_expanded = np.expand_dims(frame, axis = 0) image_tensor = detection_graph.get_tensor_by_name("image_tensor:0") boxes = detection_graph.get_tensor_by_name("detection_boxes:0") scores = detection_graph.get_tensor_by_name("detection_scores:0") classes = detection_graph.get_tensor_by_name("detection_classes:0") num_detections=detection_graph.get_tensor_by_name("num_detections:0") (boxes, scores, classes, num_detections) = sess.run( [boxes, scores, classes, num_detections], feed_dict = {image_tensor: frame_expanded}) vis_util.visualize_boxes_and_labels_on_image_array(frame,...) cv2.imshow("video", frame) if cv2.waitKey(25) & 0xFF == ord("q"): cv2.destroyAllWindows() cap.stop() break

Is that 80% usage of your entire CPU or a specific core? If it's the former, multithreading won't help much. Are you using the GPU? What does your "grab frames" code look like? A plausible bottleneck is unnecessary object creation. There's also been a recent surge in single-pass image recognition to avoid a duplication of efforts in models like these. Realistically, image recognition is computationally expensive, and speeding it up will require isolating the problem(s). — Hans Musgrave
– Hans Musgrave, Commented May 31, 2018 at 17:58
This varies from application to application, but I've had good success with using reduced-resolution images (direct downsampling or more complicated imputations). For 3d conv nets, I was able to get comparable accuracy in <1% for the input size, drastically speeding up the application. There's a point where the extra pixels don't offer extra predictive accuracy. — Hans Musgrave
– Hans Musgrave, Commented May 31, 2018 at 18:13
And 80% on 8 cores is within the realm of reason. Applying some ballpark estimates on the structure of your models, that could easily correspond to ~10 operations per pixel per layer in a conv net. — Hans Musgrave
– Hans Musgrave, Commented May 31, 2018 at 18:15

Denny Wang · Accepted Answer · 2018-06-04 20:43:10Z

1

A few things I tried for my project may help,

Use nvidia-smi -l 5, and monitor GPU usage and memory usage.

Create a small buff between OpenCV and TF, so it won't compete the same GPU resources,

BATCH_SIZE = 200
frameCount = 1
images = []

while (cap.isOpened() and frameCount <= 10000):

    ret, image_np = cap.read()

    if ret == True:
            frameCount = frameCount + 1

            images.append(image_np)

            if frameCount % BATCH_SIZE == 0:

                start = timer()
                output_dict_array = run_inference_for_images(images,detection_graph)
                end = timer()
                avg = (end - start) / len(images)

                print("TF inference took: "+str(end - start) +" for ["+str(len(images))+"] images, average["+str(avg)+"]")

                print("output array has:" + str(len(output_dict_array)))

                for idx in range(len(output_dict_array)):
                    output_dict = output_dict_array[idx]
                    image_np_org = images[idx]
                    vis_util.visualize_boxes_and_labels_on_image_array(
                        image_np_org,
                        output_dict['detection_boxes'],
                        output_dict['detection_classes'],
                        output_dict['detection_scores'],
                        category_index,
                        instance_masks=output_dict.get('detection_masks'),
                        use_normalized_coordinates=True,
                        line_thickness=6)

                    out.write(image_np_org)
                    ##cv2.imshow('object image', image_np_org)

                del output_dict_array[:]
                del images[:]



    else:
        break

Use mobilenet models.
Resize capture to 1280 * 720, save capture as a file, and run inference on the file.

I did all above, and archived 12 ~ 16 FPS on a GTX1060(6GB) laptop.

    2018-06-04 13:27:03.381783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
    2018-06-04 13:27:03.381854: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
    2018-06-04 13:27:03.381895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929]      0 
    2018-06-04 13:27:03.381933: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0:   N 
    2018-06-04 13:27:03.382069: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5211 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060 with Max-Q Design, pci bus id: 0000:01:00.0, compute capability: 6.1)
    ===TF inference took: 8.62651109695 for [100] images, average[0.0862651109695]===

answered Jun 4, 2018 at 20:43

Denny Wang

442 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

simonEE Over a year ago

Hi Denny, thanks for your sharing. I've achieved around 15-20 fps myself using the VideoStream method from the imutils.video library which is basically a threaded version of VideoCapture with a "faster-rcnn model" which is slower, but much more accurate, than mobilenet in general. Try it yourself and see what are the results. I've also managed to run 3 cameras at once without losing much on the FPS count(5 fps, 10 at best ) using the threading and the queue modules. I'll post the code tomorrow when i'm at work if you are interested

simonEE Over a year ago

Even with 1 camera, the idea is to let the opencv infinite loop run alone in a child thread and put the frames in a queue object. in the main(parent) thread you run the tf.sess grabbing frames from the queue object created by opencv

Denny Wang Over a year ago

Simon, for a home surveillance system, I guess it won't be that often an object is flying around, so you could sample 2~3 frames per second, and assume the object position won't be changed that much in 0.5 second. so in this case, 10FPS should be OK. And it depends on your use cases, you can consider separating your VideoCaputure and Inference, have low end box capture videos and stream them to a more powerful box for inferencing.

Chaine Over a year ago

Hi @simonEE, can you show how you implemented your methid using imutils.video?

Denny Wang Over a year ago

@Chaine, speed. I tried original size ~2240 * 1460, 1280 * 720, 800 * 600 and the last 400 * 300. Guess what, the interferencing speed almost doubled at 400 * 300. The tradeoff is accuracy, small detection error is unbearable. In my use case, 1280 * 720 is a good balance.

|

Collectives™ on Stack Overflow

Tensorflow Real Time object detection - Optimization advice needed

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related