TensorFlow Object-Detection for Videos on Windows-10

DeepVish
8 min readAug 14, 2020

Previous article: “TensorFlow Object Detection in Windows (under 30 lines)”, covers about 95% of the same code displayed below with an explanation of each line, we will only look forward to the amendments made in previous code that now enables it to run inference on Videos instead of images.

Line 10–33

MODEL_NAME = ‘ssd_mobilenet_v2_coco’
VIDEO_NAME = ‘time_sq.mp4’
CWD_PATH = os.getcwd()PATH_TO_CKPT = os.path.join(CWD_PATH,MODEL_NAME,’frozen_inference_graph.pb’)PATH_TO_LABELS = os.path.join(CWD_PATH,’data’,’mscoco_complete_label_map.pbtxt’)NUM_CLASSES = 90label_map = label_map_util.load_labelmap(PATH_TO_LABELS)categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)
config = tf.ConfigProto()config.gpu_options.per_process_gpu_memory_fraction = 1detection_graph = tf.Graph()with detection_graph.as_default():
od_graph_def = tf.GraphDef()
with tf.gfile.GFile(PATH_TO_CKPT, ‘rb’) as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.import_graph_def(od_graph_def, name=’’)
sess = tf.Session(graph=detection_graph, config = config)
image_tensor = detection_graph.get_tensor_by_name(‘image_tensor:0’)detection_boxes = detection_graph.get_tensor_by_name(‘detection_boxes:0’)detection_scores = detection_graph.get_tensor_by_name(‘detection_scores:0’)
detection_classes = detection_graph.get_tensor_by_name(‘detection_classes:0’)
num_detections = detection_graph.get_tensor_by_name(‘num_detections:0’)

In summary, Lines above are responsible for the following task, In-depth analysis can be seen here.

  • Model and video names are stored in a variable.
  • The number of classes is defined i.e 90 in COCO dataset.
  • Label map is used to create the category_index dict which stores the relation between integers and their respective classes.
  • TensorFlow graph is loaded into Session defined as sess
  • Python variables are declared that are responsible to feed and extract data from the loaded TensorFlow session.

Line 34–38:

video = cv2.VideoCapture(VIDEO_NAME)
while True:
stime = time.time()
ret, frame = video.read()
if ret == True:

We are using OpenCV to read our .mp4/.mov video file. As the video is but a series of images we need a continuous loop to go through all the frames OpenCV is throwing our way through ret, frame = video.read() , therefore we initiate a whilte True: loop that is always true until declared otherwise. OpenCV’s .read() extension will return 2 values i.e the state of the returned video if it's True or False and the array of pixel values representing the image(eg 1280x720x3). False value of ret means there is nothing to read and the following frame variable is bound to come up empty. Therefore ret acts as our check conditionality for being in the loop or break out of it. We will be using this to our advantage and only begin the detection part if we have image data available else break out of the while loop, which can be seen inline-38(if ret == True:).

Line 39–40

frame = cv2.resize(frame,(300,300))
frame_expanded = np.expand_dims(frame, axis = 0)

Our frame from the video may be of any size and this might be the problem for the architecture we have loaded in our memory (only if the frame is in high resolution eg. 1280x720 HD video). The high-resolution video will consume more time to get processed and thus reducing the FRAME PER SECOND smoothness we are expecting at the end of our program. In order to improve the FPS of the output video, we will be reducing the frame size to 300x300, this will help our cause. And then as needed we will be adding one extra dimension to our frame so that it can be loaded to our graph that we will see in the next couple of lines.

Line 41–43

(boxes, scores, classes, num) = sess.run([detection_boxes, detection_scores, detection_classes, num_detections],feed_dict = {image_tensor: frame_expanded})
vis_util.visualize_boxes_and_labels_on_image_array(frame,np.squeeze(boxes),np.squeeze(classes).astype(np.int32),np.squeeze(scores),category_index,use_normalized_coordinates=True,line_thickness=1,min_score_thresh=0.75)

cv2.imshow(‘output’, frame)

In short, we are sending the frame into our graph and then the obtained results are processed and bounding boxes are drawn using the vis_util library, a long version of the same can be found here.

Line 44–51

    print(‘FPS {:.1f}’.format(1/ (time.time() — stime)))
if cv2.waitKey(1) & 0xFF == ord(‘q’):
break
if ret == False:
print(‘vid not Present’)
break
video.release()
cv2.destroyAllWindows()

Almost all of the work has been already done, here we will be just setting up a termination condition that pressing the ‘q’ on the keyboard will terminate the loop and end the program. Also when our ret variable is NOT TRUE, which means we have no data on our hands to process into the graph henceforth we will break the while loop and terminate the program. video.release() and cv2.destroyAllWindows() are responsible for freeing the video variable of image data and closing all output windows respectively.

source — https://www.videvo.net/video/mcdonalds-in-times-square-/1772/

let's run our program for the above video, the download link is provided here. What will be the logical expectation out of it?

This is what it looks like:

Shocked? Wondering what's wrong? What did we miss?

NOTE: Your mileage may vary depending on the configuration of your machine, It may be close to 30 FPS or maybe worse than 5 FPS. I have implemented this code on RTX 2060 with 6 GBDDR6 memory coupled with i5–8600 CPU @ 3.10 GHz.

Let's analyze the code again Line:39 frame = cv2.resize(frame,(300,300)) This explains the loss of quality, this is why we see pixelated images, sharpness loss. But there is a considerable amount of jitter/lag/ in the video it doesn’t seem to be that smooth as it was originally, And what do you think the reason might be for that?

IN BIG words the answer is FPS i.e Frames Per Second of a video is causing this effect of jitter/lag. WHY FPS becomes an issue? to learn in detail click here. In short, FPS is the output by our graph, it is the power at which our graph can process an image, for the jittery video above, it was an avg. 14 FPS.

For a video to be smooth, it should be above 25 FPS(The standard for playback in the UK is 25 frames per second (fps) and in the US it’s 30fps.), whereas youtube supports 24,25,30,48,50,60 FPS and majority of video have 29.97 FPS. This makes the video as effortless as it seems.

So now we understand why our results are bad because it is at 14FPS, and why we got 14FPS? when our input is at 30FPS? there are many factors behind this, the major ones are as follows:

  • Model-Architecture limit: Every Architecture comes with its upper limit, as they are nothing but a pre-arranged set of numbers that run a series of calculations on an image. More the numbers(heavier the model) meaning more the calculation and thus more the time spent on each image thereby reducing the number of images processed per second. You can learn more about SS-Mobilenet architecture and its internal function here.
  • Image-Resolution: Image Resolution play s a considerable role, as it is a direct representation of the amount of data that is being fed to our graph per image, more the resolution more time it is bound to take thus reducing the FPS.
  • Hardware: Bigger the better, if by any chance you are running this on TF-cpu you will have a hard time getting 4–5 FPS. Make sure you have suitable hardware.
Full-RES 1920x1080 NODs-FPS

For the given Video, we created a Data Frame compiling the NODs(number of Detections) and FPS for Full-resolution video and plotted the graph as above, NODs and FPs share the same y-axis with NODs having values between 0–8 objects detected per frame and FPS ranging from 4–14 per second.

The x-axis is the number of frames, total of 391 frames in 14-sec video making the FPS of input video of about 28. The orange line represents the number of Detection made by our architecture per image and Blue line is the FPS value at that instant, it can be clearly seen that whenever we have high NODs, FPS value at that instant drops. Remember that our model is looking for 90 types of objects in every frame. The duration where there is no object to detect we have fairly high FPS output. I make the subject more clear I will be amalgamating data from 3 versions of the same video:

The image on the left shows the avg. FPS and total Number of Detections, it's quite clear that the highest FPS was obtained on the least resolution video and as discussed earlier we see the huge drop in NODs as well. Meaning most of the data is lost as noise. Whereas the highest NODs were seen at full resolution video but it came with the lowest FPS as well. To deploy the model or least make it more useful to curate our needs we need to find a perfect balance between acceptable results(accurate and precise bounding box given object is detected) and FPS.

The image above is the Graph showing the overlap data obtained in 3 runs with diff resolutions.

Running inference at Full resolution 1920x1080

Now we know how to implement TensorFlow Object Detection on Video by using a pre-trained model-architecture SSD-MobileNet for example seen here. The results are in the acceptable range, and you can easily implement the same, but what if your need is very specific and this architecture SSD-MobileNET trained on COCO dataset is unable to detect the specific class you are interested in, maybe you have a completely different set of demands and the list of 90 object categories doesn't cater to your need. You may have your own dataset with totally different object categories, it may be MRI scans to look for an anomaly, it may be images of an apple orchard with marked apples as annotations.

This deviation from the original dataset and inability of our Model can be well addressed by TRAINING the Model to our own custom created dataset. This dataset can be any set of images annotated properly, the size of the dataset is also a variable user may decide and then follow certain steps and TRAIN the existing MODEL to recognize our own object category. After training, we will use this newly trained model for our specific need that is now bound to produce better results.

As stated in the article: Practical aspects to select a Model for Object Detection.

We are here to obtain the result as shown on the left side of the above video, the right side is the same model we just ran on our example video, but because our example video was almost a home ground for the model, we saw acceptable performance out of it. The moment we changed the conditions of the environment the results are drastically affected. It is quite clear how poor the performance is, what steps we need to take in order to CREATE, TRAIN on a custom dataset is discussed in the next article.

Code available here.

Other related articles are:

--

--

DeepVish

AI enthusiast, Computer Vision Engineer. Self-Driving Cars needs camera not LIDAR. Vision is the Future.