TensorFlow Object Detection in Windows (under 30 lines)

10 min readAug 12, 2020

Before we start this journey:

You need to have TF-1.14 GPU installed in Windows machine to see how to do that please follow this link.
Model-Master from TensorFlow should also be set up correctly to follow this link to see how it's done.

We will start with some brief description of what MODEL-ARCHITECTURE actually is, to know what more about which model to select please refer here. The model alone won't procure the desired results. There is more to it than meets the eye, to know more about some practical challenges you might face, please go through this post.

Models are nothing but set of weights(fixed and structured) i.e numbers that will filter the data fed into it and produce a fixed set of numbers again, that ought to make some sense. These numbers generally represent Bounding box coordinates, classes, and confidence levels. Generally, it is termed as a black box as one can never really know how the results are obtained, but that is not entirely true. Yes, a huge set of complicated mathematics is in the play but everything can be broken down to simple fundamental operations that can easily be learned individually. Some common terms you must have an idea about are Activation functions, Backpropogation, Gradient Descent, Loss Functions, Convolutional operations. We will not be getting at those fundamental operations here in this article but we will encounter those terms while Training our own data-set, and having an idea about it helps a lot.

For this example, we will be taking ssd_mobilenet_v2_coco from here. Click the name to download the zip file. Unzip to see these files in the folder:

For now, we will be dealing with frozen_inference_graph.pb this file is the final file with all the weights stored in accordance with the graph structure and is ready to use. model.ckpt is the graph structure and is used during training it also acts as a backup when something goes wrong in the training. The rest of the files can be ignored for now. We will also be altering pipeline.config later on when we will train on our own data-set.

Now we have a rough idea what model is and what files are important, as mentioned earlier Model is nothing but a graphed architecture with fixed weights (a whole lot of weights), and every time we need to run an inference we need to feed the image in a fixed format and then decode the results that are also obtained in a fixed format. And since the data(structured weights) are huge even in today's machine we need to give it some time and memory to LOAD into, this usually takes a few seconds. Be it a single image or series of images (video) the process of loading the graph (model ) remains the same, here we use the feature of TensorFlow called ‘session’.

A session is initiated and frozen_inference_graph.pb is loaded into this session. The Session is responsible to read data from the .pb file, load it into the memory and initiate some variables and parameters which later we will use to feed and retrieve data and tune in parameters for memory allocation. For running inference in the video stream we need to load the graph only once. Remember this is the part where system memory is allocated for running inference, we need to close the session in order to free up the allocated memory or load in a new graph.

Before we begun make sure the unzipped file is located as such : models-master>research>object_detection>ssd_mobilenet_v2_coco.

You will need one more file(label_map) with a list of classes that will help us to convert integer output to a specific class, this data is stored in .pbtxt format. models-master>research>object_detection>data>mscoco_complete_label_map.pbtxt. It thereby defaults, just make sure it is present there before proceeding, upon opening it with notepad it should look something like this:

Object Detection in under 30 lines of python: All the required additional files will be available on my Github repo here.

Let's break down the code line by line:

Line 10- 15

The 3 files (frozen_inference_graph.pb, images.jpg, label_map.pbtxt) needed for object detection are loaded in the variables as shown below.

MODEL_NAME = 'ssd_mobilenet_v2_coco'
IMG_NAME = 'Times_sq1.jpg'
CWD_PATH = os.getcwd()
PATH_TO_CKPT = os.path.join(CWD_PATH,MODEL_NAME,'frozen_inference_graph.pb')
PATH_TO_LABELS = os.path.join(CWD_PATH,'data','mscoco_complete_label_map.pbtxt')
PATH_TO_IMG = os.path.join(CWD_PATH,IMG_NAME)

Line 16–19

Now we need to declare the total number of classes that our graph will search for in the image, this numerical data will help to translate the ‘integer number’ result obtained from the graph to the actual category which we can make sense of. This number is fixed and the user doesn’t need to change it unless it is a custom data-set, which we will discuss in the next post.

NUM_CLASSES = 90
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = label_map_util.convert_label_map_to_categories(label_map, max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

We can see now that out label_map.pbtxt which previously shown in Notepad is now converted to a ‘dict’ in python with key and value. every key has a unique value associated with it.

If our graph returns 3 as a class we will display ‘car’ along with that object as we can see in the above results.

Line 20–29

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.2
detection_graph = tf.Graph()
with detection_graph.as_default():
    od_graph_def = tf.GraphDef()
    with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')
    sess = tf.Session(graph=detection_graph, config = config)

Line 20–21 calls configuration part of the TensorFlow graph where we set a limit to how much % of GPU memory we want to allocate for our graph to load and perform in. Here we have set the limit to 0.2 i.e 20 % of GPU mem will be now be reserved for this purpose only.

Line 22–29 is where we load the frozen_inference_graph.pb into a variable called ‘sess’. TensorFlow has its own set of code structure which we need to follow to load the file in a specific manner. After loading the graph we need to declare variables that will be responsible for feeding and fetching data from our ‘sess’. Let's see how to do that:

Line 30–34

image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')
detection_boxes = detection_graph.get_tensor_by_name('detection_boxes:0')
detection_scores = detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = detection_graph.get_tensor_by_name('num_detections:0')

Our loaded graph contains multiple nodes as seen in the previous post, each node has a unique name and in order to get value from any node, we need to address them by their specific name.

tensor-flow feeds from frozen_inference_graph.pb

Using Netron we can search for all variable names we are looking for in graph itself as shown in the above image. “detection_boxes” has an element_shape of : [100,4] as can be seen in the above red circle at the right bottom of the image. Where 100 represents the number of maximum detection's graph can return, and 4 represent the (x1,y1) & (x2,y2) coordinates of top left and bottom right point of a detected rectangle.

The above image is the collage of results obtained after a graph is fed with the input frame. boxes, scores, num, classes are our python variable which will store the values obtained from the graph when we ask the data by name(detection_classes:0, detection_scores:0), as seen in the graph.pb diagram. Let's see how to feed the image into the loaded graph and retrieve results.

Before that we need to read the image and store it in python variable, previously i.e in Line 11 we just declared image name, which we would use to read the image now. Also, keep in mind that the image comes in different shapes and sizes. An image can be stored in RGB format or HSV format, but since our graph is frozen(fixed) we have to follow a certain format to send in our image and that is something like 1, height, width, number of channels ~~ 1,h,w,3. The same format can be seen in image fig: tensor-flow feeds, image_tensor (left bottom corner) as ?x?x?x3.

Line 35–36

frame = cv2.imread(PATH_TO_IMG)
frame_expanded = np.expand_dims(frame, axis = 0)

Frame variable stores the image data in RGB format by default and may have a shape like (720,1280,3). frame_expanded add an additional dimension to the frame and make the dimension to (1,720,1280,3).

Now we have prepared everything, our image is marinated, our ‘sess’ is pre-heated we just have to push the tray in and wait for fresh detection. As soon as our meal is out of the oven we will be serving it hot meaning we need to sort out overcooked/ undercooked parts before drawing the boxes and labeling them.

Line 37–38

(boxes, scores, classes, num) = sess.run([detection_boxes, detection_scores, detection_classes, num_detections],feed_dict = {image_tensor: frame_expanded})vis_util.visualize_boxes_and_labels_on_image_array(frame,np.squeeze(boxes),np.squeeze(classes).astype(np.int32),np.squeeze(scores),category_index,use_normalized_coordinates=True,line_thickness=2,min_score_thresh=0.25)

Line 37 — is the key to the whole program, this line is responsible for feeding the IMAGE in and stores the data in user-defined variables,(given we have correctly loaded the right model and providing it with the correct format of input data). We have already seen in FIG: OUTPUT of SESS, the values of all the variables.

boxes:: detection_boxes — 100 maximum boxes can be detected and 4 floating-point numbers to describe top-left and bottom-right corners of a rectangle are stored in ‘boxes’. Like there is only 3 detection in EURO TRUCK SIMULATOR image we will only get the dimension of 3 boxes and rest will be zero by default. The dimensions obtained for box coordinated are all in between 0–1 (0.498716, 0.125704, 0.524439, 0.15188). This will be multiplied by the width and height of the image respectively by the predefined function of vis_tools.
scores:: detection_scores — this is an array of 100 floating-point numbers with a max value of 1 and min of zero. It stores the confidence our graph has on its prediction of boxes. We can see that values in our example image are 0.619, 0.567, 0.557 meaning it is 61.9%, 56.7%, and 55.7% sure about the 3 boxes it has predicted. By default, the values are zero.
classes:: detection_classes — Since we have our boxes we need to know what objects are present in those boxes, so the classes variable stores array of flaot32 (but always have the whole number) the key of each detected box, this key will later be matched with category_indexes we made to match for its suitable values. like 3:: car, 8:: truck. by-default the classes will return 1(it doesn’t mean our graph is seen the object stored in the value of key 1).
num:: num_detections — num variable stores only on1 value, and its the number of detection made by the graph, we saw above how every variable is having a limit of 100 and returns a default value if no object is found. We need to be sure that we only process the number of boxes detected by the graph and not the garbage values if obtained. We use this as a safety net to cut out any values after 3rd row of data in boxes(this follows order). Our graph returns whatever it sees and even if there is 0.00015 confidence that a pole is similar to a person it will return the box dimensions and the confidence level and count it as detection.

Visualizing the result: In-line 37 we got everything we wanted to draw boxes on the image, but this may be a bit too much of data. As discussed above our graph looks for everything and it just dumps out the boxes with even 0.0001 confidence level there is no inbuilt filter for this (assume there is none for now), its up-to user now to show confidence on the result produced by the graph, if you the user set a threshold of say 0.75 (min_score_thresh=0.75) any confidence level below this will be rejected, in our case(euro truck simulator image) this will eliminate all the boxes as they have 0.61,0.56,0.55 confidence level respectively. Depending on the threshold results vary a lot, too high(0.99) of this may not produce any result until the graph is 100 % sure of its prediction, too low(0.01) may produce all sorts of bad / False positives. Users must decide this value after a long analysis of the graph performance.

“vis_util.visualize_boxes_and_labels_on_image_array” is a predefined function that takes in all the data obtained via “sess” and frame itself along with “min_score_threshold” to sort the data. After data is sorted all the boxes are drawn with their respective name on the top obtained from classes and category_index. The user is also required to give inline thickness for the boxes drawn on the image.

Line 39–40

cv2.imshow('FINAL IMG', frame)
cv2.imwrite('result1.jpg', frame)

Line 39 shows the output to the user whereas line 40 saves the output in .jpg format. And this is how you can run detection in under 30 lines. we started from line 10 and the reason was we need to import some packages before we began with the program.

These 6 lines are used to import the necessary packages and thus must be kept at the starting of the file.

import os
import cv2
import numpy as np
import tensorflow as tffrom utils import label_map_util
from utils import visualization_utils as vis_util

All the code can be found here, in the next article, we will be using the same model to run inference on videos and look at how can we improve on the performance by altering a few lines.

TensorFlow Object Detection in Windows (under 30 lines)

Written by DeepVish