Practical aspects to select a Model for Object Detection

10 min readAug 11, 2020

Please refer to my last post: Setting up TensorFlow 1.14 in bare Windows (if you are starting from scratch).

Model architectures come in different shapes and sizes, and by 2 min of Google search you might be able to select a suitable model architecture for your need, it may work like a charm if it is given the look-alike data-set it has been trained on(COCO-Dataset) but will give up if there is an anomaly. What if your need is very peculiar and does even come close to any predefined classes present in that pre-trained model(custom data-set), what if you want to work with only one class of the data-set (searching for rest 90 classes will only slow down your speed), even if you have a clear sight of your requirement and you proceed accordingly but the results are not as expected, it's obvious to think that Model architecture is at fault. But that's not entirely true, we will see what other factors affect the model performance and how to avoid those potholes, and then we will try to alter so very intricate properties of the model such as feature layer extraction for our own benefit.

Together we will work on a custom data-set(vis-drone), re-train, and see the difference. This article will focus on how to setup MODEL-MASTER for object detection and some points to keep in mind before starting any project.

Object detection on images/videos is not just dependent on Model architectures alone, a lot of variables need to be set right for optimal performance. To set up the TensorFlow Object Detection API follow this guide by Mark Labinski. Assuming you have already followed my previous post or already have TF-GPU installed correctly, now you have MODEL repo installed and ready to go as directed by Mark.

Disclaimer: You might face import errors like module not found “object detection”, module not found “nets”. Remember to install setup.py located in model-master>research. to install it just run: python setup.py install in anaconda prompt.

Model-master is a collection of almost all the tools in AI domains such as GAN, image2text, sentiment_analysis, LSTM, Object-detection, etc..

We will focus on the object_detection folder present inside the research folder of the model-master.

This directory will be used in collaboration with the entire model-master to train your own Model or to run inference on any pre-trained model such as SSD-MobileNet-v2_coco, resent, inception, RCNN, etc. Make sure this Folder Model-master is fixed on HDD and will produce errors if its location is changed.

Here is the link where you can download the weights for each of the pre-trained models. COCO-dataset is a collection of various objects with their annotations, various Models such as SSDMobileNet, Faster_RCNN, SSD_INCEPTION have already been trained on COCO dataset and their receptive performance is listed as mAP(mean Avg Precision) along with the time taken for running inference on a single image.

When choosing a Model for a specified task lot of choices have to be made, not every model can be the ONE. Each model has different architecture thus has different limitations. Let's discuss some very crucial points in the matter. It is really important to list out the needs of the project before starting to shop, you should know exactly what you want. And no detail is too small, a small error can cost you multiple times during deployment. In no order these are some important factors to be considered:

1 Camera Intrinsic properties: Camera is the source of all the data you will be building your model around so, know your camera beforehand. As discussed in the previous post video is nothing but a series of images coming your way at a high rate, thus giving you an illusion of moving objects. Camera cores are equipped with Encoders which is responsible for compressing the data(video) so that it can be transmitted to your system. H264, H265 are 2 very famous encoding techniques that are very efficient and set a perfect balance between quality and speed. Always keep in mind there is a loss whenever data is compressed, the point to be noted is whether the loss is manageable or not.

One might reduce the loss by using diff encoding techniques but this will increase the time to compress the video thus introducing a very fickle property called LATENCY and also the size of the video will increase dramatically, this is only one side of the spectrum the receiving end of data stream i.e your machine will also need to decode the encoded data and form series of images again in real-time (this never happens there is always LATENCY present), the trick is to keep latency in an acceptable limit. The second important point is the transmission of data, once compressed the data needs to to get to your machine in some form like we consume data in cellphone via 4G-5G networks(songs via blue-tooth), wifi-direct for smart Tv’s, your webcam is connected to your machine by USB interface, your security camera runs on LAN and such. These physical difference between wired and wireless technology comes with the PROTOCOL(RTSP, UDP, TCP) of transferring the data from one point to another. So we have wired and wireless communications methods and we have different protocols that are responsible to carry data between point A to B. Obviously not all Protocols are compatible with all types of data-carrying methods.

These protocols again have their own caveats such as data loss, latency, security to name a few.

If you have a security cam installed and want to use that feed for object detection (person, vehicle, motion, etc), assuming avg. configuration your camera will have a 2MP camera with 15–25 FPS video output at H264/H265 via LAN. As the primary job of the Security camera is to record movement and at best get the face of the person in the frame clearly, these are meant to work in the range of 10–25 meters. Therefore detecting a high-speed car with this camera will not be feasible as we will only be able to capture 1–2 frames that will have a car in it. GStreamer is a popular tool used to receive feed from RTSP/ UDP source(among the popular ones). A high-speed traffic cam is installed at highways to capture and calculate the speed of cars in real-time.

These cameras are capable of capturing videos at a very high frame rate 250 Fps and above, as these cameras are FIXED we know the distance a car travels when it enters and leaves the frame. Based on how many frames it took the Vehicle by the time it was not in the frame we can calculate its speed; its basic speed, distance time relation. At last, we got a very different application and very different deployment conditions, camera changes and so does every property associated with it this imposes a major penalty on your algorithm.

2. Camera Extrinsic Properties: This means the environment camera is installed in. The camera can be a fixed camera as in security/surveillance/traffic camera or it can very well be a moving camera installed in Autopilot cars, drones, cop cars, football stadiums (sky cams). They come with there own challenges to name a few are: the distance of the object can vary from just a couple of meters to a few kilometers ex- capturing a rocket launch. Light also plays an important role improper or varying light conditions change the quality a lot and thus can impact your model performance, Camera mounted on drones/planes also contains jello effect in the video caused by vibration in the body of the aircraft. Here comes the challenge of tracking the movement by eliminating the movement in the scene caused by the camera (global motion). Using dampers where the camera is mounted can reduce the vibration be it on a hovering aircraft or underside of a flyover. Rolling shutter and Global Shutter are two fundamental techniques used to capture a frame and they do play an important role when it comes to high-speed frame capture.

Magic is Real

3. Subject and its properties: Subject (object to be detected) and its properties play a crucial role in selecting the type of model and then you can set limits on your expectation about the inference too.

Consider above Image for example what can be the possible subjects: yellow cabs(cars), Bus, a truck, pedestrians, traffic light signals, no turn left signals, bike lane, only left lane, zebra crossings, pedestrians with blue shirt?? what difference does it make? well to our eyes its doesn’t but for a model it does, YOLO model is not good enough for small objects and may produce inaccurate results also called FALSE POSITIVES, but its not model’s fault. It wasn’t designed to detect small objects, it was designed for speed, real-time applications, models like YOLO can be easily deployed in jetson nano. Going up the food chain we can use SSD-MobileNet which is best of both world one can say, it can detect small objects (only up to a certain point(no pun here)) in real-time (15–25 FPS) and can be deployed in mediocre GPU’s like GTX 1060,1070 and so on. It is more accurate than YOLO but still contains some limitations, What if the subject is TOO SMALL and ACCURACY is of utmost importance (no threshold for FALSE positives) RCNN like models are to be used in such case which will bring the cost with it, of very high inference time and bring down the FPS to 4–5 may be less, thus cannot be used for real-time applications. Subjects decide what model to be selected, study the subject before selecting any model. Image may contain many subjects or just one subject size within an image cannot be altered but image size as a whole is again a variable you need to set. The above image has a size of 2186x2738 so it was very clear and all the relevant data was distinguishable, what if we reduce the size of the image to 224x224 or 512x512 (input dimension of VGG).

Look at the image on the side with 224x224 dim, focus on the small objects a lot of detail is lost when the image is shrunk in size many pedestrian are reduced to few pixels and depending on the model used those pixels will be lost(as noise) when passed through 200 layers of convolution. More details you can learn here, Jonathan Hui has very intuitively compared and contrast between all the famous architectures.

4. Hardware: As we have not determined what is that we want results to look like, we must now provide adequate hardware for the same. Models such as INCEPTION, RESNET are huge with 5 million (inception-v1) and 23 million (ResNet-50) trainable parameters, which need a lot of space in GPU memory to even load these parameters next comes the part where you might train the network from your own data-set(transfer learning) which we will see in the next article. If you have heard enough of Model/ TF-model/ model-master and want to actually see what the fuss is all about let try to visualize a simple SSD-mobilenet-v1 by uploading the .pb file here. It might take some time but eventually, you will see something like this:

Netron : Is one of the best way to actually see what layer names are, what their dimensions are how many layers do they convolve into, and how they are connected to each other. You can load any model as long the model is in .pb format and see for your self the journey an image will take on that model. In order to load such models you will need sufficient memory, now if this memory is in CPU the speed will be considerably less (I have not encountered any article which says CPU memory affects accuracy so we will assume it doesn’t), on the other hand, GPU memory is multiple times faster than that of CPU and thus comes with a price, CPU loads its data into RAM(GDDR5) whereas GPU has (GDDR6-type memory ) plus GPU CUDA CORES are designed specifically for this task of parallel processing tonnes of data in form of numbers(Tensor arrays). Which makes CUDA cores and GPU memory the prime factors to be considered when training the model or using it for inference. Make sure that whenever your machine is used for inference or training keep the machine in open space (well ventilated) so there is no breathing problem for the cooling fans, monitor the temperature regularly, as training may take up 12 hrs to 8 days straight utilizing 100% of GPU depending on the problem statement, you can easily fry your system if not taken care of.

In the next article, we will run some inferences on images and videos from the model-master you installed following this guide.

This video is our problem statement(as its clear the performance is below par for vehicle detection task), things to do in order are:

RUN inference using SSD-MobileNet-v2 detection(image and then video).
Optimizing SSD-MobileNet-v2 for speed.
Re-Train ssd-MobileNet-v2 using vis-drone data-set(custom training/ transfer learning).
RUN inference using Custom trained SSD-MobilNet-v2.
Practical applications- Tracking suspects in-crowd, crowd monitoring, traffic management, etc.

Expected Results after Step 4 are as given below:

The next article will shed light on the basics of object detection on images.

Practical aspects to select a Model for Object Detection

Written by DeepVish