Computer Vision on CPU

Published in

Analytics Vidhya

14 min readJun 7, 2021

Year-2021, the World is suffering from a deadly pandemic, countries are in and out of lockdowns. Every business, trade, the market has seen its worse in recent times. One of such is the chip industry, as the whole world has accepted work from home(WFH) as the new normal, the demand and sale for electronics went sky high, and accompanied by global logistics and nationwide shutdowns there is a huge gap between demand and supply. CHIP is now the 5th element of life just after air, water, fire, and earth. CHIP is used in every day to day device, many thanks to IoT devices that even bread toaster is operated via Alexa. GPU also known as the heart for AI computation is a very rare and expensive commodity, another reason for GPU shortage is Cryptocurrency — bitcoin and ethrium have grown roughly 800% in the last 10 months, forcing more GPUs into the mining world. GPUs that use to cost a few hundred dollars are now well above the thousand dollar price range making it very difficult for a beginner to invest in it, knowing that in 8–10 months the price will come down again.

So how can one start a journey when the road is unknown, fortunately, GPU is not the only device that can perform computer vision tasks. Intel has been working tirelessly to incorporate AI computing core in its CPU line up and if you have INTEL CPU from 2017 or above probably you can run inference on it without needing a dedicated GPU. Obviously there would be a tradeoff when you go from dedicated GPU to your CPU such as complicated setup, steep learning curve, and a lot of patience and performance. We will discuss the entire process and at the end, you can decide whether it’s worth to money or invest time to learn new skills and maybe get even better performance. Disclaimer: We are only dealing with the Inferencing part of Computer Vision and not the training part, training still needs GPU.

This article will list down the process in brief without going into too much depth of associated packages and libraries. For quick results under 5 mins jump to my git-hub repo. No prior coding skill is required. To learn about inferencing in GPU check this out. Computer Vision is a very vast topic and it’s not possible to cover every aspect in this article, instead, this article just deals with the last stretch of the job i.e. running inference on the video. You might have heard about TF, Pytorch, and other frameworks that will give you complete access to design your own CNN model from scratch, train it, optimize and run inference on. But they are not easily scalable and require python/c++ coding skills to get them running. Here we would only run the inference on pre-trained weights which is converted to IR(Intermediate Representation) format. To do so we will use a very old and reliable pipeline-based multimedia framework called GStreamer. This framework works with elements similar to lego pieces, arrange them in a particular fashion and you got yourself a running pipeline. A pipeline here means a set of elements arranged in such a manner that they take some input and produce some output. Our job will be to use this lego along with some custom-designed lego by INTEL and arrange them such that they take a video input run inference on it and produce the output. The above-mentioned git-hub repo contains many such pipelines which can be executed by shell script with .sh extension OR you can directly copy-paste the pipeline which is mentioned below. Both serve the same purpose, .sh file gives you a better understanding once you open it and see the lego pieces separately. At any point in time if you need to run inference on your own .mp4 file move the file to this directory open the relevant .sh file and under the LOCATION tab change the file name with the new one added.

To date, AI has been tied up very deeply with GPU and its performance stats, while it’s good to have GPU but not mandatory. INTEL has been working on its core architecture to run inference on CPU-core itself, if you need scalability there are various add-ons available to plug and run inference on such as — Intel NCS, which can be plugged in edge devices such as raspberry pie, Arm devices, Intel NUC. This gives you inference on demand. If you are looking for something more powerful and got space for PCIe slot you can look for IEI’s Mustang-F100-A10. This will scale your model multiple times, even if that’s not up to the task you can always opt for server-grade INTEL GOLD processors which are heavily optimized to run multiple instances of computer vision models.

Lets begin: Pre-requsites: Ubuntu 18.04 LTS, Min. HDD- 15GB, Min. Ram — 4 GB, INTEL Processor.

Git-Hub — Computer_Vision_on_CPU

We will start with docker 1st, as docker give you a complete separate playground to run your experiments on plus it’s 100% replicable. If you have not heard of docker don’t worry keep this repo handy and I will explain along the way. As of the year 2021, almost every major software/SDK/package/ tool comes in form of docker images. Which can be downloaded from here. To install docker follow this or the git link mentioned above. Docker lets you create an isolate environment for the package you are interested in. The typical method is we download a docker image of the package we need and then create a container with a unique name. You can create multiple containers from the same image. These images also allow you to mount the existing folder in your OS. To begin our quest we will clone the git-hub repo in the “Documents” folder. After that we need to download the docker image and run the docker container, both can be done by using. Remember that this docker image comes with every tool, library preinstalled and you don’t have to install anything.

sudo docker run -it -e DISPLAY=$DISPLAY — network=host -d — name opv1 -v $HOME/Documents/Computer_Vision_on_CPU/:/home/ — privileged — user root openvino/ubuntu18_data_dev

This will download the required image and create a container with the name ‘opv1’, also we are mounting our GitHub-repo from our documents folder to the home folder of our container. Before going inside the container we need to grant some permission from HOST so that our container can display the video output. to achieve that we will use xhost +. Now we can safely get inside our running container via

sudo docker exec -it — workdir /home — user root opv1 bash

By typing lswe can see the list of all the files and folders present, out model has 3 component to it which is required for inferencing .xml, .bin, .proc. For our needs, I have isolated and placed them conveniently in the folders. You can also see files ending with v1,v2,v3 in their respective names. This is done because our method is based on top of GST pipeline, this pipeline has its plugins/elements which are responsible for all of video processing task. One of which is DISPLAY-out, depending on the version of Intel CPU you have you might face some compatibility issues. To check which display element(aka SINK) works for your machine try with

gst-launch-1.0 -v videotestsrc pattern=snow ! video/x-raw,width=1280,height=720 ! xvimagesink

If you did not see a snow pattern video out replace the last element ‘xvimagesink’ by ‘ximagesink’ or ‘autovideosink’. Any one of the above will definitely work, they have corresponding version numbers allotted to them. Suppose you are seeing Display-out via using ‘ximagesink’ that means you have to stick with V2 for the rest of the tutorial. We are pretty much ready to see the results now. 1st is running the only detection on the video, we have 3 classes for detection in this model we are using i.e. Person, Bike, Vehicle. let’s see how to run it.

./only_detection_v1.sh

Yup, it’s that simple every .sh file already has the entire pipeline prebuilt inside it, you just need to run it via typing the above command, you will be just using the version number that was working for you. As soon as you hit enter you will see a couple of details about the model name, its path, and the files used in the pipeline. It will also show the pipeline which looks like this:

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! decodebin ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP32/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json device=CPU threshold=0.75 inference-interval=1 nireq=4 ! queue ! gvawatermark ! videoconvert ! gvafpscounter ! fpsdisplaysink video-sink=xvimagesink sync=false

For Object Detection + Tracking type in:

./only_detection_track_v1.sh

which will give the pipeline as :

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! decodebin ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP32/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json device=CPU threshold=0.75 inference-interval=1 nireq=4 ! queue ! gvatrack tracking-type=short-term ! queue ! queue ! gvawatermark ! videoconvert ! gvafpscounter ! fpsdisplaysink video-sink=xvimagesink sync=false

For Object Detection + Tracking + Classification x2:

./only_detection_track_person_car_classify_two_v1.sh

This one have Detection + Tracking + Classification of Person attributes + Classification of Vehicle attributes and its gst pipeline looks like:

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! decodebin ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP32/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json device=CPU threshold=0.75 inference-interval=1 nireq=4 ! queue ! gvatrack tracking-type=short-term ! queue ! gvaclassify model=model_intel/person-attributes-recognition-crossroad-0230/FP32/person-attributes-recognition-crossroad-0230.xml model-proc=model_proc/person-attributes-recognition-crossroad-0230.json reclassify-interval=1 device=CPU object-class=person ! queue ! gvaclassify model=model_intel/vehicle-attributes-recognition-barrier-0039/FP32/vehicle-attributes-recognition-barrier-0039.xml model-proc=model_proc/vehicle-attributes-recognition-barrier-0039.json reclassify-interval=10 device=CPU object-class=vehicle ! queue ! gvawatermark ! videoconvert ! gvafpscounter ! fpsdisplaysink video-sink=xvimagesink sync=false

Let’s break down the pipeline and understand what is going on here: filesrc is an element that locates the .mp4 we want to use, you are free to use your own video just change the name in this pipeline. gvadetect is an INTEL-built element that needs a model address to run inference on as we now have video from filesrc it goes on for detection here. Tracked objects such as Vehicle, Person, Bike need to tracked and this is done by gvatrack which assigns a unique ID to every object. After tracking now we need to classify the Person which will be done by using gvaclassify this also takes in the address of the classification models which have stored in model_intel folder. The same is repeated for Vehicle classification. After we have all the data, we need to draw the relevant Bounding Boxes on the video this is done by gvawatermark without this you won’t see any drawing on the video. And since we are also rooting for optimal performance we need to see the FPS numbers on the video as well as a shell so we use gvafpscounter & fpsdisplaysink respectively. Learn more about gva elements here.

Remaining elements like decodbin , queue , videoconvert are necessary to pass video from one element to another element and to play around and understand the video management part / resizing it/stream it. Learn about GST elements here. Once you have your inference working you might now look towards the CPU usage, and chances are it is hovering in and around 98–100% mark. That is because the gvadetect & gvaclassify are using their default setting to utilize CPU cores which is not optimized. To use the optimized version of the same type:./adv_detection_tracking_classification.sh This will run the pipeline with some extra set of instructions for our gvadetect element.

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! qtdemux ! avdec_h264 max_threads=4 ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP32/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json threshold=0.75 inference-interval=1 model-instance-id=detect cpu-throughput-streams=4 nireq=1 ie-config=CPU_BIND_THREAD=NO,CPU_THREADS_NUM=16 ! queue ! gvatrack tracking-type=short-term ! queue ! gvaclassify model=model_intel/person-attributes-recognition-crossroad-0230/FP32/person-attributes-recognition-crossroad-0230.xml model-proc=model_proc/person-attributes-recognition-crossroad-0230.json reclassify-interval=1 device=CPU object-class=person ! queue ! gvaclassify model=model_intel/vehicle-attributes-recognition-barrier-0039/FP32/vehicle-attributes-recognition-barrier-0039.xml model-proc=model_proc/vehicle-attributes-recognition-barrier-0039.json reclassify-interval=10 device=CPU object-class=vehicle ! queue ! gvawatermark ! videoconvert ! gvafpscounter ! fpsdisplaysink video-sink=xvimagesink sync=false

particularly :

inference-interval=1 model-instance-id=detect cpu-throughput-streams=4 nireq=1 ie-config=CPU_BIND_THREAD=NO,CPU_THREADS_NUM=16

Now you have to try out different values for cpu-throughput-streams & CPU_THREADS_NUM currently they are configured for a 6 core machine and drops the usage to about 80%. Try different values in the multiple of 2 for cpu-throughtput-streams and in multiples of 8 for CPU_THREADS_NUM. This will help in better core performance management. For better FPS in the range of 60+, we will shift to INT8 model weights, these require less computational power when compared to FP32/FP26 that we have been using since the beginning. All the weights for INT8 are already present we will just change the path for gvadetect -Run: ./adv_detection_tracking_classification_liste.sh

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! qtdemux ! avdec_h264 max_threads=4 ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP16-INT8/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json threshold=0.75 inference-interval=10 model-instance-id=detect cpu-throughput-streams=4 nireq=1 ie-config=CPU_BIND_THREAD=NO,CPU_THREADS_NUM=16 ! queue ! gvatrack tracking-type=short-term ! queue ! gvaclassify model=model_intel/person-attributes-recognition-crossroad-0230/FP16-INT8/person-attributes-recognition-crossroad-0230.xml model-proc=model_proc/person-attributes-recognition-crossroad-0230.json reclassify-interval=10 device=CPU object-class=person ! queue ! gvaclassify model=model_intel/vehicle-attributes-recognition-barrier-0039/FP32/vehicle-attributes-recognition-barrier-0039.xml model-proc=model_proc/vehicle-attributes-recognition-barrier-0039.json reclassify-interval=10 device=CPU object-class=vehicle ! queue ! gvawatermark ! videoconvert ! gvafpscounter ! fpsdisplaysink video-sink=xvimagesink sync=false

By using the above pipeline you will see drastic performance improvement in FPS and a balanced CPU core utilization. There are few extra pipelines mentioned in the GitHub repo that will let you run the inference in your IP camera feed in real-time. If you want to store the inference video in mp4 format you can do so by following the pipeline:

gst-launch-1.0 filesrc location=traffic_cam_intel.mp4 ! decodebin ! gvadetect model=model_intel/person-vehicle-bike-detection-crossroad-0078/FP32/person-vehicle-bike-detection-crossroad-0078.xml model-proc=model_proc/person-vehicle-bike-detection-crossroad-0078.json device=CPU threshold=0.75 inference-interval=1 nireq=4 ! queue ! gvawatermark ! videoconvert ! x264enc ! mp4mux ! filesink location=traffic_cam_intel_output.mp4

These are few working examples I have given here, you have many possibilities like relaying the inference video over the UDP/RTSP, attaching a Database to record all the bounding-boxes data along with tracking ids. You can customize the pipeline to your heart’s content.

One such typical example is ANPR(Automatic Number Plate Recognition), this example is present already in the docker container you are running, to make it work you just have to run the .sh file, address to which is:

root@kuk:/home#/opt/intel/openvino/deployment_tools/demo/demo_security_barrier_camera.sh

This will initiate the pipeline and since we already dont have the relevant weights , downloading will begin and in a minute you will have this image on your screen:

Now you can clearly see the wide array of applications we can have with simple easy to run INTEL pipelines. gvapython is one element that can be integrated with your current pipeline which allows you to post-process the output image and the metadata(detection -bounding boxes, tracking-ID’s, classes). You can use this to make your own model keeping track of cars/pedestrians.

Some of the use cases that can be easily implemented via this pipeline are:

Loitering Detection: You can define an area from a CCTV feed, and using gvapython check if any person/car breaching that area limits. It’s a simple if, else logic, and using this snippet you can attach your email ID and now you have a very simple AI-CPU-based program that will email you with the image if someone breaches the predefined area.
NO-Parking Detection: The same concept can be applied here, instead we have to filter for only vehicles in our defined ROI(Region Of Interest). And we will get an e-mail as soon as any vehicle enters the prohibited area, also we can include the Vehicle Attribute classification shown in the above pipelines combined with ANPR we can stack up information like make, model, color, and number plate of the car which is being parked at no parking zone.
Fire/Smoke Detection: This works on the same concept, you have to build an exclusive CNN model and train it on some architecture say for example yolov3, and after training, you can convert the yolov3 weight to IR format which will simply plug into our existing pipeline. gvametapublish & gvametaconvert will help you to send alerts to the choice of your database.
Speed Detection: gvatrack element tracks the detected object, the object can be anything that your recognition model is trained to look for. Tracking helps us to uniquely, identify every detected object. Now since we have a tracking number, we can use this to determine vehicle speed. Since we also have a BBox for a vehicle, this BBox can be reduced to 1 point in frame with the allotted tracking number. So instead of the entire car, we are looking at a single pixel that represents a car. As the camera is fixed we will take an approximation of distance in real-world i.e. from certain fixed Point A to certain fixed Point B the distance is 1 meter, and our one-pixel car went from Point A to Point B in 10 Frames, assuming our camera is recording at 30FPS, we can say that to cross 1 meter the vehicle took 1/3 seconds that equates to a speed of vehicle roughly at 3m/s ~ 10 km/hr. This is a very rough estimation but won’t lie far off from actual speed. Depending on how accurately you tune the variables you can easily achieve accuracy above 90%.
People/car counter: This model finds its application in various places that deal with crowds, hospitals, malls, concerts, stadiums parking lots for vehicles, and many more. This also works on the principle of detecting and tracking the objects, you can draw a virtual line and as soon as the detected object cross that line we increase the counter by one, the same logic when applied in reverse will give you a counter machine. That will provide you the number of people who entered and left the facility, and vehicles for parking lots.

These were to name a few you always have the freedom to get creative and model your own use case suiting your own need, like airport cameras can have weapon detection, factory floors can have helmet detection, fast food can have glove detection, and many more. All this can be easily implemented and used for a small number of cameras, also we are not recording anything we have just seen the live/ stored feed analysis in real-time where we can build logic and implement an alert generation system.

Consider a case of a Smart City, a factory floor, a university, a stadium, multi-story parking lot, a hotel.. and such. These places are generally equipped with 100s and 1000s of cameras and it becomes an exponentially complex task to handle so much via a simple pipeline. This colossal task is managed and supervised by MIRASYS-DataVision Company where they provide a complete package towards securely handling, recording, retrieving video feeds from 1000s of cameras with easy to use interface. With 50+ use cases ready to deploy, it becomes a lot easier to monitor and respond to any alert in time. You have the freedom to run multiple use cases on a single camera or single-use cases on multiple cameras. All the use cases can be integrated and viewed as a single platform to control every aspect of surveillance.

Learn more about Computer Vision on GPU here, and if you are just starting and want to take the 1st step and write the script to understand it all, follow — TensorFlow Object Detection in Windows (under 30 lines).

Computer Vision on CPU

Written by DeepVish