Written by Venkatesh Ramamrat
“Of all the senses, sight must be the most delightful.” — Helen Keller
At Wranga, when we were looking at making technology and the internet safer for children, we realized there was so much visual content, movies, games, and apps, and to be able to rate and review such enormous data we needed tools, along with Natural Language Processing, we learned in the blog Language AI, for visual content, we used the technology known as Computer vision.
For many decades, people dreamed of creating machines with the characteristics of human intelligence, those that can think and act like humans. One of the most fascinating ideas was to give computers the ability to “see” and interpret the world around them. Solving the “seeing” was possibly solved with a camera years ago, but for the machine to understand what is happening is the result of advances in image processing and neural networks which are data and algorithm-driven. Computer Vision, which according to Precedence Research will grow at a CAGR of 38.1% from 2022 to 2030 to reach $1597 billion from $87 Billion in 2022.
Types of Computer Vision
As a student of Visual Art, the color wheel is one of the most fundamental understandings of how we see and how color interacts with each other in physics. Computers, too, have a similar understanding, where, Red, Green, and Blue are the dominant colors, which when combined give another set of colors — Magenta, Cyan, and Yellow. Every image on our computer screen is nothing but a combination of the three major colors, red, green, and blue. There are a lot of models being used in the computer vision field, which are nothing but a combination of various color values. As every major color has a range from 0 to 255, we can infer that the higher the value, the brighter the color.
RGB values of a pixel in an image
This is how an image is formed on a computer screen. The values pertaining to red, green and blue are read by an AI and stored in the form of a matrix.
Max pooling is a process used to downsize the computations by taking the largest number from every subsection of the array. This method keeps the most important parts of the array, and again, makes the data easier to manage. This can be done multiple times at different stages in the network.
Edges are essentially high differences in the pixels, thus most edge detection methods try to extract the regions where the difference between the pixels are detected. Sobel’s method detects the edges by conducting convolutions (indicated by the operator *) on 3x3 regions of the image with 2 special kernels. The kernels will yield the horizontal and vertical components of the gradient which can be used to calculate a vector representing the direction and strength of edges.
It would be difficult to analyze an image upright, as the pixel arrays can be complex and noisy. This is why researchers usually extract features such as edges and lines. These features can be represented by much simpler numerical values or relationships. There are many ways to detect edges such as taking a derivative or Canny’s method.
Edge Detection Techniques
Convolutional Neural Network
Computer vision technology is being dominated by the Convolutional Neural Network (ConvNet or ConvNet) because of its high accuracy. Convolutional Neural Networks (CNN) became a star after the 2012 ImageNet contest. A neural network is a set of algorithms used to recognize patterns and relationships in a dataset. It’s very similar to how neurons in a human brain function and interact together, allowing us to perceive the natural world.
Another notable mention would probably be Joseph Redmon. Redmon invented the YOLO nets in 2016. The word YOLO probably came from a popular internet slang “You Only Live Once”. In the paper, YOLO stands for “You only look once”, which offers fast object detection based on neural networks.
What's easy for humans is often hard for computers. Artificial intelligence systems have long been better than people at doing mathematics or remembering large quantities of information, but for decades humans have had an advantage at recognizing everyday objects such as dogs, cats, tables, or chairs.
Recently, however, "neural networks" that mimic the brain have approached the human ability to identify objects, leading to technological advances supporting self-driving cars, facial recognition programs, and AI systems that help physicians spot abnormalities in radiological scans.
Computers interpreted the above images to be (from left) an electric guitar, an African grey parrot, a strawberry, and a peacock.
While humans pay attention to the shapes of pictured objects, deep-learning computer vision algorithms routinely latch on to the objects’ textures instead. This finding, presented at the International Conference on Learning Representations in May, illustrates how misleading our intuitions can be about what makes artificial intelligence tick. It may also hint at why our vision evolved the way it did. Humans live in a three-dimensional world, where objects are seen from multiple angles under many different conditions, and where our other senses, such as touch, can contribute to object recognition as needed. So it makes sense for our vision to prioritize shape over texture.
“If we want machines to think, we need to teach them to see.” — Fei Fei Li
In 2018, Joy Buolamwini from MIT Media Lab noticed a huge problem. Face recognition technology often wouldn’t recognize her face, despite working perfectly for her white male coworkers. With a team of scientists, they composed a dataset of 1,270 individuals using their self-developed Pilot Parliaments Benchmark (PPB).
They used pictures of people working in parliaments of three European countries (Iceland, Finland, and Sweden) and three African countries (Rwanda, Senegal, and South Africa), then classifying them based on gender and skin-tone, to produce a final dataset made up of 44.6% females and 46.4% darker-skinned individuals. They found that all of the classifiers had the least accuracy in classifying darker females and the highest accuracy in classifying lighter males, with a discrepancy as large as 34.4%. Since our data sets are skewed towards a certain kind of profile, there has been a criticism that the Ai we train will be as good as the data we give it.
Content Moderation and Visual-AI
Wranga has been utilizing the strength of Visual-AI, as we believe it has the potential to make the online world infinitely safer for users, thus protecting the integrity of online platforms like social media websites, video-sharing sites, messaging apps, and so on. While moderating text is an important aspect of protecting users, image and video moderation is essential to make these platforms safe environments, free from abusive and horrifying content.
Image moderation utilizes object detection by analyzing the media for items that may be unsuitable or harmful, such as weapons, drug use, abuse, nudity, and other critical variables. Text detection takes things a step further, by detecting potentially offensive or harmful words included in the frame that natural language processing alone would miss.
Video moderation uses the same technologies, analyzing the video frame by frame for offending visuals. With live streaming becoming increasingly common across all social apps, in particular, the ability to process in real-time, without introducing lag has become essential in content moderation. Visual AI can be deployed to monitor the content on OTTs, Games, Apps, and Advertisements, and using computer vision we can deploy our patented algorithms for rating content. We also go deeper into reviewing content through the use of Generative Adversarial Networks, which I shall share with you all in the subsequent post.