How data science saved hundreds of hours of work time

How data science saved hundreds of hours of work time

In brief

  • 01
    Customer

    Video aggregator, generates income from ad impressions

  • 02
    Problem

    Each video has several preview images. Some of them are blurry. Users do not open such videos, and do not watch ads in them, and the company loses income

  • 03
    Task

    Apply data science methods: develop a Python module that sorts images by quality to find and remove all poor-quality images

  • 04
    Result
    • Conducted research and proposed a principle of image quality assessment. Together with the customer, we determined the criteria, i.e., which previews are considered high-quality, and which are not
    • Developed a Python module that sorts images by degree of blur
    • Module analyzed 50 million images: found 4.7 million in very good quality. The rest are of normal quality
    • Spent a week from technical specification to result
    • The customer was able to remove all low-quality previews from the site and increase the number of views

Customer: video aggregator

The company owns a video platform that collects entertaining videos from many other sites. When a visitor selects a video to watch, the aggregator redirects them to the content owners' website. The platform is rewarded when users watch ads on source sites.

Screenshot from the platform
The company does not shoot its own videos, but collects videos from other sites.

There are over 13 million videos hosted on the platform. From content owners, the company receives images for previews. Each video has from 1 to 10 of them, more than 50 million images in total.

Issue: Some of the video preview images are of poor quality

Some of the previews were of poor quality, and the platform didn’t perform any verification before downloading. Users didn’t click on videos with blurry previews, and therefore didn’t watch ads, and the company was losing revenue.

The customer decided to remove all low-quality images, but doing so manually in such quantity is labor-intensive. If you looked at each image for even just one second, it would take about 14 thousand hours. The customer turned to OrbitSoft to automate the process.

Solution: developed a Python module that assessed image quality

High-quality images are clear, low-quality images are blurry. Our task was to teach the algorithm to mathematically determine the degree of image blur. We started by studying specialized literature regarding this issue.

For example, in the article Analysis of focus measure operators in shape almost 36 methods for calculating the blur index are considered. We noted for ourselves options that are easier to implement.

A good solution is described in the article Diatom autofocusing in brightfield microscopy: A comparative study. In this variant, a single image channel is taken, and the absolute value of the Laplace operator is calculated. Based on this method, we developed a Python module that determines the amount of blur in an image.

How the module determines the amount of blur

Each picture is made up of pixels. If the image is blurry, a smooth gradient is obtained on the border of objects. That is, the color of neighboring pixels changes gradually. If the image is clear, then the boundaries between objects are clear, and the color value of neighboring pixels on the border of objects changes sharply.

Blurry and clear images comparison
The top image is blurry: it is difficult to find the border between the shirt and the background. The color of the pixels smoothly flows from dark red to light red. The bottom image is clear: there is a border between the shirt and the background. Neighboring pixels are very different in color

To determine the blurriness of an image, the Python module calculates the magnitude of the color gradient using the Laplace operator:

1. Takes one color channel of an image and wraps it with the following core:

3x3 Laplace core

2. Calculates the variance of the result, i.e., the standard deviation squared.

3. A coefficient is obtained by which one can judge the degree of blurring: the sharper the image, the greater the coefficient.

All previews on the platform are colored, and each color has its own gradient. When there are many colors, matching takes a long time. Therefore, we decided to convert images to black and white monochrome.

How the module converts images to monochrome

Color wheel
To evaluate blurring of a color picture, you need to determine the gradation of each of the colors. In monochrome images, only gray is used.

To convert color images to black and white, it was necessary to connect a library that can work with color. We tested several Python libraries and settled on OpenCV.

OpenCV is a computer vision library with which you can process, analyze, and classify images. It supports all popular file formats, and works well with image resolution, just what we needed for our task.

How the module evaluates image quality

The result of the algorithm is a coefficient that takes a value from 9 to 2000. The higher the value, the better the image quality.

An important point was to determine the threshold value of this coefficient: which images are considered sufficiently clear, and which are already blurry. We agreed with the customer that we will consider quality previews to be those with a coefficient of more than 100.

script
The script analyzed the quality of each preview, adjusting for the number of pixels, and calculated images with a factor above and below the threshold

Result: the module found 4.7 million poor quality images, and 4.5 million excellent ones

Scheme
How the image quality evaluation module works:
1. Converts the previews uploaded to the platform to monochrome
2. Calculates the blur factor
3. Distributes images into one of two groups, depending on whether the coefficient is above or below the threshold value

Data science methods have allowed us to automate image quality assessment. Manually, it would take more than 14 thousand working hours. It took us less than 40 hours, from setting the task to obtaining the result.

The python module analyzed 50 million images and divided them into groups:

  1. 4.7 million blurry images: ratio from 0 to 100
  2. The rest are quite clear, of which 4.5 million are very good quality, where the coefficient is more than 1000

Over time, the customer removed all images of poor quality from the site. The number of views on the platform increased.

Whatever your needs, we can help!

Tell us what problems you’re facing with your business. We look forward to hearing from you.

Получите ответ по смс

Ваше сообщение успешно отправлено!
Представьтесь пожалуйста
Укажите номер, на который придет ответ
Нажимая на кнопку, вы даете согласие
на обработку персональных данных.