{"id":36491,"date":"2023-05-22T13:05:32","date_gmt":"2023-05-22T10:05:32","guid":{"rendered":"https:\/\/orbitsoft.com\/blog\/?p=36491"},"modified":"2023-05-23T16:22:18","modified_gmt":"2023-05-23T13:22:18","slug":"training-data-quality","status":"publish","type":"post","link":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/","title":{"rendered":"Case study: How poor-quality training data can ruin AI training, and what to do about it"},"content":{"rendered":"<div class=\"wp-block-lazyblock-case lazyblock-case-2cVQlB\"><div class=\"styled-block\">\n  <div class=\"styled-block__main\">\n          <h3 class=\"styled-block__title\">\n        In brief      <\/h3>\n        <ul class=\"case__list\">\n            \n                    <li class=\"case__item\">\n              \n          <span class=\"case__order\">01<\/span>\n          <div class=\"case__body\">\n            <div class=\"case__title\">\n              <span>Customer<\/span>\n            <\/div>\n            <p><span style=\"font-weight: 400;\">An advertising platform that works with RTB auctions<\/span><\/p>          <\/div>\n        <\/li>\n            \n                    <li class=\"case__item\">\n              \n          <span class=\"case__order\">02<\/span>\n          <div class=\"case__body\">\n            <div class=\"case__title\">\n              <span>Task<\/span>\n            <\/div>\n            <p><span style=\"font-weight: 400;\">Develop an AI model that wins auctions to place ads on sites with the right target audience<\/span><\/p>          <\/div>\n        <\/li>\n            \n                    <li class=\"case__item\">\n              \n          <span class=\"case__order\">03<\/span>\n          <div class=\"case__body\">\n            <div class=\"case__title\">\n              <span>Solution <\/span>\n            <\/div>\n            <ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Processed the provided training data, defined metrics for model building\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Selected, trained, and tested suitable machine learning models<\/span><\/li>\n<\/ul>          <\/div>\n        <\/li>\n            \n                    <li class=\"case__item\">\n              \n          <span class=\"case__order\">04<\/span>\n          <div class=\"case__body\">\n            <div class=\"case__title\">\n              <span>Result<\/span>\n            <\/div>\n            <ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Determined that the data collected by the customer was not suitable for training<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Stopped at the research stage without spending the customer&#8217;s money on the full development and implementation cycle<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Suggested that the customer rebuild the data collection system to make the model predictions effective<\/span><\/li>\n<\/ul>          <\/div>\n        <\/li>\n          <\/ul>\n  <\/div>\n  <\/div><\/div>\n\n\n<p>Companies collect a huge amount of data: about users, the effectiveness of advertising, the performance of services, etc. Modern tools, such as artificial intelligence, help to analyze, interpret, and turn this data to the company&#8217;s advantage.<\/p>\n\n\n\n<p>For example, we developed an algorithm for predicting click-through rates for a large advertising platform that earns money from ad placements. The platform collects information on the reaction of users of various sites to ads: clicked, watched to the end, ignored, and closed without watching. We created an algorithm that predicts on which site the effectiveness of advertising will be higher. As a result, the CTR of the platform&#8217;s clients has increased by an average of 20%, and with it, revenue the company.<\/p>\n\n\n\n<p>However, it\u2019s not always possible to make effective predictions using artificial intelligence. To develop and train an algorithm, you need a sufficient amount of high-quality training data. If its collection is not organized correctly, the result will be unreliable.Using a case study from another advertising company, we describe how data quality affects the results of AI algorithm development, and why we need preliminary research.<\/p>\n\n\n<div class=\"wp-block-lazyblock-heading lazyblock-heading-1YXMlY\"><h2 class=\"article__h\">Customer: large advertising platform <\/h2><\/div>\n\n\n<p>An Israeli company works with real-time bidding (RTB): it helps advertisers select sites with the right target audience, and places ads on them at the lowest cost. Real-time bidding technology allows for the purchase of advertising space, and places ads in real time.<\/p>\n\n\n\n<p><strong>How does RTB work:<\/strong><\/p>\n\n\n\n<p>1. Websites with high incoming traffic (publishers) sell advertising space. For a fee, they show banners, videos, pop-ups, and other ads to their visitors. To let advertisers see who is on the site, publishers collect data from visitors: location, device, login page, etc. This data is sent to the SSP-system (supply-side platform). Much advertising is generated.<\/p>\n\n\n\n<p>2. Advertisers search for sites with their target audience. They send a request to the DSP (demand-side platform) with a portrait of the audience they want, and a quote for the price at which they want to buy ad space.<\/p>\n\n\n\n<p>3. The SSP-system conducts the real-time bidding and chooses the winners: the ones who offered the highest price will place their ads on the publisher\u2019s site.<\/p>\n\n\n<div class=\"wp-block-lazyblock-figure lazyblock-figure-1vgBie\"><figure class=\"article__figure \">\n        <div class=\"article__figure-img\" >\n        <img decoding=\"async\" src=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/tg_image_2008267304.png\" alt=\"RTB scheme\">\n    <\/div>\n                <figcaption><em>The customer&#8217;s company acts as an intermediary between advertisers and the DSP-platform: for a small commission it helps advertisers find the right audience on the RTB, and buy advertising placement at the lowest price<\/em><\/figcaption>\n    <\/figure><\/div>\n\n<div class=\"wp-block-lazyblock-heading lazyblock-heading-ZaClA8\"><h2 class=\"article__h\">Objective: to increase the number of auctions won <\/h2><\/div>\n\n\n<p>To win auctions to place ads, a price must be offered that will &#8220;outbid&#8221; other bidders, and be acceptable to the platform&#8217;s clients, the advertisers. But it&#8217;s not enough just to place ads: the platform will only be rewarded if users click on them. So, platforms still must be chosen with an interested audience.&nbsp;<\/p>\n\n\n\n<p>The company was tasked with developing an artificial intelligence model that selects sites with the right audience, and predicts what kind of price offer will win auctions on them.<\/p>\n\n\n<div class=\"wp-block-lazyblock-heading lazyblock-heading-2ubXQ8\"><h2 class=\"article__h\">Solution: create a price prediction model <\/h2><\/div>\n\n\n<p>The customer turned to OrbitSoft. They wanted to implement one of our developments into their platform code, which had already shown successful results in similar cases: the Predictor algorithm. It\u2019s a machine learning (ML) model that uses artificial intelligence to predict the probability of events. It\u2019s based on the ability to find patterns between events, draw conclusions, and apply them to predictions.<\/p>\n\n\n\n<p>To teach artificial intelligence to select auctions with the right audience and win them, a machine learning model needs to be created and trained on a lot of data about audience behavior, and on auctions that have already been won. The sequence of steps is as follows:<\/p>\n\n\n\n<ol>\n<li>Collect training data: information about auction bids, wins, impressions of purchased ads, and users who click on these ads. Establish links between the metrics.&nbsp;<\/li>\n\n\n\n<li>Create a mathematical model and load data into it.&nbsp;<\/li>\n\n\n\n<li>The model builds a dependency function and finds crossover points, the points where the values of variables coincide: bid, win, conversion, and metrics of users. For example, for users from country A to click on an advertisement, it should be placed on site B by making a B bid on the auction.<\/li>\n\n\n\n<li>Test the model: load the next batch of data into it and check the probability of the event it predicts. Part of the test is performed on new data that the machine has not before. This will show whether the results of the training tests coincide with real results.&nbsp;<\/li>\n<\/ol>\n\n\n\n<p>Testing takes place many times. This helps the model to learn by itself. Regularities that give an unfulfilled prediction are deleted from the system. We leave the data on successful predictions. We continue to test the algorithm until it learns to make realistic forecasts.<\/p>\n\n\n<div class=\"wp-block-lazyblock-figure lazyblock-figure-gpFSr\"><figure class=\"article__figure \">\n        <div class=\"article__figure-img\" >\n        <img decoding=\"async\" src=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/tg_image_2621094389.png\" alt=\"Predictor scheme\">\n    <\/div>\n                <figcaption><em>Predictor scheme: the model learns and trains on incoming data<\/em><\/figcaption>\n    <\/figure><\/div>\n\n<div class=\"wp-block-lazyblock-banner lazyblock-banner-Zes2t5\"><div \n  class=\"banner\n   \n  \" \n  >\n    <div class=\"banner__body\">\n        <h2 class=\"banner__h\"><strong>Development and implementation of solutions:<\/strong><\/h2>\n        <div class=\"banner__content\">\n            <ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">To optimize work with large databases, advertising platforms<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Information systems using artificial intelligence<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Blockchain services, systems and applications, implementation of smart contracts<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Secure data exchange systems<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">As well as technical advice and outsourced development team support for your needs<\/span><\/p>        <\/div>\n                            <div \n              class=\"banner__button button js-form-modal\n               button_style_light-on-promo2\">\n              Get a free consultation                          <\/div>\n            <\/div>\n    <div class=\"banner__photo\">\n        <img decoding=\"async\" src=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/friendship.png\" alt=\"\" class=\"banner__img\">\n    <\/div>\n<\/div><\/div>\n\n<div class=\"wp-block-lazyblock-heading lazyblock-heading-2lCHEp\"><h2 class=\"article__h\">Started to develop an ML model and faced the problem of poor data <\/h2><\/div>\n\n<div class=\"wp-block-lazyblock-heading3 lazyblock-heading3-Pjwr1\"><h3 class=\"article__h3\">Researched the data and determined the metrics on which we will build the model<\/h3><\/div>\n\n\n<p>The customer&#8217;s platform processes 500 billion unique events per month, enough for a machine learning model. However, analysis showed that there was a lot of repetition and &#8220;garbage&#8221; in the data. When the database was cleaned up, 47 million rows remained.&nbsp;<\/p>\n\n\n\n<p>We categorized the data and counted the number of events:<\/p>\n\n\n\n<ul>\n<li>Auctions held: 22,746,723<\/li>\n\n\n\n<li>Auctions won: 5,390,678<\/li>\n\n\n\n<li>Ad impressions: 5,389,276<\/li>\n\n\n\n<li>Ad clicks: 228,977<\/li>\n<\/ul>\n\n\n<div class=\"wp-block-lazyblock-heading3 lazyblock-heading3-Z1EnEPR\"><h3 class=\"article__h3\">Selected, trained and tested models<\/h3><\/div>\n\n\n<p>Given the found constraints, we selected and tested 28 models for prediction training, e.g., linear regression, CCPM, FNN, and PNN. We also took multitasking models: shared bottom, ESMM, MMOE and others.<\/p>\n\n\n\n<p>Models were trained on platform data for 2 months. They were divided into portions at a ratio of 80\/20 or 70\/30: the larger portion was used to train the model, and the smaller one to predict the probability of the dynamic price per thousand impressions, clicks, and conversions. Each result was entered into a table, compared, and analyzed.&nbsp;<\/p>\n\n\n\n<p><strong>Test results of one of 28 models &#8211; linear regression<\/strong><\/p>\n\n\n<div class=\"wp-block-lazyblock-figure lazyblock-figure-11QJhi\"><figure class=\"article__figure \">\n        <div class=\"article__figure-img\" >\n        <img decoding=\"async\" src=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/tg_image_3015451779.jpeg\" alt=\"linear regression table\">\n    <\/div>\n                <figcaption><em>We compared the values obtained during training of the models (train metrics) and the estimated values on real data (eval metrics). The training data of the linear regression model is very different from the real data, which means that the model is inconclusive<\/em><\/figcaption>\n    <\/figure><\/div>\n\n<div class=\"wp-block-lazyblock-heading3 lazyblock-heading3-ZfUuM8\"><h3 class=\"article__h3\">Evaluated the results of machine learning<\/h3><\/div>\n\n\n<p>Out of 28 trained models, we selected 5. The most effective one was tested even more. To understand whether a model is suitable for the task, we correlated its f1-score with the threshold value of the metric.<\/p>\n\n\n\n<p>If the f1-score reaches the threshold value of 0.4, the model will give a correct prediction with a probability of 40%. If the coefficient is below the threshold value, then the model will not give a meaningful result.The best f1-score for the most productive model was 0.19. This is half the threshold value. This means that this model should not be implemented: the forecasts will be unreliable.<\/p>\n\n\n<div class=\"wp-block-lazyblock-heading lazyblock-heading-Z2cqPnH\"><h2 class=\"article__h\">Result: stopped at the research stage, without spending money on the full cycle of development and implementation <\/h2><\/div>\n\n\n<p>After testing all suitable machine learning algorithms, we determined that the data collected by the customer was not suitable for training. None of the trained models produced reliable predictions.<\/p>\n\n\n\n<p>We stopped at the preliminary research stage and didn&#8217;t move on to implementing the predictor algorithm in the code of the advertising platform. This way the customer tested his idea and saved money by not investing in an algorithm that didn&#8217;t work.<\/p>\n\n\n\n<p>We suggested that the customer rebuild the data collection system. We will determine the categories and how much data to collect, and then go back to implementing the model. In this case it will be trained faster, and predictions will be effective.<\/p>\n\n\n<div class=\"wp-block-lazyblock-banner lazyblock-banner-1tkbFU\"><div \n  class=\"banner\n   \n  \" \n  >\n    <div class=\"banner__body\">\n        <h2 class=\"banner__h\">Development of big data processing systems and implementation of artificial intelligence models:<\/h2>\n        <div class=\"banner__content\">\n            <ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Let&#8217;s formulate what categories of data are needed to solve the problem<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Determine a sufficient amount of data collection<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Build a data collection and analysis system for machine learning<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Select, train, and test models to predict events<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Introduce smart algorithms into ready-made systems<\/span><\/li>\n<\/ul>        <\/div>\n                            <div \n              class=\"banner__button button js-form-modal\n               button_style_light-on-promo2\">\n              Get a free consultation                          <\/div>\n            <\/div>\n    <div class=\"banner__photo\">\n        <img decoding=\"async\" src=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/earn-more.png\" alt=\"\" class=\"banner__img\">\n    <\/div>\n<\/div><\/div>","protected":false},"excerpt":{"rendered":"<p>Companies collect a huge amount of data: about users, the effectiveness of advertising, the performance of services, etc. Modern tools, such as artificial intelligence, help to analyze, interpret, and turn this data to the company&#8217;s advantage. For example, we developed an algorithm for predicting click-through rates for a large advertising platform that earns money from [&hellip;]<\/p>\n","protected":false},"author":214,"featured_media":36495,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[196],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.4 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Case study: How poor-quality training data can ruin AI training, and what to do about it - OrbitSoft Blog<\/title>\n<meta name=\"description\" content=\"How data quality affects the results of AI algorithm development, and why preliminary research is needed \u2014 an advertising company&#039;s example\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/orbitsoft.com\/blog\/training-data-quality\/\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Case study: How poor-quality training data can ruin AI training, and what to do about it - OrbitSoft Blog\" \/>\n<meta name=\"twitter:description\" content=\"How data quality affects the results of AI algorithm development, and why preliminary research is needed \u2014 an advertising company&#039;s example\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/tg_image_4202863914.jpeg\" \/>\n<meta name=\"twitter:creator\" content=\"@orbitsoft\" \/>\n<meta name=\"twitter:site\" content=\"@orbitsoft\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"elevina\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"5 minutes\" \/>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Case study: How poor-quality training data can ruin AI training, and what to do about it - OrbitSoft Blog","description":"How data quality affects the results of AI algorithm development, and why preliminary research is needed \u2014 an advertising company's example","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/","twitter_card":"summary_large_image","twitter_title":"Case study: How poor-quality training data can ruin AI training, and what to do about it - OrbitSoft Blog","twitter_description":"How data quality affects the results of AI algorithm development, and why preliminary research is needed \u2014 an advertising company's example","twitter_image":"https:\/\/orbitsoft.com\/blog\/wp-content\/uploads\/tg_image_4202863914.jpeg","twitter_creator":"@orbitsoft","twitter_site":"@orbitsoft","twitter_misc":{"Written by":"elevina","Est. reading time":"5 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/","url":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/","name":"Case study: How poor-quality training data can ruin AI training, and what to do about it - OrbitSoft Blog","isPartOf":{"@id":"https:\/\/orbitsoft.com\/blog\/#website"},"datePublished":"2023-05-22T10:05:32+00:00","dateModified":"2023-05-23T13:22:18+00:00","author":{"@id":"https:\/\/orbitsoft.com\/blog\/#\/schema\/person\/f96c7f7c1bcb1cdf7e1750794548b6fa"},"description":"How data quality affects the results of AI algorithm development, and why preliminary research is needed \u2014 an advertising company's example","breadcrumb":{"@id":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/orbitsoft.com\/blog\/training-data-quality\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/orbitsoft.com\/blog\/training-data-quality\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/orbitsoft.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Case study: How poor-quality training data can ruin AI training, and what to do about it"}]},{"@type":"WebSite","@id":"https:\/\/orbitsoft.com\/blog\/#website","url":"https:\/\/orbitsoft.com\/blog\/","name":"OrbitSoft Blog","description":"Discover the latest in news and resources for OrbitSoft","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/orbitsoft.com\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/orbitsoft.com\/blog\/#\/schema\/person\/f96c7f7c1bcb1cdf7e1750794548b6fa","name":"elevina","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/orbitsoft.com\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/9f569b41ea8902fc571542fc77005a24?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/9f569b41ea8902fc571542fc77005a24?s=96&d=mm&r=g","caption":"elevina"},"url":"https:\/\/orbitsoft.com\/blog\/author\/elevina\/"}]}},"_links":{"self":[{"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/posts\/36491"}],"collection":[{"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/users\/214"}],"replies":[{"embeddable":true,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/comments?post=36491"}],"version-history":[{"count":5,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/posts\/36491\/revisions"}],"predecessor-version":[{"id":36500,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/posts\/36491\/revisions\/36500"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/media\/36495"}],"wp:attachment":[{"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/media?parent=36491"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/categories?post=36491"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/orbitsoft.com\/blog\/wp-json\/wp\/v2\/tags?post=36491"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}