Open Access Research article

Testing and Trusting Machine Learning Systems

Sajjan Shiva1* and Deepak Venugopal1

1Department of Computer Science, The University of Memphis, USA

Corresponding Author

Received Date:January 26, 2021;  Published Date: February 24, 2020


Machine learning systems are now all over the place. These systems provide predictions in a black box mode masking their internal logic from the user. This absence of explanation creates practical and ethical issues. The explanation of a prediction reduces relying on black-box traditional ML classifiers. Trustable Artificial Intelligence is the current area of interest. Testing of such systems has also not been formalized. We highlight these two issues in this paper.


Testing the performance of Machine Learning (ML) algorithms is often challenging. Specifically, ML methods are expected to make predictions into the future and therefore their evaluation has inherent uncertainty. In general, it is impossible to obtain the true accuracy of an ML method since tests are conducted on small samples of a dataset. Thus, when tests indicate that an ML method is 90% accurate, it is an estimate of the prediction accuracy based on empirical tests on a limited dataset. The traditional approach to testing and evaluating ML algorithms is to determine the accuracy of these methods on “unseen” datasets. That is, we learn the ML model on a dataset and then compute its predictive accuracy by testing the model on a new dataset that was not used during learning. To minimize variance of these estimates, approaches such as cross-validation perform multiple tests to compute the accuracy of an ML algorithm. While traditional methods to test ML algorithms have only considered how accurate an ML algorithm is, we need to focus on testing ML algorithms on the basis of explain ability. Specifically, consider an application of an ML algorithm in healthcare. In this case, a doctor who uses the ML method to make decisions needs to trust the predictions made by the method. Even if the algorithm is 99% accurate in making predictions, without understanding the reasoning behind predictions, it is hard to trust an ML method. In fact, it turns out that some of the most accurate ML algorithms (e.g. Deep Learning) are also the least interpretable. We need a testing framework for ML algorithms where we focus on the ability of ML algorithms to generate human-interpretable predictions. Further, we need to utilize the standard system and software testing methodologies in building the framework. The framework should concentrate on formal testing of raw dataset, test dataset, validation dataset and the framework itself starting with corresponding requirements analysis and management.


LIME [1] is a recent approach that explains complex ML algorithms based on simpler models. In general, linear models are more interpretable than non-linear models. For example, if we consider a linear function Y=W^TX, we can explain the function by ranking the coefficients in W. Using this idea, given a non-linear ML classifier, LIME generates explanations for predictions through a linear approximation. Specifically, consider a classifier such as support vector machine (SVM). SVMs learn complex non-linear functions from data that yield accurate results, however, explaining the results from the complex SVM classifier is difficult. Instead, we explain the prediction made by the SVM for a specific data instance as a linear function. The parameters of linear function are then used to rank the importance of features. One of the problems with LIME is that it produces explanations that are locally consistent but may not be globally consistent which may lead to biased results. For example, consider an ML algorithm that identifies hand-written digits based on their images. Traditional testing methods for ML have shown that classifiers can achieve greater than 95% accuracy on this task. However, to trust the classifier, each prediction should point out to the key visual features in the hand-written digit as an explanation for the prediction. LIME can produce such an explanation independently for each example. However, to truly trust the ML method, it must also produce consistent explanations. That is, we should ensure that for similar digits, we pick similar visual features as the explanation. For example, while detecting the digit 4, even with the variations of writing the digit, humans are likely to recognize it based on a set of consistent features such as the cross lines. Therefore, for an ML algorithm to be trusted, it should produce consistent explanations. An illustration of our proposed framework is shown in Figure 1. As shown here, we will first extract explanations for predictions independently using LIME to generate what we call as Local Explanations. Each explanation will be in the form of a ranking over the features. We will then generate a global explanation where symmetrical instances have similar explanations. To identify symmetry in the data instances, we will use an approach that explains relationships among data instances [2]. Once we identify symmetries in the instances, we will modify LIME explanations such that features are ranked in similar order when instances are similar to each other to produce a reranking of features which we term as a Global explanation. Finally, we will use this global explanation to determine if the ML algorithm produces trustworthy results (Figure 1).



We plan to use standard publicly available datasets from visual as well as text processing tasks for evaluating our approach. In particular, we plan to use image datasets from MNIST (handwritten digits) [3] that classifies an image based on the written digit in the image. Further, we plan to use language datasets from Yelp reviews [4,5] where the task is to determine the sentiment expressed in a review. For MNIST the explanations will be visual regions while for Yelp reviews, the explanation will be text. We plan to conduct a user study to test if the global explanation produced by our approach is meaningful to a human user. If so, we can conclude that the ML algorithm produces interpretable and consistent explanations. We will use standard metrics such as t-test scores to evaluate significance of our results. We plan to adopt standard system testing techniques in the framework.



Conflict of Interest

Author declare no conflict of interest.


  1. B Hardgrave try it-you’ll like it! the RFID lab’s annual state of adoption report of us retailers, RFID Journal.
  2. M Beul, D Droeschel, M Nieuwenhuisen, J Quenzel, S Houben, et al. (2018) Fast autonomous flight in warehouses for inventory applications, IEEE Robotics and Automation Letters 3(4): 3121-3128.
  3. J Zhang, Y Lyu, T Roppel, J Patton, C Senthil kumar (2016) Mobile robot for retail inventory using RFID. In: 2016 IEEE International Conference on Industrial Technology (ICIT), IEEE, pp: 101-106.
  4. SM Bae, KH Han, CN Cha, HY Lee (2016) Development of inventory checking system based on uav and rfid in open storage yard. In: 2016 International Conference on Information Science and Security (ICISS), IEEE, p: 1-2.
  5. WE Davidson (2015) Rail-mounted robotic inventory system. us Patent 9: 129-251.
  6. MG Dissanayake, P Newman, S Clark, HF Durrant Whyte, MC sorba (2001) A solution to the simultaneous localization and map building (slam) problem. IEEE Transactions on robotics and automation 17(3): 229-241.
  7. T Roppel, Y Lyu, J Zhang, X Xia (2017) Corrosion detection using robotic vehicles in challenging environments. In: CORROSION 2017, NACE International, Louisiana, USA.
  8. N Kejriwal, S Garg, S Kumar (2015) Product counting using images with application to robot-based retail stock assessment, In: 2015 IEEE International Conference on Technologies for Practical Robot Applications (Te PRA), IEEE, p: 1-6.
  9. TG Zimmerman (2010) System and method for performing inventory using a mobile inventory robot. US Patent 7: 693-757.
  10. X Liu, MD Corner, P Shenoy (2011) Ferret: An RFID enabled pervasive multimedia application, Ad Hoc Networks 9 (4): 565-575.
  11. J Zhang, Y Lyu, J Patton, SC Peria swamy, T Roppel (2018) Bfvp: A probabilistic uhf rfid tag localization algorithm using bayesian filter and a variable power RFID model, IEEE Transactions on Industrial Electronics 65(10): 8250-8259.
  12. MA Bonuccelli, F Martelli (2018) A very fast tags polling protocol for single and multiple readers RFID systems, and its applications. Ad Hoc Networks 71: 14-30.
  13. T Brodeur, A Cater, JC Vaz, P Oh (2018) Directory navigation with robotic assistance. In: 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), IEEE, pp: 777-778.
  14. G Agnihotram, N Vepakomma, S Trivedi, S Laha, N Isaacs, et al. (2017) Combination of advanced robotics and computer vision for shelf analytics in a retail store. In: 2017 International Conference on Information Technology (ICIT), IEEE, pp: 119-124.
  15. J Melià-Seguí, R Pous, A Carreras, M Morenza Cinos, R Parada, et al. (2013) Enhancing the shopping experience through rfid in an actual retail store. In: Proceedings of the 2013 ACM conference on Pervasive and ubiquitous computing adjunct publication, ACM, pp: 1029-1036.
  16. B Massimo, R Antonio, R Giovanni, V Andrea (2017) Testing an RFID receiving gate for improving process accuracy in fashion and apparel retail. In: 2017 IEEE 3rd International Forum on Research and Technologies for Society and Industry (RTSI), IEEE, p: 1-5.
  17. A Motroni, P Nepa, P Tripicchio, M Unetti (2018) A multi antenna sarbased method for uhf rfid tag localization via UGV. In: 2018 IEEE International Conference on RFID Technology & Application (RFIDTA), IEEE, p: 1-6.
  18. I Ehrenberg, C Floerkemeier, S Sarma (2007) Inventory management with an RFID equipped mobile robot. Isn: 2007 IEEE International Conference on Automation Science and Engineering, IEEE, pp: 1020-1026.
  19. T Schairer, C Weiss, P Vorst, J Sommer, C Hoene, et al. (2008) Integrated scenario for machine-aided inventory using ambient sensors. In: 4th European Workshop on RFID Systems and Technologies, VDE, p: 1-8.
  20. TH Miller, DA Stolon, JR Spletzer, (2010) An automated asset locating system (aals) with applications to inventory management. In: Field and Service Robotics pp: 163-172.
  21. H Durrant Whyte, T Bailey (2006) Simultaneous localization and mapping: part I. IEEE robotics & automation magazine 13(2): 99-110.
  22. M Montemerlo, S Thrun, D Koller, B Wegbreit, et al. Fast slam: A factored solution to the simultaneous localization and mapping problem.
  23. T Bailey, J Nieto, J Guivant, M Stevens, E Nebot (2006) Consistency of the EKF-SLAM algorithm. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp: 3562-3568.
  24. Y Lin, J Hyyppa, A Jaakkola (2011) Mini-uav-borne lidar for fine-scale mapping. IEEE Geoscience and Remote Sensing Letters 8(3): 426-430.
  25. S Kohlbrecher, O Von Stryk, J Meyer, U Klingauf, (2011) A flexible and scalable slam system with full 3d motion estimation. In: 2011 IEEE International Symposium on Safety, Security, and Rescue Robotics, IEEE, pp: 155-160.
  26. LA James, DG Watson, WF Hansen (2007) Using lidar data to map gullies and headwater streams under forest canopy: South carolina, USA, Catena 71(1): 132-144.
  27. LM Paz, P Piniés, JD Tardós, J Neira (2008) Large-scale6-dofslamwith stereo-in-hand. IEEE transactions on robotics 24 (5): 946-957.
  28. RA Newcombe, S Izadi, O Hilliges, D Molyneaux, D Kim, et al. (2011) Kinect fusion: Real-time dense surface mapping and tracking. In: 2011 IEEE International Symposium on Mixed and Augmented Reality, IEEE, pp: 127-136.
  29. J Engel, T Schöps, D Cremers (2014) Lsd Slam: Large Scale direct monocular slam. In: European conference on computer vision, Springer pp: 834-849.
  30. M Labbé, F Michaud (2019) RTAB map as an open source lidar and visual simultaneous localization and mapping library for large-scale and long-term online operation. Journal of Field Robotics 36(2): 416-446.
  31. R Mur Artal, JMM Montiel, JD Tardos (2015) Orb-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31(5): 1147-1163.
  32. J Zhang, S Singh (2015) Visual lidar odometry and mapping: Low-drift, robust, and fast. In: 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp: 2174-2181.
  33. EY Lee, HJ Cho, KY Ryu (2016) A probabilistic approach for collision avoidance of uncertain moving objects within black zones. Ad Hoc Networks 52: 50-62.
  34. L Liu, G Han, H Wang, J Wan (2017) Obstacle-avoidance minimal exposure path for heterogeneous wireless sensor networks. Ad Hoc Networks 55: 50-61.
  35. R Dai, S Fotedar, M Radmanesh, M Kumar (2018) Quality aware uav coverage and path planning in geometrically complex environments. Ad Hoc Networks 73: 95-105.
  36. PE Hart, NJ Nilsson, B Raphael (1968) A formal basis for the heuristic determination of minimum cost paths. IEEE transactions on Systems Science and Cybernetics 4(2): 100-107.
  37. LE Kavraki, P Svestka, J Latombe, MH Overmars (1996) Probabilistic roadmaps for path planning in high-dimensional configuration spaces.
  38. LE Kavraki, J claude Latombe (1998) Probabilistic roadmaps for robot path planning.
  39. SM Lavalle (1998) rapidly exploring random trees: A new tool for path planning, Tech rep, USA.
  40. K Yang, S Keat Gan, S Sukkarieh (2013) A gaussian process based rrt planner for the exploration of an unknown and cluttered environment with a UAV, Advanced Robotics 27(6): 431-443.
  41. M Mitchell (1998) An introduction to genetic algorithms, MIT press, USA.
  42. S Kirkpatrick, CD Gelatt, MP Vecchi (1983) Optimization by simulated annealing. science 220(4598): 671-680.
  43. M Dorigo, M Birattari (2010) Ant colony optimization algorithms.
  44. HH Mukhairez, AY Maghari Performance comparison of simulated annealing, GA and ACO applied to TSP. Performance Comparison of Simulated Annealing, GA and ACO Applied to TSP 6(4).
  45. SA Haroun, B Jamal (2008) A performance comparison of ga and aco applied to TSP. International Journal of Computer Applications 117(20).
  46. A Doucet, N De Freitas, K Murphy, S Russell (2000) Rao black wellised particle filtering for dynamic bayesian networks. In: Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc pp: 176-183.
  47. S Thrun, D Fox, W Burgard, F Dellaert (2001) Robust monte carlo localization for mobile robots. Artificial intelligence 128(1-2): 99-141.
  48. Slamtec, Rplidar (2018) A1 Low Cost 360 Degree Laser Range Scanner Introduction and Datasheet, rev 1.1(3).
  49. K Kamarudin, SM Mamduh, AYM Shakaff, SM Saad, A Zakaria, et al. (2013) Method to convert kinect’s 3D depth data to a 2D map for indoor slam. In: 2013 IEEE 9th International Colloquium on Signal Processing and its Applications, IEEE, pp: 247-251.
  50. MF Fallon, H Johannsson, J Brookshire, S Teller, JJ Leonard (2012) Sensor fusion for flexible human-portable building-scale mapping. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, pp: 4405-4412.
  51. M Zaratti, Rfid sensor (2007).
  52. DM Dobkin (2012) The rf in RFID: uhf RFID in practice, Newness, USA.
  53. D Fox, W Burgard, S Thrun (1997) The dynamic window approach to collision avoidance, IEEE Robotics and Automation Magazine 4(1): 23-33.
  54. L Wong, NH Moin (2015) Enhanced ant colony optimization for inventory routing problem. AIP Conference Proceedings 1682(1): 030007.
  55. G Ye, X Rui (2013) An improved simulated annealing and genetic algorithm for tsp, In: 2013 5th IEEE International Conference on Broadband Network & Multimedia Technology, IEEE, p: 6-9.
  56. M Dorigo, LM Gambardella, M Middendorf, T Stutzle, (2002) Guest editorial: special section on ant colony optimization, IEEE Transactions on Evolutionary Computation 6(4): 317-319.
  57. R Bajaj, V Malik (2000) A review on optimization with ant colony algorithm, Journal of Network Communications and Emerging Technologies (JNCET).
  58. M Dorigo, L M Gambardella (1997) Ant colony system: a cooperative learning approach to the traveling salesman problem, IEEE Transactions on evolutionary computation 1(1): 53-66.
  59. FJ Vasko, J Bobeck, M Governale, D Rieksts, J Keffer (2011) A statistical analysis of parameter values for the rank-based ant colony optimization algorithm for the traveling salesperson problem, Journal of the operational Research Society 62(6): 1169-1176.
Signup for Newsletter
Scroll to Top