Executive Secretary

II Conferencia Internacional de Procesamiento de la Información

CIPI - IOTAI2019

A Quality Measure for Multi-label Datasets on the Apache Spark Framework

In the last years, the amounts of data have increased considerably and therefore, it is becoming more complex to handle these volumes of information. Measuring the data quality is a pivotal aspect to assess the classifier's discriminatory power as the classifiers accuracy heavily depends on the data used to build the model. Multi-label classification is one specific type of classification problem, which has generated an increasing interest in recent years. However, there are no quality measures for multi-label datasets implemented in cluster computing frameworks to evaluate large datasets. This work aims to implement a measure of data quality for multi-label datasets based on Granular Computing under the Apache Spark framework. As a result, it was possible to calculate the values of the quality measure for the datasets, and even in relatively short times.