A Quality Measure for Multi-label Datasets on the Apache Spark Framework

Propuesto por Lic. Ricardo Sánchez Alba

Resumen

In the last years, the amounts of data have increased considerably and therefore, it is becoming more complex to handle these volumes of information. Measuring the data quality is a pivotal aspect to assess the classifier's discriminatory power as the classifiers accuracy heavily depends on the data used to build the model. Multi-label classification is one specific type of classification problem, which has generated an increasing interest in recent years. However, there are no quality measures for multi-label datasets implemented in cluster computing frameworks to evaluate large datasets. This work aims to implement a measure of data quality for multi-label datasets based on Granular Computing under the Apache Spark framework. As a result, it was possible to calculate the values of the quality measure for the datasets, and even in relatively short times.

Ponente

Lic. Ricardo Sánchez Alba

UCLV

Información práctica

No definido
30 minutos
No definido

Autores

  • Carlos morell
  • Koen vanhoof
  • Marilyn bello
  • Rafael bello
  • Lic. Ricardo Sánchez Alba

Palabras clave