Datasets

Here follows the list of datasets used in the experimental evaluation of my PhD thesis with Harpia and Mulan.

Hierarchical Classification Datasets

The following datasets are properly formatted for use with Harpia. In what follows, we provide a table with datasets statistics, followed by the actual link for the files and references. For some of these datasets, which we are not allowed to publish in this web site, we provide the contact information where the source files can be found.

Statistics:

Files and References

  • Hglass
    • This dataset was adpted from the flat version available in the UCI repository.
    • My hierarchical version, containing all data can be found in [Hglass.arff]
    • my 5x2 folds split [Hglass-5x2folds.zip]
    • reference: Metz, J., Freitas, A. A., Monard, M. C., & Cherman, E. A. A study on the selection of local training sets for hierarchical classification tasks. In ENIA 2011: Anais do VIII Encontro Nacional de Inteligência Artificial, pages 1–12, 2011. [pdf]
  • Music datasets: IOIHC, Marsyas, RH and SSD
    • Unfortunately these datasets are not publicly available yet. However, if your are interested you should contact Mr. Silla Jr., whose kindly gave us a copy to test our methods.
    • reference: Silla Jr, C. N., Koerich, A. L., & Kaestner, C. A. A. The latin music database. In ISMIR 2008: Proceedings of 9th International Conference on Music Information Retrieval, pages 451–456, 2008.
  • Gene function datasets: sequence, phenotype, cell-cycle, church, derisi, eisen, exp, gasch-1, gasch-2 and SPO
    • These datasets describe the Saccharomyces cerevisiae fungus and were used in experiments to predict the functional class of yeast. The classes were taken from the MIPS functional catalog. Therefore, originally they have instances associated to classes present in more than one branch in the class taxonomy. However, the version used for the experimental evaluation of Harpia's methods were pre-processed in order to transform these datasets, allowing only single-branched hierarchical classes. This pre-processing procedure was carried out by Bruno Cordeiro Paes, whose kindly gave us a copy to test our methods.
    • Unfortunately these pre-processed datasets are not publicly available yet. You may contact Mr. Paes ( ) to get a copy of these datasets.
    • Moreover, the one interested can get the original multi-way classified datasets at Aberystwyth University.
  • GPCR proteins datasets: Pfam, Prints, Prosite and Interpro
    • These datasets were pre-processed by Mr. Silla Jr..
    • my 5x2 folds split [gpcr-5x2folds.zip]
    • reference: Silla Jr, C. N. & Freitas, A. A. A global-model naive bayes approach to the hierarchical prediction of protein functions. In ICDM 2009: Proceedings of the 9th IEEE International Conference on Data Mining, pages 992–997. 2009.
  • EC proteins datasets: Pfam, Prints, Prosite and Interpro
    • These datasets were pre-processed by Mr. Silla Jr..
    • my 5x2 folds split [ec-5x2folds.zip]
    • reference: Silla Jr, C. N. & Freitas, A. A. A global-model naive bayes approach to the hierarchical prediction of protein functions. In ICDM 2009: Proceedings of the 9th IEEE International Conference on Data Mining, pages 992–997. 2009.

Multilabel Classification Datasets

The following multi-label datasets are properly formatted for use with Mulan. In what follows, we provide a table with dataset statistics, followed by the actual link for the files and references in the Mulan web site.

Statistics:

Files and References

  • Emotions
    • original files from Mulan web site [emotions.rar]
    • my 10 folds split [emotions-10folds.zip]
    • reference: K. Trohidis, G. Tsoumakas, G. Kalliris, I. Vlahavas. "Multilabel Classification of Music into Emotions". Proc. 2008 International Conference on Music Information Retrieval (ISMIR 2008), pp. 325-330, Philadelphia, PA, USA, 2008.
  • Scene
    • original files from Mulan web site [scene.rar]
    • my 10 folds split [scene-10folds.zip]
    • reference: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
  • Slashdot-f
    • original files can be found in [Meka's web site]
    • my 10 folds split [slashdotf-10folds.zip]
    • reference: Read, J., Pfahringer, B., & Holmes, G. Multi-label classification using ensembles of pruned sets. In ICDM 2008: Proceedigns of International Conference on Data Mining, pages 995–1000, 2008.
  • Yeast
    • original files from Mulan web site [yeast.rar]
    • my 10 folds split [yeast-10folds.zip]
    • reference: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.
  • Enron
  • Medical