Publications‎ > ‎

Datasets

Here follows the list of datasets used in the experimental evaluation of my PhD thesis with Harpia and Mulan.

Hierarchical Classification Datasets

The following datasets are properly formatted for use with Harpia. In what follows, we provide a table with datasets statistics, followed by the actual link for the files and references. For some of these datasets, which we are not allowed to publish in this web site, we provide the contact information where the source files can be found.


Statistics:
Name Domain MLN Instances Number of Attributes Labels Labels per level Cardinality
Hglass glass identification yes 214 9 6 2, 5, 2 2.68
IOIHC music yes 4188 40 20 2, 9, 9 2.65
Marsyas music yes 4188 30 20 2, 9, 9 2.65
RH music yes 4188 60 20 2, 9, 9 2.65
SSD music yes 4188 168 20 2, 9, 9 2.65
GPCR-Pfam protein functions no 7053 75 192 12, 52, 79, 49 2.84
GPCR-Prints protein functions no 5404 283 179 8, 46, 76, 49 3.01
GPCR-Prosite protein functions no 6246 129 187 9, 50, 79, 49 2.95
GPCR-Interpro protein functions no 7444 450 198 12, 54, 82, 50 2.82
EC-Pfam protein functions no 13987 708 333 6, 41, 96, 190 3.67
GPCR-Prints protein functions no 14025 382 351 6, 45, 92, 208 3.70
GPCR-Prosite protein functions no 14041 585 324 6, 42, 89, 187 3.69
GPCR-Interpro protein functions no 14027 1216 330 6, 41, 96, 187 3.66
Sequence gene functions yes 1680 437 180 4, 22, 70, 84 3.59
Phenotype gene functions yes 621 64 168 4, 22, 66, 76 3.59
Cell-cycle gene functions yes 1711 78 180 4, 22, 70, 84 3.58
Church gene functions yes 1677 24 180 4, 22, 70, 84 3.60
Derisi gene functions yes 1661 62 180 4, 22, 70, 84 3.60
Eisen gene functions yes 1163 80 170 4, 22, 66, 78 3.56
Exp gene functions yes 1688 544 180 4, 22, 70, 84 3.60
Gasch-1 gene functions yes 1660 174 180 4, 22, 70, 84 3.61
Gasch-2 gene functions yes 1678 53 180 4, 22, 70, 84 3.60
SPO gene functions yes 1649 79 180 4, 22, 70, 84 3.59



Files and References
  • Hglass
    This dataset was adpted from the flat version available in the UCI repository.
    My hierarchical version, containing all data can be found in [Hglass.arff]
    my 5x2 folds split [Hglass-5x2folds.zip]
    reference: Metz, J., Freitas, A. A., Monard, M. C., & Cherman, E. A. A study on the selection of local training sets for hierarchical classification tasks. In ENIA 2011: Anais do VIII Encontro Nacional de Inteligência Artificial, pages 1–12, 2011. [pdf]
  • Music datasets: IOIHC, Marsyas, RH and SSD
    Unfortunately these datasets are not publicly available yet. However, if your are interested you should contact Mr. Silla Jr., whose kindly gave us a copy to test our methods.
    reference: Silla Jr, C. N., Koerich, A. L., & Kaestner, C. A. A. The latin music database. In ISMIR 2008: Proceedings of 9th International Conference on Music Information Retrieval, pages 451–456, 2008.
  • Gene function datasets: sequence, phenotype, cell-cycle, church, derisi, eisen, exp, gasch-1, gasch-2 and SPO
    These datasets describe the Saccharomyces cerevisiae fungus and were used in experiments to predict the functional class of yeast. The classes were taken from the MIPS functional catalog. Therefore, originally they have instances associated to classes present in more than one branch in the class taxonomy. However, the version used for the experimental evaluation of Harpia's methods were pre-processed in order to transform these datasets, allowing only single-branched hierarchical classes. This pre-processing procedure was carried out by Bruno Cordeiro Paes, whose kindly gave us a copy to test our methods.
    Unfortunately these pre-processed datasets are not publicly available yet. You may contact Mr. Paes ( ) to get a copy of these datasets.
    Moreover, the one interested can get the original multi-way classified datasets at Aberystwyth University.

  • GPCR proteins datasets: Pfam, Prints, Prosite and Interpro
    These datasets were pre-processed by Mr. Silla Jr..
    my 5x2 folds split [gpcr-5x2folds.zip]
    reference: Silla Jr, C. N. & Freitas, A. A. A global-model naive bayes approach to the hierarchical prediction of protein functions. In ICDM 2009: Proceedings of the 9th IEEE International Conference on Data Mining, pages 992–997. 2009.
  • EC proteins datasets: Pfam, Prints, Prosite and Interpro
    These datasets were pre-processed by Mr. Silla Jr..
    my 5x2 folds split [ec-5x2folds.zip]
    reference: Silla Jr, C. N. & Freitas, A. A. A global-model naive bayes approach to the hierarchical prediction of protein functions. In ICDM 2009: Proceedings of the 9th IEEE International Conference on Data Mining, pages 992–997. 2009.


Multilabel Classification Datasets

The following multi-label datasets are properly formatted for use with Mulan. In what follows, we provide a table with dataset statistics, followed by the actual link for the files and references in the Mulan web site.


Statistics:

Name Domain Instances Attributes Labels Cardinality Density Distinct
Emotions music 593 72 6 1.869 0.311 27
Scene image 2407 294 6 1.074 0.179 15
Slashdot-f text 3782 1079 22 1.180 0.040 156
Yeast biology 2417 103 14 4.237 0.303 198
Enron text 1702 1001 53 3.378 0.064 753
Medical text 978 1449 45 1.245 0.028 94



Files and References
  • Scene
    original files from Mulan web site [scene.rar]
    my 10 folds split [scene-10folds.zip]
    reference: M.R. Boutell, J. Luo, X. Shen, and C.M. Brown. Learning multi-labelscene classiffication. Pattern Recognition, 37(9):1757-1771, 2004.
  • Slashdot-f
    original files can be found in [Meka's web site]
    my 10 folds split [slashdotf-10folds.zip]
    reference: Read, J., Pfahringer, B., & Holmes, G. Multi-label classification using ensembles of pruned sets. In ICDM 2008: Proceedigns of International Conference on Data Mining, pages 995–1000, 2008.
  • Yeast
    original files from Mulan web site [yeast.rar]
    my 10 folds split [yeast-10folds.zip]
    reference: A. Elisseeff and J. Weston. A kernel method for multi-labelled classification. In T.G. Dietterich, S. Becker, and Z. Ghahramani, (eds), Advances in Neural Information Processing Systems 14, 2002.



ċ
Hglass-5x2folds.zip
(48k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
Hglass.arff
(11k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
ec-5x2folds.zip
(10487k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
emotions-10folds.zip
(1034k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
enron-10folds.zip
(1368k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
gpcr-5x2folds.zip
(2737k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
medical-10folds.zip
(285k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
scene-10folds.zip
(11844k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
slashdotf-10folds.zip
(693k)
Jean Metz,
May 6, 2012, 3:55 PM
ċ
yeast-10folds.zip
(5237k)
Jean Metz,
May 6, 2012, 3:55 PM
Comments