Imbalanced binary dataset containing protein traits for predicting their cellular localization sites.
yeast4
A data frame with 1484 instances, 51 of which belong to positive class, and 9 variables:
McGeoch's method for signal sequence recognition. Continuous attribute.
Von Heijne's method for signal sequence recognition. Continuous attribute.
Score of the ALOM membrane spanning region prediction program. Continuous attribute.
Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. Continuous attribute.
Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. Discrete attribute.
Peroxisomal targeting signal in the C-terminus. Continuous attribute.
Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. Continuous attribute.
Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins. Continuous attribute.
Two possible classes: positive (membrane protein, uncleaved signal), negative (rest of localizations).
Original available in UCI ML Repository.