Imbalanced binary dataset containing protein traits for predicting their cellular localization sites.

yeast4

Format

A data frame with 1484 instances, 51 of which belong to positive class, and 9 variables:

Mcg

McGeoch's method for signal sequence recognition. Continuous attribute.

Gvh

Von Heijne's method for signal sequence recognition. Continuous attribute.

Alm

Score of the ALOM membrane spanning region prediction program. Continuous attribute.

Mit

Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins. Continuous attribute.

Erl

Presence of "HDEL" substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute. Discrete attribute.

Pox

Peroxisomal targeting signal in the C-terminus. Continuous attribute.

Vac

Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins. Continuous attribute.

Nuc

Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins. Continuous attribute.

Class

Two possible classes: positive (membrane protein, uncleaved signal), negative (rest of localizations).

Source

KEEL Repository.

See also

Original available in UCI ML Repository.