Title: | Selecting Attributes |
---|---|
Description: | Functions for selecting attributes from a given dataset. Attribute subset selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. |
Authors: | Piotr Romanski, Lars Kotthoff, Patrick Schratz |
Maintainer: | Lars Kotthoff <[email protected]> |
License: | GPL-2 |
Version: | 0.34 |
Built: | 2024-11-14 02:58:02 UTC |
Source: | https://github.com/larskotthoff/fselector |
Package containing functions for selecting attributes from a given dataset and a destination attribute.
This package contains:
-Algorithms for filtering attributes: cfs, chi.squared, information.gain, gain.ratio, symmetrical.uncertainty, linear.correlation, rank.correlation, oneR, relief, consistency, random.forest.importance
-Algorithms for wrapping classifiers and search attribute subset space: best.first.search, backward.search, forward.search, hill.climbing.search
-Algorithm for choosing a subset of attributes based on attributes' weights: cutoff.k, cutoff.k.percent, cutoff.biggest.diff
-Algorithm for creating formulas: as.simple.formula
Piotr Romanski
Maintainer: Lars Kotthoff <[email protected]>
Converts character vector of atrributes' names and destination attribute's name to a simple formula.
as.simple.formula(attributes, class)
as.simple.formula(attributes, class)
attributes |
character vector of attributes' names |
class |
name of destination attribute |
A simple formula like "class ~ attr1 + attr2"
Piotr Romanski
data(iris) result <- cfs(Species ~ ., iris) f <- as.simple.formula(result, "Species")
data(iris) result <- cfs(Species ~ ., iris) f <- as.simple.formula(result, "Species")
The algorithm for searching atrribute subset space.
best.first.search(attributes, eval.fun, max.backtracks = 5)
best.first.search(attributes, eval.fun, max.backtracks = 5)
attributes |
a character vector of all attributes to search in |
eval.fun |
a function taking as first parameter a character vector of all attributes and returning a numeric indicating how important a given subset is |
max.backtracks |
an integer indicating a maximum allowed number of backtracks, default is 5 |
The algorithm is similar to forward.search
besides the fact that is chooses the best node from all already evaluated ones and evaluates it. The selection of the best node is repeated approximately max.backtracks
times in case no better node found.
A character vector of selected attributes.
Piotr Romanski
forward.search
, backward.search
, hill.climbing.search
, exhaustive.search
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- best.first.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- best.first.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
The algorithm finds attribute subset using correlation and entropy measures for continous and discrete data.
cfs(formula, data)
cfs(formula, data)
formula |
a symbolic description of a model |
data |
data to process |
The alorithm makes use of best.first.search
for searching the attribute subset space.
a character vector containing chosen attributes
Piotr Romanski
data(iris) subset <- cfs(Species~., iris) f <- as.simple.formula(subset, "Species") print(f)
data(iris) subset <- cfs(Species~., iris) f <- as.simple.formula(subset, "Species") print(f)
The algorithm finds weights of discrete attributes basing on a chi-squared test.
chi.squared(formula, data)
chi.squared(formula, data)
formula |
a symbolic description of a model |
data |
a symbolic description of a model |
The result is equal to Cramer's V coefficient between source attributes and destination attribute.
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski
library(mlbench) data(HouseVotes84) weights <- chi.squared(Class~., HouseVotes84) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
library(mlbench) data(HouseVotes84) weights <- chi.squared(Class~., HouseVotes84) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
The algorithm finds attribute subset using consistency measure for continous and discrete data.
consistency(formula, data)
consistency(formula, data)
formula |
a symbolic description of a model |
data |
data to process |
The alorithm makes use of best.first.search
for searching the attribute subset space.
a character vector containing chosen attributes
Piotr Romanski
## Not run: library(mlbench) data(HouseVotes84) subset <- consistency(Class~., HouseVotes84) f <- as.simple.formula(subset, "Class") print(f) ## End(Not run)
## Not run: library(mlbench) data(HouseVotes84) subset <- consistency(Class~., HouseVotes84) f <- as.simple.formula(subset, "Class") print(f) ## End(Not run)
The algorithm finds weights of continous attributes basing on their correlation with continous class attribute.
linear.correlation(formula, data) rank.correlation(formula, data)
linear.correlation(formula, data) rank.correlation(formula, data)
formula |
a symbolic description of a model |
data |
data to process |
linear.correlation
uses Pearson's correlation
rank.correlation
uses Spearman's correlation
Rows with NA
values are not taken into consideration.
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski
library(mlbench) data(BostonHousing) d=BostonHousing[-4] # only numeric variables weights <- linear.correlation(medv~., d) print(weights) subset <- cutoff.k(weights, 3) f <- as.simple.formula(subset, "medv") print(f) weights <- rank.correlation(medv~., d) print(weights) subset <- cutoff.k(weights, 3) f <- as.simple.formula(subset, "medv") print(f)
library(mlbench) data(BostonHousing) d=BostonHousing[-4] # only numeric variables weights <- linear.correlation(medv~., d) print(weights) subset <- cutoff.k(weights, 3) f <- as.simple.formula(subset, "medv") print(f) weights <- rank.correlation(medv~., d) print(weights) subset <- cutoff.k(weights, 3) f <- as.simple.formula(subset, "medv") print(f)
The algorithms select a subset from a ranked attributes.
cutoff.k(attrs, k) cutoff.k.percent(attrs, k) cutoff.biggest.diff(attrs)
cutoff.k(attrs, k) cutoff.k.percent(attrs, k) cutoff.biggest.diff(attrs)
attrs |
a data.frame containing ranks for attributes in the first column and their names as row names |
k |
a positive integer in case of |
cutoff.k
chooses k best attributes
cutoff.k.percent
chooses best k * 100% of attributes
cutoff.biggest.diff
chooses a subset of attributes which are significantly better than other.
A character vector containing selected attributes.
Piotr Romanski
data(iris) weights <- information.gain(Species~., iris) print(weights) subset <- cutoff.k(weights, 1) f <- as.simple.formula(subset, "Species") print(f) subset <- cutoff.k.percent(weights, 0.75) f <- as.simple.formula(subset, "Species") print(f) subset <- cutoff.biggest.diff(weights) f <- as.simple.formula(subset, "Species") print(f)
data(iris) weights <- information.gain(Species~., iris) print(weights) subset <- cutoff.k(weights, 1) f <- as.simple.formula(subset, "Species") print(f) subset <- cutoff.k.percent(weights, 0.75) f <- as.simple.formula(subset, "Species") print(f) subset <- cutoff.biggest.diff(weights) f <- as.simple.formula(subset, "Species") print(f)
The algorithms find weights of discrete attributes basing on their correlation with continous class attribute.
information.gain(formula, data, unit) gain.ratio(formula, data, unit) symmetrical.uncertainty(formula, data, unit)
information.gain(formula, data, unit) gain.ratio(formula, data, unit) symmetrical.uncertainty(formula, data, unit)
formula |
A symbolic description of a model. |
data |
Data to process. |
unit |
Unit for computing entropy (passed to |
information.gain
is
.
gain.ratio
is
symmetrical.uncertainty
is
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski, Lars Kotthoff
data(iris) weights <- information.gain(Species~., iris) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f) weights <- information.gain(Species~., iris, unit = "log2") print(weights) weights <- gain.ratio(Species~., iris) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f) weights <- symmetrical.uncertainty(Species~., iris) print(weights) subset <- cutoff.biggest.diff(weights) f <- as.simple.formula(subset, "Species") print(f)
data(iris) weights <- information.gain(Species~., iris) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f) weights <- information.gain(Species~., iris, unit = "log2") print(weights) weights <- gain.ratio(Species~., iris) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f) weights <- symmetrical.uncertainty(Species~., iris) print(weights) subset <- cutoff.biggest.diff(weights) f <- as.simple.formula(subset, "Species") print(f)
The algorithm for searching atrribute subset space.
exhaustive.search(attributes, eval.fun)
exhaustive.search(attributes, eval.fun)
attributes |
a character vector of all attributes to search in |
eval.fun |
a function taking as first parameter a character vector of all attributes and returning a numeric indicating how important a given subset is |
The algorithm searches the whole attribute subset space in breadth-first order.
A character vector of selected attributes.
Piotr Romanski
forward.search
, backward.search
, best.first.search
, hill.climbing.search
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- exhaustive.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- exhaustive.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
The algorithms for searching atrribute subset space.
backward.search(attributes, eval.fun) forward.search(attributes, eval.fun)
backward.search(attributes, eval.fun) forward.search(attributes, eval.fun)
attributes |
a character vector of all attributes to search in |
eval.fun |
a function taking as first parameter a character vector of all attributes and returning a numeric indicating how important a given subset is |
These algorithms implement greedy search. At first, the algorithms expand starting node, evaluate its children and choose the best one which becomes a new starting node. This process goes only in one direction. forward.search
starts from an empty and backward.search
from a full set of attributes.
A character vector of selected attributes.
Piotr Romanski
best.first.search
, hill.climbing.search
, exhaustive.search
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- forward.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- forward.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
The algorithm for searching atrribute subset space.
hill.climbing.search(attributes, eval.fun)
hill.climbing.search(attributes, eval.fun)
attributes |
a character vector of all attributes to search in |
eval.fun |
a function taking as first parameter a character vector of all attributes and returning a numeric indicating how important a given subset is |
The algorithm starts with a random attribute set. Then it evaluates all its neighbours and chooses the best one. It might be susceptible to local maximum.
A character vector of selected attributes.
Piotr Romanski
forward.search
, backward.search
, best.first.search
, exhaustive.search
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- hill.climbing.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
library(rpart) data(iris) evaluator <- function(subset) { #k-fold cross validation k <- 5 splits <- runif(nrow(iris)) results = sapply(1:k, function(i) { test.idx <- (splits >= (i - 1) / k) & (splits < i / k) train.idx <- !test.idx test <- iris[test.idx, , drop=FALSE] train <- iris[train.idx, , drop=FALSE] tree <- rpart(as.simple.formula(subset, "Species"), train) error.rate = sum(test$Species != predict(tree, test, type="c")) / nrow(test) return(1 - error.rate) }) print(subset) print(mean(results)) return(mean(results)) } subset <- hill.climbing.search(names(iris)[-5], evaluator) f <- as.simple.formula(subset, "Species") print(f)
The algorithms find weights of discrete attributes basing on very simple association rules involving only one attribute in condition part.
oneR(formula, data)
oneR(formula, data)
formula |
a symbolic description of a model |
data |
data to process |
The algorithm uses OneR classifier to find out the attributes' weights. For each attribute it creates a simple rule based only on that attribute and then calculates its error rate.
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski
library(mlbench) data(HouseVotes84) weights <- oneR(Class~., HouseVotes84) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
library(mlbench) data(HouseVotes84) weights <- oneR(Class~., HouseVotes84) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
The algorithm finds weights of attributes using RandomForest algorithm.
random.forest.importance(formula, data, importance.type = 1)
random.forest.importance(formula, data, importance.type = 1)
formula |
a symbolic description of a model |
data |
data to process |
importance.type |
either 1 or 2, specifying the type of importance measure (1=mean decrease in accuracy, 2=mean decrease in node impurity) |
This is a wrapper for importance.
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski
library(mlbench) data(HouseVotes84) weights <- random.forest.importance(Class~., HouseVotes84, importance.type = 1) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
library(mlbench) data(HouseVotes84) weights <- random.forest.importance(Class~., HouseVotes84, importance.type = 1) print(weights) subset <- cutoff.k(weights, 5) f <- as.simple.formula(subset, "Class") print(f)
The algorithm finds weights of continous and discrete attributes basing on a distance between instances.
relief(formula, data, neighbours.count = 5, sample.size = 10)
relief(formula, data, neighbours.count = 5, sample.size = 10)
formula |
a symbolic description of a model |
data |
data to process |
neighbours.count |
number of neighbours to find for every sampled instance |
sample.size |
number of instances to sample |
The algorithm samples instances and finds their nearest hits and misses. Considering that result, it evaluates weights of attributes.
a data.frame containing the worth of attributes in the first column and their names as row names
Piotr Romanski
-Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European Conference on Machine Learning, 171-182, 1994.
-Marko Robnik-Sikonja, Igor Kononenko: An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997.
data(iris) weights <- relief(Species~., iris, neighbours.count = 5, sample.size = 20) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f)
data(iris) weights <- relief(Species~., iris, neighbours.count = 5, sample.size = 20) print(weights) subset <- cutoff.k(weights, 2) f <- as.simple.formula(subset, "Species") print(f)