DecisionTree (version 2.2.2, 2014-May-3) |
DecisionTree.py
Version: 2.2.2
Author: Avinash Kak (kak@purdue.edu)
Date: 2014-May-3
View version 2.2.2 code in browser
Download Version 2.2.2:
gztar
bztar
Total number of downloads (all versions):
10215
weblogs (normally once every two to four days)
Last updated:
Sun Dec 22 06:02:01 EST 2024
SWITCH TO VERSION 2.2.3
SWITCH TO VERSION 3.4.3
CHANGES:
Version 2.2.2:
In response to requests from users, this version includes scripts in
the Examples directory that demonstrate how to carry out bulk
classification of all your test data records placed in a CSV file in
one fell swoop. Also included are scripts that demonstrate the same
for the data records placed in the old-style `.dat' files. The main
module code remains unchanged.
Version 2.2.1:
The changes made are all in the part of the module that is used for
evaluating the quality of training data through a 10-fold cross
validation test. The previous version used the default values for the
constructor parameters when constructing the decision trees in each
iteration of the test. The new version correctly uses the user-supplied
values.
Version 2.2:
This version fixes a bug discovered in the best feature calculator
function. This bug was triggered by certain conditions related to the
distribution of values for the features in a training data file.
Additionally, and VERY IMPORTANTLY, Version 2.2 allows you to test the
quality of your training data by running a 10-fold cross-validation
test on the data. This test divides all of the training data into ten
parts, with nine parts used for training a decision tree and one part
used for testing its ability to classify correctly. This selection of
nine parts for training and one part for testing is carried out in all
of the ten different possible ways. This testing functionality in
Version 2.2 can also be used to find the best values to use for the
constructor parameters entropy_threshold, max_depth_desired, and
symbolic_to_numeric_cardinality_threshold.
Version 2.1:
This is a cleaned up version of v. 2.0 of the module. Should run more
efficiently for large training data files that contain both numeric and
symbolic features.
Version 2.0:
This was a major rewrite of the DecisionTree module. This revision was
prompted by a number of users wanting to see numeric features
incorporated in the construction of decision trees. So here it is!
This version allows you to use either purely symbolic features, or
purely numeric features, or a mixture of the two. (A feature is numeric
if it can take any floating-point value over an interval.)
Version 1.7.1:
This version includes a fix for a bug that was triggered by certain
comment words in a training data file. This version also includes
additional safety checks that are useful for catching errors and
inconsistencies in large training data files that do not lend
themselves to manual checking for correctness. As an example, the new
version makes sure that the number of values you declare in each sample
record matches the number of features declared at the beginning of the
training data file.
Version 1.7:
This version includes safety checks on the consistency of the data you
place in your training data file. When a training data file contains
thousands of records, it is difficult to manually check that you used
the same class names in your sample records that you declared at the
top of your training file or that the values you have for your features
are legal vis-a-vis the earlier declarations regarding such values in
the training file. Another safety feature incorporated in this version
is the non-consideration of classes that are declared at the top of the
training file but that have no sample records in the file.
Version 1.6.1:
Fixed a bug in the method that generates synthetic test data.
Version 1.6:
This version includes several upgrades: The module now includes code
for generating synthetic training and test data for experimenting with
the DecisionTree classifier. Another upgrade in the new version is
that, after training, a decision tree can now be used in an interactive
mode in which the user is asked to supply answers for the feature tests
at the nodes as the classification process descends down the tree.
Version 1.5:
This is a Python 3.x compliant version of the DecisionTree module.
This version should work with both Python 2.x and Python 3.x.
Version 1.0:
This is a Python implementation of the author's Perl module
Algorithm::DecisionTree, Version 1.41. The Python version should work
faster for large decision trees since it uses probability and entropy
caching much more extensively than Version 1.41 of the Perl module.
(Note: I expect my next release of the Perl module to catch up with
this Python version in terms of performance.)
USAGE:
If your training data includes numeric features (a feature is numeric
if it can take any floating point value over an interval), you are
expected to supply your training data through a CSV file and your call
for constructing an instance of the DecisionTree class will look like:
training_datafile = "stage3cancer.csv"
dt = DecisionTree.DecisionTree(
training_datafile = training_datafile,
csv_class_column_index = 2,
csv_columns_for_features = [3,4,5,6,7,8],
entropy_threshold = 0.01,
max_depth_desired = 8,
symbolic_to_numeric_cardinality_threshold = 10,
)
The constructor option `csv_class_column_index' informs the module as
to which column of your CSV file contains the class label. THE COLUMN
INDEXING IS ZERO BASED. The constructor option
`csv_columns_for_features' specifies which columns are to be used for
feature values. The first row of the CSV file must specify the names
of the features. See examples of CSV files in the `examples'
subdirectory.
The option `symbolic_to_numeric_cardinality_threshold' is also
important. For the example shown above, if an ostensibly numeric
feature takes on only 10 or fewer different values in your training
datafile, it will be treated like a symbolic feature. The option
`entropy_threshold' determines the granularity with which the entropies
are sampled for the purpose of calculating entropy gain with a
particular choice of decision threshold for a numeric feature or a
feature value for a symbolic feature.
After you have constructed an instance of the DecisionTree class, you
read in the training data file and initialize the probability cache by
calling:
dt.get_training_data()
dt.calculate_first_order_probabilities()
dt.calculate_class_priors()
Next you construct a decision tree for your training data by calling:
root_node = dt.construct_decision_tree_classifier()
where root_node is an instance of the DTNode class that is also defined
in the module file. Now you are ready to classify a new data record.
Let's say that your data record looks like:
test_sample = ['g2 = 4.2',
'grade = 2.3',
'gleason = 4',
'eet = 1.7',
'age = 55.0',
'ploidy = diploid']
You can classify it by calling:
classification = dt.classify(root_node, test_sample)
The call to `classify()' returns a reference to a hash whose keys are
the class names and the values the associated classification
probabilities. This hash also includes another key-value pair for the
solution path from the root node to the leaf node at which the final
classification was carried out.
If your features are purely symbolic, you can continue to use the same
constructor syntax that was used in the older versions of this module.
However, your old `.dat' training files will not work with the new
version. The good news is that with just a small fix, you can continue
to use them. The fix and why it was needed is described in the file
README_for_dat_files in the `examples' directory. If you are going to
use a `.dat' file for supplying the training data, your constructor
syntax is likely to look like:
training_datafile = "training.dat"
dt = DecisionTree.DecisionTree(
training_datafile = training_datafile,
entropy_threshold = 0.01,
max_depth_desired = 5,
)
You'd still need to make the following calls for reading in the
training data, for initializing the probability cache, and for
constructing the decision tree:
dt.get_training_data()
dt.calculate_first_order_probabilities()
dt.calculate_class_priors()
root_node = dt.construct_decision_tree_classifier()
Now your test sample is likely to look like:
test_sample = ['exercising=never',
'smoking=heavy',
'fatIntake=heavy',
'videoAddiction=heavy']
You'd now call the calssifier as before:
classification = dt.classify(root_node, test_sample)
A decision tree can quickly become much too large (and much too slow to
construct and to yield classification results) if the total number of
features is large and/or if the number of different possible values for
the symbolic features is large. You can control the size of the tree
through the constructor options `entropy_threshold' and
`max_depth_desired'. The latter option sets the maximum depth of your
decision tree to max_depth_desired value. The parameter
`entropy_threshold' sets the granularity with which the entropies are
sampled. Its default value is 0.001. The larger the value you choose
for entropy_threshold, the smaller the tree.
INTRODUCTION:
DecisionTree is a Python module for constructing a decision tree from a
training data file containing multidimensional data in the form of a
table. In one form or another, decision trees have been around for over
fifty years. From a statistical perspective, they are closely related
to classification and regression by recursive partitioning of
multidimensional data. Early work that demonstrated the usefulness of
such partitioning for classification and regression can be traced, in
the statistics community, to the work of Terry Therneau in the early
1980's and, in the machine learning community, to the work of Ross
Quinlan in the mid 1990's.
For those not familiar with decision tree ideas, the traditional way to
classify multidimensional data is to start with a feature space whose
dimensionality is the same as that of the data. Each feature measures
a specific attribute of an entity. You use the training data to carve
up the feature space into different regions, each corresponding to a
different class. Subsequently, when you try to classify a new data
sample, you locate it in the feature space and find the class label of
the region to which it belongs. One can also give the data point the
same class label as that of the nearest training sample. This is
referred to as the nearest neighbor classification. There exist
hundreds of variations of varying power on this basic approach to the
classification of multidimensional data.
A decision tree classifier works differently. When you construct a
decision tree, you select for the root node a feature test that
partitions the training data in a way that causes maximal
disambiguation of the class labels associated with the data. In terms
of information content as measured by entropy, such a feature test
would cause maximum reduction in class entropy in going from all of the
training data taken together to the data as partitioned by the feature
test. You then drop from the root node a set of child nodes, one for
each partition of the training data created by the feature test at the
root node. When your features are purely symbolic, you'll have one
child node for each value of the feature chosen for the feature test at
the root. When the test at the root involves a numeric feature, you
find the decision threshold for the feature that best bipartitions the
data and you drop from the root node two child nodes, one for each
partition. Now at each child node you pose the same question that you
posed when you found the best feature to use at the root: Which feature
at the child node in question would maximally disambiguate the class
labels associated with the training data corresponding to that child
node?
As the reader would expect, the two key steps in any approach to
decision-tree based classification are the construction of the decision
tree itself from a file containing the training data, and then using
the decision tree thus obtained for classifying new data.
What is cool about decision tree classification is that it gives you
soft classification, meaning it may associate more than one class label
with a given data record. When this happens, it may mean that your
classes are indeed overlapping in the underlying feature space. It
could also mean that you simply have not supplied sufficient training
data to the decision tree classifier. For a tutorial introduction to
how a decision tree is constructed and used, see
https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf
WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE?
If you are new to the concept of a decision tree, their practical
utility is best understood with an example that only involves symbolic
features. However, as mentioned earlier, versions 2.0 and higher of
this module handle both symbolic and numeric features.
Consider the following scenario: Let's say you are running a small
investment company that employs a team of stockbrokers who make
buy/sell decisions for the customers of your company. Assume that your
company has asked the traders to make each investment decision on the
basis of the following five criteria:
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
return_on_equity (R_on_E)
market_share (M_S)
sentiment (S)
Since you are the boss, you keep track of the buy/sell decisions made
by the individual traders. But one unfortunate day, all of your
traders decide to quit because you did not pay them enough. So what
are you to do? If you had a module like the one here, you could still
run your company and do so in such a way that your company would, on
the average, perform better than any of the individual traders who
worked for you previously. This is what you would need to do: You
would pool together the individual trader buy/sell decisions you
accumulated during the last one year. This pooled information is
likely to look like:
example buy/sell P_to_E P_to_S R_on_E M_S S
====================================================================
example_1 buy high low medium low high
example_2 buy medium medium low low medium
example_3 sell low medium low high low
....
....
This data would constitute your training file. Assuming that this training
file is called 'training.dat', you would need to feed this file
into the module by calling:
dt = DecisionTree( training_datafile = "training.dat" )
dt.get_training_data()
dt.calculate_first_order_probabilities_for_numeric_features()
dt.calculate_class_priors()
Subsequently, you would construct a decision tree by calling:
root_node = dt.construct_decision_tree_classifier()
Now you and your company (with practically no employees) are ready to
service the customers again. Suppose your computer needs to make a
buy/sell decision about an investment prospect that is best described
by:
price_to_earnings_ratio = low
price_to_sales_ratio = very_low
return_on_equity = none
market_share = medium
sentiment = low
All that your computer would need to do would be to construct a data
record like
test_case = [ 'P_to_E=low',
'P_to_S=very_low',
'R_on_E=none',
'M_S=medium',
'S=low' ]
and call the decision tree classifier you just constructed by
classification = dt.classify(root_node, test_case)
print "Classification: ", classification
The answer returned will be 'buy' and 'sell', along with the associated
probabilities. So if the probability of 'buy' is considerably greater
than the probability of 'sell', that's what you should instruct your
computer to do.
The chances are that, on the average, this approach would beat the
performance of any of your individual traders who worked for you
previously since the buy/sell decisions made by the computer would be
based on the collective wisdom of all your previous traders.
DISCLAIMER: There is obviously a lot more to good investing than what
is captured by the silly little example here. However, it does
convey the sense in which the current module can be used.
SYMBOLIC FEATURES VERSUS NUMERIC FEATURES
A feature is symbolic when its values are compared using string
comparison operators. By the same token, a feature is numeric when its
values are compared using numeric comparison operators. Having said
that, features that take only a small number of numeric values in
the training data can be treated symbolically provided you are careful
about handling their values in the test data. At the least, you have to
set the test data value for such a feature to its closest value in the
training data.
The constructor parameter symbolic_to_numeric_cardinality_threshold
let's you tell the module when to consider an otherwise numeric feature
symbolically. Suppose you set this parameter to 10, that means that all
numeric looking features that take 10 or fewer different values in the
training datafile will be considered to be symbolic features.
See the tutorial at
https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf
for further information on the implementation issues related to the
symbolic and numeric features.
TESTING THE QUALITY OF YOUR TRAINING DATA:
Starting with version 2.2, the module includes a new class named
EvalTrainingData, derived from the main class DecisionTree, that runs a
10-fold cross-validation test on your training data to test its ability
to discriminate between the classes mentioned in the training file.
The 10-fold cross-validation test divides all of the training data into
ten parts, with nine parts used for training a decision tree and one
part used for testing its ability to classify correctly. This selection
of nine parts for training and one part for testing is carried out in
all of the ten different possible ways.
The following code fragment illustrates how you invoke the testing
function of the EvalTrainingData class:
training_datafile = "training3.csv"
eval_data = DecisionTree.EvalTrainingData(
training_datafile = training_datafile,
csv_class_column_index = 1,
csv_columns_for_features = [2,3],
entropy_threshold = 0.01,
max_depth_desired = 3,
symbolic_to_numeric_cardinality_threshold = 10,
)
eval_data.get_training_data()
eval_data.evaluate_training_data()
The last statement above prints out a Confusion Matrix and the value of
Training Data Quality Index on a scale of 100, with 100 designating
perfect training data. The Confusion Matrix shows how the different
classes were mis-identified in the 10-fold cross-validation test.
This testing functionality can also be used to find the best values to
use for the constructor parameters entropy_threshold,
max_depth_desired, and symbolic_to_numeric_cardinality_threshold.
The following two scripts in the Examples directory illustrate the use
of the EvalTrainingData class for testing the quality of your data:
evaluate_training_data1.py
evaluate_training_data2.py
HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS:
Assuming your training data is good, the quality of the results you get
from a decision tree would depend on the choices you make for the
constructor parameters entropy_threshold, max_depth_desired, and
symbolic_to_numeric_cardinality_threshold. You can optimize your
choices for these parameters by running the 10-fold cross-validation
test that is made available in Versions 2.2 and higher through the new
class EvalTrainingData that is included in the module file. A
description of how to run this test is in the section titled "TESTING
THE QUALITY OF YOUR TRAINING DATA" of this document.
METHODS:
The module provides the following methods for constructing a decision
tree from training data in a disk file, and for data classification with
the decision tree.
Constructing a decision tree:
dt = DecisionTree( training_datafile = training_datafile,
csv_class_column_index = 2,
csv_columns_for_features = [3,4,5,6,7,8],
entropy_threshold = 0.01,
max_depth_desired = 8,
symbolic_to_numeric_cardinality_threshold = 10,
)
This yields a new instance of the DecisionTree class. For this call to
make sense, the training data in the training datafile must be conform
to a certain format. For example, the first row must name the
features. It must begin with the empty string `""' as shown by the CSV
files in the Examples subdirectory. The first column for all
subsequent rows must carry a unique integer identifier for each data
record. When your features are purely symbolic, you are also allowed
to use the `.dat' files that were used in the previous versions of this
module.
The constructor option csv_class_column_index supplies to the module
zero-based index of the column that contains the class label for the
training data records. In the example shown above, the class labels are
in the third column. The option csv_columns_for_features tells the
module which of the features are supposed to be used for decision tree
construction. The constructor option max_depth_desired sets the
maximum depth of the decision tree. The parameter entropy_threshold
sets the granularity with which the entropies are sampled. The
parameter symbolic_to_numeric_cardinality_threshold allows the module
to treat an otherwise numeric feature symbolically if it only takes a
small number of different values in the training data file. For the
constructor call shown above, if a feature takes on only 10 or fewer
different values in the training data file, it will be treated like a
symbolic feature.
The constructor parameters:
training_datafile:
This parameter supplies the name of the file that contains the
training data. This must be a CSV file if your training data
includes both numeric and symbolic features. If your data is
purely symbolic, you can use the old-style `.dat' file.
csv_class_column_index:
When using a CSV file for your training data, this parameter
supplies the zero-based column index for the column that contains
the class label for each data record in the training file.
csv_columns_for_features:
When using a CSV file for your training data, this parameter
supplies a list of columns corresponding to the features you wish
to use for decision tree construction. Each column is specified by
its zero-based index.
entropy_threshold:
This parameter sets the granularity with which the entropies are
sampled by the module. For example, a feature test at a node in
the decision tree is acceptable if the entropy gain achieved by the
test exceeds this threshold. The larger the value you choose for
this parameter, the smaller the tree. Its default value is 0.001.
max_depth_desired:
This parameter sets the maximum depth of the decision tree. For
obvious reasons, the smaller the value you choose for this
parameter, the smaller the tree.
symbolic_to_numeric_cardinality_threshold:
This parameter allows the module to treat an otherwise numeric
feature symbolically if the number of different values the feature
takes in the training data file does not exceed the value of this
parameter.
You can choose the best values to use for the last three constructor
parameters by running a 10-fold cross-validation test on your training
data through the embedded class EvalTrainingData that comes with
Versions 2.2 and higher of this module. See the section "TESTING THE
QUALITY OF YOUR TRAINING DATA" of this document page.
Reading in the training data:
After you have constructed a new instance of the DecisionTree class,
you must now read in the training data that is contained in the file
named above. This you do by:
dt.get_training_data()
IMPORTANT: The training data file must be in a format that makes sense
to the decision tree constructor. If you use numeric features, you
must use a CSV file for supplying the training data. The first row of
such a file must name the features and it must begin with the empty
string `""' as shown in the `stage3cancer.csv' file in the Examples
subdirectory. The first column for all subsequent rows must carry a
unique integer identifier for each training record.
Initializing the probability cache:
After a call to the constructor and the get_training_data() method, you
must call the following methods for initializing the probabilities:
dt.calculate_first_order_probabilities()
dt.calculate_class_priors()
Displaying the training data:
If you wish to see the training data that was just digested by the
module, call
dt.show_training_data()
Constructing a decision-tree classifier:
After the training data is ingested, it is time to construct a decision
tree classifier. This you do by
root_node = dt.construct_decision_tree_classifier()
This call returns an instance of type DTNode. The DTNode class is
defined within the main package file, at its end. So, don't forget,
that root_node in the above example call will be instantiated to an
instance of type DTNode.
Displaying the decision tree:
You display a decision tree by calling
root_node.display_decision_tree(" ")
This displays the decision tree in your terminal window by using a
recursively determined offset for each node as the display routine
descends down the tree.
I have intentionally left the syntax fragment root_node in the above
call to remind the reader that display_decision_tree() is NOT called on
the instance of the DecisionTree we constructed earlier, but on the
Node instance returned by the call to
construct_decision_tree_classifier().
Classifying new data:
You classify new data by first constructing a new data record:
test_sample = ['g2 = 4.2',
'grade = 2.3',
'gleason = 4',
'eet = 1.7',
'age = 55.0',
'ploidy = diploid']
and calling the classify() method as follows:
classification = dt.classify(root_node, test_sample)
where, again, root_node is an instance of type Node that was returned
by calling construct_decision_tree_classifier(). The variable
classification is a dictionary whose keys are the class labels and
whose values the associated probabilities. You can print it out by
print "Classification: ", classification
Displaying the number of nodes created:
You can print out the number of nodes in a decision tree by calling
root_node.how_many_nodes()
Using the decision tree interactively:
Starting with Version 1.6 of the module, you can use the DecisionTree
classifier in an interactive mode. In this mode, after you have
constructed the decision tree, the user is prompted for answers to the
questions regarding the feature tests at the nodes of the tree.
Depending on the answer supplied by the user at a node, the classifier
takes a path corresponding to the answer to descend down the tree to
the next node, and so on. The following method makes this mode
possible. Obviously, you can call this method only after you have
constructed the decision tree.
dt.classify_by_asking_questions(root_node)
Generating synthetic training data:
To generate synthetic training data, you first construct an instance of
the class TrainingDataGenerator that is incorporated in the
DecisionTree module. A call to the constructor of this class will look
like:
parameter_file = "param_numeric.txt"
output_csv_file = "training.csv";
training_data_gen = TrainingDataGeneratorNumeric(
output_csv_file = output_csv_file,
parameter_file = parameter_file,
number_of_samples_per_class = some_number,
)
training_data_gen.read_parameter_file_numeric()
training_data_gen.gen_numeric_training_data_and_write_to_csv()
The training data that is generated is according to the specifications
described in the parameter file. The structure of this file must be as
shown in the file `param_numeric.txt' for the numeric training data and
as shown in `param_symbolic.txt' for the case of symbolic training
data. Both these example parameter files are in the 'Examples'
subdirectory. The parameter file names the classes, the features for
the classes, and the possible values for the features.
If you want to generate purely symbolic training data, here is the
constructor call to make:
parameter_file = "param_symbolic.txt"
output_data_file = "training.dat";
training_data_gen = TrainingDataGeneratorSymbolic(
output_datafile = output_data_file,
parameter_file = parameter_file,
write_to_file = 1,
number_of_training_samples = some_number,
)
training_data_gen.read_parameter_file_symbolic()
training_data_gen.gen_symbolic_training_data()
training_data_gen.write_training_data_to_file()
Generating synthetic test data:
To generate synthetic test data, you first construct an instance of the
class TestDataGeneratorSymbolic that is incorporated in the
DecisionTree module. A call to the constructor of this class will look
like:
test_data_gen = TestDataGeneratorSymbolic(
output_test_datafile = an_output_data_file,
output_class_labels_file = a_file_for_class_labels,
parameter_file = a_parameter_file,
write_to_file = 1,
number_of_test_samples = some_number,
)
The main difference between the training data and the test data is that
the class labels are NOT mentioned in the latter. Instead, the class
labels are placed in a separate file whose name is supplied through the
constructor option `output_class_labels_file' shown above. The test
data that is generated is according to the specifications described in
the parameter file. In general, this parameter file would be the same
that you used for generating the training data.
BULK CLASSIFICATION OF DATA RECORDS
For large test datasets, you would obviously want to process an entire
file of test data at a time. The following scripts in the Examples
directory illustrate how you can do that:
classify_test_data_in_a_file_numeric.py
classify_test_data_in_a_file_symbolic.py
These scripts require three command-line arguments, the first argument
names the training datafile, the second the test datafile, and the
third the file in which the classification results are to be deposited.
The first script is for the case of numeric/symbolic features and the
second for the purely symbolic features. An important point to
remember when using these scripts for bulk classification is that the
test file must have a column for class labels. In real-life
situations, obviously, the entries in that column in the test file will
be just the empty string "".
HOW THE CLASSIFICATION RESULTS ARE DISPLAYED
It depends on whether you apply the classifier at once to all the data
records in a file, or whether you feed one data record at a time into
the classifier.
In general, the classifier returns soft classification for a test data
record. What that means is that, in general, the classifier will list
all the classes to which a given data record could belong and the
probability of each such class label for the data record. Run the
examples scripts in the Examples directory to see how the output of
classification can be displayed.
With regard to the soft classifications returned by this classifier, if
the probability distributions for the different classes overlap in the
underlying feature space, you would want the classifier to return all
of the applicable class labels for a test data record along with the
corresponding class probabilities. (However, keep in mind the fact
that the decision tree classifier may associate significant
probabilities with multiple class labels for a given test data record
if the training file contains an inadequate number of training samples
for one or more classes.) The good thing is that the classifier would
not lie to you (unlike, say, a hard classification rule that would
return a single class label corresponding to the partitioning of the
underlying feature space). The decision tree classifier will give you
the best classification that can be made given the training data you
feed into it.
THE EXAMPLES DIRECTORY:
See the 'Examples' directory in the distribution for how to construct a
decision tree, and how to then classify new data using the decision
tree. To become more familiar with the module, run the scripts
construct_dt_and_classify_one_sample_case1.py
construct_dt_and_classify_one_sample_case2.py
construct_dt_and_classify_one_sample_case3.py
construct_dt_and_classify_one_sample_case4.py
The first script is for the purely symbolic case, the second for the
case that involves both numeric and symbolic features, the third for
the case of purely numeric features, and the last for the case when the
training data is synthetically generated by the script
generate_training_data_numeric.py
Next, run the following script as it is for bulk classification of data
records placed in a CSV file:
classify_test_data_in_a_file_numeric.py training4.csv test4.csv out4.csv
The script first constructs a decision tree using the training data in
the first-argument file, `training4.csv'. Subsequently, the script
calculates the class labels for each of the test records in the file
`test4.csv'. The class labels are written out the file `out4.csv'. An
important thing to note here that your test file --- in this case
`test4.csv' --- must have a column for the class labels. Obviously, in
real-life situations, there will be no class labels in this column.
When that is the case, you can place the empty string "" for each data
record in this column. A demonstration of that is give by the
following variation of the above call:
classify_test_data_in_a_file_numeric.py training4.csv test4_no_class_labels.csv out4.csv
If you want to use the old-style `.dat' files for the purely symbolic case,
you can do bulk classifications with those files also, as demonstrated by
the following examples:
classify_test_data_in_a_file_symbolic.py training4.dat test4.dat out4.dat
classify_test_data_in_a_file_symbolic.py training4.dat test4_no_class_labels.dat out4.dat
The point of the second example is to show the format of the test data
file must be identical to that of the training data file, in the sense
that it must have a column for the class labels even when those labels
are just empty strings "".
The following script in the 'Examples' directory
classify_by_asking_questions.py
shows how you can use a decision-tree classifier interactively. In
this mode, you first construct the decision tree from the training data
and then the user is prompted for answers to the feature tests at the
nodes of the tree.
The 'Examples' directory also contains the following scripts:
generate_training_data_numeric.py
generate_training_data_symbolic.py
generate_test_data_symbolic.py
that show how you can use the module to generate synthetic training and
test data. Synthetic training and test data are generated according to
the specifications laid out in a parameter file. There are constraints
on how the information is laid out in the parameter file. See the
files `param_numeric.txt' and `param_symbolic.txt' in the 'Examples'
directory for how to structure these files.
The Examples directory of Versions 2.2 and higher of the DecisionTree
module also contains the following two scripts:
evaluate_training_data1.py
evaluate_training_data2.py
that illustrate how the Python class EvalTrainingData can be used to
evaluate the quality of your training data (as long as it resides in a
`.csv' file.) This new class is a subclass of the DecisionTree class
in the module file. See the README in the Examples directory for
further information regarding these two scripts.
INSTALLATION:
The DecisionTree class was packaged using Distutils. For installation,
execute the following command-line in the source directory (this is the
directory that contains the setup.py file after you have downloaded and
uncompressed the package):
python setup.py install
You have to have root privileges for this to work. On Linux
distributions, this will install the module file at a location that
looks like
/usr/lib/python2.7/dist-packages/
If you do not have root access, you have the option of working directly
off the directory in which you downloaded the software by simply
placing the following statements at the top of your scripts that use
the DecisionTree class:
import sys
sys.path.append( "pathname_to_DecisionTree_directory" )
To uninstall the module, simply delete the source directory, locate
where the DecisionTree module was installed with "locate DecisionTree"
and delete those files. As mentioned above, the full pathname to the
installed version is likely to look like
/usr/lib/python2.7/dist-packages/DecisionTree*
If you want to carry out a non-standard install of the DecisionTree
module, look up the on-line information on Disutils by pointing your
browser to
http://docs.python.org/dist/dist.html
BUGS:
Please notify the author if you encounter any bugs. When sending
email, please place the string 'DecisionTree' in the subject line.
ACKNOWLEDGMENTS:
The importance of the 'sentiment' feature in the "What Practical Problem
is Solved by this Module" section was mentioned to the author by John
Gorup. Thanks John.
AUTHOR:
Avinash Kak, kak@purdue.edu
If you send email, please place the string "DecisionTree" in your
subject line to get past my spam filter.
COPYRIGHT:
Python Software Foundation License
Copyright 2014 Avinash Kak
Imported Modules | ||||||
|
Classes | ||||||||||||||||||||||||||||||||||||||||||||||
|
Functions | ||
|
Data | ||
__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2014 Avinash Kak. Python Software Foundation.' __date__ = '2014-May-3' __url__ = 'https://engineering.purdue.edu/kak/distDT/DecisionTree-2.2.2.html' __version__ = '2.2.2' |
Author | ||
Avinash Kak (kak@purdue.edu) |