DecisionTree-1.7.html

Python: module DecisionTree DecisionTree (version 1.7, 2012-July-29) DecisionTree.py Version: 1.7 Author: Avinash Kak (kak@purdue.edu) Date: 2012-July-29 Download Version 1.7: gztar bztar View version 1.7 code in browser SWITCH TO VERSION 1.7.1 SWITCH TO VERSION 2.2.4 (Recommended) CHANGES: Version 1.7: This version includes safety checks on the consistency of the data you place in your training datafile. When a training file contains thousands of samples, it is difficult to manually check that you used the same class names in your sample records that you declared at the top of your training file or that the values you have for your features are legal in light of the earlier declarations regarding such values in the training file. Another safety feature incorporated in this version is the non-consideration of classes that are declared at the top of the training file but that have no sample records in the file. Version 1.6.1: Fixed a bug in the method that generates synthetic test data. Version 1.6: This version includes several upgrades: The module now includes code for generating synthetic training and test data for experimenting with the DecisionTree classifier. Another upgrade in the new version is that, after training, a decision tree can now be used in an interactive mode in which the user is asked to supply answers for the feature tests at the nodes as the classification process descends down the tree. Version 1.5: This is a Python 3.x compliant version of the DecisionTree module. This version should work with both Python 2.x and Python 3.x. Version 1.0: This is a Python implementation of the author's Perl module Algorithm::DecisionTree, Version 1.41. The Python version should work faster for large decision trees since it uses probability and entropy caching much more extensively than Version 1.41 of the Perl module. (Note: I expect my next release of the Perl module to catch up with this Python version in terms of performance.) USAGE: The DecisionTree module is a pure-Python implementation for constructing a decision tree from multidimensional training data and for using the decision tree thus constructed for classifying unlabeled data. For constructing a decision tree and for classifying a sample: dt = DecisionTree( training_datafile = "training.dat", debug1 = 0, debug2 = 0 ) dt.get_training_data() dt.show_training_data() root_node = dt.construct_decision_tree_classifier() root_node.display_decision_tree(" ") test_sample = ['exercising=>never', 'smoking=>heavy', 'fatIntake=>heavy', 'videoAddiction=>heavy'] classification = dt.classify(root_node, test_sample) print "Classification: ", classification print "Number of nodes created: ", root_node.how_many_nodes() For the above calls to work, the format in which the training data is made available to the decision-tree constructor must meet certain assumptions. (See the 'training.dat' file in the Examples directory for how to format the training data in a file.) The training datafile must declare the class names, the feature names and the names of the different possible values for the features. The rest of the training datafile is expected to contain the training samples in the form of a multi-column table. The `debug1' option when set will warn you about there being empty classes in your training data, the possibility that the decision tree may be too large with the default choice for the very important parameter entropy_threshold, and the information parsed out of the training data file, such as the class names, the feature names, and the legal values for the features. Since each warning issued by the `debug1' option is followed by a prompt to the user as to whether he/she wishes to continue, it is best to NOT use this option if you want to dump the debugging information into a file. The `debug2' option displays run-time information regarding the nodes that are instantiated during the process of decision-tree construction and the final classification probabilities for the test data. With this option, you can dump the debugging output into a diskfile. If your training file specifies a large number of features or a large number of values for the features, the above constructor call could result in a decision tree that is simply much too large (and much too slow to construct). For such cases, consider using following additional options in the constructor call shown above: dt = DecisionTree( training_datafile = "training.dat", max_depth_desired = some_number, entropy_threshold = some_value, debug1 = 0, debug2 = 0, ) where for max_depth_desired you should choose a number that is less than the number of features in your training file. This will set the depth of your decision tree to max_depth_desired. The algorithm will automatically use the BEST max_depth_desired features --- best in the sense of being the most discriminative --- for constructing the decision tree. The parameter entropy_threshold sets the granularity with which the entropies will be sampled. Its default value is 0.001. The larger the value you choose for entropy_threshold, the smaller the tree. INTRODUCTION: DecisionTree is a pure Python module for constructing a decision tree from a training datafile containing multidimensional data in the form of a table. In one form or another, decision trees have been around for the last fifty years. However, their popularity during the last decade is owing to the entropy-based method proposed by Ross Quinlan for their construction. Fundamental to Quinlan's approach is the notion that a decision node in a tree should be split only if the entropy at the ensuing child nodes will be less than the entropy at the node in question. The implementation presented here is based on the same idea. For those not familiar with decision tree ideas, the traditional way to classify multidimensional data is to start with a feature space whose dimensionality is the same as that of the data. Each feature in this space would correspond to the attribute that each dimension of the data measures. You then use the training data to carve up the feature space into different regions, each corresponding to a different class. Subsequently, when you are trying to classify a new data sample, you locate it in the feature space and find the class label of the region to which it belongs. One can also give the data point the same class label as that of the nearest training sample. (This is referred to as the nearest neighbor classification.) A decision tree classifier works differently. When you construct a decision tree, you select for the root node a feature test that can be expected to maximally disambiguate the class labels that could be associated with the data you are trying to classify. You then attach to the root node a set of child nodes, one for each value of the feature you chose at the root node. Now at each child node you pose the same question that you posed when you found the best feature to use at the root node: What feature at the child node in question would maximally disambiguate the class labels to be associated with a given data vector assuming that the data vector passed the root node on the branch that corresponds to the child node in question. The feature that is best at each node is the one that causes the maximal reduction in class entropy at that node. As the reader would expect, the two key steps in any approach to decision-tree based classification are the construction of the decision tree itself from a file containing the training data, and then using the decision tree thus obtained for classifying the data. What is cool about decision tree classification is that it gives you soft classification, meaning it may associate more than one class label with a given data vector. When this happens, it may mean that your classes are indeed overlapping in the underlying feature space. It could also mean that you simply have not supplied sufficient training data to the decision tree classifier. For a tutorial introduction to how a decision tree is constructed and used, visit https://engineering.purdue.edu/kak/DecisionTreeClassifiers.pdf WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE? Consider the following scenario: Let's say you are running a small investment company that employs a team of stockbrokers who make buy/sell decisions for the customers of your company. Assume that your company has asked the traders to make each investment decision on the basis of the following five criteria: price_to_earnings_ratio (P_to_E) price_to_sales_ratio (P_to_S) return_on_equity (R_on_E) market_share (M_S) sentiment (S) Since you are the boss, you keep track of the buy/sell decisions made by the individual traders. But one unfortunate day, all of your traders decide to quit because you did not pay them enough. So what do you do? If you had a module like the one here, you could still run your company and do so in such a way that, on the average, would do better than any of the individual traders who worked for your company. This is what you do: You pool together the individual trader buy/sell decisions you have accumulated during the last one year. This pooled information is likely to look like: example buy/sell P_to_E P_to_S R_on_E M_S S ==================================================================== example_1 buy high low medium low high example_2 buy medium medium low low medium example_3 sell low medium low high low .... .... This data would constitute your training file. You could feed this file into the module by calling: dt = DecisionTree( training_datafile = "training.dat" ) dt.get_training_data() and then construct a decision tree by calling: root_node = dt.construct_decision_tree_classifier() Now you and your company (with practically no employees) are ready to service the customers again. Suppose your computer needs to make a buy/sell decision about an investment prospect that is best described by: price_to_earnings_ratio => low price_to_sales_ratio => very_low return_on_equity => none market_share => medium sentiment => low All that your computer would need to do would be to construct a data vector like test_case = [ 'P_to_E=>low', 'P_to_S=>very_low', 'R_on_E=>none', 'M_S=>medium', 'S=>low' ] and call the decision tree classifier you just constructed by classification = dt.classify(root_node, test_case) print "Classification: ", classification The answer returned will be 'buy' and 'sell', along with the associated probabilities. So if the probability of 'buy' is considerably greater than the probability of 'sell', that's what you should instruct your computer to do. The chances are that, on the average, this approach would beat the performance of any of your individual traders who worked for you previously since the buy/sell decisions made by the computer would be based on the collective wisdom of all your previous traders. DISCLAIMER: There is obviously a lot more to good investing than what is captured by the silly little example here. However, it does convey the sense in which the current module can be used. WHAT HAPPENS IF THE NUMBER OF FEATURES AND/OR VALUES IS LARGE? If n is the number of features and m the largest number for the possible values for any of the features, then, in only the worst case, the algorithm would want to construct m**n nodes. In other words, in the worst case, the size of the decision tree grows exponentially as you increase either the number of features or the number of possible values for the features. That is the bad news. The good news is that you have two constructor parameters at your disposal for controlling the size of the decision tree: The parameter max_depth_desired controls the depth of the constructed tree from its root node, and the parameter entropy_threshold controls the granularity with which the entropy space will be sampled. The smaller the max_depth_desired and the larger the entropy_threshold, the smaller the size of the decision tree. The default value for max_depth_desired is the number of features specified in the training datafile, and the the default value for entropy_threshold is 0.001. The users of this module with a need to create very large decision trees should also consider storing the tree once constructed in a diskfile and then using the stored tree for classification work. The scripts store_dt_on_disk.py and classify_from_diskstored_dt.py in the Examples directory show you how you can do that with the help of Python's pickle module. (NOTE: At this time, this feature works only with Python 2.x.) Also note that, if you are NOT dumping the debugging output into a file, it is always a good idea to keep the `debug1' and `debug2' options set anytime you are experimenting with a new training datafile --- especially if this data is likely to create an inordinately large decision tree. Otherwise, first try just the `debug1' option to make sure that your training data looks good and that you have used the entropy_threshold parameter to control the size of the tree. Subsequently, you may try just the `debug2' option to dump out information regarding the nodes of the tree. WHAT HAPPENS WHEN THE FEATURE VALUES ARE NUMERIC? The current module will treat a numeric value for a feature as just a string. In that sense, there is no difference between a string value for a feature and a numeric value. This would obviously make the module unsuitable for applications in which a feature may take on a numeric value from a very large set of such values and you want feature values to be compared using numeric comparison predicates as opposed to string comparison predicates. (Consider, for example, using color as an object feature in a computer vision application.) The decision trees for applications in which the feature values are primarily numeric in nature are constructed differently, as explained in the tutorial at https://engineering.purdue.edu/kak/DecisionTreeClassifiers.pdf METHODS: The module provides the following methods for decision-tree induction from training data in a diskfile, and for data classification with the decision tree. Constructing a decision tree: dt = DecisionTree( training_datafile = "training.dat", debug1 = 0, debug2 = 0, ) This yields a new instance of the DecisionTree class. For this call to make sense, the training data in the training datafile must be according to a certain format that is shown below. (Also see the file training.dat in the Examples directory.) The `debug1' option when set will warn you about there being empty classes in your training data, the possibility that the decision tree may be too large with the default choice for the very important parameter entropy_threshold, and the information parsed out of the training data file, such as the class names, the feature names, and the legal values for the features. Since each warning issued by the `debug1' option is followed by a prompt to the user as to whether he/she wishes to continue, it is best to NOT use this option if you want to dump the debugging information into a file. The `debug2' option displays run-time information regarding the nodes that are instantiated during the process of decision-tree construction and the final classification probabilities for the test data. With this option, you can dump the debugging output into a diskfile. If your training file specifies a large number of features or a large number of values for the features, the above constructor call could result in a decision tree that is simply much too large (and much too slow to construct). For such cases, consider using following additional options in the constructor call shown above: dt = DecisionTree( training_datafile = "training.dat", max_depth_desired = some_number, entropy_threshold = some_value, debug1 = 0, debug2 = 0, ) where for max_depth_desired you should choose a number that is less than the number of features in your training file. This will set the depth of your decision tree to max_depth_desired. The algorithm will automatically use the BEST max_depth_desired features --- best in the sense of being the most discriminative --- for constructing the decision tree. The parameter entropy_threshold sets the granularity with which the entropies will be sampled. Its default value is 0.001. The larger the value you choose for entropy_threshold, the smaller the tree. Reading in the training data: After you have constructed a new instance of the DecisionTree class, you must now read in the training data that is contained in the file named above. This you do by: dt.get_training_data() IMPORTANT: The training data file must be in a format that makes sense to the decision tree constructor. The information in this file should look like Class names: malignant benign Feature names and their values: videoAddiction => none low medium heavy exercising => never occasionally regularly smoking => heavy medium light never fatIntake => low medium heavy Training Data: sample class videoAddiction exercising smoking fatIntake ========================================================================== sample_0 benign medium occasionally heavy low sample_1 malignant none occasionally heavy medium sample_2 benign low occasionally light heavy sample_3 malignant medium occasionally heavy heavy .... .... IMPORTANT: Note that the class names, the number of classes, the feature names, and the possible values for the features can be anything that your data requires them to be. Displaying the training data: If you wish to see the training data that was just digested by the module, call dt.show_training_data() Constructing a decision-tree classifier: After the training data is ingested, it is time to construct a decision tree classifier. This you do by root_node = dt.construct_decision_tree_classifier() This call returns an instance of type Node. The Node class is defined within the main package file, at its end. So, don't forget, that root_node in the above example call will be instantiated to an instance of type Node. Displaying the decision tree: You display a decision tree by calling root_node.display_decision_tree(" ") This will display the decision tree in your terminal window by using a recursively determined offset for each node as the display routine descends down the tree. I have intentionally left the syntax fragment root_node in the above call to remind the reader that display_decision_tree() is NOT called on the instance of the DecisionTree we constructed earlier, but on the Node instance returned by the call to construct_decision_tree_classifier(). Classifying new data: You classify new data by first constructing a new data vector: test_sample = ['exercising=>never', 'smoking=>heavy', 'fatIntake=>heavy', 'videoAddiction=>heavy'] and calling the classify() method as follows: classification = dt.classify(root_node, test_sample) where, again, root_node is an instance of type Node that was returned by calling construct_decision_tree_classifier(). The variable classification is a dictionary whose keys are the class labels and whose values the associated probabilities. You can print it out by print "Classification: ", classification Displaying the number of nodes created: You can print out the number of nodes in a decision tree by calling root_node.how_many_nodes() Determining the condition of the training data: The following method, automatically invoked when debug1 option is set in the call to the decision-tree constructor, displays useful information regarding your training data file. This method also warns you if you are trying to construct a decision tree that may simply be much too large. dt.determine_data_condition() Using the decision tree interactively: Starting with Version 1.6 of the module, you can use the DecisionTree classifier in an interactive mode. In this mode, after you have constructed the decision tree, the user is prompted for answers to the questions regarding the feature tests at the nodes of the tree. Depending on the answer supplied by the user at a node, the classifier takes a path corresponding to the answer to descend down the tree to the next node, and so on. The following method makes this mode possible. Obviously, you can call this method only after you have constructed the decision tree. dt.classify_by_asking_questions(root_node) Generating synthetic training data: To generate synthetic training data, you first construct an instance of the class TrainingDataGenerator that is incorporated in the DecisionTree module. A call to the constructor of this class will look like: training_data_gen = TrainingDataGenerator( output_datafile = an_output_data_file, parameter_file = a_parameter_file, write_to_file = 1, number_of_training_samples = some_number, ) The training data that is generated is according to the specifications described in the parameter file. The structure of this file must be as shown in the file 'param.txt' in the 'Examples' directory. The parameter file names the classes, the features for the classes, the possible values for the features, and these values are biased for the different classes. Generating synthetic test data: To generate synthetic test data, you first construct an instance of the class TestDataGenerator that is incorporated in the DecisionTree module. A call to the constructor of this class will look like: test_data_gen = TrainingDataGenerator( output_test_datafile = an_output_data_file, output_class_labels_file = a_file_for_class_labels, parameter_file = a_parameter_file, write_to_file = 1, number_of_test_samples = some_number, ) The main difference between the training data and the test data is that the class labels are NOT mentioned in the latter. Instead, the class labels are placed in a separate file whose name is supplied through the constructor option 'output_class_labels_file' shown above. The test data that is generated is according to the specifications described in the parameter file. In general, this parameter file would be the same that you used for generating the training data. HOW THE CLASSIFICATION RESULTS ARE DISPLAYED It depends on whether you apply the classifier at once to all the data samples in a file, or whether you feed one data vector at a time into the classifier. In general, the classifier returns soft classification for a data vector. What that means is that, in general, the classifier will list all the classes to which a given data vector could belong and the probability of each such class label for the data vector. For the sort of data that is in the 'training.dat' file in the Examples directory, the result of classification for a single data vector would look like: malignant with probability 0.744186046511628 benign with probability 0.255813953488372 For large test datasets, you would obviously want to process an entire file of test data at a time. The best way to do this is to follow my script classify_test_data_in_a_file.py in the 'Examples' directory. This script requires three command-line arguments, the first argument names the training datafile, the second the test datafile, and the third in which the classification results will be deposited. The test datafile must mention the order in which the features values are presented. For an example, see the file 'testdata.dat' in the 'Examples' directory. With regard to the soft classifications returned by this classifier, if the probability distributions for the different classes overlap in the underlying feature space, you would want the classifier to return all of the applicable class labels for a data vector along with the corresponding class probabilities. Another reason for why the decision tree classifier may associate significant probabilities with multiple class labels is that you used inadequate number of training samples to induce the decision tree. The good thing is that the classifier does not lie to you (unlike, say, a hard classification rule that would return a single class label corresponding to the partitioning of the underlying feature space). The decision tree classifier give you the best classification that can be made given the training data you fed into it. THE EXAMPLES DIRECTORY: See the 'Examples' directory in the distribution for how to induce a decision tree, and how to then classify new data using the decision tree. To become more familiar with the module, run the script construct_dt_and_classify_one_sample.py to classify the data record that is in the script. Next run the script as it is classify_test_data_in_a_file.py training.dat testdata.dat out.txt This call will first construct a decision tree using the training data in the file 'training.dat'. It will then calculate the class label for each data record in the file 'testdata.dat'. The estimated class labels will be written out to the file 'out.txt'. The 'Examples' directory also contains the script 'store_dt_on_disk.py' that shows how you can use Python's pickle module to store a decision tree in a disk file. The file 'classify_from_diskstored_dt.py' in the same directory shows how you can classify new data vectors with the stored decision tree. This is expected to be extremely useful for situations that involve tens of thousands or millions of decision nodes. (NOTE: At this time, this feature only works with Python 2.x) The following script in the 'Examples' directory classify_by_asking_questions.py* shows how you can use a decision-tree classifier interactively. In this mode, you first construct the decision tree from the training data and then the user is prompted for answers to the feature tests at the nodes of the tree. The 'Examples' directory also contains the following scripts: generate_training_data.py generate_test_data.py that show how you can use the module to generate synthetic training and test data. Synthetic training and test data are generated according to the specifications laid out in a parameter file. There are constraints on how the information is laid out in the parameter file. See the file 'param.txt' in the 'Examples' directory for an example. INSTALLATION: The DecisionTree class was packaged using Distutils. For installation, execute the following command-line in the source directory (this is the directory that contains the setup.py file after you have downloaded and uncompressed the package): python setup.py install You have to have root privileges for this to work. On Linux distributions, this will install the module file at a location that looks like /usr/lib/python2.6/site-packages/ If you do not have root access, you have the option of working directly off the directory in which you downloaded the software by simply placing the following statements at the top of your scripts that use the DecisionTree class: import sys sys.path.append( "pathname_to_DecisionTree_directory" ) To uninstall the module, simply delete the source directory, locate where the DecisionTree module was installed with "locate DecisionTree" and delete those files. As mentioned above, the full pathname to the installed version is likely to look like /usr/lib/python2.6/site-packages/DecisionTree* If you want to carry out a non-standard install of the DecisionTree module, look up the on-line information on Disutils by pointing your browser to http://docs.python.org/dist/dist.html BUGS: Please notify the author if you encounter any bugs. When sending email, please place the string 'DecisionTree' in the subject line. ACKNOWLEDGMENTS: The importance of the 'sentiment' feature in the "What Practical Problem is Solved by this Module" section was mentioned to the author by John Gorup. Thanks John. AUTHOR: Avinash Kak, kak@purdue.edu If you send email, please place the string "DecisionTree" in your subject line to get past my spam filter. COPYRIGHT: Python Software Foundation License Copyright 2012 Avinash Kak Modules functools math re sys Classes __builtin__.object DecisionTree Node TestDataGenerator TrainingDataGenerator class DecisionTree(__builtin__.object) Methods defined here: __init__(self, *args, **kwargs) best_feature_calculator(self, features_and_values_on_branch) check_names_used(self, features_and_values_test_data) class_entropy_for_a_given_sequence_of_features_values(self, array_of_features_and_values) class_entropy_on_priors(self) classify(self, root_node, features_and_values) classify_by_asking_questions(self, root_node) construct_decision_tree_classifier(self) determine_data_condition(self) get_class_names(self) get_training_data(self) interactive_recursive_descent_for_classification(self, node) prior_probability_for_class(self, class_name) probability_for_a_class_given_feature_value(self, class_name, feature_name, feature_value) probability_for_a_class_given_sequence_of_features_and_values(self, class_name, array_of_features_and_values) probability_for_feature_value(self, feature, value) probability_for_feature_value_given_class(self, feature_name, feature_value, class_name) probability_for_sequence_of_features_and_values_given_class(self, array_of_features_and_values, class_name) probability_of_a_sequence_of_features_and_values(self, array_of_features_and_values) recursive_descent(self, node) recursive_descent_for_classification(self, node, feature_and_values) show_training_data(self) Data descriptors defined here: __dict__ dictionary for instance variables (if defined) __weakref__ list of weak references to the object (if defined) class Node(__builtin__.object) # The nodes of the decision tree are instances of this class: Methods defined here: __init__(self, feature, entropy, class_probabilities, branch_features_and_values) add_child_link(self, new_node) add_to_branch_features_and_values(self, feature_and_value) delete_all_links(self) display_decision_tree(self, offset) display_node(self) get_branch_features_and_values(self) get_children(self) get_class_probabilities(self) get_entropy(self) get_feature(self) # this returns the feature test at the current node get_next_serial_num(self) get_serial_num(self) set_feature(self, feature) Static methods defined here: how_many_nodes() # This is a class method: Data descriptors defined here: __dict__ dictionary for instance variables (if defined) __weakref__ list of weak references to the object (if defined) Data and other attributes defined here: nodes_created = -1 class TestDataGenerator(__builtin__.object) Methods defined here: __init__(self, *args, **kwargs) find_longest_value(self) gen_test_data(self) read_parameter_file(self) # Read the parameter file for generating the TEST data write_test_data_to_file(self) Data descriptors defined here: __dict__ dictionary for instance variables (if defined) __weakref__ list of weak references to the object (if defined) class TrainingDataGenerator(__builtin__.object) Methods defined here: __init__(self, *args, **kwargs) find_longest_feature_or_value(self) gen_training_data(self) read_parameter_file(self) # Read the parameter for generating the TRAINING data write_training_data_to_file(self) Data descriptors defined here: __dict__ dictionary for instance variables (if defined) __weakref__ list of weak references to the object (if defined) Functions deep_copy_array(array_in) # Meant only for an array of strings (no nesting): minimum(arr) # Returns simultaneously the minimum value and its positional index in an # array. [Could also have used min() and index() defined for Python's # sequence types.] sample_index(sample_name) We assume that every record in the training datafile begins with the name of the record. The only requirement is that these names end in a suffix like '_23', as in 'sample_23' for the 23rd training record. This function returns the integer in the suffix. Data __author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2012 Avinash Kak. Python Software Foundation.' __date__ = '2012-July-29' __url__ = 'https://engineering.purdue.edu/kak/distDT/DecisionTree-1.7.html' __version__ = '1.7'

Author Avinash Kak (kak@purdue.edu)

Data
		__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2012 Avinash Kak. Python Software Foundation.' __date__ = '2012-July-29' __url__ = 'https://engineering.purdue.edu/kak/distDT/DecisionTree-1.7.html' __version__ = '1.7'