DecisionTree-2.3.html

DecisionTree (version 2.3, 2015-March-15)

DecisionTree.py
Version: 2.3     Author: Avinash Kak (kak@purdue.edu) Date: 2015-March-15

Download Version 2.3: gztar bztar      Total number of downloads (all versions): 13760
     This count is automatically updated at every rotation of      the weblogs (normally once every two to four days)      Last updated: Thu Jun 18 06:01:01 EDT 2026

View version 2.3 code in browser SWITCH TO VERSION 2.3.1 CHANGES:   Version 2.3:     In response to requests from several users, this version includes a new     capability: You can now ask the module to introspect about the     classification decisions returned by the decision tree.  Toward that     end, the module includes a new class named DTIntrospection.  Perhaps     the most important bit of information you are likely to seek through DT     introspection is the list of the training samples that fall directly in     the portion of the feature space that is assigned to a node.  CAVEAT:     When training samples are non-uniformly distributed in the underlying     feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING     SAMPLES FALL IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE.     (This is the entire point of the generalization achieved by     probabilistic modeling of the training data.)  For additional     information related to DT introspection, see the section titled     "DECISION TREE INTROSPECTION" in this documentation page.   Version 2.2.6:     Removed a serious bug in the method get_training_data_from_dat() that     is used to read the training data from `.dat' files.  The class naming     convention internally in the code is that it is a concatenation of the     header of the column that contains the class labels and the actual     class name as mentioned in each row of the training data. There was a     problem with this concatenation. Another very important bug fix was in     the method probability_of_feature_value(). The probability calculation     in the previous versions was erroneous for those features that acquired     zero values.  The users of this module may not have noticed this error     in the past if the zero values for the features occurred relatively     infrequently.  This error has now been fixed.   Version 2.2.5:     If you are not using the synthetic data generation feature of this     module, the changes made in this version do not affect you. The code     that was changed is all in the class TrainingDataGeneratorNumeric.  The     changes to this class remove an important bug related to the ordering     of the feature names that are read from a user-supplied parameter     file. The basic Decision Tree construction and classification code     remains unchanged.   Version 2.2.4:     This version should prove more robust when the probability distribution     for the values of a feature is expected to be heavy-tailed; that is,     when the supposedly rare observations can occur with significant     probabilities.  A new option in the DecisionTree constructor lets the     user specify the precision with which the probability distributions     are estimated for such features.   Version 2.2.3:     This version fixes a bug that was caused by the explicitly set zero     values for numerical features being misconstrued as "false" in the     conditional statements in some of the method definitions.   Version 2.2.2:     In response to requests from users, this version includes scripts in     the Examples directory that demonstrate how to carry out bulk     classification of all your test data records placed in a CSV file in     one fell swoop.  Also included are scripts that demonstrate the same     for the data records placed in the old-style `.dat' files.  The main     module code remains unchanged.   Version 2.2.1:     The changes made are all in the part of the module that is used for     evaluating the quality of training data through a 10-fold cross     validation test.  The previous version used the default values for the     constructor parameters when constructing the decision trees in each     iteration of the test. The new version correctly uses the user-supplied     values.   Version 2.2:     This version fixes a bug discovered in the best feature calculator     function. This bug was triggered by certain conditions related to the     distribution of values for the features in a training data file.     Additionally, and VERY IMPORTANTLY, Version 2.2 allows you to test the     quality of your training data by running a 10-fold cross-validation     test on the data.  This test divides all of the training data into ten     parts, with nine parts used for training a decision tree and one part     used for testing its ability to classify correctly. This selection of     nine parts for training and one part for testing is carried out in all     of the ten different possible ways.  This testing functionality in     Version 2.2 can also be used to find the best values to use for the     constructor parameters entropy_threshold, max_depth_desired, and     symbolic_to_numeric_cardinality_threshold.   Version 2.1:     This is a cleaned up version of v. 2.0 of the module. Should run more     efficiently for large training data files that contain both numeric and     symbolic features.   Version 2.0:     This was a major rewrite of the DecisionTree module.  This revision was     prompted by a number of users wanting to see numeric features     incorporated in the construction of decision trees.  So here it is!     This version allows you to use either purely symbolic features, or     purely numeric features, or a mixture of the two. (A feature is numeric     if it can take any floating-point value over an interval.)   Version 1.7.1:     This version includes a fix for a bug that was triggered by certain     comment words in a training data file.  This version also includes     additional safety checks that are useful for catching errors and     inconsistencies in large training data files that do not lend     themselves to manual checking for correctness.  As an example, the new     version makes sure that the number of values you declare in each sample     record matches the number of features declared at the beginning of the     training data file.   Version 1.7:     This version includes safety checks on the consistency of the data you     place in your training data file.  When a training data file contains     thousands of records, it is difficult to manually check that you used     the same class names in your sample records that you declared at the     top of your training file or that the values you have for your features     are legal vis-a-vis the earlier declarations regarding such values in     the training file.  Another safety feature incorporated in this version     is the non-consideration of classes that are declared at the top of the     training file but that have no sample records in the file.   Version 1.6.1:     Fixed a bug in the method that generates synthetic test data.   Version 1.6:     This version includes several upgrades: The module now includes code     for generating synthetic training and test data for experimenting with     the DecisionTree classifier.  Another upgrade in the new version is     that, after training, a decision tree can now be used in an interactive     mode in which the user is asked to supply answers for the feature tests     at the nodes as the classification process descends down the tree.   Version 1.5:     This is a Python 3.x compliant version of the DecisionTree module.     This version should work with both Python 2.x and Python 3.x.   Version 1.0:     This is a Python implementation of the author's Perl module     Algorithm::DecisionTree, Version 1.41.  The Python version should work     faster for large decision trees since it uses probability and entropy     caching much more extensively than Version 1.41 of the Perl module.     (Note: I expect my next release of the Perl module to catch up with     this Python version in terms of performance.) USAGE:     If your training data includes numeric features (a feature is numeric     if it can take any floating point value over an interval), you are     expected to supply your training data through a CSV file and your call     for constructing an instance of the DecisionTree class will look like:         training_datafile = "stage3cancer.csv"         dt = DecisionTree.DecisionTree(                                 training_datafile = training_datafile,                                 csv_class_column_index = 2,                                 csv_columns_for_features = [3,4,5,6,7,8],                                 entropy_threshold = 0.01,                                 max_depth_desired = 8,                                 symbolic_to_numeric_cardinality_threshold = 10,              )     The constructor option `csv_class_column_index' informs the module as     to which column of your CSV file contains the class label.  THE COLUMN     INDEXING IS ZERO BASED.  The constructor option     `csv_columns_for_features' specifies which columns are to be used for     feature values.  The first row of the CSV file must specify the names     of the features.  See examples of CSV files in the `examples'     subdirectory.     The option `symbolic_to_numeric_cardinality_threshold' is also     important.  For the example shown above, if an ostensibly numeric     feature takes on only 10 or fewer different values in your training     datafile, it will be treated like a symbolic feature.  The option     `entropy_threshold' determines the granularity with which the entropies     are sampled for the purpose of calculating entropy gain with a     particular choice of decision threshold for a numeric feature or a     feature value for a symbolic feature.     After you have constructed an instance of the DecisionTree class as     shown above, you read in the training data file and initialize the     probability cache by calling:         dt.get_training_data()         dt.calculate_first_order_probabilities()         dt.calculate_class_priors()     Next you construct a decision tree for your training data by calling:         root_node = dt.construct_decision_tree_classifier()     where root_node is an instance of the DTNode class that is also defined     in the module file.  Now you are ready to classify a new data record.     Let's say that your data record looks like:         test_sample  = ['g2 = 4.2',                         'grade = 2.3',                         'gleason = 4',                         'eet = 1.7',                         'age = 55.0',                         'ploidy = diploid']     You can classify it by calling:         classification = dt.classify(root_node, test_sample)     The call to `classify()' returns a reference to a hash whose keys are     the class names and the values the associated classification     probabilities.  This hash also includes another key-value pair for the     solution path from the root node to the leaf node at which the final     classification was carried out.     If your features are purely symbolic, you can continue to use the same     constructor syntax that was used in the older versions of this module.     However, your old `.dat' training files will not work with the new     version.  The good news is that with just a small fix, you can continue     to use them.  The fix and why it was needed is described in the file     README_for_dat_files in the `examples' directory.  If you are going to     use a `.dat' file for supplying the training data, your constructor     syntax is likely to look like:         training_datafile = "training.dat"         dt = DecisionTree.DecisionTree(                                 training_datafile = training_datafile,                                 entropy_threshold = 0.01,                                 max_depth_desired = 5,              )     You'd still need to make the following calls for reading in the     training data, for initializing the probability cache, and for     constructing the decision tree:         dt.get_training_data()         dt.calculate_first_order_probabilities()         dt.calculate_class_priors()         root_node = dt.construct_decision_tree_classifier()     Now your test sample is likely to look like:         test_sample = ['exercising=never',                        'smoking=heavy',                        'fatIntake=heavy',                        'videoAddiction=heavy']     You'd now call the calssifier as before:         classification = dt.classify(root_node, test_sample)     A decision tree can quickly become much too large (and much too slow to     construct and to yield classification results) if the total number of     features is large and/or if the number of different possible values for     the symbolic features is large.  You can control the size of the tree     through the constructor options `entropy_threshold' and     `max_depth_desired'. The latter option sets the maximum depth of your     decision tree to max_depth_desired value.  The parameter     `entropy_threshold' sets the granularity with which the entropies are     sampled.  Its default value is 0.001.  The larger the value you choose     for entropy_threshold, the smaller the tree. INTRODUCTION:     DecisionTree is a Python module for constructing a decision tree from a     training data file containing multidimensional data in the form of a     table. In one form or another, decision trees have been around for over     fifty years. From a statistical perspective, they are closely related     to classification and regression by recursive partitioning of     multidimensional data. Early work that demonstrated the usefulness of     such partitioning of data for classification and regression can be     traced to the work of Terry Therneau in the early 1980's in the     statistics community, and to the work of Ross Quinlan in the mid 1990's     in the machine learning community,     For those not familiar with decision tree ideas, the traditional way to     classify multidimensional data is to start with a feature space whose     dimensionality is the same as that of the data.  Each feature measures     a specific attribute of an entity.  You use the training data to carve     up the feature space into different regions, each corresponding to a     different class.  Subsequently, when you try to classify a new data     sample, you locate it in the feature space and find the class label of     the region to which it belongs.  One can also give the data point the     same class label as that of the nearest training sample. This is     referred to as the nearest neighbor classification. There exist     hundreds of variations of varying power on these two basic approaches     to the classification of multidimensional data.     A decision tree classifier works differently.  When you construct a     decision tree, you select for the root node a feature test that     partitions the training data in a way that causes maximal     disambiguation of the class labels associated with the data.  In terms     of information content as measured by entropy, such a feature test     would cause maximum reduction in class entropy in going from all of the     training data taken together to the data as partitioned by the feature     test.  You then drop from the root node a set of child nodes, one for     each partition of the training data created by the feature test at the     root node. When your features are purely symbolic, you'll have one     child node for each value of the feature chosen for the feature test at     the root.  When the test at the root involves a numeric feature, you     find the decision threshold for the feature that best bipartitions the     data and you drop from the root node two child nodes, one for each     partition.  Now at each child node you pose the same question that you     posed when you found the best feature to use at the root: Which feature     at the child node in question would maximally disambiguate the class     labels associated with the training data corresponding to that child     node?     As the reader would expect, the two key steps in any approach to     decision-tree based classification are the construction of the decision     tree itself from a file containing the training data, and then using     the decision tree thus obtained for classifying new data.     What is cool about decision tree classification is that it gives you     soft classification, meaning it may associate more than one class label     with a given data record.  When this happens, it may mean that your     classes are indeed overlapping in the underlying feature space.  It     could also mean that you simply have not supplied sufficient training     data to the decision tree classifier.  For a tutorial introduction to     how a decision tree is constructed and used, see     https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE?     If you are new to the concept of a decision tree, their practical     utility is best understood with an example that only involves symbolic     features.  However, as mentioned earlier, versions 2.0 and higher of     this module handle both symbolic and numeric features.     Consider the following scenario: Let's say you are running a small     investment company that employs a team of stockbrokers who make     buy/sell decisions for the customers of your company.  Assume that your     company has asked the traders to make each investment decision on the     basis of the following five criteria:             price_to_earnings_ratio   (P_to_E)             price_to_sales_ratio      (P_to_S)             return_on_equity          (R_on_E)             market_share              (M_S)             sentiment                 (S)     Since you are the boss, you keep track of the buy/sell decisions made     by the individual traders.  But one unfortunate day, all of your     traders decide to quit because you did not pay them enough.  So what     are you to do?  If you had a module like the one here, you could still     run your company and do so in such a way that your company would, on     the average, perform better than any of the individual traders who     worked for you previously.  This is what you would need to do: You     would pool together the individual trader buy/sell decisions you     accumulated during the last one year.  This pooled information is     likely to look like:       example      buy/sell     P_to_E     P_to_S     R_on_E     M_S     S       ====================================================================       example_1     buy          high       low        medium    low    high       example_2     buy          medium     medium     low       low    medium       example_3     sell         low        medium     low       high   low       ....       ....     This data would constitute your training file. Assuming that this training     file is called 'training.dat', you would need to feed this file     into the module by calling:         dt = DecisionTree( training_datafile = "training.dat" )         dt.get_training_data()         dt.calculate_first_order_probabilities_for_numeric_features()         dt.calculate_class_priors()     Subsequently, you would construct a decision tree by calling:         root_node = dt.construct_decision_tree_classifier()     Now you and your company (with practically no employees) are ready to     service the customers again. Suppose your computer needs to make a     buy/sell decision about an investment prospect that is best described     by:         price_to_earnings_ratio   =  low         price_to_sales_ratio      =  very_low         return_on_equity          =  none         market_share              =  medium             sentiment                 =  low     All that your computer would need to do would be to construct a data     record like         test_case = [ 'P_to_E=low',                       'P_to_S=very_low',                       'R_on_E=none',                       'M_S=medium',                       'S=low'  ]     and call the decision tree classifier you just constructed by         classification = dt.classify(root_node, test_case)         print "Classification: ", classification     The answer returned will be 'buy' and 'sell', along with the associated     probabilities.  So if the probability of 'buy' is considerably greater     than the probability of 'sell', that's what you should instruct your     computer to do.     The chances are that, on the average, this approach would beat the     performance of any of your individual traders who worked for you     previously since the buy/sell decisions made by the computer would be     based on the collective wisdom of all your previous traders.     DISCLAIMER: There is obviously a lot more to good investing than what     is captured by the silly little example here. However, it does convey     the sense in which the current module can be used. SYMBOLIC FEATURES VERSUS NUMERIC FEATURES:     A feature is symbolic when its values are compared using string     comparison operators.  By the same token, a feature is numeric when its     values are compared using numeric comparison operators.  Having said     that, features that take only a small number of numeric values in the     training data can be treated symbolically provided you are careful     about handling their values in the test data.  At the least, you have     to set the test data value for such a feature to its closest value in     the training data.  For those numeric features that the module is     allowed to treat symbolically, this snapping of the values of the     features in the test data to the small set of values in the training     data is carried out automatically by the module.  That is, after a user     has told the module which numeric features to treat symbolically, the     user need not worry about how the feature values appear in the test     data.     The constructor parameter symbolic_to_numeric_cardinality_threshold     let's you tell the module when to consider an otherwise numeric feature     symbolically. Suppose you set this parameter to 10, that means that all     numeric looking features that take 10 or fewer different values in the     training datafile will be considered to be symbolic features.          See the tutorial at     https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf     for further information on the implementation issues related to the     symbolic and numeric features. FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES:     For the purpose of estimating the probabilities, it is necessary to     sample the range of values taken on by a numerical feature. For     features with "nice" statistical properties, this sampling interval is     set to the median of the differences between the successive feature     values in the training data.  (Obviously, as you would expect, you     first sort all the values for a feature before computing the successive     differences.)  This logic will not work for the sort of a feature     described below.     Consider a feature whose values are heavy-tailed, and, at the same     time, the values span a million to one range.  What I mean by     heavy-tailed is that rare values can occur with significant     probabilities.  It could happen that most of the values for such a     feature are clustered at one of the two ends of the range. At the same     time, there may exist a significant number of values near the end of     the range that is less populated.  (Typically, features related to     human economic activities --- such as wealth, incomes, etc. --- are of     this type.)  With the logic described in the previous paragraph, you     could end up with a sampling interval that is much too small, which     could result in millions of sampling points for the feature if you are     not careful.     Beginning with Version 2.2.4, you have two options in dealing with such     features.  You can choose to go with the default behavior of the     module, which is to sample the value range for such a feature over a     maximum of 500 points.  Or, you can supply an additional option to the     constructor that sets a user-defined value for the number of points to     use.  The name of the option is "number_of_histogram_bins".  The     following script           construct_dt_for_heavytailed.py     in the Examples directory shows an example of how to call the     constructor of the module with the "number_of_histogram_bins" option. TESTING THE QUALITY OF YOUR TRAINING DATA:     Starting with version 2.2, the module includes a new class named     EvalTrainingData, derived from the main class DecisionTree, that runs a     10-fold cross-validation test on your training data to test its ability     to discriminate between the classes mentioned in the training file.     The 10-fold cross-validation test divides all of the training data into     ten parts, with nine parts used for training a decision tree and one     part used for testing its ability to classify correctly. This selection     of nine parts for training and one part for testing is carried out in     all of the ten different possible ways.       The following code fragment illustrates how you invoke the testing     function of the EvalTrainingData class:         training_datafile = "training3.csv"         eval_data = DecisionTree.EvalTrainingData(                                 training_datafile = training_datafile,                                 csv_class_column_index = 1,                                 csv_columns_for_features = [2,3],                                 entropy_threshold = 0.01,                                 max_depth_desired = 3,                                 symbolic_to_numeric_cardinality_threshold = 10,                     )         eval_data.get_training_data()         eval_data.evaluate_training_data()     The last statement above prints out a Confusion Matrix and the value of     Training Data Quality Index on a scale of 100, with 100 designating     perfect training data.  The Confusion Matrix shows how the different     classes were mis-identified in the 10-fold cross-validation test.     This testing functionality can also be used to find the best values to     use for the constructor parameters entropy_threshold,     max_depth_desired, and symbolic_to_numeric_cardinality_threshold.     The following two scripts in the Examples directory illustrate the use     of the EvalTrainingData class for testing the quality of your data:         evaluate_training_data1.py         evaluate_training_data2.py HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS:     Assuming your training data is good, the quality of the results you get     from a decision tree would depend on the choices you make for the     constructor parameters entropy_threshold, max_depth_desired, and     symbolic_to_numeric_cardinality_threshold.  You can optimize your     choices for these parameters by running the 10-fold cross-validation     test that is made available in Versions 2.2 and higher through the new     class EvalTrainingData that is included in the module file.  A     description of how to run this test is in the section titled "TESTING     THE QUALITY OF YOUR TRAINING DATA" of this document. DECISION TREE INTROSPECTION:     Starting with Version 2.3, you can ask the DTIntrospection class of the     module to explain the classification decisions made at the different     nodes of the decision tree.     Perhaps the most important bit of information you are likely to seek     through DT introspection is the list of the training samples that fall     directly in the portion of the feature space that is assigned to a     node.     However, note that, when training samples are non-uniformly distributed     in the underlying feature space, it is possible for a node to exist     even when there are no training samples in the portion of the feature     space assigned to the node.  That is because the decision tree is     constructed from the probability densities estimated from the training     data.  When the training samples are non-uniformly distributed, it is     entirely possible for the estimated probability densities to be     non-zero in a small region around a point even when there are no     training samples specifically in that region.  (After you have created     a statistical model for, say, the height distribution of people in a     community, the model may return a non-zero probability for the height     values in a small interval even if the community does not include a     single individual whose height falls in that interval.)     That a decision-tree node can exist even where there are no training     samples in the feature space is an important contributor to the     generalization abilities of decision-tree-based classification.     In light of the explanation provided above, before the DTIntrospection     class supplies any answers at all, it asks you to accept the fact that     features can take on non-zero probabilities at a point in the feature     space even though there are zero training samples at that point (or in     a small region around that point).  If you do not accept this     rudimentary fact, the introspection class will not yield any answers     (since you are not going to believe the answers anyway).     The point made above implies that the path leading to a node in the     decision tree may test a feature for a certain value or threshold     despite the fact that the portion of the feature space assigned to that     node is devoid of any training data.     See the following two scripts in the Examples directory for how to     carry out DT introspection:         decision_tree_introspection_loop.py         decision_tree_introspection_single_node.py     The first script puts you in an interactive session in which you will     be asked for the node numbers you are interested in.  At each node, you     will be asked for whether or not you are interested in specific     questions that the introspector can provide answers for. The second     script does the same but only at a specific node.           METHODS:     The module provides the following methods for constructing a decision     tree from training data in a disk file, and for data classification with     the decision tree. Constructing a decision tree:         dt = DecisionTree( training_datafile = training_datafile,                            csv_class_column_index = 2,                            csv_columns_for_features = [3,4,5,6,7,8],                            entropy_threshold = 0.01,                            max_depth_desired = 8,                            symbolic_to_numeric_cardinality_threshold = 10,                          )     This yields a new instance of the DecisionTree class.  For this call to     make sense, the training data in the training datafile must be conform     to a certain format.  For example, the first row must name the     features.  It must begin with the empty string `""' as shown by the CSV     files in the Examples subdirectory.  The first column for all     subsequent rows must carry a unique integer identifier for each data     record.  When your features are purely symbolic, you are also allowed     to use the `.dat' files that were used in the previous versions of this     module.     The constructor option csv_class_column_index supplies to the module     zero-based index of the column that contains the class label for the     training data records. In the example shown above, the class labels are     in the third column.  The option csv_columns_for_features tells the     module which of the features are supposed to be used for decision tree     construction.  The constructor option max_depth_desired sets the     maximum depth of the decision tree. The parameter entropy_threshold     sets the granularity with which the entropies are sampled.  The     parameter symbolic_to_numeric_cardinality_threshold allows the module     to treat an otherwise numeric feature symbolically if it only takes a     small number of different values in the training data file.  For the     constructor call shown above, if a feature takes on only 10 or fewer     different values in the training data file, it will be treated like a     symbolic feature. The constructor parameters:     training_datafile:         This parameter supplies the name of the file that contains the         training data.  This must be a CSV file if your training data         includes both numeric and symbolic features.  If your data is         purely symbolic, you can use the old-style `.dat' file.     csv_class_column_index:         When using a CSV file for your training data, this parameter         supplies the zero-based column index for the column that contains         the class label for each data record in the training file.     csv_columns_for_features:         When using a CSV file for your training data, this parameter         supplies a list of columns corresponding to the features you wish         to use for decision tree construction.  Each column is specified by         its zero-based index.     entropy_threshold:         This parameter sets the granularity with which the entropies are         sampled by the module.  For example, a feature test at a node in         the decision tree is acceptable if the entropy gain achieved by the         test exceeds this threshold.  The larger the value you choose for         this parameter, the smaller the tree.  Its default value is 0.001.     max_depth_desired:         This parameter sets the maximum depth of the decision tree.  For         obvious reasons, the smaller the value you choose for this         parameter, the smaller the tree.     symbolic_to_numeric_cardinality_threshold:         This parameter allows the module to treat an otherwise numeric         feature symbolically if the number of different values the feature         takes in the training data file does not exceed the value of this         parameter.     number_of_histogram_bins:         This parameter gives the user the option to set the number of         points at which the value range for a feature should be sampled for         estimating the probabilities.  This parameter is effective only for         those features that occupy a large value range and whose         probability distributions are heavy tailed.     You can choose the best values to use for the constructor parameters by     running a 10-fold cross-validation test on your training data through     the embedded class EvalTrainingData that comes with Versions 2.2 and     higher of this module.  See the section "TESTING THE QUALITY OF YOUR     TRAINING DATA" of this document page. Reading in the training data:     After you have constructed a new instance of the DecisionTree class,     you must now read in the training data that is contained in the file     named above.  This you do by:         dt.get_training_data()     IMPORTANT: The training data file must be in a format that makes sense     to the decision tree constructor.  If you use numeric features, you     must use a CSV file for supplying the training data.  The first row of     such a file must name the features and it must begin with the empty     string `""' as shown in the `stage3cancer.csv' file in the Examples     subdirectory.  The first column for all subsequent rows must carry a     unique integer identifier for each training record. Initializing the probability cache:     After a call to the constructor and the get_training_data() method, you     must call the following methods for initializing the probabilities:         dt.calculate_first_order_probabilities()         dt.calculate_class_priors() Displaying the training data:     If you wish to see the training data that was just digested by the     module, call         dt.show_training_data() Constructing a decision-tree classifier:     After the training data is ingested, it is time to construct a decision     tree classifier.  This you do by         root_node = dt.construct_decision_tree_classifier()     This call returns an instance of type DTNode.  The DTNode class is     defined within the main package file, at its end.  So, don't forget,     that root_node in the above example call will be instantiated to an     instance of type DTNode. Displaying the decision tree:     You display a decision tree by calling         root_node.display_decision_tree("   ")     This displays the decision tree in your terminal window by using a     recursively determined offset for each node as the display routine     descends down the tree.     I have intentionally left the syntax fragment root_node in the above     call to remind the reader that display_decision_tree() is NOT called on     the instance of the DecisionTree we constructed earlier, but on the     Node instance returned by the call to     construct_decision_tree_classifier(). Classifying new data:     You classify new data by first constructing a new data record:         test_sample  = ['g2 = 4.2',                         'grade = 2.3',                         'gleason = 4',                         'eet = 1.7',                         'age = 55.0',                         'ploidy = diploid']     and calling the classify() method as follows:         classification = dt.classify(root_node, test_sample)     where, again, root_node is an instance of type Node that was returned     by calling construct_decision_tree_classifier().  The variable     classification is a dictionary whose keys are the class labels and     whose values the associated probabilities.  You can print it out by         print "Classification: ", classification Displaying the number of nodes created:     You can print out the number of nodes in a decision tree by calling         root_node.how_many_nodes() Using the decision tree interactively:     Starting with Version 1.6 of the module, you can use the DecisionTree     classifier in an interactive mode.  In this mode, after you have     constructed the decision tree, the user is prompted for answers to the     questions regarding the feature tests at the nodes of the tree.     Depending on the answer supplied by the user at a node, the classifier     takes a path corresponding to the answer to descend down the tree to     the next node, and so on.  The following method makes this mode     possible.  Obviously, you can call this method only after you have     constructed the decision tree.         dt.classify_by_asking_questions(root_node) Generating synthetic training data:     To generate synthetic training data, you first construct an instance of     the class TrainingDataGenerator that is incorporated in the     DecisionTree module.  A call to the constructor of this class will look     like:         parameter_file = "param_numeric.txt"         output_csv_file = "training.csv";         training_data_gen = TrainingDataGeneratorNumeric(                               output_csv_file   = output_csv_file,                               parameter_file    = parameter_file,                               number_of_samples_per_class = some_number,                             )         training_data_gen.read_parameter_file_numeric()         training_data_gen.gen_numeric_training_data_and_write_to_csv()     The training data that is generated is according to the specifications     described in the parameter file.  The structure of this file must be as     shown in the file `param_numeric.txt' for the numeric training data and     as shown in `param_symbolic.txt' for the case of symbolic training     data.  Both these example parameter files are in the 'Examples'     subdirectory.  The parameter file names the classes, the features for     the classes, and the possible values for the features.     If you want to generate purely symbolic training data, here is the     constructor call to make:         parameter_file = "param_symbolic.txt"         output_data_file = "training.dat";         training_data_gen = TrainingDataGeneratorSymbolic(                               output_datafile   = output_data_file,                               parameter_file    = parameter_file,                               write_to_file     = 1,                               number_of_training_samples = some_number,                             )         training_data_gen.read_parameter_file_symbolic()         training_data_gen.gen_symbolic_training_data()         training_data_gen.write_training_data_to_file() Generating synthetic test data:     To generate synthetic test data, you first construct an instance of the     class TestDataGeneratorSymbolic that is incorporated in the     DecisionTree module.  A call to the constructor of this class will look     like:         test_data_gen = TestDataGeneratorSymbolic(                           output_test_datafile     = an_output_data_file,                           output_class_labels_file = a_file_for_class_labels,                           parameter_file           = a_parameter_file,                           write_to_file            = 1,                           number_of_test_samples = some_number,                         )     The main difference between the training data and the test data is that     the class labels are NOT mentioned in the latter.  Instead, the class     labels are placed in a separate file whose name is supplied through the     constructor option `output_class_labels_file' shown above.  The test     data that is generated is according to the specifications described in     the parameter file.  In general, this parameter file would be the same     that you used for generating the training data.      Constructing and using an instance of the DTIntrospection class:     To construct an instance of DTIntrospection, you call         introspector = DecisionTree.DTIntrospection(dt)     where you supply the instance of the DecisionTree class you used for     constructing the decision tree through the parameter dt.  After you     have constructed an instance of the introspection class, you must     initialize it by         initialize()     on the introspector instance. Subsequently, you can invoke either of     the following methods:         explain_classification_at_one_node(node)         explain_classifications_at_multiple_nodes_interactively()     depending on whether you want introspection at a single specified node     or inside an infinite loop for an arbitrary number of nodes. BULK CLASSIFICATION OF DATA RECORDS     For large test datasets, you would obviously want to process an entire     file of test data at a time. The following scripts in the Examples     directory illustrate how you can do that:       classify_test_data_in_a_file_numeric.py       classify_test_data_in_a_file_symbolic.py     These scripts require three command-line arguments, the first argument     names the training datafile, the second the test datafile, and the     third the file in which the classification results are to be deposited.     The first script above is for the case of numeric/symbolic features and     the second for the purely symbolic features.  An important point to     remember when using these scripts for bulk classification is that the     test file must have a column for class labels.  In real-life     situations, obviously, the entries in that column in the test file will     be just the empty string "". HOW THE CLASSIFICATION RESULTS ARE DISPLAYED     It depends on whether you apply the classifier at once to all the data     records in a file, or whether you feed one data record at a time into     the classifier.     In general, the classifier returns soft classification for a test data     record.  What that means is that, in general, the classifier will list     all the classes to which a given data record could belong and the     probability of each such class label for the data record. Run the     examples scripts in the Examples directory to see how the output of     classification can be displayed.          With regard to the soft classifications returned by this classifier, if     the probability distributions for the different classes overlap in the     underlying feature space, you would want the classifier to return all     of the applicable class labels for a test data record along with the     corresponding class probabilities.  (However, keep in mind the fact     that the decision tree classifier may associate significant     probabilities with multiple class labels for a given test data record     if the training file contains an inadequate number of training samples     for one or more classes.)  The good thing is that the classifier would     not lie to you (unlike, say, a hard classification rule that would     return a single class label corresponding to the partitioning of the     underlying feature space).  The decision tree classifier will give you     the best classification that can be made given the training data you     feed into it. THE EXAMPLES DIRECTORY:     See the 'Examples' directory in the distribution for how to construct a     decision tree, and how to then classify new data using the decision     tree.  To become more familiar with the module, run the scripts         construct_dt_and_classify_one_sample_case1.py         construct_dt_and_classify_one_sample_case2.py         construct_dt_and_classify_one_sample_case3.py         construct_dt_and_classify_one_sample_case4.py     The first script is for the purely symbolic case, the second for the     case that involves both numeric and symbolic features, the third for     the case of purely numeric features, and the last for the case when the     training data is synthetically generated by the script     generate_training_data_numeric.py     Next, run the following script as it is for bulk classification of data     records placed in a CSV file:       classify_test_data_in_a_file_numeric.py  training4.csv  test4.csv  out4.csv     The script first constructs a decision tree using the training data in     the first-argument file, `training4.csv'.  Subsequently, the script     calculates the class labels for each of the test records in the file     `test4.csv'.  The class labels are written out the file `out4.csv'.  An     important thing to note here that your test file --- in this case     `test4.csv' --- must have a column for the class labels.  Obviously, in     real-life situations, there will be no class labels in this column.     When that is the case, you can place the empty string "" for each data     record in this column.  A demonstration of that is give by the     following variation of the above call:       classify_test_data_in_a_file_numeric.py  training4.csv  test4_no_class_labels.csv  out4.csv     If you want to use the old-style `.dat' files for the purely symbolic case,     you can do bulk classifications with those files also, as demonstrated by     the following examples:       classify_test_data_in_a_file_symbolic.py  training4.dat  test4.dat  out4.dat       classify_test_data_in_a_file_symbolic.py  training4.dat  test4_no_class_labels.dat  out4.dat     The point of the second example is to show the format of the test data     file must be identical to that of the training data file, in the sense     that it must have a column for the class labels even when those labels     are just empty strings "".     The following script in the 'Examples' directory         classify_by_asking_questions.py     shows how you can use a decision-tree classifier interactively.  In     this mode, you first construct the decision tree from the training data     and then the user is prompted for answers to the feature tests at the     nodes of the tree.     If your training data has a feature whose values span a large range     and, at the same time, are characterized by a heavy-tail distribution,     you should look at the script         construct_dt_for_heavytailed.py     to see how to use the option number_of_histogram_bins in the call to     the constructor.  This option was introduced in Version 2.2.4 for     dealing with such features.  If you do not set this option, the module     will use the default value of 500 for the number of points at which to     sample the value range for such a feature.     The 'Examples' directory also contains the following scripts:         generate_training_data_numeric.py         generate_training_data_symbolic.py         generate_test_data_symbolic.py     that show how you can use the module to generate synthetic training and     test data.  Synthetic training and test data are generated according to     the specifications laid out in a parameter file.  There are constraints     on how the information is laid out in the parameter file.  See the     files `param_numeric.txt' and `param_symbolic.txt' in the 'Examples'     directory for how to structure these files.     The Examples directory of Versions 2.2 and higher of the DecisionTree     module also contains the following two scripts:        evaluate_training_data1.py        evaluate_training_data2.py     that illustrate how the Python class EvalTrainingData can be used to     evaluate the quality of your training data (as long as it resides in a     `.csv' file.)  This new class is a subclass of the DecisionTree class     in the module file.  See the README in the Examples directory for     further information regarding these two scripts.     The Examples directory of Versions 2.3 and higher of the DecisionTree     module contains the following two scripts:         decision_tree_introspection_loop.py         decision_tree_introspection_single_node.py     that illustrate how to use the DTIntrospection class of the module for     generating explanations for the classification decisions at the nodes     of the decision tree.  The first script puts you in an interactive     session in which you will be asked for the node numbers you are     interested in.  At each node, you will be asked for whether or not you     are interested in specific questions that the introspector can provide     answers for. The second script does the same but only at a specific     node. INSTALLATION:     The DecisionTree class was packaged using setuptools.  For     installation, execute the following command-line in the source     directory (this is the directory that contains the setup.py file after     you have downloaded and uncompressed the package):             python setup.py install     You have to have root privileges for this to work.  On Linux     distributions, this will install the module file at a location that     looks like              /usr/lib/python2.7/dist-packages/     If you do not have root access, you have the option of working directly     off the directory in which you downloaded the software by simply     placing the following statements at the top of your scripts that use     the DecisionTree class:             import sys             sys.path.append( "pathname_to_DecisionTree_directory" )     To uninstall the module, simply delete the source directory, locate     where the DecisionTree module was installed with "locate DecisionTree"     and delete those files.  As mentioned above, the full pathname to the     installed version is likely to look like     /usr/lib/python2.7/dist-packages/DecisionTree*     If you want to carry out a non-standard install of the DecisionTree     module, look up the on-line information on Disutils by pointing your     browser to               http://docs.python.org/dist/dist.html BUGS:     Please notify the author if you encounter any bugs.  When sending     email, please place the string 'DecisionTree' in the subject line. ACKNOWLEDGMENTS:     The importance of the 'sentiment' feature in the "What Practical Problem     is Solved by this Module" section was mentioned to the author by John     Gorup.  Thanks John.     Thanks go to Wenshuai Hou for discovering and reporting the bug that     resulted in Version 2.2.3.  I should also thank Wenshuai for sending me     a training data file with heavy-tailed values for one of the     features. This datafile became the reason for the modifications to the     code that are incorporated in Version 2.2.4.     I owe many thanks to Eugene Kolotinsky for the bug discovery whose fix     was the primary reason for Version 2.2.6.  This bug related to the     erroneous calculation of the probability of a feature acquiring a     certain value if the training data contained zeros for the feature     values. AUTHOR:     Avinash Kak, kak@purdue.edu     If you send email, please place the string "DecisionTree" in your     subject line to get past my spam filter. COPYRIGHT:     Python Software Foundation License     Copyright 2015 Avinash Kak

`Download Version 2.3: gztar bztar`	`Total number of downloads (all versions): 13760` `This count is automatically updated at every rotation of the weblogs (normally once every two to four days) Last updated: Thu Jun 18 06:01:01 EDT 2026`

Imported Modules

functools
itertools
math
operator
re
sys

Classes



__builtin__.object

DTIntrospection
DTNode
DecisionTree

EvalTrainingData

TestDataGeneratorSymbolic
TrainingDataGeneratorNumeric
TrainingDataGeneratorSymbolic

class DTIntrospection(__builtin__.object)

    Instances constructed from this class can provide explanations for the classification decisions at the nodes of a decision tree.  At this time, the DT introspection made possible by this class provides answers to the following three questions: (1) List of the training samples that fall in the portion of the feature space that corresponds to the node; (2) The probabilities associated with the last feature test that led to the node; and (3) The class probabilities predicated on just the last feature test. IMPORTANT CAVEAT: It is possible for a node to exist even when there are no training samples in the portion of the feature space that corresponds to the node.  That is because a decision tree is based on the probability densities estimated from the training data. When training data is non-uniformly distributed, it is possible for the probability associated with a point in the feature space to be non-zero even when there are no training samples at or in the vicinity of that point. For a node to exist even where there are no training samples at the node may be thought of as the generalization ability of decision-tree based classification.

Methods defined here:

__init__(self, dt)

explain_classification_at_one_node(self, node_id)

explain_classifications_at_multiple_nodes_interactively(self)

extract_feature_op_val(self, feature_value_combo)

get_samples_for_feature_value_combo(self, feature_value_combo)

initialize(self)

recursive_descent(self, node)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class DTNode(__builtin__.object)

    The nodes of the decision tree are instances of this class:

Methods defined here:

__init__(self, feature, entropy, class_probabilities, branch_features_and_values_or_thresholds, root_or_not=None)

add_child_link(self, new_node)

delete_all_links(self)

display_decision_tree(self, offset)

display_node(self)

get_branch_features_and_values_or_thresholds(self)

get_children(self)

get_class_probabilities(self)

get_feature(self)
Returns the feature test at the current node

get_next_serial_num(self)

get_node_entropy(self)

get_serial_num(self)

set_feature(self, feature)

Static methods defined here:

get_class_names()

how_many_nodes()

initialize_DTNode_class()

set_class_names(class_names_list)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Data and other attributes defined here:

class_names = None

nodes_created = -1

class DecisionTree(__builtin__.object)

    Methods defined here:

__init__(self, *args, **kwargs)

best_feature_calculator(self, features_and_values_or_thresholds_on_branch, existing_node_entropy)
This is the heart of the decision tree constructor.  Its main job is to figure out the best feature to use for partitioning the training data samples that correspond to the current node.  The search for the best feature is carried out differently for symbolic features and for numeric features.  For a symbolic feature, the method estimates the entropy for each value of the feature and then averages out these entropies as a measure of the discriminatory power of that features.  For a numeric feature, on the other hand, it estimates the entropy reduction that can be achieved if we were to partition the set of training samples at each possible threshold for that numeric feature.  For a numeric feature, all possible sampling points relevant to the node in question are considered as candidates for thresholds.

calculate_class_priors(self)

calculate_first_order_probabilities(self)

class_entropy_for_a_given_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)

class_entropy_for_greater_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)

class_entropy_for_less_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)

class_entropy_on_priors(self)

classify(self, root_node, features_and_values)
Classifies one test sample at a time using the decision tree constructed from your training file.  The data record for the test sample must be supplied as shown in the scripts in the `Examples' subdirectory.  See the scripts construct_dt_and_classify_one_sample_caseX.py in that subdirectory.

classify_by_asking_questions(self, root_node)
If you want classification to be carried out by engaging a human user in a question-answer session, this is the method to use for that purpose.  See the script classify_by_asking_questions.py in the Examples subdirectory for an illustration of how to do that.

construct_decision_tree_classifier(self)
At the root node, we find the best feature that yields the greatest reduction in class entropy from the entropy based on just class priors. The logic for finding this feature is different for symbolic features and for numeric features.  That logic is built into the best feature calculator.

determine_data_condition(self)
This method estimates the worst-case fan-out of the decision tree taking into account the number of values (and therefore the number of branches emanating from a node) for the symbolic features.

entropy_scanner_for_a_numeric_feature(self, feature)

find_bounded_intervals_for_numeric_features(self, arr)
Given a list of branch attributes for the numeric features of the form, say, ['g2<1','g2<2','g2<3','age>34','age>36','age>37'], this method returns the smallest list that is relevant for the purpose of calculating the probabilities.  To explain, the probability that the feature `g2' is less than 1 AND, at the same time, less than 2, AND, at the same time, less than 3, is the same as the probability that the feature less than 1. Similarly, the probability that 'age' is greater than 34 and also greater than 37 is the same as `age' being greater than 37.

get_class_names(self)

get_training_data(self)
If your training data is purely symbolic, as in Version 1.7.1, you might find it easier to create a `.dat' file.  For purely numeric data, or mixed symbolic and numeric data, you MUST use a `.csv' file.  See examples of these files in the `Examples' subdirectory.

get_training_data_from_csv(self)

get_training_data_from_dat(self)
Meant for purely symbolic data (as in all versions up to v. 1.7.1)

interactive_recursive_descent_for_classification(self, node, answer, scratchpad_for_numerics)

prior_probability_for_class(self, class_name)

probability_of_a_class_given_sequence_of_features_and_values_or_thresholds(self, class_name, array_of_features_and_values_or_thresholds)

probability_of_a_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
This method requires that all truly numeric types only be expressed as '<' or '>' constructs in the array of branch features and thresholds

probability_of_a_sequence_of_features_and_values_or_thresholds_given_class(self, array_of_features_and_values_or_thresholds, class_name)
This method requires that all truly numeric types only be expressed as '<' or '>' constructs in the array of branch features and thresholds

probability_of_feature_less_than_threshold(self, feature_name, threshold)

probability_of_feature_less_than_threshold_given_class(self, feature_name, threshold, class_name)

probability_of_feature_value(self, feature_name, value)

probability_of_feature_value_given_class(self, feature_name, feature_value, class_name)

recursive_descent(self, node)
After the root node of the decision tree is constructed by the previous methods, we invoke this method recursively to create the rest of the tree. At each node, we find the feature that achieves the largest entropy reduction with regard to the partitioning of the training data samples that correspond to that node.

recursive_descent_for_classification(self, node, feature_and_values, answer)

show_training_data(self)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class EvalTrainingData(DecisionTree)


Method resolution order:

EvalTrainingData

DecisionTree

__builtin__.object

Methods defined here:

__init__(self, *args, **kwargs)

evaluate_training_data(self)

Methods inherited from DecisionTree:

best_feature_calculator(self, features_and_values_or_thresholds_on_branch, existing_node_entropy)
This is the heart of the decision tree constructor.  Its main job is to figure out the best feature to use for partitioning the training data samples that correspond to the current node.  The search for the best feature is carried out differently for symbolic features and for numeric features.  For a symbolic feature, the method estimates the entropy for each value of the feature and then averages out these entropies as a measure of the discriminatory power of that features.  For a numeric feature, on the other hand, it estimates the entropy reduction that can be achieved if we were to partition the set of training samples at each possible threshold for that numeric feature.  For a numeric feature, all possible sampling points relevant to the node in question are considered as candidates for thresholds.

calculate_class_priors(self)

calculate_first_order_probabilities(self)

class_entropy_for_a_given_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)

class_entropy_for_greater_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)

class_entropy_for_less_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)

class_entropy_on_priors(self)

classify(self, root_node, features_and_values)
Classifies one test sample at a time using the decision tree constructed from your training file.  The data record for the test sample must be supplied as shown in the scripts in the `Examples' subdirectory.  See the scripts construct_dt_and_classify_one_sample_caseX.py in that subdirectory.

classify_by_asking_questions(self, root_node)
If you want classification to be carried out by engaging a human user in a question-answer session, this is the method to use for that purpose.  See the script classify_by_asking_questions.py in the Examples subdirectory for an illustration of how to do that.

construct_decision_tree_classifier(self)
At the root node, we find the best feature that yields the greatest reduction in class entropy from the entropy based on just class priors. The logic for finding this feature is different for symbolic features and for numeric features.  That logic is built into the best feature calculator.

determine_data_condition(self)
This method estimates the worst-case fan-out of the decision tree taking into account the number of values (and therefore the number of branches emanating from a node) for the symbolic features.

entropy_scanner_for_a_numeric_feature(self, feature)

find_bounded_intervals_for_numeric_features(self, arr)
Given a list of branch attributes for the numeric features of the form, say, ['g2<1','g2<2','g2<3','age>34','age>36','age>37'], this method returns the smallest list that is relevant for the purpose of calculating the probabilities.  To explain, the probability that the feature `g2' is less than 1 AND, at the same time, less than 2, AND, at the same time, less than 3, is the same as the probability that the feature less than 1. Similarly, the probability that 'age' is greater than 34 and also greater than 37 is the same as `age' being greater than 37.

get_class_names(self)

get_training_data(self)
If your training data is purely symbolic, as in Version 1.7.1, you might find it easier to create a `.dat' file.  For purely numeric data, or mixed symbolic and numeric data, you MUST use a `.csv' file.  See examples of these files in the `Examples' subdirectory.

get_training_data_from_csv(self)

get_training_data_from_dat(self)
Meant for purely symbolic data (as in all versions up to v. 1.7.1)

interactive_recursive_descent_for_classification(self, node, answer, scratchpad_for_numerics)

prior_probability_for_class(self, class_name)

probability_of_a_class_given_sequence_of_features_and_values_or_thresholds(self, class_name, array_of_features_and_values_or_thresholds)

probability_of_a_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
This method requires that all truly numeric types only be expressed as '<' or '>' constructs in the array of branch features and thresholds

probability_of_a_sequence_of_features_and_values_or_thresholds_given_class(self, array_of_features_and_values_or_thresholds, class_name)
This method requires that all truly numeric types only be expressed as '<' or '>' constructs in the array of branch features and thresholds

probability_of_feature_less_than_threshold(self, feature_name, threshold)

probability_of_feature_less_than_threshold_given_class(self, feature_name, threshold, class_name)

probability_of_feature_value(self, feature_name, value)

probability_of_feature_value_given_class(self, feature_name, feature_value, class_name)

recursive_descent(self, node)
After the root node of the decision tree is constructed by the previous methods, we invoke this method recursively to create the rest of the tree. At each node, we find the feature that achieves the largest entropy reduction with regard to the partitioning of the training data samples that correspond to that node.

recursive_descent_for_classification(self, node, feature_and_values, answer)

show_training_data(self)

Data descriptors inherited from DecisionTree:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class TestDataGeneratorSymbolic(__builtin__.object)

    This convenience class does basically the same thing as the TrainingDataGeneratorSymbolic except that it places the class labels for the sample records in a separate file.  Let's say you have already created a DT classifier and you would like to test its class discriminatory power.  You can use the classifier to calculate the class labels for the data records by the class shown here.  And then you can you can compare those class labels with those placed originally by this class in a separate file.  See the script generate_test_data_symbolic.py for how to use this class.

Methods defined here:

__init__(self, *args, **kwargs)

find_longest_value(self)

gen_test_data(self)
This method generates the test data according to the specifications laid out in the parameter file read by the previous method.

read_parameter_file(self)
This methods reads the parameter file for generating the test data.

write_test_data_to_file(self)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class TrainingDataGeneratorNumeric(__builtin__.object)

    See the example script generate_training_data_numeric.py on how to use this class for generating your numeric training data.  The training data is generator in accordance with the specifications you place in a parameter file.

Methods defined here:

__init__(self, *args, **kwargs)

gen_numeric_training_data_and_write_to_csv(self)
After the parameter file is parsed by the previous method, this method calls on `numpy.random.multivariate_normal()' to generate the training data samples. Your training data can be of any number of of dimensions, can have any mean, and any covariance.

read_parameter_file_numeric(self)
The training data generated by an instance of the class TrainingDataGeneratorNumeric is based on the specs you place in a parameter that you supply to the class constructor through a constructor variable called `parameter_file.  This method is for parsing the parameter file in order to order to determine the names to be used for the different data classes, their means, and their variances.

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class TrainingDataGeneratorSymbolic(__builtin__.object)

    See the sample script generate_training_data_symbolic.py for how to use this class for generating symbolic training data.  The data is generated according to the specifications you place in a parameter file.

Methods defined here:

__init__(self, *args, **kwargs)

find_longest_feature_or_value(self)

gen_symbolic_training_data(self)
This method generates the training data according to the specifications placed in the parameter file that is read by the previous method.

read_parameter_file_symbolic(self)
Read the parameter file for generating symbolic training data. See the script generate_training_data_symbolic.py in the Examples directory for how to pass the name of the parameter file to the constructor of the TrainingDataGeneratorSymbolic class.

write_training_data_to_file(self)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Functions

closest_sampling_point(value, arr)

convert(value)

deep_copy_array(array_in)
Meant only for an array of scalars (no nesting):

minimum(arr)
Returns simultaneously the minimum value and its positional index in an array. [Could also have used min() and index() defined for Python's sequence types.]

sample_index(sample_name)
When the training data is read from a CSV file, we assume that the first column of each data record contains a unique integer identifier for the record in that row. This training data is stored in a dictionary whose keys are the prefix 'sample_' followed by the identifying integers. For the data in the old-style `.dat' files, we assume that each record begins with the string `sample_xx' where `xx' is a unique integer. In both cases, the purpose of this function is to return the identifying integer associated with a data record.

Data

__author__ = 'Avinash Kak (kak@purdue.edu)'
__copyright__ = '(C) 2015 Avinash Kak. Python Software Foundation.'
__date__ = '2015-March-15'
__url__ = 'https://engineering.purdue.edu/kak/distDT/DecisionTree-2.3.html'
__version__ = '2.3'

Author

Avinash Kak (kak@purdue.edu)

Data
		__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2015 Avinash Kak. Python Software Foundation.' __date__ = '2015-March-15' __url__ = 'https://engineering.purdue.edu/kak/distDT/DecisionTree-2.3.html' __version__ = '2.3'