 
          IN ORDER TO BECOME FAMILIAR WITH THE DecisionTree MODULE
          ========================================================


(1) First run the scripts

        construct_dt_and_classify_one_sample_case1.py

        construct_dt_and_classify_one_sample_case2.py

        construct_dt_and_classify_one_sample_case3.py

        construct_dt_and_classify_one_sample_case4.py

    as they are.  The first script is for the purely symbolic case, the
    second for a case that involves both numeric and symbolic features, the
    third for the case of purely numeric features, and the last for the
    case when the training data is synthetically generated by the script
    generate_training_data_numeric.py

    Next, try to modify the test sample in these scripts and see what
    classification results you get for the new test samples.



(2) The second and the third scripts listed above use the training file
    `stage3cancer.csv'. The first script named above uses the training
    file `training.dat'.  And the last script named above uses the training
    data file `training.csv'.  These training files serve as examples for:

       stage3cancer.csv:   Example of a CSV training data file with both 
                           symbolic and numeric features

       training.csv    :   Example of a CSV training data file for the
                           purely numeric case.  Contains two classes, each
                           a Gaussian distribution in 2D.  The parameters of
                           the two Gaussians are in the file: 
                           `param_numeric.txt'

       training.dt     :   Example of a `.dat' file.

    Note again that you can use a CSV file for the cases of purely symbolic
    data, purely numeric data, or a mixture of the two. However, a `.dat' 
    file can only be used for symbolic data.

    There are two additional training data files in the directory:

          training2.csv

          training3.csv

    These are similar to the file `training.csv' in the sense that 
    they both contain two classes, each a 2D Gaussian distribution.
    The first, `training2.csv' was generated by 
    `generate_training_data_numeric.py ' using the parameter file

           param_numeric_strongly_overlapping_classes.txt

    and the second, `training3.csv' was generated by the script using
    the parameter file

           param_numeric_extremely_overlapping_classes.txt

    Study the training datafiles carefully.  Now create your own 
    datafiles that follow the formatting guidelines in these data files
    and create scripts similar to those in Item (1) above for 
    working with your data.




(3) So far we have talked about classifying one test data record at a time.
    You can place multiple test data records in a disk file and classify
    them all in one go.  To see how that can be done, execute the following
    two command lines in the `examples' directory:

     classify_test_data_in_a_file_numeric.py   training4.csv   test4.csv   out4.csv

     classify_test_data_in_a_file_symbolic.py  training4.dat   test4.dat   out4.dat

    Each of these two scripts constructs the decision tree from the data in
    the first argument file and then uses it to classify the data in the
    second argument file.  The computed class labels are deposited in the
    third argument file.

    In general, the test data files should look identical to the training
    data files.  Of course, for real-world test data, you will not have the
    class labels for the test samples.  You are still required to reserve a
    column for the class label, which now must be just the empty string ""
    for each data record.  For example, the test data supplied in the
    following two calls through the files test4_no_class_labels.csv and
    test4_no_class_labels.dat uses the empty string "" for the class
    labels:

        classify_test_data_in_a_file_numeric.py  training4.csv  test4_no_class_labels.csv out4.csv

        classify_test_data_in_a_file_symbolic.py  training4.dat   test4_no_class_labels.dat   out4.dat

    For bulk classification for the numeric case, the output file can also
    be a `.txt' file.  In that case, you will see white-space separate
    results in the output file.  When you mention a `.txt' file for the
    output for the numeric case, you can control the extent of information
    placed in the output file by setting the variable
    $show_hard_classifications in the script.  If this variable is set, the
    output will show only the most probable class for each test data
    record.

>   TO REMIND THE READER AGAIN, IF YOUR TRAINING DATA USES JUST NUMERIC
>   FEATURES OR A MIXTURE OF NUMERIC AND SYMBOLIC FEATURES, YOU MUST USE
>   CSV FILES FOR TRAINING AND TEST DATA.



(4) Let's now talk about how you can deal with features that, statistically
    speaking, are not so "nice".  We are talking about features with
    heavy-tailed distributions over large value ranges.  As mentioned in
    the HTML based API for this module, such features can create problems
    with the estimation of the probability distributions associated with
    them.  As mentioned there, the main problem that such features cause
    is with deciding how best to sample the value range. 

    Beginning with Version 2.2.4, you have two options in dealing with such
    features.  You can choose to go with the default behavior of the
    module, which is to sample the value range for such a feature over a
    maximum of 500 points.  Or, you can supply an additional option to the
    constructor that sets a user-defined value for the number of points to
    use.  The name of the option is "number_of_histogram_bins".  The
    following script

          construct_dt_for_heavytailed.py

    shows an example of a DecisionTree constructor with the
    "number_of_histogram_bins" option.






===========================================================================


             FOR USING A DECISION TREE CLASSIFIER INTERACTIVELY


    Starting with Version 1.6 of the module, you can use the DecisionTree
    classifier in an interactive mode.  In this mode, after you have
    constructed the decision tree, the user is prompted for answers to the
    questions regarding the feature tests at the nodes of the tree.
    Depending on the answer supplied by the user at a node, the classifier
    takes a path corresponding to the answer to descend down the tree to
    the next node, and so on.  To get a feel for using a decision tree in
    this mode, examine the script

        classify_by_asking_questions.py

    Execute the script as it is and see what happens.


===========================================================================


     EVALUATING THE CLASS DISCRIMINATORY POWER OF YOUR TRAINING DATA


Given a training data file that contains data records and the associated
class labels, one often wants to know the quality of the data in the file.
In other words, one wants to know if a training data file contains
sufficient information to discriminate between the different classes
mentioned in the file.

Starting with Version 2.2 of the DecisionTree module, you can now run a
10-fold cross-validation test on your training data to find out how much
class-discriminatory information is contained in the data.  The following
two scripts in the Examples directory:

       evaluate_training_data1.py

       evaluate_training_data2.py

As these scripts show, the following class 

       EvalTrainingData

defined in the main DecisionTree module file makes it straightforward to
evaluate the class discriminatory power your data (as long as it resides in
a `.csv' file.)  This new class is is a subclass of the DecisionTree class
in the module file.

Both the `evaluate' scripts mentioned above are identical in terms of the
usage logic shown.  The first is specifically for the training data file
`stage3cancer.csv' and second for the training data files `training.csv',
`training2.csv', and `training3.csv'.  The latter three data files contain
two Gaussian classes that are increasingly overlapping.  You can see for
yourself the decreasing quality of the training data as you evaluate first
the training file `training.csv', then the training file `training2.csv',
and finally the training file `training3.csv'.


===========================================================================


              GENERATING SYNTHETIC TRAINING AND TEST DATA


    Starting with Version 1.6, you can use the module itself to generate
    synthetic training and test data.  See the script

        generate_training_data_numeric.py

        generate_training_data_symbolic.py

    for how to generate training data for the decision-tree classifier for
    the purely numeric case and for the purely symbolic case.  The data is
    generated according to the information placed in a parameter file in
    each case.  These files must follow certain rules regarding the
    declaration of the classes, the features, the possible values for the
    features, etc.  An example of such a parameter file for the numeric
    case is:

        param_numeric.txt

    and for the symbolic case:

        param_symbolic.txt

    A test-data file looks very much like a training data file, except that
    the former does not contain the class labels for the different data
    records.  See the script

        generate_test_data_symbolic.py

    for an example of how you can generate test data for the purely
    symbolic case.  Note that the class labels for the test data are placed
    in a separate file whose name is supplied in the script named above.
    By comparing the classification labels obtained for each of the data
    records with their true labels you can assess the accuracy of the
    decision-tree classifier.

