| |
- __builtin__.object
-
- DTIntrospection
- DecisionTree
-
- EvalTrainingData
- TestDataGeneratorSymbolic
- TrainingDataGeneratorNumeric
- TrainingDataGeneratorSymbolic
class DTIntrospection(__builtin__.object) |
|
Instances constructed from this class can provide explanations for the
classification decisions at the nodes of a decision tree.
When used in the interactive mode, the DT introspection made possible by this
class provides answers to the following three questions: (1) List of the training
samples that fall in the portion of the feature space that corresponds to the
node; (2) The probabilities associated with the last feature test that led to the
node; and (3) The class probabilities predicated on just the last feature test.
CAVEAT: It is possible for a node to exist even when there are no training
samples in the portion of the feature space that corresponds to the node. That
is because a decision tree is based on the probability densities estimated from
the training data. When training data is non-uniformly distributed, it is
possible for the probability associated with a point in the feature space to be
non-zero even when there are no training samples at or in the vicinity of that
point.
For a node to exist even where there are no training samples in the portion of
the feature space that belongs to the node is an indication of the generalization
ability of decision-tree based classification.
When used in a non-interactive mode, an instance of this class can be used to
create a tabular display that shows what training samples belong directly to the
portion of the feature space that corresponds to each node of the decision tree.
An instance of this class can also construct a tabular display that shows how the
influence of each training sample propagates in the decision tree. For each
training sample, this display first shows the list of nodes that came into
existence through feature test(s) that used the data provided by that sample.
This list for each training sample is followed by a subtree of the nodes that owe
their existence indirectly to the training sample. A training sample influences a
node indirectly if the node is a descendant of another node that is affected
directly by the training sample. |
|
Methods defined here:
- __init__(self, dt)
- display_training_samples_at_all_nodes_direct_influence_only(self)
- display_training_samples_to_nodes_influence_propagation(self)
- explain_classification_at_one_node(self, node_id)
- explain_classifications_at_multiple_nodes_interactively(self)
- extract_feature_op_val(self, feature_value_combo)
- get_samples_for_feature_value_combo(self, feature_value_combo)
- initialize(self)
- recursive_descent(self, node)
- recursive_descent_for_sample_to_node_influence(self, node_serial_num, nodes_already_accounted_for, offset)
- recursive_descent_for_showing_samples_at_a_node(self, node)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class DecisionTree(__builtin__.object) |
| |
Methods defined here:
- __init__(self, *args, **kwargs)
- best_feature_calculator(self, features_and_values_or_thresholds_on_branch, existing_node_entropy)
- This is the heart of the decision tree constructor. Its main job is to figure
out the best feature to use for partitioning the training data samples that
correspond to the current node. The search for the best feature is carried
out differently for symbolic features and for numeric features. For a
symbolic feature, the method estimates the entropy for each value of the
feature and then averages out these entropies as a measure of the
discriminatory power of that features. For a numeric feature, on the other
hand, it estimates the entropy reduction that can be achieved if we were to
partition the set of training samples at each possible threshold for that
numeric feature. For a numeric feature, all possible sampling points
relevant to the node in question are considered as candidates for thresholds.
- calculate_class_priors(self)
- calculate_first_order_probabilities(self)
- class_entropy_for_a_given_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
- class_entropy_for_greater_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)
- class_entropy_for_less_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)
- class_entropy_on_priors(self)
- classify(self, root_node, features_and_values)
- Classifies one test sample at a time using the decision tree constructed from
your training file. The data record for the test sample must be supplied as
shown in the scripts in the `Examples' subdirectory. See the scripts
construct_dt_and_classify_one_sample_caseX.py in that subdirectory.
- classify_by_asking_questions(self, root_node)
- If you want classification to be carried out by engaging a human user in a
question-answer session, this is the method to use for that purpose. See the
script classify_by_asking_questions.py in the Examples subdirectory for an
illustration of how to do that.
- construct_decision_tree_classifier(self)
- At the root node, we find the best feature that yields the greatest reduction
in class entropy from the entropy based on just class priors. The logic for
finding this feature is different for symbolic features and for numeric
features. That logic is built into the best feature calculator.
- determine_data_condition(self)
- This method estimates the worst-case fan-out of the decision tree taking into
account the number of values (and therefore the number of branches emanating
from a node) for the symbolic features.
- entropy_scanner_for_a_numeric_feature(self, feature)
- find_bounded_intervals_for_numeric_features(self, arr)
- Given a list of branch attributes for the numeric features of the form, say,
['g2<1','g2<2','g2<3','age>34','age>36','age>37'], this method returns the
smallest list that is relevant for the purpose of calculating the
probabilities. To explain, the probability that the feature `g2' is less
than 1 AND, at the same time, less than 2, AND, at the same time, less than
3, is the same as the probability that the feature less than 1. Similarly,
the probability that 'age' is greater than 34 and also greater than 37 is the
same as `age' being greater than 37.
- get_class_names(self)
- get_training_data(self)
- If your training data is purely symbolic, as in Version 1.7.1, you might find
it easier to create a `.dat' file. For purely numeric data, or mixed
symbolic and numeric data, you MUST use a `.csv' file. See examples of these
files in the `Examples' subdirectory.
- get_training_data_from_csv(self)
- get_training_data_from_dat(self)
- Meant for purely symbolic data (as in all versions up to v. 1.7.1)
- interactive_recursive_descent_for_classification(self, node, answer, scratchpad_for_numerics)
- prior_probability_for_class(self, class_name)
- probability_of_a_class_given_sequence_of_features_and_values_or_thresholds(self, class_name, array_of_features_and_values_or_thresholds)
- probability_of_a_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
- This method requires that all truly numeric types only be expressed as '<' or '>'
constructs in the array of branch features and thresholds
- probability_of_a_sequence_of_features_and_values_or_thresholds_given_class(self, array_of_features_and_values_or_thresholds, class_name)
- This method requires that all truly numeric types only be expressed as '<' or '>'
constructs in the array of branch features and thresholds
- probability_of_feature_less_than_threshold(self, feature_name, threshold)
- probability_of_feature_less_than_threshold_given_class(self, feature_name, threshold, class_name)
- probability_of_feature_value(self, feature_name, value)
- probability_of_feature_value_given_class(self, feature_name, feature_value, class_name)
- recursive_descent(self, node)
- After the root node of the decision tree is constructed by the previous
methods, we invoke this method recursively to create the rest of the tree.
At each node, we find the feature that achieves the largest entropy reduction
with regard to the partitioning of the training data samples that correspond
to that node.
- recursive_descent_for_classification(self, node, feature_and_values, answer)
- show_training_data(self)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
Data and other attributes defined here:
- DTNode = <class 'DecisionTree.DTNode'>
- The nodes of the decision tree are instances of this class:
|
class EvalTrainingData(DecisionTree) |
| |
- Method resolution order:
- EvalTrainingData
- DecisionTree
- __builtin__.object
Methods defined here:
- __init__(self, *args, **kwargs)
- evaluate_training_data(self)
Methods inherited from DecisionTree:
- best_feature_calculator(self, features_and_values_or_thresholds_on_branch, existing_node_entropy)
- This is the heart of the decision tree constructor. Its main job is to figure
out the best feature to use for partitioning the training data samples that
correspond to the current node. The search for the best feature is carried
out differently for symbolic features and for numeric features. For a
symbolic feature, the method estimates the entropy for each value of the
feature and then averages out these entropies as a measure of the
discriminatory power of that features. For a numeric feature, on the other
hand, it estimates the entropy reduction that can be achieved if we were to
partition the set of training samples at each possible threshold for that
numeric feature. For a numeric feature, all possible sampling points
relevant to the node in question are considered as candidates for thresholds.
- calculate_class_priors(self)
- calculate_first_order_probabilities(self)
- class_entropy_for_a_given_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
- class_entropy_for_greater_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)
- class_entropy_for_less_than_threshold_for_feature(self, array_of_features_and_values_or_thresholds, feature, threshold)
- class_entropy_on_priors(self)
- classify(self, root_node, features_and_values)
- Classifies one test sample at a time using the decision tree constructed from
your training file. The data record for the test sample must be supplied as
shown in the scripts in the `Examples' subdirectory. See the scripts
construct_dt_and_classify_one_sample_caseX.py in that subdirectory.
- classify_by_asking_questions(self, root_node)
- If you want classification to be carried out by engaging a human user in a
question-answer session, this is the method to use for that purpose. See the
script classify_by_asking_questions.py in the Examples subdirectory for an
illustration of how to do that.
- construct_decision_tree_classifier(self)
- At the root node, we find the best feature that yields the greatest reduction
in class entropy from the entropy based on just class priors. The logic for
finding this feature is different for symbolic features and for numeric
features. That logic is built into the best feature calculator.
- determine_data_condition(self)
- This method estimates the worst-case fan-out of the decision tree taking into
account the number of values (and therefore the number of branches emanating
from a node) for the symbolic features.
- entropy_scanner_for_a_numeric_feature(self, feature)
- find_bounded_intervals_for_numeric_features(self, arr)
- Given a list of branch attributes for the numeric features of the form, say,
['g2<1','g2<2','g2<3','age>34','age>36','age>37'], this method returns the
smallest list that is relevant for the purpose of calculating the
probabilities. To explain, the probability that the feature `g2' is less
than 1 AND, at the same time, less than 2, AND, at the same time, less than
3, is the same as the probability that the feature less than 1. Similarly,
the probability that 'age' is greater than 34 and also greater than 37 is the
same as `age' being greater than 37.
- get_class_names(self)
- get_training_data(self)
- If your training data is purely symbolic, as in Version 1.7.1, you might find
it easier to create a `.dat' file. For purely numeric data, or mixed
symbolic and numeric data, you MUST use a `.csv' file. See examples of these
files in the `Examples' subdirectory.
- get_training_data_from_csv(self)
- get_training_data_from_dat(self)
- Meant for purely symbolic data (as in all versions up to v. 1.7.1)
- interactive_recursive_descent_for_classification(self, node, answer, scratchpad_for_numerics)
- prior_probability_for_class(self, class_name)
- probability_of_a_class_given_sequence_of_features_and_values_or_thresholds(self, class_name, array_of_features_and_values_or_thresholds)
- probability_of_a_sequence_of_features_and_values_or_thresholds(self, array_of_features_and_values_or_thresholds)
- This method requires that all truly numeric types only be expressed as '<' or '>'
constructs in the array of branch features and thresholds
- probability_of_a_sequence_of_features_and_values_or_thresholds_given_class(self, array_of_features_and_values_or_thresholds, class_name)
- This method requires that all truly numeric types only be expressed as '<' or '>'
constructs in the array of branch features and thresholds
- probability_of_feature_less_than_threshold(self, feature_name, threshold)
- probability_of_feature_less_than_threshold_given_class(self, feature_name, threshold, class_name)
- probability_of_feature_value(self, feature_name, value)
- probability_of_feature_value_given_class(self, feature_name, feature_value, class_name)
- recursive_descent(self, node)
- After the root node of the decision tree is constructed by the previous
methods, we invoke this method recursively to create the rest of the tree.
At each node, we find the feature that achieves the largest entropy reduction
with regard to the partitioning of the training data samples that correspond
to that node.
- recursive_descent_for_classification(self, node, feature_and_values, answer)
- show_training_data(self)
Data descriptors inherited from DecisionTree:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
Data and other attributes inherited from DecisionTree:
- DTNode = <class 'DecisionTree.DTNode'>
- The nodes of the decision tree are instances of this class:
|
class TestDataGeneratorSymbolic(__builtin__.object) |
|
This convenience class does basically the same thing as the
TrainingDataGeneratorSymbolic except that it places the class labels for the
sample records in a separate file. Let's say you have already created a DT
classifier and you would like to test its class discriminatory power. You can
use the classifier to calculate the class labels for the data records by the
class shown here. And then you can you can compare those class labels with those
placed originally by this class in a separate file. See the script
generate_test_data_symbolic.py for how to use this class. |
|
Methods defined here:
- __init__(self, *args, **kwargs)
- find_longest_value(self)
- gen_test_data(self)
- This method generates the test data according to the specifications
laid out in the parameter file read by the previous method.
- read_parameter_file(self)
- This methods reads the parameter file for generating the test data.
- write_test_data_to_file(self)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class TrainingDataGeneratorNumeric(__builtin__.object) |
|
See the example script generate_training_data_numeric.py on how to use this class
for generating your numeric training data. The training data is generator in
accordance with the specifications you place in a parameter file. |
|
Methods defined here:
- __init__(self, *args, **kwargs)
- gen_numeric_training_data_and_write_to_csv(self)
- After the parameter file is parsed by the previous method, this method calls
on `numpy.random.multivariate_normal()' to generate the training data
samples. Your training data can be of any number of of dimensions, can have
any mean, and any covariance.
- read_parameter_file_numeric(self)
- The training data generated by an instance of the class
TrainingDataGeneratorNumeric is based on the specs you place in a parameter
that you supply to the class constructor through a constructor variable
called `parameter_file. This method is for parsing the parameter file in
order to order to determine the names to be used for the different data
classes, their means, and their variances.
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
|