| |
- __builtin__.object
-
- DataGenerator
- LocalitySensitiveHashing
class DataGenerator(__builtin__.object) |
| |
Methods defined here:
- __init__(self, *args, **kwargs)
- gen_data_and_write_to_csv(self)
- Note that a unit cube in N dimensions has 2^N corner points. The coordinates of all these
corner points are given by the bit patterns of the integers 0, 1, 2, ...., 2^N - 1.
For example, in a vector 3-space, a unit cube has 8 corners whose coordinates are given by
the bit patterns for the integers 0, 1, 2, 3, 4, 5, 6, 7. These bit patterns would be
000, 001, 010, 011, 100, 101, 110, 111.
This script uses only N of the 2^N vertices of a unit cube as the mean vectors for N
similarity groups. These N vertices correspond to the far points on the cube edges that
emanate at the origin. For example, when N=3, it uses only 001,010,100 as the three mean
vectors for the AT MOST 3 similarity groups. If needed, we can add additional similarity
groups by selecting additional coordinate bit patterns from the integers 0 through 2^N - 1.
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
|
class LocalitySensitiveHashing(__builtin__.object) |
| |
Methods defined here:
- __init__(self, *args, **kwargs)
- display_contents_of_all_hash_bins_pre_lsh(self)
- evaluate_quality_of_similarity_groups(self, evaluation_similarity_groups)
- The argument to this method, evaluation_similarity_groups, is a list of sets, with each set being
a similarity group, which is the same thing as a cluster.
If you plan to invoke this method to evaluate the quality of clustering achieved by the values
used for the parameters r and b, you'd want the data records in the CSV datafile to look like:
sample0_3,0.925,-0.008,0.009,0.058,0.092,0.117,-0.076,0.239,0.086,-0.149
Note in particular the syntax used for naming a data record. The name 'sample0_3' means that this
is the 3rd sample generated randomly for data class 0. The goal of this method is to example all
such sample names and figure out how many classes exist in the data.
- get_data_from_csv(self)
- hash_all_data(self)
- hash_all_data_with_one_hyperplane(self)
- initialize_hash_store(self)
- lsh_basic_for_nearest_neighbors(self)
- Regarding this implementation of LSH, note that each row of self.htable_rows corresponds to
one hash function. So if you have 3000 hash functions for 3000 different randomly chosen
orientations of a hyperplane passing through the origin of the vector space in which the
numerical data is defined, this table has 3000 rows. Each column of self.htable_rows is for
one data sample in the vector space. So if you have 80 samples, then the table has 80 columns.
The output of this method consists of an interactive session in which the user is asked to
enter the symbolic name of a data record in the dataset processed by the LSH algorithm. The
method then returns the names (some if not all) of the nearest neighbors of that data point.
- lsh_basic_for_neighborhood_clusters(self)
- This method is a variation on the method lsh_basic_for_nearest_neighbors() in the following
sense: Whereas the previous method outputs a hash table whose keys are the data sample names
and whose values are the immediate neighbors of the key sample names, this method merges
the keys with the values to create neighborhood clusters. These clusters are returned as
a list of similarity groups, with each group being a set.
- merge_similarity_groups_with_coalescence(self, similarity_groups)
- The purpose of this method is to do something that, strictly speaking, is not the right thing to do
with an implementation of LSH. We take the clusters produced by the method
lsh_basic_for_neighborhood_clusters() and we coalesce them based on the basis of shared data samples.
That is, if two neighborhood clusters represented by the sets A and B have any data elements in
common, we merge A and B by forming the union of the two sets.
- merge_similarity_groups_with_l2norm_sample_based(self, similarity_groups)
- The neighborhood set coalescence as carried out by the previous method will generally result
in a clustering structure that is likely to have more clusters than you may be expecting to
find in your data. This method first orders the clusters (called 'similarity groups') according
to their size. It then pools together the data samples in the trailing excess similarity groups.
Subsequently, for each data sample in the pool, it merges that sample with the closest larger
group.
- merge_similarity_groups_with_l2norm_set_based(self, similarity_groups)
- The overall goal of this method is the same as that of
merge_similarity_groups_with_l2norm_sample_based(), except for the difference that
we now merge the excess similarity groups wholesale with the retained similarity
groups. For each excess similarity group, we find the closest retained similarity group,
closest in terms of the l2 norm distance between the mean values of the two groups.
- prune_similarity_groups(self)
- If your data produces too many similarity groups, you can get rid of the smallest with
this method. In order to use this method, you must specify a value for the parameter
'similarity_group_min_size_threshold' in the call to the constructor of the LSH module.
- show_data_for_lsh(self)
- show_sample_to_initial_similarity_group_mapping(self)
- write_clusters_to_file(self, clusters, filename)
Data descriptors defined here:
- __dict__
- dictionary for instance variables (if defined)
- __weakref__
- list of weak references to the object (if defined)
| |