LocalitySensitiveHashing-1.0.html

LocalitySensitiveHashing (version 1.0, 2017-May-11)

LocalitySensitiveHashing.py
Version: 1.0     Author: Avinash Kak (kak@purdue.edu) Date: 2017-May-11

Download Version 1.0: gztar bztar      Total number of downloads (all versions): 1424
     This count is automatically updated at every rotation of      the weblogs (normally once every two to four days)      Last updated: Sun Sep 22 06:09:01 EDT 2024

View version 1.0 code in browser SWITCH TO VERSION 1.0.1

`Download Version 1.0: gztar bztar`	`Total number of downloads (all versions): 1424` `This count is automatically updated at every rotation of the weblogs (normally once every two to four days) Last updated: Sun Sep 22 06:09:01 EDT 2024`

INTRODUCTION: The LocalitySensitiveHashing module is an implementation of the Locality Sensitive Hashing (LSH) algorithm for nearest neighbor search. The main idea in LSH is to avoid having to compare every pair of data samples in a large dataset in order to find the nearest similar neighbors for the different data samples. With LSH, one can expect a data sample and its closest similar neighbors to be hashed into the same bucket with a high probability. By treating the data samples placed in the same bucket as candidates for similarity checking, we significantly reduce the computational burden associated with finding nearest neighbors in large datasets. While LSH algorithms have traditionally been used for finding nearest neighbors, this module goes a step further and explores using LSH for clustering the data. Strictly speaking, this violates the basic mandate of LSH, which is to return just the nearest neighbors. (A being a nearest neighbor of B and B being a nearest neighbor of C, in the sense such nearest neighbors are defined in the context of LSH, does not imply that A and C are sufficiently close to be considered similar.) Nonetheless, if you believe that your datafile consists of non-overlapping data clusters, this module may do a decent job of finding those clusters. The rest of this section reviews the basic notions in LSH to help the user of this module understand the implementation code. As to how these notions are used for clustering will be taken up in the next section. A hash function is locality sensitive if it places similar samples in the same bucket with a high probability, and if it places dissimilar samples in the same bucket with a low probability. Two data samples are considered to be similar if the "distance" between them is at most d1 and two data samples are considered dissimilar if the "distance" between them is at least d2 = c * d1, with c > 1. (As for how to measure this "distance", for numerical data in a vector space, we have several distance measures at our disposal. For example, we could use the Cosine distances or the L2-norm, etc.) For given d1 and d2, the quality of a locality-sensitive hash function is measured by two probabilities, p1 and p2, where the former is the the least probability of collision for similar samples (which is something that we want) and the latter the largest probability of collision for dissimilar samples (which is something that we do NOT want). For obvious reasons, you'd want p1 to be as high as possible and p2 to be as low as possible. The locality sensitivity of such a hash function is characterized by the 4-tuple (d1,d2,p1,p2). We may refer to p1 as the least probability for detecting the true positives and p2 as largest probability of encountering false positives. In practice, it is not possible to come up with a single hash function with acceptably high true-positive probability and acceptably low false-positive probability. However, it has been shown that a large number of hash functions working together in an AND-OR structure can give us the desired values for p1 and p2. One starts out with a basic hash function that places similar samples in the same bucket with a high probability, but that, at the same time, places any two dissimilar samples in the same bucket with NOT a sufficiently low probability. Subsequently, one can require that for any two given samples to be considered candidates for similarity, they must be in the same bucket for a set of r random variants of the same basic hash function. This is referred to as enforcing an ÁND over the r hash functions in order to decrease the collisions for dissimilar samples. Since the ÁND operation can also diminish the probability of collisions for similar samples, we introduce an OR operation, which consists of ORing over b bands of ANDed operations. Before visiting the collision probabilities for similar and dissimilar samples associated with this AND-OR logic, let me further illustrate in the next paragraph as to what is meant by the AND and the OR operations. I'll do so by alluding to the basic data structure, htable_rows, that the module uses for the AND and the OR operations. This module implements the AND-OR idea described above in the following manner: The module constructs a 2D bit array, represented by htable_rows in the code, in which each row corresponds to a different hash function and each column corresponds to a data sample. The total number of rows in htable_rows is b*r where r is the number of hash functions over which we want to carry out the AND operation and b is the number of such bands of r hash functions for the OR operation. To elaborate, assume that both r and b are set to 3 and that we have a total of 5 data samples. The 2D array of 1's and 0's in htable_rows could look like what is shown below: data samples x1 x2 x3 x4 x5 ---------------------------------- | h1 | 1 . 1 . . b=0 r=0 h2 | 0 . 0 . . b=0 r=1 hash h3 | 1 . 1 . . b=0 r=2 functions | h4 | . 1 . 1 . b=1 r=0 h5 | . 1 . 1 . b=1 r=1 h6 | . 0 . 0 . b=1 r=2 | h7 | 1 . . . 1 b=2 r=0 h8 | 1 . . . 1 b=2 r=1 h9 | 1 . . . 1 b=2 r=2 | where an entry of "1" means that the hyperplane hash function placed that data sample in the "plus" bucket and an entry of "0" means that the data sample was placed in the "minus" bucket. When the entry is a dot ".", that means we don't care about that entry for the sake of the explanation here. (Showing 1s and 0s for all entries would create too much visual clutter in the table.) Note also that, on the right of the table, I have taken the liberty of using the symbol b as the index vis-a-vis the parameter b, and the symbol r as the index vis-a-vis the parameter r. Each distinct value for the index b shows a separate band of the hash functions. And each distinct value for the index r for each index b shows the output of a separate hash function within each band. The AND-OR property says that for any two data samples to be considered similar they must agree with respect to all the hash values in ALL of the r rows IN AT LEAST ONE of the b bands. (Now I am using b and r as the parameters of the algorithm. In the example shown above, both b and r are set to 3.) Based on this property, we claim that we have the following candidate pairs that we should test for similarity: x1 and x3, x2 and x4, and x1 and x5. Let's now revisit the question of what probabilities to associate with the collisions for similar and dissimilar pairs of samples through the AND-OR construction shown above. If p1 is the probability that two similar samples would be placed in the same bucket by the basic hash function used, the probability that r such functions would do the same in an AND aggregation of the hash functions is p1^r. Along the same lines, if p2 is the probability that the basic hash function would place two dissimilar samples in the same bucket, then the probability of r such hash functions doing the same in an AND aggregation of the hash functions goes down to p2^r. So, with regard to any single set of r hash functions, we have a desirable reduction in the collision probability associated with two dissimilar samples, but, unfortunately, also an undesirable reduction in the collision probability associated with two similar samples. However, assuming that p1 is quite high to begin with, say around 0.9, and p2 is quite low to begin with, say, around 0.1. The p1^r value will be impacted much less that the p2^r value. So if the basic hash function is characterized by the 4-tuple (d1, d2, p1, p2) for its locality sensitivity, an r-wise AND aggregation of random variations on the basic hash function will be characterized by (d1, d2, p1^r,p2^r). Let's now see what OR logic does to a set of random variants of a (d1, d2, p1, p2) family of hash function. Let's say we are ORing over the bucket entries as produced by a family of b hash functions. Since p1 is the probability of collision for similar samples, we can say that each of the b hash functions would declare two similar samples to be dissimilar with a probability of (1 - p1). That implies that, given a set of b such hash functions, none of them would declare two similar samples to be similar with a probability of (1 - p1)^b. Therefore, the probability that at least one of these b hash functions in an OR construction would declare two similar samples to be similar is [1 - (1 - p1)^b]. The same logic tells us that an OR combination over b hash functions would declare two dissimilar samples to be similar with a probability of [1 - (1 - p2)^b]. Hence, we can characterize the locality sensitivity of a set of ORed b hash functions by the 4-tuple (d1, d2, [1 - (1 - p1)^b], [1 - (1 - p2)^b]). It follows that if we apply the OR logic to b bands of hash functions, with each band consisting of r hash function operating together according to the AND logic, we can characterize the locality sensitivity of the b*r hash functions by the 4-tuple (d1, d2, [1 - (1 - p1^r)^b], [1 - (1 - p2^r)^b]). which we may express as (d1, d2, P1, P2) with P1 = 1 - (1 - p1^r)^b and P2 = 1 - (1 - p2^r)^b. We can think of P1 and P2 as our amplified true-positive and false-positive probabilities. Given target values for P1 and P2, one would need to solve the nonlinear equations that relate P1 and P2 with b and r for the values one should use for these two parameters in an LSH implementation. But what about the relationship between the distances d1 and d2, on the one hand, and the collision probabilities p1 and p2? That depends on the choice of the basic hash function. Consider what's surely the most commonly used hash function for finding nearest neighbors in multi-dimensional numerical data: the hyperplane hash function. Let's say your data consists of vectors in an N dimensional space. For a hyperplane passing through the origin of the vector space, the orientation of the hyperplane being given by its surface normal, we can measure the similarity two data samples by finding out whether or not their projection on the normal to the hyperplane is on the same side of the hyperplane. Let's say that the angle between the two data vectors is d1 (the angle being measured in the plane that passes through both the vectors). We now say that for a randomly oriented hyperplane, the probability that the projections of the two data vectors on the normal to the hyperplane would fall on the same side of the hyperplane is (180 - d) / 180. Obviously, as d approaches zero, this probability will approach 1, which makes sense since when d=o, the projections of the two vectors on the surface normal will always be on the same side of the hyperplane for all its orientations. On the other hand, when d approaches 180 degrees, you have two data vectors that are as dissimilar as they can be and in this case their projections on the surface normal will be on the opposite sides of the hyperplane for all its orientations. For a non-zero angle d between the data vectors, as you consider hyperplanes with different possible random orientations that vary from 0 degrees to 180 degrees (each orientation being the angle of the surface normal in the plane containing both data vectors), over the orientation range 0 through 180-d degrees, both data vectors would project onto the normal on the same side of the hyperplane. These arguments apply to the relationship between d1 and p1 and also to the relationship between d2 and p2. So we write: 180 - d1 p1 = ---------- 180 180 - d2 p2 = ---------- 180 One can plug these formulas in the 4-tuple sensitivity characterization (d1, d2, [1 - (1 - p1^r)^b], [1 - (1 - p2^r)^b]). and given the desired targets for the probabilities for true-positive collisions and false-positive collisions, solve for the best values to use for b and r. For further information regarding Locality Sensitive Hash Functions, the reader is referred to Chapter 3 of the book "Mining of Massive Datasets" by Jure Leskovec, Anand Rajaraman, and Jeffery Ullman. CAN NEAREST NEIGHBORS RETURNED BY LSH BE USED TO CLUSTER THE DATA? Strictly speaking, the answer is no. The problem is that the nearest neighbor property as calculated by LSH with hyperplane hash functions is not transitive. Let's say that LSH considers data samples A and B to be similar because the algorithm has placed them in the same bucket. And, again, let's say that LSH considers the data samples B and C to be similar because the algorithm has placed them also in the same bucket (this bucket being different from the first bucket). A being similar to B and B being similar to C in the sense described here may not imply that A is similar to C. Just imagine three points on the same great circle on the surface of a sphere. The angular interval between A and B, on the one hand, and between B and C, on the other, could be sufficiently small so that A and B would be considered similar and B and C would be considered similar. Yet, the angular interval between A and C may exceed the threshold test for similarity. Despite the difficulty created by the non-transitivity of angle-based measures of similarity between any two data samples, it may be possible to get good clustering results if the data is known to reside in non-overlapping clusters in the vector space. Although such a property not likely to be true for most practical applications, modifications of the logic presented in this module may provide acceptable results when the data clusters are somewhat overlapping. Consider, for example, the notion used in this module that two neighborhoods returned by LSH can be clustered together if they share data samples. We could additionally predicate the joining of such neighborhoods on the distance between their means, or the distance between the shared data samples and the means of the two neighborhoods, with the idea that the shared data samples closer to the means in each of the neighborhoods that are close to the mean must be trusted more with regard to the two neighborhoods belonging to the same cluster. This additional logic is not yet in the module. Think of this module as just a first step in the direction of exploring how one might exploit the neighborhood information returned by LSH in forming similarity clusters in data. USAGE: If you want to find just the nearest neighbors of a given data sample in a datafile that was processed by this module, your usage of the module will look like: lsh = LocalitySensitiveHashing( datafile = "data_for_lsh.csv", dim = 10, r = 50, b = 100, ) lsh.get_data_from_csv() lsh.initialize_hash_store() lsh.hash_all_data() similarity_neighborhoods = lsh.lsh_basic_for_nearest_neighbors() where 'dim' is the dimensionality of the numerical data in the file that in the above example is named "data_for_lsh.csv". The datafile must be in the CSV format. (See the Examples directory for what a datafile must look like.) The parameter 'r' is number of rows for r-wise AND in each band of hash functions, and b is the number of bands for b-wise OR over all the bands. On the other hand, if you want to use this module for experimenting with clustering your data using LSH for discovering the neighbors of the individual data samples, your usage of the module will look like: lsh = LocalitySensitiveHashing( datafile = "data_for_lsh.csv", dim = 10, r = 50, b = 100, expected_num_of_clusters = 10, ) lsh.get_data_from_csv() lsh.initialize_hash_store() lsh.hash_all_data() similarity_groups = lsh.lsh_basic_for_neighborhood_clusters() coalesced_similarity_groups = lsh.merge_similarity_groups_with_coalescence( similarity_groups ) merged_similarity_groups = lsh.merge_similarity_groups_with_l2norm_sample_based( coalesced_similarity_groups ) lsh.write_clusters_to_file( merged_similarity_groups, "clusters.txt" ) where the constructor parameters 'dim', 'r', and 'b' carry the same meaning as mentioned for the previous case. The new constructor parameter 'expected_num_of_clusters' specifies how many clusters you expect to see in the data. The last four statements are about constructing clusters from the neighborhoods returned by LSH. Their roles are explained elsewhere on this documentation page. The next to the last statement shown above first orders the clusters formed according to their size. It then retains a certain number of the largest clusters, the number being as specified by the constructor parameter expected_num_of_clusters. The samples in the excess clusters are then pooled together and each sample in the pool assigned to the closest retained cluster. In the usage example shown above for clustering the data, you can replace the call in the next to the last statement by merged_similarity_groups = lsh.merge_similarity_groups_with_l2norm_set_based( coalesced_similarity_groups ) This invokes a set based approach to merging the excess clusters with the retained clusters. In the set based approach, you find the difference between the means of the excess clusters and the retained clusters. An excess cluster is merged with a retained cluster if the difference between the means of the two is the least. CONSTRUCTOR PARAMETERS: datafile: Must be a ".csv" file. Each record in this file corresponds one data point in a vector space. Each records must have associated with it a unique symbolic name that must be in the first column. dim: Is set to the dimensionality of the vector space in which the data is defined. r: The number of rows in one band of hash functions. An r-wise AND operator is applied to the buckets in each such band. b: The number of r-row bands of hash functions. A b-wise OR operator is applied to the buckets that correspond to each of the bands. expected_num_of_clusters: This tell the module how many clusters you expect to see in your datafile. METHODS: (1) display_contents_of_all_hash_bins_pre_lsh() As mentioned elsewhere in this documentation page, each data sample is hashed by a randomly oriented hyperplane (passing through the origin of the vector space in which the data resides) by projecting the sample on the normal to the hyperplane. This creates a two-bin hash table, with some samples projected into the positive half-space and the others into the negative half-space. This method displays the orientation of the hyperplane, along with two lists, one consisting of the sample names that projected into the positive half-space and the other consisting of the sample names that projected into the negative half-space. (2) evaluate_quality_of_similarity_groups( merged_similarity_groups ) If the symbolic names in your datafile are based on the "sampleX_Y" format, where X is the integer ID of a cluster and Y the integer ID of a data element in the cluster, you can call on this method to evaluate the quality of the clusters produced by the LSH module. (3) get_data_from_csv() This method extracts the numerical data from your CSV file. (4) hash_all_data() It is this method that hashes all of your data records in the CSV file with r * b number of hash functions, each hash function being a randomly oriented hyperplane passing through the origin of the vector space in which the data resides. (5) initialize_hash_store() This method must be called before the 'hash_all_data()' method. The initialization consists of generating the desired number of hyperplane orientations randomly and associating with each orientation a two-bin hash table in the form of a dictionary with two <key,value> pairs in it for the keys 'plus' and 'minus', with 'plus' standing for the positive half-space and 'minus' for the negative half-space for each hyperplane. (6) lsh_basic_for_nearest_neighbors() This method is this module's implementation of the hyperplane based LSH algorithm. This method's mandate is what is traditionally accomplished with LSH --- finding nearest neighbors for data elements. (7) lsh_basic_for_neighborhood_clusters() This method is a slight variation on the previous method, in that, instead of returning the nearest neighbors of a data element, it merges the data element with its LSH-discovered neighbors to form a cluster. The method returns a list of such clusters. (8) merge_similarity_groups_with_coalescence() This is where we violate the traditional mandate of the LSH algorithm --- so use it with care. Make doubly sure that this and other similar methods are appropriate for your data. This method merges together those clusters produced by the previous method if they share any data elements. As you can well imagine, if the data is noisy and the clusters in the data are overlapping, this method may cluster all of your data into a single large cluster. (9) merge_similarity_groups_with_l2norm_sample_based() A sign of the coalescence step in the previous method working reasonably well is that the number of clusters produced by the previous method will be larger than the number of clusters actually present in the data. If that is the case, you can use this method to merge the excess clusters with those that are retained. This method first orders by size the clusters produced by the previous method. Subsequently, it pools together all the samples in the excess clusters. Each sample in this pool is then merged with one of the retained clusters on the basis of the least distance between the sample and cluster mean. (10) merge_similarity_groups_with_l2norm_set_based() The contract of this method is the same as that of the previous method, except for the difference that the excess clusters are merged wholesale with the retained clusters. The methods computes the difference in the mean vectors between a given excess cluster and each of the retained clusters. An excess cluster is merged with the nearest retained cluster on the basis of this difference in the means of the two being the smallest. (11) prune_similarity_groups(self): If the module is producing too many small clusters and you don't see any value in retaining them, you can call on this method to get rid of them. In order to use this method, you must first set the constructor parameter 'similarity_group_min_size_threshold' to the minimum cluster size you want. (12) show_data_for_lsh() For the purpose of verification, this method shows you what data was extracted from the CSV file. (13) write_clusters_to_file() You can call on this method to write the clusters out to a disk file. The method takes two arguments, the first for the list of clusters returned by any of the methods in items (8), (9), and (10) above, and the second for the name of the disk file. The DataGenerator CLASS: The module comes equipped with a DataGenerator class that you can use to generate multi-class multi-variate data for experimenting with the LSH module. The DataGenerator class is programmed to generate N "balls" of multi-variate Gaussian data, where N is the value for the parameter 'how_many_similarity_groups' shown in the constructor call below. Consider an N dimensional cube in the positive quadrant of an N-dimensional space. Such a cube has 2^N vertices. The N Gaussian balls are centered at the N vertices of the cube that are closest to the origin. import LocalitySensitiveHashing dim = 10 covar = numpy.diag([0.01] * dim) output_file = 'data_for_lsh.csv' data_gen = DataGenerator( output_csv_file = output_file, how_many_similarity_groups = 10, dim = dim, number_of_samples_per_group = 8, covariance = covar, ) data_gen.gen_data_and_write_to_csv() THE EXAMPLES DIRECTORY: The best way to become familiar with this module is by executing the following scripts in the Examples subdirectory: 1. LSH_basic_for_demonstrating_nearest_neighbors.py This script demonstrates the functionality one traditionally associates with the LSH algorithm --- finding nearest neighbors. This script places the user in an interactive session in which the user is asked to enter the symbolic name of a data record in the datafile that was processed by the LSH algorithm. Subsequently, the user is shown the nearest neighbors of that record. 2. Clustering_with_LSH_with_sample_based_merging.py This is one of the two scripts in the Examples directory that attempt to cluster the data that is supplied to the module through a CSV file. The data is clustered on the basis of the neighborhoods supplied by the LSH algorithm. If the coalescence of the LSH generated neighborhood creates more cluster than expected, this script calls on the merge_similarity_groups_with_l2norm_sample_based() to merge excess small clusters with the main retained clusters. 3. Clustering_with_LSH_with_set_based_merging.py This is the second of the cluster producing scripts in the Examples directory. Again, the starting points for forming the clusters are the neighborhoods supplied by the LSH algorithm. Unlike the previous script, this script calls on the merge_similarity_groups_with_l2norm_set_based() to merge excess small clusters with the main retained clusters. 4. gen_data.py This script uses the DataGenerator class that comes with the LSH module to generate multi-variate Gaussian data for experimenting with the LSH module. The output file produced by this method can be used directly for input to the LSH class. CAVEATS: Assuming that your data contains non-overlapping clusters, as to what sort of clustering results you'll get with this module depends a great deal on your choice of b and r parameters. If you can make a good guess at the distances d1 and d2 appropriate for your application (recall from the Introduction that we consider two data elements to be similar if their angular difference in the vector space is at most d1; and we consider them to be dissimilar if their angular difference is at least d2 = c * d1 with c > 1), and if you can come up with the target true-positive collision probability P1 and false-positive collision probability P2 as defined in the Introduction, you can solve the nonlinear equations presented there for the best values to use for b and r. INSTALLATION: The LSH class was packaged using setuptools. For installation, execute the following command in the source directory (this is the directory that contains the setup.py file after you have downloaded and uncompressed the package): sudo python setup.py install and/or, for the case of Python 3, sudo python3 setup.py install On Linux distributions, this will install the module file at a location that looks like /usr/local/lib/python2.7/dist-packages/ and, for the case of Python 3, at a location that looks like /usr/local/lib/python3.4/dist-packages/ If you do not have root access, you have the option of working directly off the directory in which you downloaded the software by simply placing the following statements at the top of your scripts that use the LSH class: import sys sys.path.append( "pathname_to_LSH_directory" ) To uninstall the module, simply delete the source directory, locate where LSH was installed with "locate LSH" and delete those files. As mentioned above, the full pathname to the installed version is likely to look like /usr/local/lib/python2.7/dist-packages/LSH* If you want to carry out a non-standard install of the LSH module, look up the on-line information on Disutils by pointing your browser to http://docs.python.org/dist/dist.html ACKNOWLEDGMENTS: The author has learned much from all his LSH-related discussions with Tommy Chang who is currently finishing his Ph.D. in the Robot Vision Lab at Purdue. Tommy is using LSH to solve a truly big-data problem: the problem of creating concise training and testing datasets for land-cover classification using hundreds of satellite images that may cover a region of interest whose size may be as large as hundreds of thousands of square kilometers. ABOUT THE AUTHOR: The author, Avinash Kak, recently finished his 17-year long Objects Trilogy project with the publication of the book "Designing with Objects" by John-Wiley. If interested, check out his web page at Purdue to find out what the Objects Trilogy project was all about. You might like "Designing with Objects" especially if you enjoyed reading Harry Potter as a kid (or even as an adult, for that matter). @endofdocs

Imported Modules

numpy
os
random
re
signal
string
sys

Classes



__builtin__.object

DataGenerator
LocalitySensitiveHashing

class DataGenerator(__builtin__.object)

    Methods defined here:

__init__(self, *args, **kwargs)

gen_data_and_write_to_csv(self)
Note that a unit cube in N dimensions has 2^N corner points.  The coordinates of all these corner points are given by the bit patterns of the integers 0, 1, 2, ...., 2^N - 1. For example, in a vector 3-space, a unit cube has 8 corners whose coordinates are given by the bit patterns for the integers 0, 1, 2, 3, 4, 5, 6, 7.  These bit patterns would be 000, 001, 010, 011, 100, 101, 110, 111. This script uses only N of the 2^N vertices of a unit cube as the mean vectors for N similarity groups.   These N vertices correspond to the far points on the cube edges that emanate at the origin.  For example, when N=3, it uses only 001,010,100 as the three mean vectors for the AT MOST 3 similarity groups.  If needed, we can add additional similarity groups by selecting additional coordinate bit patterns from the integers 0 through 2^N - 1.

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class LocalitySensitiveHashing(__builtin__.object)

    Methods defined here:

__init__(self, *args, **kwargs)

display_contents_of_all_hash_bins_pre_lsh(self)

evaluate_quality_of_similarity_groups(self, evaluation_similarity_groups)
The argument to this method, evaluation_similarity_groups, is a list of sets, with each set being a similarity group, which is the same thing as a cluster. If you plan to invoke this method to evaluate the quality of clustering achieved by the values used for the parameters r and b, you'd want the data records in the CSV datafile to look like:     sample0_3,0.925,-0.008,0.009,0.058,0.092,0.117,-0.076,0.239,0.086,-0.149 Note in particular the syntax used for naming a data record. The name 'sample0_3' means that this is the 3rd sample generated randomly for data class 0.  The goal of this method is to example all such  sample names and figure out how many classes exist in the data.

get_data_from_csv(self)

hash_all_data(self)

hash_all_data_with_one_hyperplane(self)

initialize_hash_store(self)

lsh_basic_for_nearest_neighbors(self)
Regarding this implementation of LSH, note that each row of self.htable_rows corresponds to one hash function.  So if you have 3000 hash functions for 3000 different randomly chosen orientations of a hyperplane passing through the origin of the vector space in which the numerical data is defined, this table has 3000 rows.  Each column of self.htable_rows is for one data sample in the vector space.  So if you have 80 samples, then the table has 80 columns. The output of this method consists of an interactive session in which the user is asked to enter the symbolic name of a data record in the dataset processed by the LSH algorithm. The method then returns the names (some if not all) of the nearest neighbors of that data point.

lsh_basic_for_neighborhood_clusters(self)
This method is a variation on the method lsh_basic_for_nearest_neighbors() in the following sense: Whereas the previous method outputs a hash table whose keys are the data sample names and whose values are the immediate neighbors of the key sample names, this method merges the keys with the values to create neighborhood clusters.  These clusters are returned as a list of similarity groups, with each group being a set.

merge_similarity_groups_with_coalescence(self, similarity_groups)
The purpose of this method is to do something that, strictly speaking, is not the right thing to do with an implementation of LSH.  We take the clusters produced by the method lsh_basic_for_neighborhood_clusters() and we coalesce them based on the basis of shared data samples. That is, if two neighborhood clusters represented by the sets A and B have any data elements in common, we merge A and B by forming the union of the two sets.

merge_similarity_groups_with_l2norm_sample_based(self, similarity_groups)
The neighborhood set coalescence as carried out by the previous method will generally result in a clustering structure that is likely to have more clusters than you may be expecting to find in your data. This method first orders the clusters (called 'similarity groups') according to their size.  It then pools together the data samples in the trailing excess similarity groups.   Subsequently, for each data sample in the pool, it merges that sample with the closest larger group.

merge_similarity_groups_with_l2norm_set_based(self, similarity_groups)
The overall goal of this method is the same as that of merge_similarity_groups_with_l2norm_sample_based(), except for the difference that we now merge the excess similarity groups wholesale with the retained similarity groups.  For each excess similarity group, we find the closest retained similarity group, closest in terms of the l2 norm distance between the mean values of the two groups.

prune_similarity_groups(self)
If your data produces too many similarity groups, you can get rid of the smallest with this method.  In order to use this method, you must specify a value for the parameter 'similarity_group_min_size_threshold' in the call to the constructor of the LSH module.

show_data_for_lsh(self)

show_sample_to_initial_similarity_group_mapping(self)

write_clusters_to_file(self, clusters, filename)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Functions


Ctrl_c_handler(signum, frame)
# Needed for cleanly terminating the interactive method lsh_basic_for_nearest_neighbors():

band_hash_group_index(block_name)
The keys of the final output that is stored in the hash self.coalesced_band_hash are strings that look like:      "block3 10110" This function returns the block index, which is the integer that follows the word "block" in the first substring in the string that you see above.

cleanup_csv(line)

convert(value)

deep_copy_array(array_in)
Meant only for an array of scalars (no nesting):

l2norm(list1, list2)

sample_group_index(sample_group_name)
As the comment block for the previous function explains, the data sample for LSH are supposed to have a symbolic name at the beginning of the comma separated string.  These symbolic names look like 'sample0_0', 'sample3_4', etc., where the first element of the name, such as 'sample0', indicates the group affiliation of the sample.  The purpose of this function is to return just the integer part of the group name.

sample_index(sample_name)
We assume that the raw data is stored in the following form:    sample0_0,0.951,-0.134,-0.102,0.079,0.12,0.123,-0.03,-0.078,0.036,0.138    sample0_1,1.041,0.057,0.095,0.026,-0.154,0.231,-0.074,0.005,0.055,0.14    ...    ...    sample1_8,-0.153,1.083,0.041,0.086,-0.059,0.042,-0.172,0.014,-0.153,0.091    sample1_9,0.051,1.122,-0.014,-0.117,0.015,-0.044,0.011,0.008,-0.121,-0.017    ...    ... This function returns the second integer in the name of each data record. It is useful for sorting the samples and for visualizing whether or not the final clustering step is working correctly.

Data

__author__ = 'Avinash Kak (kak@purdue.edu)'
__copyright__ = '(C) 2017 Avinash Kak. Python Software Foundation.'
__date__ = '2017-May-11'
__url__ = 'https://engineering.purdue.edu/kak/distLSH/LocalitySensitiveHashing-1.0.html'
__version__ = '1.0'

Author

Avinash Kak (kak@purdue.edu)

Data
		__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2017 Avinash Kak. Python Software Foundation.' __date__ = '2017-May-11' __url__ = 'https://engineering.purdue.edu/kak/distLSH/LocalitySensitiveHashing-1.0.html' __version__ = '1.0'