LocalitySensitiveHashing-1.0.1.html

LocalitySensitiveHashing (version 1.0.1, 2017-May-25)

LocalitySensitiveHashing.py
Version: 1.0.1     Author: Avinash Kak (kak@purdue.edu) Date: 2017-May-25

Download Version 1.0.1: gztar bztar      Total number of downloads (all versions): 1671
     This count is automatically updated at every rotation of      the weblogs (normally once every two to four days)      Last updated: Tue Dec 2 06:08:01 EST 2025

View version 1.0.1 code in browser CHANGES:   Version 1.0.1:     This version fixes the typos and other errors discovered in the     documentation.  The module code remains unchanged.          INTRODUCTION:     The LocalitySensitiveHashing module is an implementation of the     Locality Sensitive Hashing (LSH) algorithm for nearest neighbor search.     The main idea in LSH is to avoid having to compare every pair of data     samples in a large dataset in order to find the nearest similar     neighbors for the different data samples.  With LSH, one can expect a     data sample and its closest similar neighbors to be hashed into the     same bucket with a high probability.  By treating the data samples     placed in the same bucket as candidates for similarity checking, we     significantly reduce the computational burden associated with finding     nearest neighbors in large datasets.     While LSH algorithms have traditionally been used for finding nearest     neighbors, this module goes a step further and explores using LSH for     clustering the data.  Strictly speaking, this violates the basic     mandate of LSH, which is to return just the nearest neighbors. (A data     sample X being Y's nearest neighbor and Y being Z's nearest neighbor,     in the sense nearest neighbors are commonly defined with the Cosine     metric in LSH, does not always imply that X and Z will be sufficiently     close to be considered each other's nearest neighbors.)  Nonetheless,     if you believe that your datafile consists of non-overlapping data     clusters, this module may do a decent job of finding those clusters.     The rest of this section reviews the basic notions in LSH to help the     user of this module understand the implementation code.  As to how     these notions are used for clustering will be taken up in the next     section.     A hash function is locality sensitive if it places similar samples in     the same bucket with a high probability, and if it places dissimilar     samples in the same bucket with a low probability.  Two data samples     are considered to be similar if the "distance" between them is at most     d1 and two data samples are considered dissimilar if the "distance"     between them is at least d2 = c * d1, with c > 1. (As for how to     measure this "distance", for numerical data in a vector space, we have     several distance measures at our disposal.  For example, we could use     the Cosine distances or the L2-norm, etc.)  For given d1 and d2, the     quality of a locality-sensitive hash function is measured by two     probabilities, p1 and p2, where the former is the the least probability     of collision for similar samples (which is something that we want) and     the latter the largest probability of collision for dissimilar samples     (which is something that we do NOT want).  For obvious reasons, you'd     want p1 to be as high as possible and p2 to be as low as possible.  The     locality sensitivity of such a hash function is characterized by the     4-tuple (d1,d2,p1,p2).  We may refer to p1 as the least probability for     detecting the true positives and p2 as largest probability of     encountering false positives.          In practice, it is not possible to come up with a single hash function     with acceptably high true-positive probability and acceptably low     false-positive probability.  However, it has been shown that a large     number of hash functions working together in an AND-OR structure can     give us the desired values for p1 and p2.  One starts out with a basic     hash function that places similar samples in the same bucket with a     high probability, but that, at the same time, places any two dissimilar     samples in the same bucket with NOT a sufficiently low probability.     Subsequently, one can require that for any two given samples to be     considered candidates for similarity, they must be in the same bucket     for a set of r random variants of the same basic hash function.  This     is referred to as enforcing an ÁND over the r hash functions in order     to decrease the collisions for dissimilar samples.  Since the ÁND     operation can also diminish the probability of collisions for similar     samples, we introduce an OR operation, which consists of ORing over b     bands of ANDed operations.  Before visiting the collision probabilities     for similar and dissimilar samples associated with this AND-OR logic,     let me further illustrate in the next paragraph as to what is meant by     the AND and the OR operations.  I'll do so by alluding to the basic     data structure, htable_rows, that the module uses for the AND and the     OR operations.          This module implements the AND-OR idea described above in the following     manner: The module constructs a 2D bit array, represented by     htable_rows in the code, in which each row corresponds to a different     hash function and each column corresponds to a data sample.  The total     number of rows in htable_rows is b*r where r is the number of hash     functions over which we want to carry out the AND operation and b is     the number of such bands of r hash functions for the OR operation.  To     elaborate, assume that both r and b are set to 3 and that we have a     total of 5 data samples. The 2D array of 1's and 0's in htable_rows     could look like what is shown below:                                      data samples                                  x1     x2     x3     x4     x5                      ----------------------------------                     |                                                              h1  |  1      .      1      .      .        b=0  r=0                 h2  |  0      .      0      .      .        b=0  r=1     hash        h3  |  1      .      1      .      .        b=0  r=2     functions       |                 h4  |  .      1      .      1      .        b=1  r=0                 h5  |  .      1      .      1      .        b=1  r=1                 h6  |  .      0      .      0      .        b=1  r=2                     |                 h7  |  1      .      .      .      1        b=2  r=0                 h8  |  1      .      .      .      1        b=2  r=1                 h9  |  1      .      .      .      1        b=2  r=2                     |          where an entry of "1" means that the hyperplane hash function placed     that data sample in the "plus" bucket and an entry of "0" means that     the data sample was placed in the "minus" bucket.  When the entry is a     dot ".", that means we don't care about that entry for the sake of the     explanation here. (Showing 1s and 0s for all entries would create too     much visual clutter in the table.) Note also that, on the right of the     table, I have taken the liberty of using the symbol b as the index     vis-a-vis the parameter b, and the symbol r as the index vis-a-vis the     parameter r.  Each distinct value for the index b shows a separate band     of the hash functions.  And each distinct value for the index r for     each index b shows the output of a separate hash function within each     band.                  The AND-OR property says that for any two data samples to be considered     similar they must agree with respect to all the hash values in ALL of     the r rows IN AT LEAST ONE of the b bands.  (Now I am using b and r as     the parameters of the algorithm.  In the example shown above, both b     and r are set to 3.)  Based on this property, we claim that we have the     following candidate pairs that we should test for similarity: x1 and     x3, x2 and x4, and x1 and x5.          Let's now revisit the question of what probabilities to associate with     the collisions for similar and dissimilar pairs of samples through the     AND-OR construction shown above.          If p1 is the probability that two similar samples would be placed in     the same bucket by the basic hash function used, the probability that r     such functions would do the same in an AND aggregation of the hash     functions is p1^r.  Along the same lines, if p2 is the probability that     the basic hash function would place two dissimilar samples in the same     bucket, then the probability of r such hash functions doing the same in     an AND aggregation of the hash functions goes down to p2^r.  So, with     regard to any single set of r hash functions, we have a desirable     reduction in the collision probability associated with two dissimilar     samples, but, unfortunately, also an undesirable reduction in the     collision probability associated with two similar samples.  However,     assuming that p1 is quite high to begin with, say around 0.9, and p2 is     quite low to begin with, say, around 0.1. The p1^r value will be     impacted much less that the p2^r value. So if the basic hash function     is characterized by the 4-tuple (d1, d2, p1, p2) for its locality     sensitivity, an r-wise AND aggregation of random variations on the     basic hash function will be characterized by (d1, d2, p1^r,p2^r).          Let's now see what OR logic does to a set of random variants of a (d1,     d2, p1, p2) family of hash function. Let's say we are ORing over the     bucket entries as produced by a family of b hash functions.  Since p1     is the probability of collision for similar samples, we can say that     each of the b hash functions would declare two similar samples to be     dissimilar with a probability of (1 - p1).  That implies that, given a     set of b such hash functions, none of them would declare two similar     samples to be similar with a probability of (1 - p1)^b.  Therefore, the     probability that at least one of these b hash functions in an OR     construction would declare two similar samples to be similar is [1 - (1     - p1)^b].  The same logic tells us that an OR combination over b hash     functions would declare two dissimilar samples to be similar with a     probability of [1 - (1 - p2)^b].  Hence, we can characterize the     locality sensitivity of a set of ORed b hash functions by the 4-tuple     (d1, d2, [1 - (1 - p1)^b], [1 - (1 - p2)^b]).          It follows that if we apply the OR logic to b bands of hash functions,     with each band consisting of r hash function operating together     according to the AND logic, we can characterize the locality     sensitivity of the b*r hash functions by the 4-tuple                 (d1, d2, [1 - (1 - p1^r)^b], [1 - (1 - p2^r)^b]).     which we may express as (d1, d2, P1, P2) with P1 = 1 - (1 - p1^r)^b and     P2 = 1 - (1 - p2^r)^b.  We can think of P1 and P2 as our amplified     true-positive and false-positive probabilities.  Given target values     for P1 and P2, one would need to solve the nonlinear equations that     relate P1 and P2 with b and r for the values one should use for these     two parameters in an LSH implementation.          But what about the relationship between the distances d1 and d2, on the     one hand, and the collision probabilities p1 and p2?  That depends on     the choice of the basic hash function.  Consider what's surely the most     commonly used hash function for finding nearest neighbors in     multi-dimensional numerical data: the hyperplane hash function. Let's     say your data consists of vectors in an N dimensional space.  For a     hyperplane passing through the origin of the vector space, the     orientation of the hyperplane being given by its surface normal, we can     measure the similarity two data samples by finding out whether or not     their projection on the normal to the hyperplane is on the same side of     the hyperplane.  Let's say that the angle between the two data vectors     is d1 (the angle being measured in the plane that passes through both     the vectors).  We now say that for a randomly oriented hyperplane, the     probability that the projections of the two data vectors on the normal     to the hyperplane would fall on the same side of the hyperplane is (180     - d) / 180.  Obviously, as d approaches zero, this probability will     approach 1, which makes sense since when d=o, the projections of the     two vectors on the surface normal will always be on the same side of     the hyperplane for all its orientations.  On the other hand, when d     approaches 180 degrees, you have two data vectors that are as     dissimilar as they can be and in this case their projections on the     surface normal will be on the opposite sides of the hyperplane for all     its orientations.  For a non-zero angle d between the data vectors, as     you consider hyperplanes with different possible random orientations     that vary from 0 degrees to 180 degrees (each orientation being the     angle of the surface normal in the plane containing both data vectors),     over the orientation range 0 through 180-d degrees, both data vectors     would project onto the normal on the same side of the hyperplane.     These arguments apply to the relationship between d1 and p1 and also to     the relationship between d2 and p2.  So we write:                            180 - d1              p1  =   ----------                         180                            180 - d2              p2  =   ----------                         180     One can plug these formulas in the 4-tuple sensitivity characterization            (d1, d2, [1 - (1 - p1^r)^b], [1 - (1 - p2^r)^b]).     and given the desired targets for the probabilities for true-positive     collisions and false-positive collisions, solve for the best values to     use for b and r.     For further information regarding Locality Sensitive Hash Functions,     the reader is referred to Chapter 3 of the book "Mining of Massive     Datasets" by Jure Leskovec, Anand Rajaraman, and Jeffery Ullman.      CAN NEAREST NEIGHBORS RETURNED BY LSH BE USED TO CLUSTER THE DATA?     Strictly speaking, the answer is no.            The problem is that the nearest neighbor property as calculated by LSH     with hyperplane hash functions is not transitive.  Let's say that LSH     considers data samples X and Y to be similar because the algorithm has     placed them in the same bucket.  And, again, let's say that LSH     considers the data samples Y and Z to be similar because the algorithm     has placed them also in the same bucket (this bucket being different     from the first bucket).       X being similar to Y and Y being similar to Z in the sense described     here may not always imply that X is similar to Z.  Just imagine three     points on the same great circle on the surface of a sphere.  The     angular interval between X and Y, on the one hand, and between Y and Z,     on the other, could be sufficiently small so that X and Y would be     considered similar and Y and Z would be considered similar also.  Yet,     the angular interval between X and Z may exceed the threshold test for     similarity.          Despite the difficulty created by the non-transitivity of angle-based     measures of similarity between any two data samples, it may be possible     to get good clustering results if the data is known to reside in     non-overlapping clusters in the vector space. Although such a property     is not likely to be true for most practical applications, modifications     of the logic presented in this module may provide acceptable results     when the data clusters are somewhat overlapping.  Consider, for     example, the notion used in this module that two neighborhoods returned     by LSH can be clustered together if they share data samples.  We could     additionally predicate the joining of such neighborhoods on the     distance between their means, or the distance between the shared data     samples and the means of the two neighborhoods, with the idea that the     shared data samples closer to the respective means in each of the     neighborhoods are to be trusted more with regard to the two     neighborhoods belonging to the same cluster.  This additional logic is     not yet in the module.          Think of this module as just a first step in the direction of exploring     how one might exploit the neighborhood information returned by LSH in     forming similarity clusters in data.      USAGE:     If you want to find just the nearest neighbors of a given data sample     in a datafile that was processed by this module, your usage of the     module will look like:         lsh = LocalitySensitiveHashing(                    datafile = "data_for_lsh.csv",                    dim = 10,                    r = 50,                             b = 100,                         )         lsh.get_data_from_csv()         lsh.initialize_hash_store()         lsh.hash_all_data()         similarity_neighborhoods = lsh.lsh_basic_for_nearest_neighbors()     where 'dim' is the dimensionality of the numerical data in the file     that in the above example is named "data_for_lsh.csv".  The datafile     must be in the CSV format. (See the Examples directory for what a     datafile must look like.)  The parameter 'r' is the number of rows for     r-wise AND in each band of the hash functions, and the paramter 'b' is     the number of bands for b-wise OR over all the bands.     On the other hand, if you want to use this module for clustering your     data using LSH for discovering the neighbors of the individual data     samples, your usage of the module will look like:         lsh = LocalitySensitiveHashing(                    datafile = "data_for_lsh.csv",                    dim = 10,                    r = 50,                                b = 100,                                  expected_num_of_clusters = 10,               )         lsh.get_data_from_csv()         lsh.initialize_hash_store()         lsh.hash_all_data()         similarity_groups = lsh.lsh_basic_for_neighborhood_clusters()         coalesced_similarity_groups = lsh.merge_similarity_groups_with_coalescence( similarity_groups )         merged_similarity_groups = lsh.merge_similarity_groups_with_l2norm_sample_based( coalesced_similarity_groups )         lsh.write_clusters_to_file( merged_similarity_groups, "clusters.txt" )             where the constructor parameters 'dim', 'r', and 'b' carry the same    meaning as mentioned for the previous case.  The new constructor    parameter 'expected_num_of_clusters' specifies how many clusters you    expect to see in the data.  The last four statements are about    constructing clusters from the neighborhoods returned by LSH.  Their    roles are explained elsewhere on this documentation page.  The next to    the last statement shown above first orders the clusters formed    according to their size.  It then retains a certain number of the    largest clusters, the number being as specified by the constructor    parameter expected_num_of_clusters.  The samples in the excess clusters    are then pooled together and each sample in the pool assigned to the    closest retained cluster.    In the usage example shown above for clustering the data, you can    replace the call in the next to the last statement by         merged_similarity_groups = lsh.merge_similarity_groups_with_l2norm_set_based( coalesced_similarity_groups )    This invokes a set based approach to merging the excess clusters with    the retained clusters.  In the set based approach, you find the    difference between the means of the excess clusters and the retained    clusters.  An excess cluster is merged with a retained cluster if the    difference between the means of the two is the least. CONSTRUCTOR PARAMETERS:     datafile:            Must be a ".csv" file.  Each record in this file                          corresponds one data point in a vector space.                          Each record must have associated with it a                          unique symbolic name that must be in the first                          column.     dim:                 Is set to the dimensionality of the vector space                          in which the data is defined.     r:                   The number of rows in each band of the hash                          functions.  An r-wise AND operator is applied to                          the buckets in each such band.     b:                   The number of r-row bands of the hash functions.                            A b-wise OR operator is applied to the buckets                          that correspond to each of the bands.     expected_num_of_clusters:  This tell the module how many clusters                          you expect to see in your datafile.      METHODS:     (1)  display_contents_of_all_hash_bins_pre_lsh()          As mentioned elsewhere in this documentation page, each data          sample is hashed by a randomly oriented hyperplane (passing          through the origin of the vector space in which the data resides)          by projecting the sample on the normal to the hyperplane. This          creates a two-bin hash table, with some samples projected into the          positive half-space and the others into the negative half-space.          This method displays the orientation of the hyperplane, along with          two lists, one consisting of the sample names that projected into          the positive half-space and the other consisting of the sample          names that projected into the negative half-space.     (2)  evaluate_quality_of_similarity_groups( merged_similarity_groups )          If the symbolic names in your datafile are based on the          "sampleX_Y" format, where X is the integer ID of a cluster and Y          the integer ID of a data element in the cluster, you can call on          this method to evaluate the quality of the clusters produced by          the LSH module.     (3)  get_data_from_csv()          This method extracts the numerical data from your CSV file.     (4)  hash_all_data()          It is this method that hashes all of your data records in the CSV          file with r * b number of hash functions, each hash function being          a randomly oriented hyperplane passing through the origin of the          vector space in which the data resides.     (5)  initialize_hash_store()          This method must be called before the 'hash_all_data()' method.          The initialization consists of generating the desired number of          hyperplane orientations randomly and associating with each          orientation a two-bin hash table in the form of a dictionary with          two <key,value> pairs in it for the keys 'plus' and 'minus', with          'plus' standing for the positive half-space and 'minus' for the          negative half-space for each hyperplane.     (6)  lsh_basic_for_nearest_neighbors()          This method is this module's implementation of the hyperplane          based LSH algorithm.  This method's mandate is what is          traditionally accomplished with LSH --- finding nearest neighbors          for data elements.     (7)  lsh_basic_for_neighborhood_clusters()          This method is a slight variation on the previous method, in that,          instead of returning the nearest neighbors of a data element, it          merges the data element with its LSH-discovered neighbors to form          a cluster.  The method returns a list of such clusters.     (8)  merge_similarity_groups_with_coalescence()          This is where we violate the traditional mandate of the LSH          algorithm --- so use it with care.  Make doubly sure that this and          other similar methods are appropriate for your data.  This method          merges together those clusters produced by the previous method if          they share any data elements.  As you can well imagine, if the          data is noisy and the clusters in the data are overlapping, this          method may cluster all of your data into a single large cluster.     (9)  merge_similarity_groups_with_l2norm_sample_based()          A sign of the coalescence step in the previous method working          reasonably well is that the number of clusters produced by the          previous method will be larger than the number of clusters          actually present in the data.  If that is the case, you can use          this method to merge the excess clusters with those that are          retained.  This method first orders by size the clusters produced          by the previous method.  Subsequently, it pools together all the          samples in the excess clusters.  Each sample in this pool is then          merged with one of the retained clusters on the basis of the least          distance between the sample and cluster mean.     (10) merge_similarity_groups_with_l2norm_set_based()          The contract of this method is the same as that of the previous          method, except for the difference that the excess clusters are          merged wholesale with the retained clusters.  The methods computes          the difference in the mean vectors between a given excess cluster          and each of the retained clusters. An excess cluster is merged          with the nearest retained cluster on the basis of this difference          in the means of the two being the smallest.     (11) prune_similarity_groups(self):          If the module is producing too many small clusters and you don't          see any value in retaining them, you can call on this method to          get rid of them.  In order to use this method, you must first set          the constructor parameter 'similarity_group_min_size_threshold' to          the minimum cluster size you want.     (12) show_data_for_lsh()          For the purpose of verification, this method shows you what data          was extracted from the CSV file.     (13) write_clusters_to_file()              You can call on this method to write the clusters out to a disk          file.  The method takes two arguments, the first for the list of          clusters returned by any of the methods in items (8), (9), and          (10) above, and the second for the name of the disk file. The DataGenerator CLASS:     The module comes equipped with a DataGenerator class that you can use     to generate multi-class multi-variate data for experimenting with the     LSH module.     The DataGenerator class is programmed to generate N "balls" of     multi-variate Gaussian data, where N is the value for the parameter     'how_many_similarity_groups' shown in the constructor call     below. Consider an N dimensional cube in the positive quadrant of an     N-dimensional space.  Such a cube has 2^N vertices. The N Gaussian     balls are centered at the N vertices of the cube that are closest to     the origin.         import LocalitySensitiveHashing         dim = 10         covar = numpy.diag([0.01] * dim)         output_file = 'data_for_lsh.csv'         data_gen = DataGenerator(                                   output_csv_file   = output_file,                                   how_many_similarity_groups = 10,                                   dim = dim,                                   number_of_samples_per_group = 8,                                   covariance = covar,                                 )                  data_gen.gen_data_and_write_to_csv()          THE EXAMPLES DIRECTORY:     The best way to become familiar with this module is by executing the     following scripts in the Examples subdirectory:     1.  LSH_basic_for_demonstrating_nearest_neighbors.py             This script demonstrates the functionality one traditionally             associates with the LSH algorithm --- finding nearest             neighbors.  This script places the user in an interactive             session in which the user is asked to enter the symbolic name             of a data record in the datafile that was processed by the LSH             algorithm.  Subsequently, the user is shown the nearest             neighbors of that record.     2.  Clustering_with_LSH_with_sample_based_merging.py             This is one of the two scripts in the Examples directory that             attempt to cluster the data that is supplied to the module             through a CSV file. The data is clustered on the basis of the             neighborhoods supplied by the LSH algorithm.  If the             coalescence of the LSH generated neighborhood creates more             cluster than expected, this script calls on the             merge_similarity_groups_with_l2norm_sample_based() to merge             excess small clusters with the main retained clusters.     3.  Clustering_with_LSH_with_set_based_merging.py             This is the second of the cluster producing scripts in the             Examples directory.  Again, the starting points for forming the             clusters are the neighborhoods supplied by the LSH algorithm.             Unlike the previous script, this script calls on the             merge_similarity_groups_with_l2norm_set_based() to merge excess             small clusters with the main retained clusters.     4.  gen_data.py             This script uses the DataGenerator class that comes with the             LSH module to generate multi-variate Gaussian data for             experimenting with the LSH module.  The output file produced by             this method can be used directly for input to the LSH class. CAVEATS:     Assuming that your data contains non-overlapping clusters, as to what     sort of clustering results you'll get with this module depends a great     deal on your choice of b and r parameters.  If you can make a good     guess at the distances d1 and d2 appropriate for your application     (recall from the Introduction that we consider two data elements to be     similar if their angular difference in the vector space is at most d1;     and we consider them to be dissimilar if their angular difference is at     least d2 = c * d1 with c > 1), and if you can come up with the target     true-positive collision probability P1 and false-positive collision     probability P2 as defined in the Introduction, you can solve the     nonlinear equations presented there for the best values to use for b     and r. INSTALLATION:                                                                                                         The LSH class was packaged using setuptools.  For installation, execute     the following command in the source directory (this is the directory     that contains the setup.py file after you have downloaded and     uncompressed the package):                                                                                                                 sudo python setup.py install                                                                                                                                                                    and/or, for the case of Python 3,                                                                                                                                                                                sudo python3 setup.py install                                                                                                                                                                   On Linux distributions, this will install the module file at a location                             that looks like                                                                                                                                                                                                  /usr/local/lib/python2.7/dist-packages/                                                                                                                                                        and, for the case of Python 3, at a location that looks like                                                                                                                                                      /usr/local/lib/python3.4/dist-packages/     If you do not have root access, you have the option of working directly     off the directory in which you downloaded the software by simply     placing the following statements at the top of your scripts that use     the LSH class:         import sys         sys.path.append( "pathname_to_LSH_directory" )     To uninstall the module, simply delete the source directory, locate     where LSH was installed with "locate LSH" and delete those files.  As     mentioned above, the full pathname to the installed version is likely     to look like /usr/local/lib/python2.7/dist-packages/LSH*     If you want to carry out a non-standard install of the LSH module,     look up the on-line information on Disutils by pointing your     browser to           http://docs.python.org/dist/dist.html ACKNOWLEDGMENTS:     The author has learned much from his LSH-related discussions with Tommy     Chang who is currently finishing his Ph.D. in the Robot Vision Lab at     Purdue.  Tommy is using LSH to solve a truly big-data problem: the     problem of creating concise training and testing datasets for wide-area     land-cover classification algorithms that must work well on hundreds of     satellite images (as opposed to just one satellite image at a time,     which is the norm in the remote-sensing community).  Typically, these     images cover a region of the earth whose size may be as large as     hundreds of thousands of square kilometers. ABOUT THE AUTHOR:     The author, Avinash Kak, recently finished his 17-year long Objects     Trilogy project with the publication of the book "Designing with     Objects" by John-Wiley. If interested, check out his web page at Purdue     to find out what the Objects Trilogy project was all about. You might     like "Designing with Objects" especially if you enjoyed reading Harry     Potter as a kid (or even as an adult, for that matter). @endofdocs

`Download Version 1.0.1: gztar bztar`	`Total number of downloads (all versions): 1671` `This count is automatically updated at every rotation of the weblogs (normally once every two to four days) Last updated: Tue Dec 2 06:08:01 EST 2025`

Imported Modules

numpy
os
random
re
signal
string
sys

Classes



__builtin__.object

DataGenerator
LocalitySensitiveHashing

class DataGenerator(__builtin__.object)

    Methods defined here:

__init__(self, *args, **kwargs)

gen_data_and_write_to_csv(self)
Note that a unit cube in N dimensions has 2^N corner points.  The coordinates of all these corner points are given by the bit patterns of the integers 0, 1, 2, ...., 2^N - 1. For example, in a vector 3-space, a unit cube has 8 corners whose coordinates are given by the bit patterns for the integers 0, 1, 2, 3, 4, 5, 6, 7.  These bit patterns would be 000, 001, 010, 011, 100, 101, 110, 111. This script uses only N of the 2^N vertices of a unit cube as the mean vectors for N similarity groups.   These N vertices correspond to the far points on the cube edges that emanate at the origin.  For example, when N=3, it uses only 001,010,100 as the three mean vectors for the AT MOST 3 similarity groups.  If needed, we can add additional similarity groups by selecting additional coordinate bit patterns from the integers 0 through 2^N - 1.

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

class LocalitySensitiveHashing(__builtin__.object)

    Methods defined here:

__init__(self, *args, **kwargs)

display_contents_of_all_hash_bins_pre_lsh(self)

evaluate_quality_of_similarity_groups(self, evaluation_similarity_groups)
The argument to this method, evaluation_similarity_groups, is a list of sets, with each set being a similarity group, which is the same thing as a cluster. If you plan to invoke this method to evaluate the quality of clustering achieved by the values used for the parameters r and b, you'd want the data records in the CSV datafile to look like:     sample0_3,0.925,-0.008,0.009,0.058,0.092,0.117,-0.076,0.239,0.086,-0.149 Note in particular the syntax used for naming a data record. The name 'sample0_3' means that this is the 3rd sample generated randomly for data class 0.  The goal of this method is to example all such  sample names and figure out how many classes exist in the data.

get_data_from_csv(self)

hash_all_data(self)

hash_all_data_with_one_hyperplane(self)

initialize_hash_store(self)

lsh_basic_for_nearest_neighbors(self)
Regarding this implementation of LSH, note that each row of self.htable_rows corresponds to one hash function.  So if you have 3000 hash functions for 3000 different randomly chosen orientations of a hyperplane passing through the origin of the vector space in which the numerical data is defined, this table has 3000 rows.  Each column of self.htable_rows is for one data sample in the vector space.  So if you have 80 samples, then the table has 80 columns. The output of this method consists of an interactive session in which the user is asked to enter the symbolic name of a data record in the dataset processed by the LSH algorithm. The method then returns the names (some if not all) of the nearest neighbors of that data point.

lsh_basic_for_neighborhood_clusters(self)
This method is a variation on the method lsh_basic_for_nearest_neighbors() in the following sense: Whereas the previous method outputs a hash table whose keys are the data sample names and whose values are the immediate neighbors of the key sample names, this method merges the keys with the values to create neighborhood clusters.  These clusters are returned as a list of similarity groups, with each group being a set.

merge_similarity_groups_with_coalescence(self, similarity_groups)
The purpose of this method is to do something that, strictly speaking, is not the right thing to do with an implementation of LSH.  We take the clusters produced by the method lsh_basic_for_neighborhood_clusters() and we coalesce them based on the basis of shared data samples. That is, if two neighborhood clusters represented by the sets A and B have any data elements in common, we merge A and B by forming the union of the two sets.

merge_similarity_groups_with_l2norm_sample_based(self, similarity_groups)
The neighborhood set coalescence as carried out by the previous method will generally result in a clustering structure that is likely to have more clusters than you may be expecting to find in your data. This method first orders the clusters (called 'similarity groups') according to their size.  It then pools together the data samples in the trailing excess similarity groups.   Subsequently, for each data sample in the pool, it merges that sample with the closest larger group.

merge_similarity_groups_with_l2norm_set_based(self, similarity_groups)
The overall goal of this method is the same as that of merge_similarity_groups_with_l2norm_sample_based(), except for the difference that we now merge the excess similarity groups wholesale with the retained similarity groups.  For each excess similarity group, we find the closest retained similarity group, closest in terms of the l2 norm distance between the mean values of the two groups.

prune_similarity_groups(self)
If your data produces too many similarity groups, you can get rid of the smallest with this method.  In order to use this method, you must specify a value for the parameter 'similarity_group_min_size_threshold' in the call to the constructor of the LSH module.

show_data_for_lsh(self)

show_sample_to_initial_similarity_group_mapping(self)

write_clusters_to_file(self, clusters, filename)

Data descriptors defined here:

__dict__

dictionary for instance variables (if defined)

__weakref__

list of weak references to the object (if defined)

Functions


Ctrl_c_handler(signum, frame)
# Needed for cleanly terminating the interactive method lsh_basic_for_nearest_neighbors():

band_hash_group_index(block_name)
The keys of the final output that is stored in the hash self.coalesced_band_hash are strings that look like:      "block3 10110" This function returns the block index, which is the integer that follows the word "block" in the first substring in the string that you see above.

cleanup_csv(line)

convert(value)

deep_copy_array(array_in)
Meant only for an array of scalars (no nesting):

l2norm(list1, list2)

sample_group_index(sample_group_name)
As the comment block for the previous function explains, the data sample for LSH are supposed to have a symbolic name at the beginning of the comma separated string.  These symbolic names look like 'sample0_0', 'sample3_4', etc., where the first element of the name, such as 'sample0', indicates the group affiliation of the sample.  The purpose of this function is to return just the integer part of the group name.

sample_index(sample_name)
We assume that the raw data is stored in the following form:    sample0_0,0.951,-0.134,-0.102,0.079,0.12,0.123,-0.03,-0.078,0.036,0.138    sample0_1,1.041,0.057,0.095,0.026,-0.154,0.231,-0.074,0.005,0.055,0.14    ...    ...    sample1_8,-0.153,1.083,0.041,0.086,-0.059,0.042,-0.172,0.014,-0.153,0.091    sample1_9,0.051,1.122,-0.014,-0.117,0.015,-0.044,0.011,0.008,-0.121,-0.017    ...    ... This function returns the second integer in the name of each data record. It is useful for sorting the samples and for visualizing whether or not the final clustering step is working correctly.

Data

__author__ = 'Avinash Kak (kak@purdue.edu)'
__copyright__ = '(C) 2017 Avinash Kak. Python Software Foundation.'
__date__ = '2017-May-25'
__url__ = 'https://engineering.purdue.edu/kak/distLSH/LocalitySensitiveHashing-1.0.1.html'
__version__ = '1.0.1'

Author

Avinash Kak (kak@purdue.edu)

Data
		__author__ = 'Avinash Kak (kak@purdue.edu)' __copyright__ = '(C) 2017 Avinash Kak. Python Software Foundation.' __date__ = '2017-May-25' __url__ = 'https://engineering.purdue.edu/kak/distLSH/LocalitySensitiveHashing-1.0.1.html' __version__ = '1.0.1'