Section 33.5 Standard Library Hash Functions and Custom Types
Rather than writing our own hash functions for built in types like doubles and strings, we can use the hash functions provided by the C++ standard library. The
std::hash class is a templated functor that can be used to compute a hash value (as a
size_t) for a given type. For example, to compute the hash value of a string, we can use the following code:
string name = "John Smith";
string name2 = "Sandra Jackson";
// Create a hash functor for strings
std::hash<string> stringHasher;
// Compute the hash values for the strings
size_t strHash = stringHasher(name);
size_t strHash2 = stringHasher(name2);
For custom data types, we need to write our own hash functions that use
std::hash to compute hash values for basic members and then combine them using bit shifting and xor to produce a single hash value for the entire object.
For example, if we have a
Point struct with
x and
y integer coordinates, we can write a hash function for it like the one below. It shifts the bits of the x coordinateโs hash value to the left by 8 bits, we ensure that (1, 2) will have a different hash value than a point with coordinates (2, 1).
size_t myHash(const Point& p) {
std::hash<int> intHasher;
size_t pointHash = (intHasher(p.x) << 8) ^ intHasher(p.y);
return pointHash;
}
For more complex objects, we need to carefully consider which members to include in the hash. There are a couple of general guidelines to follow when deciding which members to include in the hash:
-
We should include enough members to ensure that different objects that we want to treat as different will have different hash values. For example, we would not want to compute the hash value for a Person using just the age as then everyone with the age 23 would hash to the exact same value.
-
We should try to avoid including data that is likely to change frequently. The hash value will be used to โfindโ objects in the hash table. We want an objectโs hash value to remain consistent whenever possible.
If there is a single member that is unique for each object (like a social security number for a person), the best strategy is often to just use the hash value of that member as the hash value for the entire object. If there is not a single unique member, we need to decide on a set of members to include in the hash and then combine their hash values in a way that produces a good distribution of hash values.
We also need to make sure that the logic for the hash function matches the logic of equality. If two objects are considered equal by the equality operator, they should have the same hash value. If two objects are not considered equal, they should ideally have different hash values (although we do not have to guarantee this). Any object we are going to hash should also have an equality operator defined that relies on the same attributes as the hash function.
Letโs consider two different versions of a
Student class. The first will have a
studentID member that is unique for each student:
struct Student {
string firstName;
string lastName;
string dateOfBirth;
int earnedCredits;
double GPA;
int studentID; // Unique identifier for each student
};
bool operator==(const Student& s1, const Student& s2) {
return s1.studentID == s2.studentID; // Equality based on studentID
}
size_t myHash(const Student& s) {
std::hash<int> intHasher;
return intHasher(s.studentID); // Hash based on studentID
}
Here, the existence of a unique field makes things simple. We will just base the hash value on that unique field. Then, we will make sure that any two items that are
== always will produce the same hash value by defining the equality operator to also only rely on the unique field.
Now, letโs consider a second version of the
Student class where we just have
firstName,
lastName, and
dateOfBirth members, none of which are unique by themselves:
struct Student {
string firstName;
string lastName;
string dateOfBirth;
int earnedCredits;
double GPA;
bool operator==(const Student& s1, const Student& s2) {
// Equality based on first name, last name, and date of birth
return s1.firstName == s2.firstName &&
s1.lastName == s2.lastName &&
s1.dateOfBirth == s2.dateOfBirth;
}
size_t myHash(const Student& s) {
std::hash<string> stringHasher;
size_t h1 = stringHasher(s.firstName);
size_t h2 = stringHasher(s.lastName);
size_t h3 = stringHasher(s.dateOfBirth);
// Combine the hash values and shift each field by one additional bit
return h1 ^ (h2 << 1) ^ (h3 << 2);
}
In this second version, we do not have a single unique field to base the hash on, so we need to combine the hash values of multiple fields. We use bit shifting and xor to combine the hash values of the first name, last name, and date of birth. We will shift each value by a different number of bits (0, 1, and 2) to ensure that the order of the fields affects the final hash value. This way, we will produce different hash values for a student with the first name "John", last name "Smith", and date of birth "01/01/2000" than for a student with the first name "Smith", last name "John", and date of birth "01/01/2000". However, if we have two students with the same first name, last name, and date of birth, they will produce the same hash value, which is consistent with our equality operator. (It should be clear why having a unique student ID is a nice way to identify them.)
As before, we need to make sure that the logic of the hash function is consistent with the logic of the equality operator. By defining
== to rely on the same fields as the hash function, we ensure that any two students that are considered equal will produce the same hash value.
Note that neither version includes things like earned credits or GPA in the hash. If we included those values in the hash and then tried to update a studentโs GPA, the hash value for that student would change, which means its location in the table would need to change as well.
Checkpoint 33.5.1.
We are storing structs that contain information about stocks:
struct StockInfo {
string tickerSymbol;
string companyName;
int sharesOwned;
double currentPrice;
};
The ticker symbol is the short code used to identify stocks in listings, while the company name is the full legal name of the company. Thus, we might have a value like this:
{tickerSymbol: "GOOG", companyName: "Alphabet Inc.", sharesOwned: 100, currentPrice: 316.45}.
What members should we include in the hash function for this struct?
tickerSymbol
-
companyName
-
This is a tough call. If we believe the ticker symbol is unique (which it is), then using just it would be sufficient. And not including the companyName means that we can let that name change without having to worry about rehashing.
If tickerSymbols were not unique, we might want to include the companyName to ensure a unique identifier.
sharesOwned
Shares owned is likely to change. And it does not contribute to uniquely identifying the stock.
currentPrice
-
Current price is likely to change and does not contribute to uniquely identifying the stock.
It is also a double which can introduce floating-point precision issues.
Checkpoint 33.5.2.
Which statements are correct about hash functions and equality?
If two objects are equal, their hash values must be the same.
-
If two objects have the same hash value, they must be equal.
This is not necessarily true. Two objects can have the same hash value (a collision) without being equal. Though it is desirable for hash functions to minimize collisions, they cannot be completely eliminated.
If two objects are equal, they should have different hash values.
This is incorrect. Equal objects must have the same hash value.
If two objects have different hash values, they must not be equal.
Equal objects must have the same hash value, so if two objects have different hash values, they cannot be equal.
You have attempted
of
activities on this page.