15.7. The Set
data structure¶
A data structure is a container for grouping a collection of data into a
single object. We have seen some examples already, including
string
s, which are collections of characters, and
vector
s which are collections of any type.
An ordered set is a collection of items with two defining properties:
- Ordering:
The elements of the set have indices associated with them. We can use these indices to identify elements of the set.
- Uniqueness:
No element appears in the set more than once. If you try to add an element to a set, and it already exists, there is no effect.
In addition, our implementation of an ordered set will have the following property:
- Arbitrary size:
As we add elements to the set, it expands to make room for new elements.
Both string
s and vector
s have an ordering; every element
has an index we can use to identify it. None of the data structures
we have seen so far have the properties of uniqueness or arbitrary size.
To achieve uniqueness, we have to write an add
function that
searches the set to see if it already exists. To make the set expand as
elements are added, we can take advantage of the resize
function on
vector
s.
Here is the beginning of a class definition for a Set
.
class Set {
private:
vector<string> elements;
int numElements;
public:
Set (int n);
int getNumElements () const;
string getElement (int i) const;
int find (const string& s) const;
int add (const string& s);
};
Set::Set (int n)
{
vector<string> temp (n);
elements = temp;
numElements = 0;
}
The instance variables are a vector
of strings and an integer
that keeps track of how many elements there are in the set. Keep in mind
that the number of elements in the set, numElements
, is not the same
thing as the size of the vector
. Usually it will be smaller.
The Set
constructor takes a single parameter, which is the initial
size of the vector
. The initial number of elements is always zero.
getNumElements
and getElement
are accessor functions for the
instance variables, which are private. numElements
is a read-only
variable, so we provide a get
function but not a set
function.
int Set::getNumElements () const
{
return numElements;
}
Why do we have to prevent client programs from changing
getNumElements
? What are the invariants for this type, and how could
a client program break an invariant. As we look at the rest of the
Set
member function, see if you can convince yourself that they all
maintain the invariants.
When we use the []
operator to access the vector
, it checks to
make sure the index is greater than or equal to zero and less than the
length of the vector
. To access the elements of a set, though, we
need to check a stronger condition. The index has to be less than the
number of elements, which might be smaller than the length of the
vector
.
string Set::getElement (int i) const
{
if (i < numElements) {
return elements[i];
} else {
cout << "Set index out of range." << endl;
exit (1);
}
}
If getElement
gets an index that is out of range, it prints an error
message (not the most useful message, I admit), and exits.
The interesting functions are find
and add
. By now, the pattern
for traversing and searching should be old hat:
int Set::find (const string& s) const
{
for (int i=0; i<numElements; i++) {
if (elements[i] == s) return i;
}
return -1;
}
So that leaves us with add
. Often the return type for something like
add
would be void, but in this case it might be useful to make it
return the index of the element.
int Set::add (const string& s)
{
// if the element is already in the set, return its index
int index = find (s);
if (index != -1) return index;
// if the vector is full, double its size
if (numElements == elements.length()) {
elements.resize (elements.length() * 2);
}
// add the new elements and return its index
index = numElements;
elements[index] = s;
numElements++;
return index;
}
The tricky thing here is that numElements
is used in two ways. It is
the number of elements in the set, of course, but it is also the index
of the next element to be added.
It takes a minute to convince yourself that that works, but consider
this: when the number of elements is zero, the index of the next element
is 0. When the number of elements is equal to the length of the
vector
, that means that the vector is full, and we have to
allocate more space (using resize
) before we can add the new
element.
Here is a state diagram showing a Set
object that initially contains
space for 2 elements.
Now we can use the Set
class to keep track of the cities we find in
the file. In main
we create the Set
with an initial size of 2:
Set cities (2);
Then in processLine
we add both cities to the Set
and store the
index that gets returned.
int index1 = cities.add (city1);
int index2 = cities.add (city2);
I modified processLine
to take the cities
object as a second
parameter.
- the set grows to accomodate any new elements we add
- Correct! This is the "arbitrary size" property.
- the set is sorted in an order (ie alphabetically, numerically, e.t.c.)
- Incorrect! This is not a requirement of a set.
- elements of the set have indices, which can be used to identify them
- Correct! This is the "ordering" property.
- there is a limit on how large a set can be
- Incorrect! This is not a requirement of a set... in fact, sets are always expanding with each added element!
- there are no repeat elements in the set
- Correct! This is the uniqueness property!
Q-2: Which of the following are properties of an ordered set?
- numElements is a read-only variable.
- Correct!
- The user might pick a value for numElements that is out of range.
- Incorrect! While this could happen, it just wouldn't make sense for the uer to interact with numElements at all!
- numElements cannot be modified.
- Incorrect! numElements is modified, just not by the user.
- We should provide a set function, we just haven't implemented it yet!
- Incorrect! There is no need for the user to have access to a set function.
Q-3: Why don’t we provide a set()
function for numElements
?