A Python-style Set in simple Java

Dan Hales
The Startup
Published in
9 min readJun 7, 2020

--

A Venn diagram showing the intersection of sets A and B

One of the highlights of my time as a high school teacher was having the opportunity to teach AP Computer Science A. I had no background in Java before taking on the class (just a lone course in C++ in college) and had to teach myself the language from the ground up, so when I took on AP Computer Science Principles, I decided to use that as an excuse to dive into another programming language: Python. One of the ways I bridged the gap between thinking in Python and thinking in Java was to reverse-engineer some of the behaviors of core Python objects that were outside of the curriculum I covered in Java. While reinventing the wheel like this is ill-advised for practical purposes, doing so for educational reasons is a great way to take a deep dive into nuance.

In this post, we’ll be exploring a Java implementation of some of the important behaviors of Python set objects. The code is restricted to the AP Computer Science A subset of Java, and does require basic understanding of the ArrayList class, the enhanced for loop, and implementation of the Iterable interface. You should also either understand (or just believe in the magic of) the Integer wrapper class’ autoboxing and unboxing. In order to focus on the behavior of a Set and avoid some rabbit-holes such as shallow versus deep copying, we will limit ourselves to a simple Set for holding Integer objects.

Both Set.java and SetDemo.java, which demonstrates its methods, are available on my github.

First, a UML diagram of the Set class can shed light on some of its behavior:

UML Diagram for the Set class

This Set class will implement the Iterable interface, so we can iterate over its elements one by one in an enhanced for loop (which implicitly calls the iterator method). The elements contained in the Set class will actually be stored in ArrayList of Integer objects — this is to highlight how a Set is different from an ArrayList. You will notice that some methods are simply calls to an underlying ArrayList method, while other methods from the ArrayList class are completely inaccessible.

The constructors, contains, size, and toString behave exactly as we would expect. In fact, contains and size are simply calls to the ArrayList object’s contains and size methods. The differences starts to appear when we examine the add method:

public void add(int new_value){    if (!contains(new_value))
{
set.add((int)(Math.random()*size()), new_value);
}
}

This method does the heavy lifting when it comes to implementing the two main properties that distinguish the Set class from an ArrayList of integers. In the source code, we can see that no value can be added to a Set object without passing through this filter method. The very first thing add does is check to see whether or not the value is already in the list — new_value is added if and only if it is not already present in the Set. This is the first major difference between a traditional ArrayList and a Set; while an ArrayList can contain duplicates, a Set, by definition, contains only unique values. This method does not refuse to add items to the Set; instead, it checks to see if they are already represented.

The second thing this method does is reinforce that the order of elements in a Set is arbitrary. In Python, elements in a set are arranged in memory according to their hash value (which is why asetholds only hashable objects), and that hash value can present them in unpredictable ways. Because hashing is outside the scope of APCSA, we instead randomize the order of the elements to simulate this unpredictability. When thinking about how a Set differs from an ArrayList, we need to throw the idea of ordered elements out the window, and this implementation forces us to do that.

Now that we’ve talked about adding elements to the Set, let’s look at how we can remove them.

public void discard(int value){    for (int i = 0; i < set.size(); i++)    {        if (set.get(i) == value)        {            set.remove(i);        }    }}

If we want to remove a specific item from the set, we can use the discard method, and pass the element you want removed from the Set. If the element is not found in the Set, the discard method does nothing. While this is similar to the ArrayList’s remove method, which deletes the first occurrence of the desired element, the first occurrence is the only occurrence in a Set, due to the uniqueness of the elements.

public int pop(){    return set.remove((int)(Math.random() * set.size()));}

A second way to remove elements from the Set is with the pop method. This method removes and returns an arbitrary (implemented here as random) element from the Set, and is useful in algorithms which process and remove elements from the Set one-by-one, but in no particular order. Similar to Python, attempting to pop an element from an empty Set will result in a runtime error.

Now that we’ve looked at how we can move elements into and out of the Set, let’s look at operations involving more than one Set. We’ll start with methods that check conditions.

First up is the notion of a subset. A set S is a subset of a set T if every element in S is also in T.

S is a subset of T

The Set lacks an implementation of order, so this differs from the idea of a subarray, in which the elements of the sub-array appear in exactly the same order. In order to check if the calling object is a subset of the passed object, we have the isSubset method:

public boolean isSubset(Set other){    for (Integer element : set)    {        if (!other.contains(element))        {            return false;        }    }    return true;}

We check every element in our internal set, and if we find any element in the calling Set that is not present in other, we return false. Notice that in this implementation, we will always return true if the calling Set is empty (which should sound familiar to anyone familiar with Set theory). The isSuperset method turns the tables, checking to see if every element in the passed Set is also in the calling Set.

The two Sets contain exactly the same elements, so they are equal.

Now that we have the notion of a subset in our toolkit, we can use a set-theoretic definition for the equals method. Two sets S and T are equal if and only if every element of S is also in T, and every element of T is also in S. In other words, this set is equal to some other set if this.isSubset(other) and other.isSubset(this) are both true:

public boolean equals(Set other){    return this.isSubset(other) && other.isSubset(this);}

With a Set, the order is arbitrary, so the same elements only need to be present, not in a particular order. This is one of the features that really distinguishes a Set from an ArrayList, in which two objects are equal if and only if they contain exactly the same elements in exactly the same order. (I am deliberately avoiding the actual implementation of an overridden version of the Object class’ equals method in order to focus on the logic of the Set class.)

Finally, let’s take a look at the methods that accept two Set objects and return a thirdSet: union, intersection, difference, and symmetricDifference. I have provided both a static method and an instance method for each of these; in each case, the instance method merely calls the static method, passing this as the first argument and the instance method’s parameter as the second.

Public Domain, https://commons.wikimedia.org/w/index.php?curid=3437444

The simplest of these methods to understand is union. The union of two Setobjects is a new Set that contains every element that is in at least one of the original Set objects. The logic of the union method is as simple as that: create a new Set object, then use the add method to add all elements from setA and all elements from setB. As we saw above, the add method filters out any duplicates for us.

public static Set union(Set setA, Set setB){    Set setUnion = new Set();    for (Integer element : setA)    {        setUnion.add(element);    }    for (Integer element: setB)    {        setUnion.add(element);    }    return setUnion;}

Notice that again, the union method can both operate on and return an empty Set.

Public Domain, https://commons.wikimedia.org/w/index.php?curid=3437020

The second useful binary operation on Set objects is taking the intersection. While the union includes all elements that are in at least one of the two Set objects, the intersection includes only the elements that are in both:

public static Set intersection(Set setA, Set setB){    Set setIntersection = new Set();    for (Integer element : setA)    {        if (setB.contains(element))        {            setIntersection.add(element);        }    }    return setIntersection;}

By definition, any element in the intersection must be in both setA and setB, so we start by examining the elements in setA, and add only the elements that are also in setB to setIntersection. Notice that the order of the sets does not matter; had we passed setA and setB in the opposite order, we would still wind up returning the same unordered collection of unique elements — that is, the same Set. The method testIntersection in SetDemo.java demonstrates that these are indeed the same. Furthermore, this method is capable of accepting and returning an empty Set.

Public Domain, https://commons.wikimedia.org/w/index.php?curid=3437431

Next, we have the only method in which the order of the arguments matters: difference. If S and T are two sets, the difference, denoted in math as S-T or S\T, is the set of elements in S that are not also in T. Because this expression is read from left to right, I have used those as the parameter names in the difference method to keep things straight:

public static Set difference(Set left, Set right){    Set diff = new Set();    for (Integer element : left)    {        if (!right.contains(element))        {            diff.add(element);        }    }    return diff;}

As a fan of symmetry, I wanted to highlight how we can implement intersection and difference with virtually identical code, where the only change isthe ! operator in the condition. This binary condition actually partitions left into two different subsets: one that overlaps right, and one that does not. The overlapping subset is returned by intersection, while the non-overlapping subset is returned by difference. This also shows why the order of the arguments matters with difference but not with intersection: the overlapping subset returned by intersection is included in both left and right, but the non-overlapping subset returned by difference is unique to left.

Public Domain, https://commons.wikimedia.org/w/index.php?curid=3437442

Speaking of symmetry, the final method the Set class is symmetricDifference. This method accepts two Set objects, and returns all the elements that are in exactly one of them. One way to think of this is the difference of the union and the intersection, and that is exactly how I have implemented it:

public static Set symmetricDifference(Set setA, Set setB){    return Set.difference(Set.union(setA, setB),                          Set.intersection(setA, setB));}

And there we have it! A Python-style Set built from the ground up using high-school-level Java.

A few closing remarks:

  1. This code is not optimized in any sense of the word. Its purpose was not to run efficiently, but to be written readably.
  2. These implementations are not even close to unique. For instance, we could have written symmetricDifference to return the union of the differences instead of the difference of the union and the intersection.
  3. To really emulate Python behavior, we should have been hashing our elements instead of simply storing them in an ArrayList. We emulated the unpredictable order by randomizing, but ordering of elements of a Python set is arbitrary, not random.
  4. In Python, a single set can hold objects of different data types, as long as they are hashable. We could have accomplished this with generics or polymorphism, but that would have distracted from the important behaviors that distinguish the Set class from ArrayList.
  5. This version of a Set is based on the set found in Python 2, which has since been deprecated. I opted for this older version because the latest set retains the fundamental behavior, but adds in a few additional features that could have made this post unruly.

For APCSA teachers and students, or Java programmers looking to get their toes wet with Python, I hope this post helped to shed some light on Python set functionality.

--

--