Taking a step back, it might seem that easy duplicate removal is the only benefit of using sets. We previously discussed how sets have no order; arrays have indexed element which could simply ignored and treated like a set. It appears that arrays can do the same job as a set, if not more.
However, this simplification enforced by sets opens way to different underlying implementations. In lists, elements are assigned indices to give each element a place in the order. Sets have no need to assign indices, so they instead implement a different approach of referencing: hash mapping. These operate by (pseudo)randomly allocating addresses to elements, as opposed to storing them in a row. The allocation is governed by hashing functions, which use the element as an input to output an address.
H(x) is deterministic, so the same input always gives the same output, ie. there is no RNG within the function H, so H(4) = 6 always in this case.
Running this function takes the same amount of time regardless of the size of the set, ie. hashing has O(1) time complexity. This means that the time taken to hash is independent of the size of the list, and remains at a constant, quick speed.
Because hashing is generally quick, a whole host of operations that are typically slow on large arrays can be executed very efficiently on a set.
Search or Membership Testing
Searching for elements in an array utilises an algorithm called Linear Search, by checking each item in the list one by one. In the worst case, where the item being searched for does not exist in the list, the algorithm traverses every element of the list (O(n)). In a very large list, this process takes a long time.
However, as hashing is O(1), Python hashes the item to be found, and either returns where it is in memory, or that it doesn’t exist- in a very small amount of time.
number_list = range(random.randint(1,000,000))
number_set = set(number_list)#Line 1
#BEGIN TIMER
print(-1 in number_list)
#END TIMER
#Line 2
#BEGIN TIMER
print(-1 in number_set)
#END TIMER
Note: Searching using a hashmap has an amortized time complexity of O(1). This means that in the average case, it runs at constant time but technically, in the worst case, searching is O(n). However, this is extremely unlikely and comes down to the hashing implementation having a chance of collisions, which is when multiple elements in a hashmap/set are hashed to the same address.
Deletion
Deleting an element from a list first involves a search to locate the element, and then removing reference to the element by clearing the address. In an array, after the O(n) time search, the index of every element following the deleted element needs to be shifted down one. This itself is another O(n) process.
Deleting an element from a set involves the O(1) lookup, and then erasure of the memory address which is an O(1) process so deletion also operates in constant time. Sets also have more ways to delete elements, such that errors are not raised, or such that multiple elements can be removed concisely.
#LIST
numbers = [1, 3, 4, 7, 8, 11]numbers.remove(4)
numbers.remove(5) #Raises ERROR as 5 is not in list
numbers.pop(0) #Deletes number at index 0, ie. 1
#SET
numbers = 1, 3, 4, 7, 8, 11
numbers.remove(4)
numbers.remove(5) #Raises ERROR as 5 is not in set
numbers.discard(5) #Does not raise error if 5 is not in the set
numbers -= 1,2,3 #Performs set difference, ie. 1, 3 are discarded
Insertion
Both appending to a list and adding elements to a set are constant operations; adding to a specified index in a list (.insert) however comes with the added time to shift elements around.
num_list = [1,2,3]
num_set = 1,2,3num_list.append(4)
num_set.add(4)
num_list += [5,6,7]
num_set += 5,6,7
Advanced Set Operations
Additionally, all the mathematical operations that can be performed on sets have implementation in python also. These operations are once again time consuming to manually perform on a list, and are once again optimised using hashing.
A = 1, 2, 3, 5, 8, 13
B = 2, 3, 5, 7, 13, 17# A n B
AintersectB = A & B
# A U B
AunionB = A | B
# A \ B
AminusB = A - B
# A U B - A n B or A Delta B
AsymmetricdiffB = A ^ B
This also includes comparison operators, namely proper and relaxed subsets and supersets. These operations once again run much faster than their list counterparts, operating in O(n) time, where n is the larger of the 2 sets.
A <= B #A is a proper subset of B
A > B #A is a superset of B
Frozen Sets
A final small, but underrated feature in python is the frozen set, which is essentially a read-only or immutable set. These offer greater memory efficiency and could be useful in cases where you frequently test membership in a tuple.
Conclusion
The essence of using sets to boost performance is encapsulated by the principle of optimisation by reduction.
Data structures like lists have the most functionality- being indexed and dynamic- but come at the cost of comparatively lower efficiency: speed and memory-wise. Identifying which features are essential vs unused to inform what data type to use will result in code that runs faster and reads better.
All technical diagrams by author.