Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counted Set #132

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
10 changes: 10 additions & 0 deletions .swiftpm/xcode/xcshareddata/xcschemes/Collections.xcscheme
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,16 @@
ReferencedContainer = "container:">
</BuildableReference>
</TestableReference>
<TestableReference
skipped = "NO"
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "MultiSetsTests"
BuildableName = "MultiSetsTests"
BlueprintName = "MultiSetsTests"
ReferencedContainer = "container:">
</BuildableReference>
</TestableReference>
</Testables>
</TestAction>
<LaunchAction
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,20 @@
ReferencedContainer = "container:">
</BuildableReference>
</BuildActionEntry>
<BuildActionEntry
buildForTesting = "YES"
buildForRunning = "YES"
buildForProfiling = "YES"
buildForArchiving = "YES"
buildForAnalyzing = "YES">
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "MultiSets"
BuildableName = "MultiSets"
BlueprintName = "MultiSets"
ReferencedContainer = "container:">
</BuildableReference>
</BuildActionEntry>
</BuildActionEntries>
</BuildAction>
<TestAction
Expand Down Expand Up @@ -152,6 +166,16 @@
ReferencedContainer = "container:">
</BuildableReference>
</TestableReference>
<TestableReference
skipped = "NO">
<BuildableReference
BuildableIdentifier = "primary"
BlueprintIdentifier = "MultiSetsTests"
BuildableName = "MultiSetsTests"
BlueprintName = "MultiSetsTests"
ReferencedContainer = "container:">
</BuildableReference>
</TestableReference>
</Testables>
</TestAction>
<LaunchAction
Expand Down
11 changes: 11 additions & 0 deletions Package.swift
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ let package = Package(
products: [
.library(name: "Collections", targets: ["Collections"]),
.library(name: "DequeModule", targets: ["DequeModule"]),
.library(name: "MultiSets", targets: ["MultiSets"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We must not use a capital S here, as "multiset" is a single word.

I'm not sure I like this module name. What other type do you expect will under this module? It may make sense to just call this something boring like CountedMultisetModule -- although I do hope we will find a better name.

(Note that I don't know if we'll ever have a multiset that keeps duplicate items in a flat hash table -- that construct seems more likely to be based on a HAMT, and so it would probably fit better in a module dedicated to that family of types, such as the one @msteindorfer is working on in #31. We'll probably also have a SortedMultiset that will be based on Comparable instead of Hashable.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seemed to be some interest in signed multisets, specifically. I don’t entirely see the benefit myself, but after spending too much time agonizing over the module name I decided to just pick that and change it later if necessary.

I’m not a big fan of the “Module” suffix, but if that’s necessary it’s necessary.

.library(name: "OrderedCollections", targets: ["OrderedCollections"]),
.library(name: "PriorityQueueModule", targets: ["PriorityQueueModule"]),
],
Expand All @@ -61,6 +62,7 @@ let package = Package(
name: "Collections",
dependencies: [
"DequeModule",
"MultiSets",
"OrderedCollections",
"PriorityQueueModule",
],
Expand Down Expand Up @@ -94,6 +96,15 @@ let package = Package(
dependencies: ["DequeModule", "_CollectionsTestSupport"],
swiftSettings: settings),

// CountedSet<Element>
.target(
name: "MultiSets",
swiftSettings: settings),
.testTarget(
name: "MultiSetsTests",
dependencies: ["MultiSets", "_CollectionsTestSupport"],
swiftSettings: settings),

// OrderedSet<Element>, OrderedDictionary<Key, Value>
.target(
name: "OrderedCollections",
Expand Down
1 change: 1 addition & 0 deletions Sources/Collections/Collections.swift
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,6 @@
//===----------------------------------------------------------------------===//

@_exported import DequeModule
@_exported import MultiSets
@_exported import OrderedCollections
@_exported import PriorityQueueModule
105 changes: 105 additions & 0 deletions Sources/MultiSets/CountedSet/CountedSet+Collection.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
//===----------------------------------------------------------------------===//
//
// This source file is part of the Swift Collections open source project
//
// Copyright (c) 2021 Apple Inc. and the Swift project authors
// Licensed under Apache License v2.0 with Runtime Library Exception
//
// See https://swift.org/LICENSE.txt for license information
//
//===----------------------------------------------------------------------===//

extension CountedSet: Collection {
/// The position of an element in a counted set.
@frozen
public struct Index: Comparable {
/// An index in the underlying dictionary storage of a counted set.
///
/// This is used to distinguish between indices that point to elements with
/// different values.
@usableFromInline
let storageIndex: Dictionary<Element, UInt>.Index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please follow The Leading Underscore Rule when naming non-public members and types.

(Throughout this PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, sorry. In my defense, that was discouraged for normal (that is, not Standard Library) Swift code until around three months ago.

I actually didn’t realize the guidance changed until I tried to pull it just now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting -- where was it discouraged?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can’t find it now. Maybe I was thinking of Google’s Swift style guide? Never mind then.


/// The relative position of the element.
///
/// This doesn't actually correspond to a distinct part of memory. Instead,
/// it is used to distinguish between indices that point to elements of the
/// same value.
///
/// For example, the first index pointing to a value has an index of 0, the
/// second index pointing to that value will have an index of 1, and so on.
///
/// When a counted set is subscripted with an index, the stored multiplicity
/// is retrieved using the `storageIndex` and compared with the `position`
/// to determine the index's validity.
@usableFromInline
let position: UInt

@inlinable
public static func < (
lhs: CountedSet<Element>.Index,
rhs: CountedSet<Element>.Index
) -> Bool {
guard lhs.storageIndex != rhs.storageIndex else {
return lhs.storageIndex < rhs.storageIndex
}
return lhs.position < rhs.position
}

@usableFromInline
init(storageIndex: Dictionary<Element, UInt>.Index, position: UInt) {
self.storageIndex = storageIndex
self.position = position
}
}

@inlinable
public func index(after i: Index) -> Index {
guard i.position + 1 < rawValue[i.storageIndex].value else {
return Index(
storageIndex: rawValue.index(after: i.storageIndex),
position: 0
)
}

return Index(storageIndex: i.storageIndex, position: i.position + 1)
}

@inlinable
public subscript(position: Index) -> Element {
let keyPair = rawValue[position.storageIndex]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: it is usually more readable to bind the key-value pair to separate variables, as in

Suggested change
let keyPair = rawValue[position.storageIndex]
let (key, value) = rawValue[position.storageIndex]

precondition(
position.position < keyPair.value,
"Attempting to access CountedSet elements using an invalid index"
)
return keyPair.key
}

/// The number of elements in the set.
///
/// - Complexity: O(*k*), where *k* is the number of unique elements in the
/// set.
@inlinable
public var count: Int {
Int(rawValue.values.reduce(.zero, +))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine per the Collection protocol, but it seems like a missed opportunity.

What if we kept a running count of items in this set, updating it on every mutation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing the count would mean this is no longer an implicit data structure. I’m not really sure that’s a worthwhile tradeoff.

Is there precedent for doing that with similar data structures?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh definitely -- Set and Dictionary do it, for example, to prevent having to scan their bitmap to figure out their element count.

I don't think having to store one extra integer matters much in this case -- the hash table in the underlying dictionary already comes with way too much memory overhead to call this a truly implicit data structure. 🙈

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it was an implicit data structure for a dictionary, but okay then.

}

@inlinable
public var startIndex: Index {
Index(storageIndex: rawValue.startIndex, position: 0)
}

@inlinable
public var endIndex: Index {
Index(storageIndex: rawValue.endIndex, position: 0)
}

/// A value equal to the number of unique elements in the set.
///
/// - Complexity: O(*k*), where *k* is the number of unique elements in the
/// set.
@inlinable
public var underestimatedCount: Int {
rawValue.count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, but if we decide to add a running total instead of using reduce above, then we should return it here, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underestimatedCount has always been incredibly unclear to me, in terms of its intended role in Collections. I don’t believe anything actually says it should be equivalent to count in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.

If we change it to reference count, we should add another property with this value.

By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underestimatedCount has always been incredibly unclear to me, in terms of its intended role in Collections. I don’t believe anything actually says it should be equivalent to count in such cases, so I figured it’d be a useful way to express the (less computationally expensive) number of unique elements.

This is a good idea in general, and Collection does allow this! underestimatedCount is typically only used in contexts that are generic over Sequence, but that doesn't mean it's a bad idea to make it diverge from count -- it can still be quite beneficial.

By the way, would it be possible to add a more efficient conditional implementation of Swift Algorithms’ Unique methods for CountedSet, in case anyone ends up using them together? I’m guessing the answer is no, but it’d be a nice-to-have.

The way to do this right now would require Algorithms to import Collections, which would be a bad idea.

However, Swift does have support for cross-module overlays, which are modules that automatically get loaded whenever some module A and module B are both imported to a source file. (Apple's SDKs use this to add, e.g., MapKit-related functionality to SwiftUI when a file imports both of these modules.) If this functionality was cleaned up & promoted to a full-blown language feature, including SwiftPM support for defining cross-import modules, then we could use that to provide uniquing overloads that take CountedSets.

Another alternative is to define a set of additional collection protocols in a separate package, and have swift-collections and swift-algorithms both depend on that. This would let swift-algorithms provide generic overloads for specific algorithms to speed them up when possible.

I don't think we need to do this for CountedMultiset right now, but it's something to think about for later. (FWIW, I think we'll eventually want to define some sort of a dictionary protocol at least, and possibly a UniqueCollection protocol.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the easiest way is to simply provide the uniquing methods directly on the counted multiset type, without importing Algorithms. This would work, but perhaps it would be even better to provide a uniqueMembers view instead, exposing the underlying dictionary's keys view.

(I'm not sure we need to do this though -- we are already exposing the storage dictionary, after all, so folks can simply access keys directly.)

}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
//===----------------------------------------------------------------------===//
//
// This source file is part of the Swift Collections open source project
//
// Copyright (c) 2021 Apple Inc. and the Swift project authors
// Licensed under Apache License v2.0 with Runtime Library Exception
//
// See https://swift.org/LICENSE.txt for license information
//
//===----------------------------------------------------------------------===//

extension CountedSet: ExpressibleByDictionaryLiteral {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the utility of this, but I do expect CountedMultiset to conform to ExpressibleByArrayLiteral instead of (or, more likely, in addition to) this conformance.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that protocol literally uses a CountedSet as an example, I couldn’t really overlook it in good conscience.

ExpressibleByArrayLiteral is actually inherited by SetAlgebra, so there’s already conformance with it. I can separate it, of course.

/// Creates a counted set initialized with a dictionary literal.
///
/// Do not call this initializer directly. It is called by the compiler to
/// handle dictionary literals. To use a dictionary literal as the initial
/// value of a counted set, enclose a comma-separated list of key-value pairs
/// in square brackets.
///
/// For example, the code sample below creates a counted set with string keys.
///
/// let countriesOfOrigin = ["BR": 2, "GH": 1, "JP": 5]
/// print(countriesOfOrigin)
/// // Prints "["BR", "BR", "JP", "JP", "JP", "JP", "JP", "GH"]"
///
/// - Parameter elements: The element-multiplicity pairs that will make up the
/// new counted set.
/// - Precondition: Each element must be unique.
@inlinable
public init(dictionaryLiteral elements: (Element, UInt)...) {
_storage = RawValue(
uniqueKeysWithValues: elements.lazy.filter { $0.1 > .zero }
)
}
}

41 changes: 41 additions & 0 deletions Sources/MultiSets/CountedSet/CountedSet+Sequence.swift
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
//===----------------------------------------------------------------------===//
//
// This source file is part of the Swift Collections open source project
//
// Copyright (c) 2021 Apple Inc. and the Swift project authors
// Licensed under Apache License v2.0 with Runtime Library Exception
//
// See https://swift.org/LICENSE.txt for license information
//
//===----------------------------------------------------------------------===//

extension CountedSet: Sequence {
/// Returns an iterator over the elements of this collection.
///
/// - Complexity: O(1)
/// - Remark: A type-erased wrapper is used instead of an opaque type purely
/// to preserve compatibility with older versions of Swift.
@inlinable
public func makeIterator() -> AnyIterator<Element> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to implement a custom Iterator type here -- AnyIterator will not do!

I recommend coding the next() method from scratch instead of using lazy algorithms in this case. (The iterator state can simply consist of the underlying dictionary iterator (or index) along with the current value and the number of times we have to return it still.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I figured as much. While I couldn’t see any downside to using AnyIterator, the fact that nothing else seemed to was a bad sign.

I don’t suppose you could explain why that’s frowned upon here? As the comment notes, I’d have used an opaque type if I was allowed. Is there some benefit to making a dedicated type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it's a performance issue. Like AnySequence, AnyCollection and AnyHashable, the type-erased iterator needs to box things up into a class (or a closure, I can't remember & I'm too lazy to check 😈) to get rid of the original types and then dispatch dynamically to the underlying next() method, and this tends to make things much slower than usual.

AnyIterator isn't a bad construct -- it does have its use cases. But it's almost always better to design things so that we don't need to use it, especially in cases where performance is important. (Which is definitely the case for this package.)

Copy link
Member

@lorentey lorentey Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(AnyIterator is really nice in that it hides the actual iterator type we're using so that we remain able to change it later. I expect some IteratorProtocol will at some point let us do the same thing without the performance impact -- unfortunately currently we can't use it yet because we wouldn't be able to constrain its Element type.)

AnyIterator(
_storage
.lazy
.map { (key, value) -> [Repeated<Element>] in
let repetitions = value.quotientAndRemainder(
dividingBy: UInt(Int.max)
)

var result = [Repeated<Element>](
repeating: repeatElement(key, count: .max),
count: Int(repetitions.quotient)
)
result.append(repeatElement(key, count: Int(repetitions.remainder)))

return result
}
.joined()
.joined()
.makeIterator()
)
}
}
Loading