Why is the kmeans algorithm column-oriented instead of row-oriented? #79

paulhendricks · 2017-01-03T17:58:35Z

In the docs (below), the kmeans algorithm takes a matrix where each column X[:, i] corresponds to an observed sample. This implementation goes against the idea of tidy data as well as differs from Python's scikit-learn implementation of kmeans and R's base implementation of kmeans.

Is there a good reason for this? Should this algorithm be changed from column-oriented to row-oriented so as to be consistent with R and Python as well as with the concept of tidy data?

URL: http://clusteringjl.readthedocs.io/en/stable/overview.html

Inputs

A clustering algorithm, depending on its nature, may accept an input matrix in either of the following forms:

Sample matrix X, where each column X[:,i] corresponds to an observed sample.
Distance matrix D, where D[i,j] indicates the distance between samples i and j, or the cost of assigning one to the other.

The text was updated successfully, but these errors were encountered:

ararslan · 2017-01-03T18:05:34Z

There were discussions about this a while back (though I'm struggling to locate one now), and ultimately it came down to the original package author's personal preference. Apparently observations-as-columns is the norm in machine learning, at least according to some, and observations-as-rows is the norm in classical statistics. Personally I'm in the camp of observations being rows and have been bitten by this package's use of columns.

The problem with changing it at this point is that existing code will suddenly become very wrong, and it's not something we'd be able to catch and emit a deprecation warning for. We'd have to redesign the API and deprecate the existing API in favor of the new one.

nalimilan · 2017-01-03T21:12:49Z

Unless we can find another package where clustering works that way, I'd favor moving to rows as observations. A possible deprecation path would be to add a vardim argument, and require passing e.g. vardim=1 (with a default of 2) to get the current behavior without a warning. Then we would switch the default in the next release, and possibly deprecate the option later.

EDIT: see also JuliaStats/Distances.jl#35

ararslan · 2017-01-03T21:15:22Z

@nalimilan That sounds perfect. Simple and largely non-disruptive. Nice idea! 💯

paulhendricks · 2017-01-04T17:06:10Z

Sounds like a plan. I can take a first stab at implementing this and submit a pull request for your review.

ararslan · 2017-01-04T17:59:30Z

That sounds great, @paulhendricks. Thanks!

paulhendricks · 2017-01-04T20:05:46Z

Awesome. As far as I can tell, kmeans is the only clustering algorithm that needs changed to close this issue.

Below is a first pass at implementing @nalimilan 's suggestion:

const _kmeans_default_init = :kmpp
const _kmeans_default_maxiter = 100
const _kmeans_default_tol = 1.0e-6
const _kmeans_default_display = :none
const _vardim = 2

function kmeans!{T<:AbstractFloat}(X::Matrix{T}, centers::Matrix{T};
                                   weights=nothing,
                                   maxiter::Integer=_kmeans_default_maxiter,
                                   tol::Real=_kmeans_default_tol,
                                   display::Symbol=_kmeans_default_display,
                                   vardim::Integer=_vardim)

    # Here, we use vardim to assess if matrix X is
    # column-oriented (vardim == 1) or row-oriented (vardim == 2).
    # By default, X should be row-oriented. Use vardim=1
    # to recover the old behavior where X is column-oriented.
    #
    # See this issue for more details:
    # https://github.com/JuliaStats/Clustering.jl/issues/79
    info("vardim defaults to 2 in version x.x.x and will default to 1 in x.x.x.")
    info("vardim will then be deprecated in version x.x.x.")
    if vardim == 2:
      # If matrix X is row-oriented (vardim == 2), transpose
      # it to be column-oriented. When vardim is deprecated,
      # this transposition will be the first operation.
      X = transpose(X)

    # Once X is column-oriented, the rest of algorithm
    # can run as normal with no need for modification.
    m, n = size(X)
    m2, k = size(centers)
    m == m2 || throw(DimensionMismatch("Inconsistent array dimensions."))
    (2 <= k < n) || error("k must have 2 <= k < n.")

    assignments = zeros(Int, n)
    costs = zeros(T, n)
    counts = Array(Int, k)
    cweights = Array(Float64, k)

Does this look good 👍 ?

nalimilan · 2017-01-04T20:44:57Z

Use Base.depwarn to print warnings; and no need to say in what version it will be removed (it's always in the next major version). Also, you need to check that vardim is equal either to 1 or to 2.

kmsquire · 2017-01-18T08:35:20Z

Just as a point of reference, one of the original arguments for having observations as columns was that Julia arrays are in column-major order, so this allows data for individual observations to be stored contiguously.

See, e.g., http://julialang.org/blog/2013/09/fast-numeric, especially under "Write cache-friendly codes".

nalimilan · 2017-01-18T09:49:49Z

Right, that's probably how Dahua chose the arrangement to use. Benchmarks are presented in Distances.jl README: https://github.com/JuliaStats/Distances.jl#benchmarks

Though I wonder whether it really makes a difference since the text says the computations are done by calling BLAS's gemm. Since the array is contiguous anyway, BLAS should be able to choose the best way to perform the computation efficiently. I can't believe SciKit would use a deliberately slow convention.

Somebody would have to run the benchmarks using the rowwise convention to check this.

kmsquire · 2017-01-18T15:09:19Z

I can't believe SciKit would use a deliberately slow convention.

SciKit Learn has a great interface, but in my experience, it doesn't always have the fastest implementation.

More importantly, Python/numpy are row-major order by default, so matching that convention already keeps the Dara contiguous/cache friendly.

kmsquire · 2017-01-18T15:11:11Z

(But running the benchmarks is definitely the way to go!)

Since Julia just obtained a Transpose type, this actually shouldn't be that costly.

nalimilan · 2017-01-18T15:17:44Z

More importantly, Python/numpy are row-major order by default, so matching that convention already keeps the Dara contiguous/cache friendly.

I didn't know that. It's indeed a good explanation. OTOH, R is column-major, but it doesn't care so much about speed...

nalimilan · 2017-06-22T14:07:59Z

A confirmation that R's dist function is slow due to storing observations in rows despite being column-major: https://stat.ethz.ch/pipermail/r-devel/2017-June/074492.html

andreasnoack · 2017-10-17T07:07:13Z

Most relevant comments have already been made here but just wanted to point out that it seems that ML algorithms are often expressed and written by observation, i.e. for OLS that would be (∑x*x')\(∑x*y) instead of the matrix notation (X'X)\(X'y). The former can be convenient if data is just considered a stream. However, it is often wrongly argued that the former is also faster when each x is contiguous but because of (mainly) BLAS3 calls, the matrix formulation, if available, will typically be quite a bit faster than the "streaming"/"vector" formulation.

nalimilan · 2019-02-13T15:48:57Z

pairwise in Distances has just gained support for a dims keyword argument, plus a deprecation of the default method which assumes observations are stored as columns (JuliaStats/Distances.jl#121). kmeans should probably do the same thing.

alyst · 2019-02-13T16:31:11Z

It's fine to allow dims, especially if the implementation can choose the optimal algorithm depending on the data layout. But many methods are only optimized for the "observations as columns" case. Shouldn't we still keep it as the default dims= to guide users to the proper layout?
I'd also try to keep the signatures of all clustering methods in sync. So if we are deprecating the default dims= for kmeans(), we should also do it for the other methods.

I assume that the major motivation of the change is Tables.jl support, which has "observations as rows" layout. We definitely need to support Tables.jl, but maybe the default strategy should be to transpose the table before clustering?

nalimilan · 2019-02-13T16:44:57Z

I assume that the major motivation of the change is Tables.jl support, which has "observations as rows" layout. We definitely need to support Tables.jl, but maybe the default strategy should be to transpose the table before clustering?

Tables.jl really isn't my motivation for this change, as it can easily gain an option to transpose the columns, which should be relatively efficient compared with the cost of computing pairwise distances and clustering (JuliaData/Tables.jl#66). The main reason is that I want to ensure consistency of functions that take a dimension argument across the ecosystem, and cov in Statistics assumes dims=1 (observations as rows) by default. This is compounded by the fact that most other implementations of distances/clustering use observations as rows.

Shouldn't we still keep it as the default dims= to guide users to the proper layout?

Given what I said above, I'd rather require users to choose explicitly between consistency with the rest of Julia and the world (dims=1) and performance (dims=2). Hopefully, clearly documenting what's the best solution for performance is OK. Ironically, the fact that the dims argument is so hard to get right without reading the docs ensures people will read that warning. ;-)

We should also provide a Tables.jl-based interface which will automatically transpose the data under the hood, so that the efficient approach is used and people don't have to think about it.

nalimilan · 2019-02-22T14:20:57Z

JuliaStats/Distances.jl#123 implements the convenience Tables.jl interface.

I've tried measuring the performance difference between dims=2 and dims=1 on permutedims(a), and I couldn't observe any significant difference for Euclidean. Does anybody have an idea of a benchmark which could show a difference?

mkborregaard · 2020-02-03T14:20:57Z

As far as I can see, MultivariateStats is unfortunately also implemented columns-are-observations - see e.g. the PCA example

julia> using MultivariateStats, RDatasets
julia> iris = dataset("datasets", "iris");
julia> # split half to training set
       Xtr = convert(Matrix,iris[:,1:4])'
4×150 LinearAlgebra.Adjoint{Float64,Array{Float64,2}}:
 5.1  4.9  4.7  4.6  5.0  5.4  4.6  5.0  …  6.8  6.7  6.7  6.3  6.5  6.2  5.9
 3.5  3.0  3.2  3.1  3.6  3.9  3.4  3.4     3.2  3.3  3.0  2.5  3.0  3.4  3.0
 1.4  1.4  1.3  1.5  1.4  1.7  1.4  1.5     5.9  5.7  5.2  5.0  5.2  5.4  5.1
 0.2  0.2  0.2  0.2  0.2  0.4  0.3  0.2     2.3  2.5  2.3  1.9  2.0  2.3  1.8

julia> fit(PCA, Xtr)
PCA(indim = 4, outdim = 3, principalratio = 0.9947878161267246)

julia> transform(ans, Xtr)
3×150 Array{Float64,2}:
 2.68413     2.71414    2.88899    …  -1.76435    -1.90094   -1.39019
 0.319397   -0.177001  -0.144949       0.0788589   0.116628  -0.282661
 0.0279148   0.210464  -0.0179003     -0.130482   -0.723252  -0.36291

In my view, observations-as-columns are practically never found in the wild outside some parts of Julia's ecosystem, and it would seem quite necessary to converge on observations-as-rows throughout, in spite of the issues with deprecations.

jebej · 2020-05-26T14:11:57Z

As an potential additional data point (against the observations-as-row view), observations as columns is nice when converting between a complex number vector and the array used for clustering. Since complex numbers are stored as (real,imag), converting to an observations-as-columns array is as easy as reinterpreting:

reshape(reinterpret(Float64,cvals::Vector{Complex{Float64}}),2,length(cvals))

No transpose is needed.

ararslan added the question label Jan 3, 2017

This was referenced Feb 8, 2019

Finixh #58 by adding the vardim keyword argument to Tables.matrix, al… JuliaData/Tables.jl#66

Merged

pairwise to work on rows instead of columns. JuliaStats/Distances.jl#35

Closed

mkborregaard mentioned this issue Feb 3, 2020

unifying the ecosystem around observations-as-row convention JuliaStats/MultivariateStats.jl#114

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the kmeans algorithm column-oriented instead of row-oriented? #79

Why is the kmeans algorithm column-oriented instead of row-oriented? #79

paulhendricks commented Jan 3, 2017

ararslan commented Jan 3, 2017

nalimilan commented Jan 3, 2017 •

edited

Loading

ararslan commented Jan 3, 2017

paulhendricks commented Jan 4, 2017

ararslan commented Jan 4, 2017

paulhendricks commented Jan 4, 2017

nalimilan commented Jan 4, 2017

kmsquire commented Jan 18, 2017

nalimilan commented Jan 18, 2017

kmsquire commented Jan 18, 2017

kmsquire commented Jan 18, 2017

nalimilan commented Jan 18, 2017

nalimilan commented Jun 22, 2017

andreasnoack commented Oct 17, 2017

nalimilan commented Feb 13, 2019 •

edited

Loading

alyst commented Feb 13, 2019

nalimilan commented Feb 13, 2019

nalimilan commented Feb 22, 2019

mkborregaard commented Feb 3, 2020 •

edited

Loading

jebej commented May 26, 2020

Why is the kmeans algorithm column-oriented instead of row-oriented? #79

Why is the kmeans algorithm column-oriented instead of row-oriented? #79

Comments

paulhendricks commented Jan 3, 2017

Inputs

ararslan commented Jan 3, 2017

nalimilan commented Jan 3, 2017 • edited Loading

ararslan commented Jan 3, 2017

paulhendricks commented Jan 4, 2017

ararslan commented Jan 4, 2017

paulhendricks commented Jan 4, 2017

nalimilan commented Jan 4, 2017

kmsquire commented Jan 18, 2017

nalimilan commented Jan 18, 2017

kmsquire commented Jan 18, 2017

kmsquire commented Jan 18, 2017

nalimilan commented Jan 18, 2017

nalimilan commented Jun 22, 2017

andreasnoack commented Oct 17, 2017

nalimilan commented Feb 13, 2019 • edited Loading

alyst commented Feb 13, 2019

nalimilan commented Feb 13, 2019

nalimilan commented Feb 22, 2019

mkborregaard commented Feb 3, 2020 • edited Loading

jebej commented May 26, 2020

nalimilan commented Jan 3, 2017 •

edited

Loading

nalimilan commented Feb 13, 2019 •

edited

Loading

mkborregaard commented Feb 3, 2020 •

edited

Loading