Add KNNImputer #303

srzeszut · 2024-10-20T13:28:02Z

I have added the KNNImputer and I am currently implementing tests to ensure that it behaves as expected across various scenarios, including edge cases.

lib/scholar/impute/knn_imputer.ex

josevalim · 2024-10-21T07:16:01Z

lib/scholar/impute/knn_imputer.ex

+    if opts[:missing_values] != :nan and
+         Nx.any(Nx.is_nan(x)) == Nx.tensor(1, type: :u8) do
+      raise ArgumentError,
+            ":missing_values other than :nan possible only if there is no Nx.Constant.nan() in the array"
+    end
+


This check does not really work in Nx. If you call fit inside Nx.Defn.jit, then x is an expression, and we can't read its values to find out if there is a nan or not. The best we can do is to remove this check and document it.

I found this check in simple imputer
https://github.com/elixir-nx/scholar/blob/main/lib/scholar/impute/simple_imputer.ex
Are you sure it won't work?

It is also broken there. :)

I have fixed it there: c024c5b

josevalim · 2024-10-21T07:17:44Z

lib/scholar/impute/knn_imputer.ex

+
+    all_nan_rows_count = Nx.sum(all_nan_rows)
+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do


Same here, this code won't work because, when you have an expression, you can't get a number from it. Can we remove this check? What happens if we don't check for this condition?

You can test this by calling fit after jitting it with Nx.Defn.fit.

lib/scholar/impute/knn_imputer.ex

josevalim · 2024-10-21T07:21:11Z

lib/scholar/impute/knn_imputer.ex

+
+    # if potential neighbor has nan in nan_col, we don't want to calculate distance and the case if potential_neighbour is the row to impute
+    {potential_neighbor} =
+      if potential_neighbor[nan_col] == Nx.Constants.nan() do


I am not sure if this check is guaranteed to work, given two NaNs are not guaranteed to be equal. Using Nx.is_nan would be more appropriate.

lib/scholar/impute/knn_imputer.ex

msluszniak · 2024-10-21T12:36:56Z

lib/scholar/impute/knn_imputer.ex

+
+    x =
+      if opts[:missing_values] != :nan,
+        do: Nx.select(Nx.equal(x, opts[:missing_values]), Nx.Constants.nan(), x),


Use Nx.is_nan here NaN is not equal to itself

lib/scholar/impute/knn_imputer.ex

msluszniak · 2024-10-21T12:57:35Z

lib/scholar/impute/knn_imputer.ex

+    coordinates = coordinates - 1
+
+    # inputes zeros in nan_col to calculate distance with squared_euclidean
+    new_row = Nx.indexed_put(row, Nx.new_axis(nan_col, 0), Nx.tensor(0))


Generally, when you write in defn, you don't need to wrap this zero in Nx.tensor. I prefer to explicitly use Nx.<type> or Nx.tensor(x, type: type) to indicate the type of the tensor. Now, there are some cases where imputter has fixed type like :f32. I think that this might cause undesired upcasts when e.g. I have tensor of type :bf16. So I suggest to check if there are any unwanted casts / upcast.

I changed it but I don't know how to change this line
row_distances = Nx.iota({rows}, type: {:f, 32})
because i don't know what the type calculated distance will be at this point

msluszniak · 2024-10-21T13:02:04Z

lib/scholar/impute/knn_imputer.ex

+
+    # if row has all nans we skip it
+    {weight, potential_neighbor} =
+      if present_coordinates == 0 do


As mentioned in comment up, try to replace "bare" numbers with typed tensors

msluszniak · 2024-10-21T13:03:24Z

lib/scholar/impute/knn_imputer.ex

@@ -0,0 +1,256 @@
+defmodule Scholar.Impute.KNNImputer do


I think it should be written with double t KNNImputter like formatter etc.

msluszniak

Thanks for the PR, I dropped some comments :))

krstopro · 2024-10-22T16:21:43Z

Hi @srzeszut and thanks for the pull request. I’m traveling now and don’t have my laptop with me. Will be back this Sunday, so I will have a look probably next week.

srzeszut · 2024-10-27T10:40:27Z

Thanks for the review, I apply suggested changes and left some comments.

josevalim · 2024-10-28T10:06:33Z

lib/scholar/impute/knn_imputter.ex

+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do
+      raise ArgumentError,
+            "Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"


error messages start in lowercase. :)

Suggested change

"Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

"number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

josevalim · 2024-10-28T10:07:12Z

lib/scholar/impute/knn_imputter.ex

+
+    all_nan_rows_count = Nx.sum(all_nan_rows)
+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do


Can you please add some tests? In particular, please add a test where you call jit this function and then you call it: Nx.Defn.jit(...).(arg1, arg2). It should reveal some errors around here. :)

I added tests and checked it. I removed those checks and added them in the description

josevalim · 2024-10-29T19:01:37Z

lib/scholar/impute/knn_imputter.ex

+    `n_neighbors` nearest neighbors found in the training set. Two samples are
+    close if the features that neither is missing are close.


Suggested change

`n_neighbors` nearest neighbors found in the training set. Two samples are

close if the features that neither is missing are close.

`n_neighbors` nearest neighbors found in the training set. Two samples are

close if the features that neither is missing are close.

josevalim · 2024-10-29T19:01:57Z

lib/scholar/impute/knn_imputter.ex

+
+  Preconditions:
+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter


Please try to break this long line :)

lib/scholar/impute/knn_imputter.ex

josevalim · 2024-10-29T19:05:32Z

test/scholar/impute/knn_imputter_test.exs

+    test "Wrong impute rank" do
+      x = Nx.tensor([1, 2, 2, 3])
+
+      assert_raise ArgumentError,
+                   "Wrong input rank. Expected: 2, got: 1",
+                   fn ->
+                     KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)
+                   end
+    end
+
+    test "Invalid n_neighbors value" do


Test names start in lowercase :)

Suggested change

test "Wrong impute rank" do

x = Nx.tensor([1, 2, 2, 3])

assert_raise ArgumentError,

"Wrong input rank. Expected: 2, got: 1",

fn ->

KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)

end

end

test "Invalid n_neighbors value" do

test "invalid impute rank" do

x = Nx.tensor([1, 2, 2, 3])

assert_raise ArgumentError,

"Wrong input rank. Expected: 2, got: 1",

fn ->

KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)

end

end

test "invalid n_neighbors value" do

josevalim

I dropped the last round of nitpicks and we are good to go!

krstopro

First review. Some features we might wanna have:

Make k-NN algorithm configurable.
Make the metric configurable.

You can leave these for another pull request. Have a look at e.g. KNNClassifier how it is done over there.
I should have another look tonight.

krstopro · 2024-10-30T09:39:38Z

lib/scholar/impute/knn_imputter.ex

+      The default value expects there are no NaNs in the input tensor.
+      """
+    ],
+    number_of_neighbors: [


I would suggest changing this to num_neighbors to be consistent with the rest of Scholar.

krstopro

Several minor comments for now. I have to go through the code at least once more as I don't exactly understand the logic here.

krstopro · 2024-10-30T14:36:48Z

lib/scholar/impute/knn_imputter.ex

+
+    x =
+      if opts[:missing_values] != :nan,
+        do: Nx.select(Nx.equal(x, opts[:missing_values]), Nx.Constants.nan(), x),


You should be able to use == instead of Nx.equal/2.

This is a deftransform, so Nx.equal is the proper function. == will be Elixir.Kernel.==

krstopro · 2024-10-30T14:42:59Z

lib/scholar/impute/knn_imputter.ex

+    placeholder_value = Nx.Constants.nan() |> Nx.tensor()
+
+    statistics = knn_impute(x, placeholder_value, num_neighbors: num_neighbors)
+    missing_values = opts[:missing_values]


I would move this line above so that you don't access opts[:missing_values] multiple times.

krstopro · 2024-10-30T14:49:00Z

lib/scholar/impute/knn_imputter.ex

+
+    {_, values_to_impute} =
+      while {{row = 0, mask, num_neighbors, num_rows, x}, values_to_impute},
+            Nx.less(row, num_rows) do


You can use < instead of Nx.less/2 over here.

krstopro · 2024-10-30T14:49:14Z

lib/scholar/impute/knn_imputter.ex

+            Nx.less(row, num_rows) do
+        {_, values_to_impute} =
+          while {{col = 0, mask, num_neighbors, num_cols, row, x}, values_to_impute},
+                Nx.less(col, num_cols) do


krstopro · 2024-10-30T14:52:05Z

lib/scholar/impute/knn_imputter.ex

+        {_, values_to_impute} =
+          while {{col = 0, mask, num_neighbors, num_cols, row, x}, values_to_impute},
+                Nx.less(col, num_cols) do
+            if mask[row][col] > 0 do


I think if mask[row][col] do should work here.

polvalente · 2024-10-30T19:40:44Z

lib/scholar/impute/knn_imputter.ex

+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter


Suggested change

* `number_of_neighbors` is a positive integer.

* number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter

* The number of neighbors must be less than the number of valid rows - 1.

A valid row is a row with more than 1 non-NaN values. Otherwise it is better to use a simpler imputer.

polvalente · 2024-10-30T19:40:58Z

lib/scholar/impute/knn_imputter.ex

+  Preconditions:
+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter
+    *  when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor


Suggested change

* when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor

* When you set a value different than `:nan` in `missing_values` there should be no NaNs in the input tensor

polvalente · 2024-10-30T19:41:48Z

lib/scholar/impute/knn_imputter.ex

+    * `:missing_values` - the same value as in `:missing_values`
+
+    * `:statistics` - The imputation fill value for each feature. Computing statistics can result in
+    [`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.


Suggested change

[`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.

[`Nx.Constants.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.

Do you need the explicit linking in hexdoc?

polvalente · 2024-10-30T19:42:23Z

lib/scholar/impute/knn_imputter.ex

+
+    The function returns a struct with the following parameters:
+
+    * `:missing_values` - the same value as in `:missing_values`


Suggested change

* `:missing_values` - the same value as in `:missing_values`

* `:missing_values` - the same value as in the `:missing_values` option

polvalente · 2024-10-30T19:43:58Z

lib/scholar/impute/knn_imputter.ex

+
+    num_neighbors = opts[:number_of_neighbors]
+
+    placeholder_value = Nx.Constants.nan() |> Nx.tensor()


Suggested change

placeholder_value = Nx.Constants.nan() |> Nx.tensor()

placeholder_value = Nx.Constants.nan()

you probably want to pass the input type here to avoid upcasts

polvalente · 2024-10-30T19:45:39Z

lib/scholar/impute/knn_imputter.ex

+
+  opts_schema = [
+    missing_values: [
+      type: {:or, [:float, :integer, {:in, [:nan]}]},


Suggested change

type: {:or, [:float, :integer, {:in, [:nan]}]},

type: {:or, [:float, :integer, {:in, [:nan]}]},

I believe this should allow :infinity and :neg_infinity too for completeness

polvalente · 2024-10-30T19:50:59Z

lib/scholar/impute/knn_imputter.ex

+              indices =
+                [Nx.stack(row), Nx.stack(col)]
+                |> Nx.concatenate()
+                |> Nx.stack()


Suggested change

indices =

[Nx.stack(row), Nx.stack(col)]

|> Nx.concatenate()

|> Nx.stack()

indices = Nx.stack([row, col]) |> Nx.reshape({1, 2})

If I read the code correctly, row and col are scalars and this should yield the same result

polvalente · 2024-10-30T19:52:41Z

lib/scholar/impute/knn_imputter.ex

+                |> Nx.concatenate()
+                |> Nx.stack()
+
+              values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))


Suggested change

values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))

values_to_impute = Nx.put_slice(values_to_impute, [row, col], Nx.reshape(neighbor_avg, {1, 1}))

I think this is even simpler

polvalente · 2024-10-30T20:00:11Z

lib/scholar/impute/knn_imputter.ex

+    {_, row_distances} =
+      while {{i = 0, x, row_with_value_to_fill, rows, nan_row, nan_col}, row_distances},
+            Nx.less(i, rows) do
+        potential_donor = x[i]
+
+        distance =
+          if i == nan_row do
+            Nx.Constants.infinity(Nx.type(row_with_value_to_fill))
+          else
+            nan_euclidian(row_with_value_to_fill, nan_col, potential_donor)
+          end
+
+        row_distances = Nx.indexed_put(row_distances, Nx.new_axis(i, 0), distance)
+        {{i + 1, x, row_with_value_to_fill, rows, nan_row, nan_col}, row_distances}
+      end


try this:

potential_donors = Nx.vectorize(x, :rows) distances = nan_euclidean(row_with_value_to_fill, nan_col, potential_donors) |> Nx.devectorize() row_distances = Nx.indexed_put(distances, [i], Nx.Constants.infinity())

srzeszut added 5 commits October 20, 2024 15:17

add KNNImputer

d6c7a55

fix doctests

47b4a65

mix format

eb8f245

change placeholder_value to tensor

642b15e

fix doctest

520633a

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

josevalim requested review from krstopro and msluszniak October 21, 2024 07:22

msluszniak reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

msluszniak reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

msluszniak reviewed Oct 21, 2024

View reviewed changes

srzeszut added 2 commits October 27, 2024 11:20

apply suggested changes

926a1c7

added type tensors

a3e0eba

josevalim reviewed Oct 28, 2024

View reviewed changes

srzeszut and others added 2 commits October 28, 2024 12:49

Merge branch 'elixir-nx:main' into main

4757dd1

added tests and remove not working checks

108475d

josevalim reviewed Oct 29, 2024

View reviewed changes

lib/scholar/impute/knn_imputter.ex Show resolved Hide resolved

josevalim reviewed Oct 29, 2024

View reviewed changes

krstopro reviewed Oct 30, 2024

View reviewed changes

polvalente reviewed Oct 30, 2024

View reviewed changes


		all_nan_rows_count = Nx.sum(all_nan_rows)

		if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do

	"Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"
	"number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

		`n_neighbors` nearest neighbors found in the training set. Two samples are
		close if the features that neither is missing are close.

		* `number_of_neighbors` is a positive integer.
		* number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter

	* when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor
	* When you set a value different than `:nan` in `missing_values` there should be no NaNs in the input tensor

	[`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.
	[`Nx.Constants.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.


		The function returns a struct with the following parameters:

		* `:missing_values` - the same value as in `:missing_values`


		num_neighbors = opts[:number_of_neighbors]

		placeholder_value = Nx.Constants.nan() \|> Nx.tensor()

	placeholder_value = Nx.Constants.nan() \|> Nx.tensor()
	placeholder_value = Nx.Constants.nan()

	type: {:or, [:float, :integer, {:in, [:nan]}]},
	type: {:or, [:float, :integer, {:in, [:nan]}]},

	values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))
	values_to_impute = Nx.put_slice(values_to_impute, [row, col], Nx.reshape(neighbor_avg, {1, 1}))

Add KNNImputer #303

Are you sure you want to change the base?

Add KNNImputer #303

Conversation

srzeszut commented Oct 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msluszniak Oct 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msluszniak left a comment

Choose a reason for hiding this comment

krstopro commented Oct 22, 2024

srzeszut commented Oct 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

krstopro left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krstopro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msluszniak Oct 21, 2024 •

edited

Loading

srzeszut commented Oct 27, 2024 •

edited

Loading

krstopro left a comment •

edited

Loading