Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[🐛BUG] 负采样未利用显反馈样本 #2066

Open
HowardZJU opened this issue Jul 17, 2024 · 1 comment
Open

[🐛BUG] 负采样未利用显反馈样本 #2066

HowardZJU opened this issue Jul 17, 2024 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@HowardZJU
Copy link

HowardZJU commented Jul 17, 2024

描述这个 bug
以ML-1M数据集为例,评分【1-5】。

生成的稀疏inter矩阵只存储了评分大于threshold的user-item。评分小于threshold的user-item,和未观测的user-item一同设为0。

这种做法没有有效利用显反馈负样本。把显反馈负样本和未观测样本都视作负样本。

问题和诉求

  1. 是否可以在训练阶段获取显反馈负样本,即rating<threshold的样本
  2. 是否可以在训练阶段同时获取显反馈负样本,以及负采样得到的未观测样本,并有效区分?

如何复现
复现这个 bug 的步骤:
在quick start中,于下列代码打断点观察即可。
train_data, valid_data, test_data = data_preparation(config, dataset)

实验环境:

  • 操作系统: Linux
@HowardZJU HowardZJU added the bug Something isn't working label Jul 17, 2024
@HowardZJU
Copy link
Author

HowardZJU commented Jul 17, 2024

For example, to address the problems issued, whether it is feasible to change the _set_label_by_threshold(self) function, by setting negative labels to -1?

  def _set_label_by_threshold(self):
      """Generate 0/1 labels according to value of features.

      According to ``config['threshold']``, those rows with value lower than threshold will
      be given negative label, while the other will be given positive label.
      See :doc:`../user_guide/data/data_args` for detail arg setting.

      Note:
          Key of ``config['threshold']`` if a field name.
          This field will be dropped after label generation.
      """
      threshold = self.config["threshold"]
      if threshold is None:
          return

      self.logger.debug(f"Set label by {threshold}.")

      if len(threshold) != 1:
          raise ValueError("Threshold length should be 1.")

      self.set_field_property(
          self.label_field, FeatureType.FLOAT, FeatureSource.INTERACTION, 1
      )
      for field, value in threshold.items():
          if field in self.inter_feat:
              self.inter_feat[self.label_field] = (
                  self.inter_feat[field] >= value
              ).astype(int)
          else:
              raise ValueError(f"Field [{field}] not in inter_feat.")
          if field != self.label_field:
              self._del_col(self.inter_feat, field)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants