Navigation Menu

recombee / beeformer Public

Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems

License

CC-BY-SA-4.0 license 32 stars 4 forks Branches Tags Activity

recombee/beeformer

main BranchesTags

Folders and files

Name| Name| Last commit message| Last commit date ---|---|---|--- _datasets | _datasets | .gitignore | .gitignore | LICENSE.txt | LICENSE.txt | README.md | README.md | beeFormer-poster.pdf | beeFormer-poster.pdf | beeFormer-poster.png | beeFormer-poster.png | beeformer_explaining.png | beeformer_explaining.png | callbacks.py | callbacks.py | config.py | config.py | dataloaders.py | dataloaders.py | evaluate_itemsplit.py | evaluate_itemsplit.py | evaluate_timesplit.py | evaluate_timesplit.py | images.py | images.py | layers.py | layers.py | models.py | models.py | requirements.txt | requirements.txt | schedules.py | schedules.py | train.py | train.py | utils.py | utils.py | View all files

Latest commit

History

29 Commits

License: CC BY-SA 4.0 DOI arXiv Follow me on HF

beeFormer

This is the official implementation provided with our paper beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems.

main idea of beeFormer

alt text

协同过滤 (CF) 方法可以从交互数据中捕获乍看之下不明显的模式。例如,在购买打印机时,用户还可以购买墨粉、纸张或连接打印机的电缆,而协同过滤可以考虑这些模式。然而,在冷启动推荐设置中,当新项目没有任何交互时,协同过滤方法无法使用,推荐系统被迫使用其他方法,如基于内容的过滤 (CBF)。基于内容的过滤的问题在于它依赖于项目属性,例如文本描述。在我们的打印机示例中,语义相似性训练的语言模型会将其他打印机置于用户可能正在搜索的附件之前。我们的方法是训练语言模型从交互数据中学习这些用户行为模式,并将该知识转移到以前未见过的项目。我们的实验表明,这种方法的性能优势是巨大的。

Steps to start training the models:

  1. create virtual environment python3.10 -m venv beef and activate it source beef/bin/activate
  2. clone this repository and navigate to it cd beeformer
  3. install packages pip install -r requirements.txt
  4. download the data for movielens: navigate to the _dataset/ml20m folder and run source download_data
  5. download the data for goodbooks: navigate to the _dataset/goodbooks folder and run source download_data
  6. download the data for amazonbooks: navigate to the _dataset/amazonbooks folder and run source download_data && python preprocess.py
  7. in the root folder of the project run the train.py, for example like this:
python train.py --seed 42 --scheduler None --lr 1e-5 --epochs 5 --dataset goodbooks --sbert "sentence-transformers/all-mpnet-base-v2" --max_seq_length 384 --batch_size 1024 --max_output 10000 --sbert_batch_size 200 --use_cold_start true --save_every_epoch true --model_name my_model
  1. Evaluate the results. To reproduce numbers from the paper using our hugginface repository, run for example:
python evaluate_itemsplit.py --seed 42 --dataset goodbooks --sbert beeformer/Llama-goodbooks-mpnet

or

python evaluate_timesplit.py --seed 42 --dataset amazon-books --sbert beeformer/Llama-amazbooks-mpnet

Datasets and preprocessing

Preprocessing information

我们认为 4.0 及以上的评分是交互。我们只保留至少有 5 次交互的用户。

LLM Data augmentations

由于原始数据中没有文本描述,我们手动将几个数据集与原始数据连接起来,并在其上训练我们的模型。然而,这种方法有一些局限性:来自不同来源的文本具有不同的风格和不同的长度,这可能会影响结果。因此,我们使用 Llama-3.1-8b-instruct 模型来为我们生成项目描述。我们使用以下对话模板:

import pandas as pd
from tqdm import tqdm
from vllm import LLM, SamplingParams
items = pd.read_feather("items_with_gathered_side_info.feather")
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct",dtype="float16")
tokenizer = llm.get_tokenizer()
conversation = [ tokenizer.apply_chat_template(
    [
      {'role': 'system','content':"You are ecomerce shop designer. Given a item description create one paragraph long summarization of the product."},
      {'role': 'user', 'content': "Item description: "+x},
      {'role': 'assistant', 'content': "Sure, here is your one paragraph summary of your product:"},
    ],
    tokenize=False,
  ) for x in tqdm(items.gathered_features.to_list())]
output = llm.generate(
  conversation,
  SamplingParams(
    temperature=0.1,
    top_p=0.9,
    max_tokens=512,
    stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")], 
  )
)
items_descriptions = [o.outputs[0].text for o in output]

然而,LLM 拒绝为某些项目生成描述(例如,因为它拒绝生成明确的内容)。我们从数据集中删除了此类项目。我们还删除了我们无法从其他数据集中连接有意义的描述的项目,这导致 LLM 完全虚构项目描述。

我们在 datasets/ml20mdataset/goodbooksdatasets/amazonbooks 文件夹中分享生成的 LLM 项目描述。

Statistics of datasets used for evaluation

GoodBooks-10k | MovieLens-20M | Amazon Books ---|---|---

of items in X | 9975 | 16902 | 63305

of users in X | 53365 | 136589 | 634964

of interactions in X | 4119623 | 9694668 | 8290500

density of X [%] | 0.7739 | 0.4199 | 0.0206 density of X^TX [%] | 41.22 | 26.93 | 7.59

Pretrained models

We share pretrained models at https://huggingface.co/beeformer.

Hyperparameters

We used hyperparameters for training our models as follows.

hyperparameter | description | beeformer/Llama-goodbooks-mpnet | beeformer/Llama-movielens-mpnet | beeformer/Llama-goodlens-mpnet | beeformer/Llama-amazbooks-mpnet ---|---|---|---|---|--- seed | random seed used during training | 42 | 42 | 42 | 42 scheduler | learning rate scheduling strategy | constant learning rate | constant learning rate | constant learning rate | constant learning rate lr | learning rate | 1e-5 | 1e-5 | 1e-5 | 1e-5 epochs | number of trained epochs | 5 | 5 | 10 | 5 devices | training script allow to train on multiple gpus in parallel - we used 4xV100 | [0,1,2,3] | [0,1,2,3] | [0,1,2,3] | [0,1,2,3] dataset | dataset used for training | goodbooks | ml20m | goodlens | amazon-books sbert | original sentence transformer model used as an initial model for training | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 max_seq_length | limitation of sequence length; shorter sequences trains faster original mpnet model uses max 512 tokens in. sequence | 384 | 384 | 384 | 384 batch_size | number of users sampled in random batch from interaction matrix | 1024 | 1024 | 1024 | 1024 max_output | negative sampling hyperparameter (m in the paper). Negatives are sampled uniformly at random. | 10000 | 10000 | 10000 | 12500 sbert_batch_size | number of items processed together during training step (gradient accumulation step size) | 200 | 200 | 200 | 200 use_cold_start | split the dataset item-wise (some items are hidden to test the genralization towards new items) | true | true | true | false use_time_split | sort interactions by timestamp and use last 20% of interactions as a test set (generalization from the past to the future) | false | false | false | true

RecSys 2024 poster

RecSys poster

Citation

If you find this repository helpful, feel free to cite our paper:

@inproceedings{10.1145/3640457.3691707,
    author = {Van\v{c}ura, Vojt\v{e}ch and Kord\'{\i}k, Pavel and Straka, Milan},
    title = {beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems},
    year = {2024},
    isbn = {9798400705052},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3640457.3691707},
    doi = {10.1145/3640457.3691707},
    booktitle = {Proceedings of the 18th ACM Conference on Recommender Systems},
    pages = {1102–1107},
    numpages = {6},
    keywords = {Cold-start recommendation, Recommender systems, Sentence embeddings, Text mining, Zero-shot recommendation},
    location = {Bari, Italy},
    series = {RecSys '24}
}

About

Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems