BeeFormer:用于推荐系统的 CF 和 CBF 混合方法
Navigation Menu
Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems
License
CC-BY-SA-4.0 license 32 stars 4 forks Branches Tags Activity
recombee/beeformer
Folders and files
Name| Name| Last commit message| Last commit date ---|---|---|--- _datasets | _datasets | .gitignore | .gitignore | LICENSE.txt | LICENSE.txt | README.md | README.md | beeFormer-poster.pdf | beeFormer-poster.pdf | beeFormer-poster.png | beeFormer-poster.png | beeformer_explaining.png | beeformer_explaining.png | callbacks.py | callbacks.py | config.py | config.py | dataloaders.py | dataloaders.py | evaluate_itemsplit.py | evaluate_itemsplit.py | evaluate_timesplit.py | evaluate_timesplit.py | images.py | images.py | layers.py | layers.py | models.py | models.py | requirements.txt | requirements.txt | schedules.py | schedules.py | train.py | train.py | utils.py | utils.py | View all files
Latest commit
History
beeFormer
This is the official implementation provided with our paper beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems.
main idea of beeFormer
协同过滤 (CF) 方法可以从交互数据中捕获乍看之下不明显的模式。例如,在购买打印机时,用户还可以购买墨粉、纸张或连接打印机的电缆,而协同过滤可以考虑这些模式。然而,在冷启动推荐设置中,当新项目没有任何交互时,协同过滤方法无法使用,推荐系统被迫使用其他方法,如基于内容的过滤 (CBF)。基于内容的过滤的问题在于它依赖于项目属性,例如文本描述。在我们的打印机示例中,语义相似性训练的语言模型会将其他打印机置于用户可能正在搜索的附件之前。我们的方法是训练语言模型从交互数据中学习这些用户行为模式,并将该知识转移到以前未见过的项目。我们的实验表明,这种方法的性能优势是巨大的。
Steps to start training the models:
- create virtual environment
python3.10 -m venv beef
and activate itsource beef/bin/activate
- clone this repository and navigate to it
cd beeformer
- install packages
pip install -r requirements.txt
- download the data for movielens: navigate to the
_dataset/ml20m
folder and runsource download_data
- download the data for goodbooks: navigate to the
_dataset/goodbooks
folder and runsource download_data
- download the data for amazonbooks: navigate to the
_dataset/amazonbooks
folder and runsource download_data && python preprocess.py
- in the root folder of the project run the
train.py
, for example like this:
python train.py --seed 42 --scheduler None --lr 1e-5 --epochs 5 --dataset goodbooks --sbert "sentence-transformers/all-mpnet-base-v2" --max_seq_length 384 --batch_size 1024 --max_output 10000 --sbert_batch_size 200 --use_cold_start true --save_every_epoch true --model_name my_model
- Evaluate the results. To reproduce numbers from the paper using our hugginface repository, run for example:
python evaluate_itemsplit.py --seed 42 --dataset goodbooks --sbert beeformer/Llama-goodbooks-mpnet
or
python evaluate_timesplit.py --seed 42 --dataset amazon-books --sbert beeformer/Llama-amazbooks-mpnet
Datasets and preprocessing
Preprocessing information
我们认为 4.0 及以上的评分是交互。我们只保留至少有 5 次交互的用户。
LLM Data augmentations
由于原始数据中没有文本描述,我们手动将几个数据集与原始数据连接起来,并在其上训练我们的模型。然而,这种方法有一些局限性:来自不同来源的文本具有不同的风格和不同的长度,这可能会影响结果。因此,我们使用 Llama-3.1-8b-instruct
模型来为我们生成项目描述。我们使用以下对话模板:
import pandas as pd
from tqdm import tqdm
from vllm import LLM, SamplingParams
items = pd.read_feather("items_with_gathered_side_info.feather")
llm = LLM(model="meta-llama/Meta-Llama-3.1-8B-Instruct",dtype="float16")
tokenizer = llm.get_tokenizer()
conversation = [ tokenizer.apply_chat_template(
[
{'role': 'system','content':"You are ecomerce shop designer. Given a item description create one paragraph long summarization of the product."},
{'role': 'user', 'content': "Item description: "+x},
{'role': 'assistant', 'content': "Sure, here is your one paragraph summary of your product:"},
],
tokenize=False,
) for x in tqdm(items.gathered_features.to_list())]
output = llm.generate(
conversation,
SamplingParams(
temperature=0.1,
top_p=0.9,
max_tokens=512,
stop_token_ids=[tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>")],
)
)
items_descriptions = [o.outputs[0].text for o in output]
然而,LLM
拒绝为某些项目生成描述(例如,因为它拒绝生成明确的内容)。我们从数据集中删除了此类项目。我们还删除了我们无法从其他数据集中连接有意义的描述的项目,这导致 LLM
完全虚构项目描述。
我们在 datasets/ml20m
、dataset/goodbooks
和 datasets/amazonbooks
文件夹中分享生成的 LLM
项目描述。
Statistics of datasets used for evaluation
GoodBooks-10k | MovieLens-20M | Amazon Books ---|---|---
of items in X | 9975 | 16902 | 63305
of users in X | 53365 | 136589 | 634964
of interactions in X | 4119623 | 9694668 | 8290500
density of X [%] | 0.7739 | 0.4199 | 0.0206 density of X^TX [%] | 41.22 | 26.93 | 7.59
Pretrained models
We share pretrained models at https://huggingface.co/beeformer.
Hyperparameters
We used hyperparameters for training our models as follows.
hyperparameter | description | beeformer/Llama-goodbooks-mpnet | beeformer/Llama-movielens-mpnet | beeformer/Llama-goodlens-mpnet | beeformer/Llama-amazbooks-mpnet ---|---|---|---|---|--- seed | random seed used during training | 42 | 42 | 42 | 42 scheduler | learning rate scheduling strategy | constant learning rate | constant learning rate | constant learning rate | constant learning rate lr | learning rate | 1e-5 | 1e-5 | 1e-5 | 1e-5 epochs | number of trained epochs | 5 | 5 | 10 | 5 devices | training script allow to train on multiple gpus in parallel - we used 4xV100 | [0,1,2,3] | [0,1,2,3] | [0,1,2,3] | [0,1,2,3] dataset | dataset used for training | goodbooks | ml20m | goodlens | amazon-books sbert | original sentence transformer model used as an initial model for training | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 | sentence-transformers/all-mpnet-base-v2 max_seq_length | limitation of sequence length; shorter sequences trains faster original mpnet model uses max 512 tokens in. sequence | 384 | 384 | 384 | 384 batch_size | number of users sampled in random batch from interaction matrix | 1024 | 1024 | 1024 | 1024 max_output | negative sampling hyperparameter (m in the paper). Negatives are sampled uniformly at random. | 10000 | 10000 | 10000 | 12500 sbert_batch_size | number of items processed together during training step (gradient accumulation step size) | 200 | 200 | 200 | 200 use_cold_start | split the dataset item-wise (some items are hidden to test the genralization towards new items) | true | true | true | false use_time_split | sort interactions by timestamp and use last 20% of interactions as a test set (generalization from the past to the future) | false | false | false | true
RecSys 2024 poster
Citation
If you find this repository helpful, feel free to cite our paper:
@inproceedings{10.1145/3640457.3691707,
author = {Van\v{c}ura, Vojt\v{e}ch and Kord\'{\i}k, Pavel and Straka, Milan},
title = {beeFormer: Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems},
year = {2024},
isbn = {9798400705052},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3640457.3691707},
doi = {10.1145/3640457.3691707},
booktitle = {Proceedings of the 18th ACM Conference on Recommender Systems},
pages = {1102–1107},
numpages = {6},
keywords = {Cold-start recommendation, Recommender systems, Sentence embeddings, Text mining, Zero-shot recommendation},
location = {Bari, Italy},
series = {RecSys '24}
}
About
Bridging the Gap Between Semantic and Interaction Similarity in Recommender Systems