RepoRoulette：从 GitHub 随机抽样代码仓库

RepoRoulette: Randomly sample repositories from GitHub

Source | HN Comments

RepoRoulette是一个Python工具，用于从GitHub随机抽样代码仓库。它提供了多种抽样方法：基于ID、时间抽样、BigQuery抽样和GH Archive抽样。BigQuery抽样利用Google BigQuery的公共GitHub数据集，提供高级过滤功能，但需要GCP账户。GH Archive抽样则通过GitHub Archive获取数据。该工具可用于学术研究、学习资源、数据科学、趋势分析和安全研究等领域。欢迎贡献。

RepoRoulette 🎲: 从 GitHub 随机抽样代码仓库

转动轮盘，看看你能获得哪些 GitHub 仓库！

🚀 安装

# 使用 pip
pip install reporoulette
# 从源码安装
git clone https://github.com/gojiplus/reporoulette.git
cd reporoulette
pip install -e .

📖 抽样方法

RepoRoulette 提供了三种不同的方法用于随机 GitHub 代码仓库抽样：

1. 🎯 基于 ID 的抽样

使用 GitHub 的顺序代码仓库 ID 系统，通过探测有效 ID 范围内的随机 ID 来生成真正随机的样本。使用这种方法的缺点是命中率可能很低（因为许多 ID 无效，部分原因是代码仓库是私有的或被废弃的等）。并且对代码仓库特征的任何过滤都必须等到你有了名称之后才能进行。

该函数将继续抽样，直到达到 max_attempts 或 n_samples。你可以传递 seed 来实现可重复性。

from reporoulette import IDSampler
# 初始化抽样器
sampler = IDSampler(token="your_github_token")
# 获取 50 个随机代码仓库
repos = sampler.sample(n_samples=50)
# 打印基本统计信息
print(f"成功率: {sampler.success_rate:.2f}%")
print(f"收集到的样本: {len(repos)}")

2. ⏱️ 时间抽样

在指定范围内随机选择时间点（日期/小时组合），然后检索这些期间更新的代码仓库。

from reporoulette import TemporalSampler
from datetime import datetime, timedelta
# 定义一个日期范围（最近 3 个月）
end_date = datetime.now()
start_date = end_date - timedelta(days=90)
# 初始化抽样器
sampler = TemporalSampler(
  token="your_github_token",
  start_date=start_date,
  end_date=end_date
)
# 获取 100 个随机代码仓库
repos = sampler.sample(n_samples=100)
# 获取具有特定特征的代码仓库
filtered_repos = sampler.sample(
  n_samples=50,
  min_stars=10,
  languages=["python", "javascript"]
)

3. 🔍 BigQuery 抽样

BigQuerySampler 利用 Google BigQuery 的公共 GitHub 数据集来抽样代码仓库，具有高级过滤功能。

BigQuery 抽样器设置

创建一个 Google Cloud Platform (GCP) 项目 :
- 转到 GCP 控制台
- 创建一个新项目
启用 BigQuery API :
- 在你的项目中，转到 "API & 服务" > "库"
- 搜索 "BigQuery API" 并启用它
创建服务账号 :
- 转到 "IAM & 管理" > "服务账号"
- 创建一个新的服务账号
- 授予它 "BigQuery 用户" 角色
- 创建并下载一个 JSON 密钥文件
安装所需的依赖项 :

pip install google-cloud-bigquery google-auth

使用 BigQuerySampler :

from reporoulette import BigQuerySampler
# 使用服务账号凭据初始化
sampler = BigQuerySampler(
  credentials_path="path/to/your-service-account-key.json",
  project_id="your-gcp-project-id",
  seed=42
)
# 抽样过去一年有提交的活跃代码仓库
active_repos = sampler.sample(
  n_samples=50,
  population="active",
  languages=["Python", "JavaScript"] # 可选的语言过滤器
)
# 跨随机日期抽样代码仓库
random_repos = sampler.sample_by_day(
  n_samples=50,
  days_to_sample=10,
  years_back=5
)
# 获取抽样代码仓库的语言信息
languages = sampler.get_languages(random_repos)
# 打印结果
for repo in random_repos:
  print(f"代码仓库: {repo['full_name']}")
  repo_languages = languages.get(repo['full_name'], [])
  if repo_languages:
    print(f"主要语言: {repo_languages[0]['language']}")
  print("---")

优点：

有效处理大型样本
强大的过滤和分层选项
不受 GitHub API 速率限制的限制
访问历史数据

缺点：

可能会很昂贵
需要 Google Cloud Platform 帐户和账单
数据集可能略有延迟（通常为 24-48 小时）

4. GH Archive Sampler

GHArchiveSampler 通过从 GitHub Archive（一个记录公共 GitHub 时间线的项目）中抽样事件来获取代码仓库。

from reporoulette import GHArchiveSampler
# 使用可选参数初始化
sampler = GHArchiveSampler(seed=42) # 设置 seed 以实现可重复性
# 抽样代码仓库
repos = sampler.sample(
  n_samples=100,      # 要抽样的代码仓库数量
  days_to_sample=5,    # 要抽样的随机天数
  repos_per_day=20,    # 每天要抽样的代码仓库数量
  years_back=2,      # 要回溯多少年
  event_types=["PushEvent", "CreateEvent", "PullRequestEvent"] # 要考虑的事件类型
)
# 访问结果
for repo in repos:
  print(f"代码仓库: {repo['full_name']}")
  print(f"事件类型: {repo['event_type']}")
  print(f"抽样自: {repo['sampled_from']}")
  print("---")

📊 示例用例

学术研究：研究不同语言和社区的编码实践
学习资源：发现多样化的代码示例用于教育
数据科学：构建关于代码模式的机器学习模型的数据集
趋势分析：识别新兴技术和实践
安全研究：查找各种代码仓库类型的漏洞模式

🤝 贡献

欢迎贡献！请随时提交 Pull Request。

📜 许可证

该项目已获得 MIT 许可证的许可 - 有关详细信息，请参见 LICENSE 文件。

🔗 相关项目

GHTorrent - GitHub 数据存档项目
GitHub Archive - 公共 GitHub 时间线的存档
PyGithub - GitHub API 的 Python 库

Built with ❤️ by Gojiplus