Benchmark Leaderboard

What is GISA?

GISA is a benchmark for General Information-Seeking Assistants with 373 human-crafted queries that reflect real-world information needs. It includes both stable and live subsets, four structured answer formats (item, set, list, table), and complete human search trajectories for every query.

Diverse answer formats with deterministic evaluation. GISA uses four structured answer types (item, set, list, table) with strict matching metrics for reproducible evaluation, avoiding subjective LLM judging while preserving task diversity.
Unified deep + wide search capabilities. Tasks require both vertical reasoning and horizontal information aggregation across sources, evaluating long-horizon exploration and summarization in one benchmark.
Dynamic, anti-static evaluation. Queries are split into stable and live subsets; the live subset is periodically updated to reduce memorization and keep the benchmark challenging over time.
Process-level supervision via human trajectories. Full human search trajectories are provided for every query, serving as gold references for process reward modeling and imitation learning while validating task solvability.

Citation

@article{GISA,
  title     = {GISA: A Benchmark for General Information Seeking Assistant},
  author    = {Yutao Zhu and
               Xingshuo Zhang and
               Maosen Zhang and
               Jiajie Jin and
               Liancheng Zhang and
               Xiaoshuai Song and
               Kangzhi Zhao and
               Wencong Zeng and
               Ruiming Tang and
               Han Li and
               Ji-Rong Wen and
               Zhicheng Dou},
  journal      = {CoRR},
  volume       = {abs/2602.08543},
  year         = {2026},
  url          = {https://doi.org/10.48550/arXiv.2602.08543},
  doi          = {10.48550/ARXIV.2602.08543},
  eprinttype    = {arXiv},
  eprint       = {2602.08543}
}

Leaderboard

Model rankings on the official test split. Click a column to sort. Use search to filter by model name.

Rank	Model / System	Framework	Date	Overall EM	Item	Set		List			Table
Rank	Model / System	Framework	Date	Overall EM	EM	EM	F1	EM	F1	Order	EM	Row-F1	Item-F1

Submit Results

Please follow our submission instructions (link coming soon) and open a pull request on the GitHub repository. We review PRs periodically and merge approved results.