MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development

Moshood Fakorede1 Krishna Upadhyay1 A.B. Siddique2 Umar Farooq1
1Louisiana State University 2University of Kentucky

TL;DR: MobileDev-Bench evaluates whether LLM coding agents can resolve real mobile app issues across Android, Flutter, and React Native. On 407 verified tasks, frontier models resolve only 3.23%-4.23% under automated retrieval.

Key Takeaways

  • Mobile-specific gap: Current issue-resolution benchmarks focus mainly on library-style repositories; MobileDev-Bench targets app-level mobile systems with framework build constraints, UI/resource artifacts, and platform APIs.
  • Patch complexity: Fixes modify 12.9 files and 334.6 lines on average, and 41% of instances require coordinated changes across multiple artifact types.
  • Coordination bottleneck: Single-file tasks reach 12.7%-15.5% automated resolution, but tasks requiring 6 or more files drop to 0% across all evaluated models.

MobileDev-Bench Leaderboard

The leaderboard summarizes end-to-end issue resolution. Use the toggles to switch retrieval settings and click table headers to sort.

Tracks
  • Automated Retrieval: the agent must retrieve context and generate a patch.
  • Oracle Retrieval: the agent receives the ground-truth files, isolating patch generation.
# Model Resolved Rate Single File Multi-Artifact

Values are resolution rates on 407 verified tasks. Single-file and multi-artifact columns show category-specific rates.

What is MobileDev-Bench?

MobileDev-Bench construction pipeline

MobileDev-Bench instances are mined from issue-linked pull requests in production mobile applications. Each task includes a base commit, issue statement, fix patch, test patch, and reproducible container environment. Validation executes mobile builds and tests to verify whether a generated patch resolves the issue.

MobileDev-Bench at a Glance

407
verified tasks
19
production repos
3
mobile frameworks
12.9
avg. files edited
334.6
avg. lines edited
41%
multi-artifact fixes

How are tasks evaluated?

Execution validation: Generated patches are applied to the base repository and checked against executable test patches inside mobile build environments.

Resolution rate: A task is resolved only when the generated patch passes the validation tests for the issue.

Retrieval metrics: File-level recall, precision, and F1 quantify whether agents identify the files that must be modified.

Resolution rates by file count and patch size
Retrieval metrics by file count

How can I evaluate my model?

To run your own agent on MobileDev-Bench and evaluate its performance, please consult the repository README.

Citation

@misc{fakorede2026mobiledevbench,
      title={MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development}, 
      author={Moshood A. Fakorede and Krishna Upadhyay and A. B. Siddique and Umar Farooq},
      year={2026},
      eprint={2603.24946},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.24946}, 
}