Technical Report

GUI-Perturbed: Breaking Browser-use Models using Domain Randomization

By Fig Team Jun 20, 2026

Yangyue Wang^{1, 2}, Harshvardhan Sikka^{1, 2}, Yash Mathur^*², Tony Zhou^*², Jinu Nyachhyon^*², Pranav Guruprasad^{1, 2}

^* Equal contributions. ¹Fig; ²Manifold Research Group.

0:00

/0:05

TL;DR

We introduce a baseline study of 7B GUI models using GUI-Perturbed to stress test CUA models and understand what agents fail at and why.
A detailed failure mode analysis showcases common GUI failure modes from spatial reasoning, false visual heuristics to CoT reasoning's effect.

Key Sections · How do GUI models fail · The triple alignment problem · Experimental setup · Results · Discussion · Model failure modes

Relevant links · Code · Dataset · Results Viewer · Cite this

GUI Perturbation — Research Series

Part 1 · Previous report

Data Augmentation Pipeline

GUI grounding failures under controlled UI perturbations. Data, tooling, and evaluation protocol.

Part 2 · This report

Dataset Release & Baseline Evaluations

How leading CUA models perform across perturbation types. Structured failure analysis.

Part 3 · Coming soon

Fine-tuning Experiments & Model Checkpoint

Training on perturbation-augmented data. How does fine-tuning on training data generated via perturbation affect model failure modes.

GUI models scoring above 90% on ScreenSpot-v2 fail to find the target element when you set web page zoom to 70% [1]. Same website, same layout, same UI elements. Just smaller.

These models were trained on hundreds of thousands to millions of GUI screenshots through supervised fine-tuning and reinforcement learning stages yet they still cannot adapt to a change in zoom.

Original

Model Result

✓ Correct

Precision Variant (70% zoom)

Model Result

✗ Clicked on fake 'View Deal' ad

Figure 1: Sample: 21 of 390: "Click on 'View Deal' button for flight '#2125, #2126'", Direct Instruction, No Reasoning. Result: UI-TARS1.5-7B clicked on the fake 'view deal' button in the ads after the 70% zoom.

GUI grounding looks solved. On fixed-scene benchmarks like ScreenSpot-v2, 7B models now score above 90%, and it is tempting to read those numbers as evidence that perception is no longer the bottleneck for computer-use agents (CUAs). The numbers are real, but they are measured on screens that never move. Real users zoom, restyle, and resize, and production websites are redesigned constantly. The question the industry should care about is not how well a model does on a frozen screenshot, but how much of that performance survives contact with ordinary variation.

So we ask one question: how much of a GUI grounding model's benchmark accuracy is stable under perturbation, and how much of it is memorized? Because the three models we study share a base checkpoint but differ in post-training, we can ask a sharper version too: does each additional stage of GUI-specialized post-training buy real robustness, or does it only raise the fixed-scene score?

Our goal here is to separate those two things. In Part 1, we introduced GUI-DR, a data augmentation pipeline that varies visual scenes and instructions along controlled axes to stress test CUA model's GUI grounding capability. In this post, we use it to create a dataset composed of visual scene and the instruction variations created along controlled axes. This dataset serves as a benchmark to evaluate three state-of-the-art models that share the same base checkpoint but differ in their post-training recipes, and we report where they break. We find that:

Visual perturbations degrade models that benchmarks call production-ready. A change as small as setting browser zoom to 70% drops accuracy by 2 to 6 points across all three models.
Spatial relational instructions are the weakest point. Asking for "the button above X" instead of "the submit button" costs 27 to 56 points, the largest single effect we see.
Reasoning is not uniformly good. A chain of thought helps on hard relational tasks and hurts on easy direct ones, and a model post-trained for direct coordinate prediction is harmed by it everywhere.
More GUI-specialized post-training does not fix any of this. The same weaknesses persist from the base model through two further stages of GUI training.

The Triple Alignment Problem

Grounding a GUI instruction is harder than it looks, because the model has to align three different things at once and a benchmark score collapses all three into one number.

Visual alignment: identifying an element's appearance in pixel space, its shape, color, size, and boundaries.
Functional alignment: knowing what the element does, telling an input field from a display label or a clickable button from a static icon.
Geometric alignment: resolving spatial relationships between elements, "above," "next to," "the one between X and Y."

Figure 2: Triple alignment in GUI agent perception

Most benchmarks test the three entangled together [2, 3], so when a model fails the score cannot say which alignment broke. GUI-Perturbed is built to stress visual and geometric alignment independently, which lets us attribute a failure to one axis rather than guess. (Our current perturbations do not isolate functional alignment the way AutoGUI's instructions do [4]; we leave that for future work.)

This framing carries through the rest of the post. For every result, we name the alignment axis it implicates.

Problem Formulation

A computer-use agent acts over many steps: it sees a screen, picks an action, the screen changes, and it sees the next one. That full setting is a partially observable Markov decision process (POMDP) [9-11], and recent agentic models are trained against it [12, 13]. We do not need the full machinery here, because our evaluation isolates a single step.

The POMDP tuple defines the CUA problem structure: hidden app states, observations (screenshots + goal), actions, transition dynamics, and reward [12].

\[ \mathcal{M} \;=\; \langle\, \mathcal{S},\; \mathcal{A},\; \mathcal{O},\; \underbrace{\mathcal{T}(s_{t+1}\mid s_t, a_t)}_{\text{transition}},\; \mathcal{R} \,\rangle \]

At each step, a CUA model receives an instruction I and an observation OO (the screenshot), optionally produces a chain of thought T, and outputs an action A. The full step can be written as:

\[ O_t \;\sim\; \mathcal{Z}(O_t \mid s_t), \qquad (t_t,\, a_t) \;=\; \mathrm{VLM}_\theta\!\left( I,\; O_{1:t},\; t_{1:t-1},\; a_{1:t-1} \right), \quad I \in \mathcal{I} \]

Triple alignment is the key representational challenge: to select a correct action at at , the model must be able to interpret Ot along three axes visual appearance, geometric relation, functional affordance of the GUI from screenshots simultaneously.

\[ a_t \;=\; \pi_\theta\!\left(I,\; O_{1:t},\; t_{1:t-1},\; a_{1:t-1} \right), \qquad O_t \;\supseteq\; \left(\, \underbrace{O_t^{\mathrm{vis}}}_{\substack{\text{visual}\\\text{appearance}}},\; \underbrace{O_t^{\mathrm{geo}}}_{\substack{\text{geometric}\\\text{layout}}},\; \underbrace{O_t^{\mathrm{func}}}_{\substack{\text{functional}\\\text{affordance}}} \,\right) \]

Our evaluation isolates the grounding step: given (I,O), predict the correct element. This removes multi-step dependencies and focuses the evaluation on the single-step alignment problem described above. We do evaluate models in both reasoning and no-reasoning configurations, which introduces a planning-like component through the thought trace T, and we report on its effects below.

Experimental Setup

Three Models, One Base Checkpoint

① Pretrain

Qwen2.5VL-7B

Vision-Language Model

4.1T token pretraining

Builds on

② Specialized Fine-Tune

UI-TARS1.5-7B

GUI-Specialized Trajectories

~50B GUI-focused tokens

Builds on

③ GUI GRPO Training

GTA1-7B

Grounding Agent + o3 Planner

Test-time scaling

Figure 3: Model lineage diagram

We study three 7B models built on the same base weights, so that any difference in robustness is attributable to post-training rather than to scale or architecture:

Qwen2.5-VL-7B, the base vision-language model (VLM). It saw some GUI trajectories during long-context pre-training but had no dedicated GUI fine-tuning [5].
UI-TARS-1.5-7B, initialized from Qwen2.5-VL and trained further on end-to-end GUI trajectories [8]. The exact recipe is not public; the UI-TARS papers [6] point to supervised fine-tuning (SFT) and reinforcement learning.
GTA1-7B, initialized from UI-TARS-1.5 and trained further on GUI data with step-level Group Relative Policy Optimization (GRPO) [7].

	Qwen2.5VL-7B	UI-TARS1.5-7B	GTA1-7B
Architecture
Base model	Qwen2.5VL-7B	Qwen2.5VL-7B	UI-TARS1.5-7B (grounding) + o3 (planner)
Training
Stages	Visual Pre-Training (1.5T tokens) Multimodal Pre-Training (2T tokens) Long-Context Pre-Training (0.6T tokens) + SFT + DPO	Continual Pre-training (GUI knowledge) Annealing Phase (SFT) DPO Phase Based on UI-TARS recipe, UI-TARS1.5 training recipe not public [7]	RL Optimization: GRPO (Group Relative Policy Optimization) Click reward mechanism
Training Data
Volume	4.1T tokens	$\sim$50B tokens	—
Sources	Interleaved image-text VQA Image captions & OCR Visual knowledge Video grounding Document parsing Agent interaction data	18.4M grounding elements (web, mobile, desktop) 6M GUI tutorials 151.4k action traces Reflective online traces	Aria-UI [15] OmniAct [16] Widget Caption [17] UI-Vision [18] OS-Atlas [19] (lightly cleaned)

Table 1: Model architecture & training comparison

We chose this lineage deliberately, the models share the same architecture and same base weights with different training recipes. Any performance variation between the three is directly attributable to post-training, not architecture, which allows us to precisely ask: how does each additional stage of GUI-specialized training affect robustness under perturbation?

Stress Test Matrix

We evaluate three 7-B models from the same lineage on eight task variants from GUI-Perturbed, enabling comparisons along three axes visual variants, instruction types, and reasoning mode:

Axis	Levels	Cardinality
Visual Perturbation	$\{\text{Original},\ \text{Style},\ \text{Precision},\ \text{Text Shrink}\}$	4
Instruction Type	$\{\text{Directional},\ \text{Relational}\}$	2
Reasoning Mode	$\{\text{With CoT},\ \text{Without CoT}\}$	2
Total Configurations per Model		$4 \times 2 \times 2 = 16$

Table 2: Stress-testing configuration space using GUI-Perturbed

Evaluation Metrics

We focus our analysis primarily on hit rate, flip rate, and net delta, which measure the models' overall performance and sensitivity under perturbation:

Hit rate measures whether the predicted coordinate falls within the bounding box of the target element. We report 95% bootstrap confidence intervals (10,000 resamples) for all hit rates; exact binomial (Clopper--Pearson) intervals agreed within 0.2 pp throughout and are omitted for brevity. This is our primary metric: a grounding prediction either lands on the right element or it does not.
Flip rate measures the fraction of matched pairs whose binary outcome (hit/miss) changed between the original and the perturbed condition. Each perturbation test uses n = 390 matched sample pairs (same task and step evaluated on both the original and perturbed screenshots). A high flip rate indicates the model's output is sensitive to the perturbation, regardless of whether accuracy improves or degrades on average.
Net Delta measures the difference in hit rate between the original and perturbed conditions (original minus perturbed), with 95% bootstrap CI. A positive Delta indicates overall degradation.

We also test significance with McNemar's test, which compares the number of samples that degraded (b: correct to incorrect) against those that improved (c: incorrect to correct) under perturbation, ignoring samples whose outcome did not change. We report p-values with continuity correction; when the number of discordant pairs (b + c) is below 25, we use the exact binomial test instead.

Additionally, MSE, normalized MSE, and normalized distance provide additional signal on error magnitude but did not reveal trends beyond what the primary metrics capture.

Results

Visual Perturbations Break Models that Benchmarks Call Robust

All three models degrade under visual perturbations, including a change as small as browser zoom. Models scoring above 90% on fixed-scene benchmarks drop on perturbed versions of the same pages.

			Flip Rate		Net $\Delta$ (%)
Model	Pert.	Base Acc.	Dir.	Rel.	Dir.	Rel.	$\boldsymbol{b}$ / $\boldsymbol{c}$	Sig.
GTA-1	Precision	79.3	10.3%	21.5%	+3.6**	+7.9***	169/79	3/4
	Style		9.7%	21.5%	+1.3	−0.3	126/118	0/4
	Text Shrink		4.2%	16.7%	+0.4	+1.8	90/73	0/4
Qwen2.5-VL	Precision	66.0	13.1%	16.4%	+3.3	+4.9*	147/83	2/4
	Style		8.7%	19.1%	+2.8**	+0.1	120/97	1/4
	Text Shrink		7.8%	14.4%	+0.1	+2.8	98/75	0/4
UI-TARS-1.5	Precision	63.0	13.1%	18.7%	+6.2***	+5.4**	169/79	4/4
	Style		11.2%	19.2%	+2.4	+1.0	132/105	0/4
	Text Shrink		6.9%	14.1%	−0.8	+0.0	79/85	0/4

Table 3: Perturbation robustness of baseline models (n = 390 matched sample pairs per test).

The zoom and text-size perturbations are worth emphasizing as they are not exotic transformations. Any user adjusting their browser zoom or system font size produces exactly this kind of variation. The fact that models trained on millions of GUI screenshots cannot handle a zoom change suggests they are memorizing absolute spatial positions rather than understanding relational structure.

Figure 4: A cross model comparison against hit accuracy. All three models degrade 2-6% on precision variant (70% zoom). UI-TARS1.5 shows the largest drop.

In our experiments, all 3 models experience 2-6% degradation on precision variant (70% zoom) compared to the original variant with direct instructions as seen in figure 4. UI-TARS1.5-7B shows the most degradation with precision variant.

Specifically, precision perturbation (70% zoom) produced statistically significant accuracy drops in 9 of 12 paired comparisons (McNemar's p < 0.05) as seen in table 3. Although, style perturbation reached significance in only 1 of 12, and text-shrink in 0 of 12, the lack of significant net degradation does not mean those perturbations left predictions unchanged. Style perturbation flipped 14.9% of all predictions (698 of 4,680), nearly matching precision's 15.5% flip rate (726 of 4,680). Style flips were roughly symmetric between degraded and improved, so the net effect washed out. This points to a useful distinction between robustness (whether net accuracy holds) and consistency (whether individual predictions stay stable). All three visual perturbation types destabilize predictions substantially; only precision does so in a systematically harmful direction.

Some combinations of perturbations unexpectedly improved accuracy on individual configurations (e.g., style on relational+CoT for GTA-1: 63.1% to 65.1%, +2.1 pp), though none reached significance (all p > 0.4). Similar non-significant improvements appeared for text-shrink on UI-TARS-1.5 direct queries (92.8% to 93.8%, p = 0.45). Our hypothesis is that some perturbations inadvertently increase whitespace between elements or enlarge text in ways that make grounding easier. This is a useful diagnostic signal in itself: it tells us which visual properties these models are most sensitive to.

Spatial Reasoning is the Weakest Link

The sharpest performance drops come from relational instruction variants. When the instruction asks the model to identify an element by its spatial relationship to a neighbor (“click the button above X”), performance falls well below what the same models achieve on direct instructions (“click the submit button”).

Benchmark	Qwen2.5VL-7B	UI-TARS1.5-7B	GTA1-7B
ScreenSpot-v2	88.8	89.7	92.4
ScreenSpot-Pro	27.6	42.0	50.1
OSWorld	—	$27.4 \pm 2.2\%$	45.2 (with o3)
OSWorld-G	27.7	64.2	67.7
GP-Unperturbed (Direct)	86.9	91.0	92.8
GP-Unperturbed (Relational)	45.0 (↓$-41.9$)	35.0 (↓$-56.0$)	65.8 (↓$-27.1$)

Table 4: Comparing our direct and relational query approaches on unperturbed data (GP-Unperturbed, Direct) with the queries of other datasets; Bold = best score per benchmark.

This gap between direct and relational performance is consistent across all three models and both reasoning modes ranging from 27.1% to 56.0% as seen in table 4. It is the single largest effect we observe in the evaluation, larger than any visual perturbation effect.

We also examined whether models exhibit systematic directional biases in their spatial errors. Our directional hit rate analysis shows that models perform unevenly across spatial directions, with consistently higher accuracy on instructions involving the direction “right” compared to other directions.

Figure 5: Hit accuracy by model for relational query directions. Models show consistently higher accuracy on 'right' instructions. Directional bias may reflect training distribution or patchification artifacts.

We note that our directional analysis relies partly on manual labeling to correct reference landmark coordinates, which limits the scale at which we can draw conclusions. The directional asymmetry we observe is suggestive of biases introduced during training or by the visual patchification process, but confirming this would require a larger controlled study.

Reasoning Helps on Hard Tasks, Hurts on Easy Ones

The effect of reasoning mode is not uniformly positive. Enabling reasoning (the thought trace TT in our formulation) produces different outcomes depending on task difficulty.

On simple direct grounding tasks, reasoning introduces unnecessary deliberation that can actively mislead the final prediction. The model “overthinks” a task that the base visual grounding would handle correctly without intermediate reasoning.

On more complex relational tasks, reasoning recovers some performance by providing useful intermediate structure: the model can reason about spatial relationships step by step rather than attempting to resolve them in a single forward pass.

Figure 6: Instruction: "*Click on 'Notifications' div*". Model output: "Thought: I noticed that there is a "Notifications" option in the left sidebar, which is exactly what I need to click on. This option is located just below "Privacy and data" and above "Security and logins." By clicking on it, I can access the notification management page. Action: (None, 'Notifications')"

GTA1 provides a particularly instructive case. It was further trained to predict coordinates directly, and it suffers degradation from reasoning on both simple and harder relational tasks. Its post-training has optimized it for direct coordinate prediction, and the reasoning trace interferes with that pipeline regardless of task complexity.

The implication is that blanket “enable reasoning” is not the right strategy. Models need exposure to diverse reasoning styles during post-training, and they need to calibrate when to reason and how much. The optimal reasoning style and length likely varies by task.

Failure Mode Taxonomy

We conducted a qualitative analysis of representative failures across all models and configurations. Several recurring patterns emerge.

Failure Mode	Definition
spatialClick Region Error	The model selects the correct UI element conceptually but clicks the wrong physical area of it.
spatialLocation Hallucination	The model correctly identifies what to click but fabricates or misplaces its on-screen coordinates.
reasoningReasoning Drift	Adding explicit reasoning causes the model's grounded action prediction to deteriorate.
semanticGoal Hallucination	The model invents user intentions or interface functionality not specified in the instruction.
semanticInstruction Misinterpretation	The model misunderstands what the instruction refers to and selects a related but incorrect element.
visualVisual Confusion	The model relies on superficial visual cues (shape, color, position) and misidentifies the functional element.
spatialSpatial Reasoning Error	The model incorrectly interprets relative spatial relationships such as left, right, above, or below.
semanticText Matching Bias	The model interacts with visible text matching the instruction without properly grounding it to the correct UI element.

Table 5: Failure mode definitions

The full set of annotated failure examples is available in the appendix section 2.

Discussion

Models Lack Spatial Relational Understanding

Figure 7: **Sample 216 of 390**: “Click on the button above ‘FOURMIDABLE’ img”; UI-TARS1.5-7B prediction: "Thought: I noticed that there is a green heart icon located above the "FOURMIDABLE" logo. According to the task requirements, I need to click on this heart icon. It's positioned right next to the logo, and clicking on it should allow me to complete the task. Action: click(start_box='(1899,138)')"; UI-TARS1.5 confuses what is on the right with what is above.

The sharpest performance drops in our evaluation come from relational instructions. Models can find an element by name but cannot resolve “the field above X.” This is not a data quantity problem. These models have been trained on millions of screenshots. The issue is representational: current architectures do not build structured spatial models of GUI layouts. They encode visual features at the patch level without maintaining an explicit spatial graph of element relationships.

This maps directly to the geometric alignment axis from our framing. Visual and functional alignment may be adequate for direct grounding tasks, but geometric alignment, the ability to reason about spatial relationships between elements, is where current models fall short.

Visual Heuristics are Static and Fragile

Models learn fixed visual associations (white rectangle at the top of the screen equals search bar) that break on any layout or style change. In the zoom perturbation results, we see models clicking on advertisement elements that happen to occupy the spatial position where the target element used to be at the original zoom level. The model has memorized a position, not learned a function.

In production, websites update their designs regularly. A model relying on static visual heuristics is one deployment away from failure. This is a visual alignment problem: the model’s visual representations are too tightly coupled to the specific pixel-level appearances in the training distribution.

Figure 8: A new UI version change could render many GUI agents useless on the same website [14]

Reasoning is a Double-Edged Sword for Grounding

Pulling the reasoning results together: deliberation helps on hard relational tasks and hurts on easy direct ones, and GTA-1 sharpens the point. A model post-trained for direct coordinate prediction is harmed by reasoning in every condition, because the trace disrupts the pipeline its training optimized for.

The implication for post-training is that reasoning should be treated as a learnable, task-conditioned skill rather than a switch to flip on or off. Models need exposure to varied reasoning styles and lengths during training so they can learn when deliberation helps and when it does not.

Scope and Limitations

Model coverage. We evaluate three models from one base checkpoint lineage and while this approach isolates the effect of post-training recipes, it may not generalize to models with different base architectures or scales. Broader coverage is future work.

Perturbation coverage. We evaluate on eight variants from GUI-Perturbed. More perturbation types and combinations are possible, and interactions between perturbation types (ex: zoom combined with relational instructions) remain unexplored.

What’s Next

This report used GUI-Perturbed as a benchmark to uncover systematic weaknesses in state-of-the-art GUI grounding models that persist across all three models despite increasingly specialized post-training.

Part 3 will explore the natural next question: whether we can fix these weaknesses with better training data. We use GUI-Perturbed for data augmentation and measure whether targeted training closes the gaps we identified here.

At Fig, we're building the control layer for AI: systems that perceive an environment, act in it reliably, and improve from the experience. Frontier models have largely solved perception. Agency is the open problem — acting dependably in real environments, recovering from mistakes, and compounding over time — and it will not come from scaling language models alone. Computer use is where we begin, and grounding is where the gap first surfaces: a model that can read a screen but cannot reliably act on it is not yet in control. GUI-Perturbed measures that gap precisely. Closing it, across software today and every environment over time, is the work ahead.

Subscribe to stay updated as we make progress! If you'd like to get involved, reach out to contact@fig.inc

Citation

Please cite this work as:

@online{measuring_gui_models_robustness_technical_report_2026,
  title   = {GUI-Perturbed: Breaking Browser-use Models using Domain Randomization},
  author  = {Yangyue Wang and Harshvardhan Sikka and Yash Mathur and Tony Zhou and Jinu Nyachhyon and Pranav Guruprasad},
  year    = {2026},
  url     = {www.fig.inc/blog/gui-pertubed-breaking-browser-use-models/},
  note    = {Part 2: Baseline evaluation}
}

References

[1] K. Cheng et al., "SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents," Feb. 23, 2024, arXiv: arXiv:2401.10935. doi: 10.48550/arXiv.2401.10935.

[2] T. Xie et al., "OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments," May 30, 2024, arXiv: arXiv:2404.07972. doi: 10.48550/arXiv.2404.07972.

[3] K. Li et al., "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use," Apr. 04, 2025, arXiv: arXiv:2504.07981. doi: 10.48550/arXiv.2504.07981.

[4] H. Li, J. Chen, J. Su, Y. Chen, Q. Li, and Z. Zhang, "AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs," Jun. 07, 2025, arXiv: arXiv:2502.01977. doi: 10.48550/arXiv.2502.01977.

[5] S. Bai et al., "Qwen2.5-VL Technical Report," Feb. 19, 2025, arXiv: arXiv:2502.13923. doi: 10.48550/arXiv.2502.13923.

[6] Y. Qin et al., "UI-TARS: Pioneering Automated GUI Interaction with Native Agents," Jan. 21, 2025, arXiv: arXiv:2501.12326. doi: 10.48550/arXiv.2501.12326.

[7] Y. Yang et al., "GTA1: GUI Test-time Scaling Agent," Oct. 03, 2025, arXiv: arXiv:2507.05791. doi: 10.48550/arXiv.2507.05791.

[8] U.-T. Team, "UI-TARS - Next-generation native GUI agent model," UI-TARS. Accessed: Mar. 04, 2026.

[9] M. Lu et al., "Scaling Agentic Reinforcement Learning for Tool-Integrated Reasoning in VLMs," Nov. 24, 2025, arXiv: arXiv:2511.19773. doi: 10.48550/arXiv.2511.19773.

[10] L. Zhao et al., "Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators," Apr. 04, 2025, arXiv: arXiv:2504.03245. doi: 10.48550/arXiv.2504.03245.

[11] H. Li et al., "VLA-RFT: Vision-Language-Action Reinforcement Fine-tuning with Verified Rewards in World Simulators," Oct. 01, 2025, arXiv: arXiv:2510.00406. doi: 10.48550/arXiv.2510.00406.

[12] J. Wu et al., "OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks," Feb. 02, 2026, arXiv: arXiv:2601.20650. doi: 10.48550/arXiv.2601.20650.

[13] K. Team et al., "Kimi K2.5: Visual Agentic Intelligence," Feb. 02, 2026, arXiv: arXiv:2602.02276. doi: 10.48550/arXiv.2602.02276.

[14] B. Oliveira and C. Teixeira Lopes, "The Evolution of Web Search User Interfaces - An Archaeological Analysis of Google Search Engine Result Pages," in Proceedings of the 2023 Conference on Human Information Interaction and Retrieval, Austin TX USA: ACM, Mar. 2023, pp. 55–68. doi: 10.1145/3576840.3578320.

[15] Y. Yang et al., "Aria-UI: Visual Grounding for GUI Instructions," 2024, arXiv: arXiv:2412.16256.

[16] R. Kapoor et al., "OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web," 2024, arXiv: arXiv:2402.17553.

[17] Y. Li et al., "Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements," in Proc. EMNLP, 2020, pp. 5495–5510.

[18] S. Nayak et al., "UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction," 2025, arXiv: arXiv:2503.15661.

[19] Z. Wu et al., "OS-ATLAS: A Foundation Action Model for Generalist GUI Agents," 2024, arXiv: arXiv:2410.23218. doi: 10.48550/arXiv.2410.23218.

Appendix

1. Model Performance with MSE, NMSE, and D_norm (Normalized Distance)

Figure 9: Cross-model bounding box center MSE across perturbation conditions.

Figure 10: Cross-model normalized MSE (NMSE) across perturbation conditions.

Figure 11: Cross-model Euclidean distance from predicted point to bounding box center, normalised by box diagonal. Values above 1.0 indicate predictions outside the target box.

2. Failure Mode Qualitative Examples

Click Region Error

Instruction

Click on 'Done' buttonAction: click(start_box='(639,438)')

Model misidentifies clicking the area next to 'Done' as equivalent to clicking the 'Done' text itself.

Location Hallucination