A blaring issue with getting GPT-4V to generate pixel locations is robustness, but Vimium actually solves this! Instead of getting it to generate pixel locations just directly have it return the string to click on. RPA does OCR on the page to match objects. Humans don’t need DOMs to see, computers don’t navigate the web. Could be good for accessibility as well.