Touchpoint

Give your AI agent eyes and hands on any desktop.

pip install touchpoint-py

Touchpoint demo — AI agent creates a formatted Excel table using Touchpoint

AI agent researches data in Chrome, then creates a formatted Excel table — full task completed in ~12 minutes

Touchpoint is a cross-platform Python library for reading and interacting with desktop UI through native accessibility APIs. One import, one API — works on Linux, macOS, and Windows, with built-in support for Chromium and Electron apps via CDP (Chrome DevTools Protocol).

Instead of scraping pixels or running vision models, Touchpoint reads the real accessibility tree — structured names, roles, states, and positions for every element on screen. Fast and reliable, with no vision model required. Ships with an MCP server so LLM agents like Claude, Cursor, or any local model can control any desktop app out of the box.

hljs language-python

import touchpoint as tp

elements = tp.find("Send", role=tp.Role.BUTTON, app="Slack")
tp.click(elements[0])

Why Touchpoint?

	Screenshot / vision	Browser automation	Touchpoint
Native desktop apps	⚠️ inaccurate or slow	❌	✅
Browsers	⚠️ inaccurate or slow	✅	✅ via CDP
Electron apps (Slack, VS Code, ...)	⚠️ inaccurate or slow	⚠️ web content only	✅ native + web
Structured element data	❌ needs OCR/vision model	✅ web only	✅ names, roles, states, positions
Works with local / non-vision models	❌	✅ web only	✅ all apps
Works across Linux, macOS, Windows	✅	✅	✅

Table of Contents
Install
- Platform requirements
Quick Start
- Element IDs
- Output formats
MCP Server
Browser & Electron Apps (CDP)
- Setup
API Reference
Architecture
Configuration
Development
Status
- Known limitations
License

Install

Requires Python 3.10+.

hljs language-bash

pip install touchpoint-py

Everything is included: your platform's native backend, CDP support for browsers and Electron apps, the MCP server, and screenshot capabilities. Platform-specific dependencies are installed automatically via pip environment markers.

Platform requirements

Platform	Backend	Requirement
Linux	AT-SPI2	Install `xdotool` (required for input + `minimize_window`) and `wmctrl` (required for all window management — used for AT-SPI → X11 id mapping). Most desktops include `python3-gi` and `gir1.2-atspi-2.0` — install them if missing.
Windows	UI Automation	None — uses built-in COM APIs
macOS	Accessibility (AX)	Grant permission: System Settings → Privacy & Security → Accessibility

Quick Start

hljs language-python

import touchpoint as tp

# Discover
apps = tp.apps()                            # ["Firefox", "Slack", "Terminal", ...]
windows = tp.windows()                      # Window objects with title, position, size
all_els = tp.elements(app="Firefox", named_only=True)  # only elements with text labels

# Find
results = tp.find("Search", role=tp.Role.TEXT_FIELD, app="Firefox")

# Act
tp.set_value(results[0], "touchpoint python", replace=True)
tp.press_key("enter")
tp.hotkey("ctrl", "s")                      # keyboard shortcuts

# Wait for UI changes
tp.wait_for("results", app="Firefox", timeout=10)

# Screenshot
img = tp.screenshot()                       # full desktop → PIL.Image
img = tp.screenshot(app="Firefox")           # cropped to app window

Element IDs

Every element has a unique ID like atspi:1234:1:2.0 or cdp:9222:TID:4. Action functions accept either an Element object or a bare ID string — useful for storing references across steps:

hljs language-python

results = tp.find("Send", max_results=1)
element_id = results[0].id                  # "atspi:1234:1:5.2"

# later...
tp.click(element_id)                        # works with just the string

Output formats

Control how results are returned:

hljs language-python

tp.elements(app="Slack", format="flat")     # one compact line per element (best for LLMs)
tp.elements(app="Slack", format="tree")     # indented parent/child hierarchy
tp.elements(app="Slack", format="json")     # full JSON with all fields

MCP Server

Touchpoint ships an MCP (Model Context Protocol) server ready for any MCP-compatible client. Use it to let LLM agents like Claude, Cursor, local models, or any tool that supports MCP control your desktop.

Two modes — vision and no-vision

Set TOUCHPOINT_MODE=no-vision (default: vision) to switch modes:

Vision mode — agents use screenshot() to see the screen and interact by element ID or coordinates. Best for frontier models with strong vision capabilities.
No-vision mode — agents use snapshot() to get a compact structured text tree of the active window, then act on element IDs directly. Works with any model including local ones that have no vision capability. Most action tools append auto-verify flags ((new window: ...), (focus moved), (no change detected)) so the agent can detect state changes without taking a screenshot.

Tools

Category	Vision mode	No-vision mode
Orient	`screenshot`, `snapshot`, `apps`, `windows`	`snapshot`, `diff_snapshot`, `apps`, `windows`
Find	`find`, `get_element`	`find`
Read	`read_text`	`read_text`
Actions	`click` (element or coordinates), `set_value`, `set_numeric_value`, `select_text`, `focus`, `action`	`click` (element only), `set_value`, `set_numeric_value`, `select_text`, `focus`, `action`
Keyboard	`type_text`, `press_key`	`type_text`, `press_key`
Mouse	`mouse_move`, `scroll`	`scroll`
Window	`activate_window`, `minimize_window`, `fullscreen_window`, `close_window`, `move_window`, `resize_window`	`activate_window`, `minimize_window`, `fullscreen_window`, `close_window`
Waiting	`wait_for`, `wait_for_app`, `wait_for_window`	`wait_for`, `wait_for_app`, `wait_for_window`
Health	`diagnostics`	`diagnostics`

The MCP server includes built-in instructions that teach agents the correct workflow for each mode — including the orient → act → verify loop, when to use read_text vs find, and how to recover from errors.

hljs language-arduino

         ┌──────────┐
    ┌───▶│  ORIENT  │  screenshot · apps · windows
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  LOCATE  │  find · snapshot · get_element
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │   ACT    │  click · set_value · type_text · press_key
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  VERIFY  │───▶ Done ✅
    │    └────┬─────┘
    │         │ not yet
    └─────────┘

Client setup

Claude Desktop

Config file location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

If using a virtualenv, use the full path: "/path/to/venv/bin/touchpoint-mcp"

VS Code / GitHub Copilot

Add to .vscode/mcp.json in your workspace:

hljs language-json

{
  "servers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Cursor

Create or edit ~/.cursor/mcp.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Claude Code (CLI)

hljs language-bash

claude mcp add touchpoint -- touchpoint-mcp

OpenClaw

Add to mcpServers in ~/.openclaw/openclaw.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Environment variables

All optional — click to see available settings

Variable	Example	Description
`TOUCHPOINT_CDP_DISCOVER`	`true`	Auto-discover CDP ports from running processes
`TOUCHPOINT_CDP_PORTS`	`{"Chrome": 9222}`	Explicit app-to-port mapping (JSON)
`TOUCHPOINT_CDP_APP`	`Google Chrome`	Single app name (pair with `_PORT`)
`TOUCHPOINT_CDP_PORT`	`9222`	Single port (pair with `_APP`)
`TOUCHPOINT_CDP_REFRESH_INTERVAL`	`5.0`	Seconds between CDP port scans
`TOUCHPOINT_SCALE_FACTOR`	`1.25`	Display scale override
`TOUCHPOINT_FUZZY_THRESHOLD`	`0.6`	Minimum match score for find() (0.0–1.0)
`TOUCHPOINT_FALLBACK_INPUT`	`true`	Use coordinate fallback when native actions fail
`TOUCHPOINT_MAX_ELEMENTS`	`5000`	Maximum elements per query
`TOUCHPOINT_MAX_DEPTH`	`20`	Default tree depth limit
`TOUCHPOINT_AX_MESSAGING_TIMEOUT`	`1.0`	Max seconds to wait for a macOS AX app reply

Browser & Electron Apps (CDP)

Native accessibility APIs return limited data for Electron and Chromium apps (Slack, Discord, VS Code, etc.). Touchpoint's CDP backend connects via Chrome DevTools Protocol to get the full web content.

Auto-discovery is enabled by default — Touchpoint automatically finds running browsers and Electron apps that were launched with a debug port. No manual configuration needed beyond launching the app with the flag.

Setup

Launch the app with a debug port:

hljs language-bash

# Linux
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# macOS
open -na "Google Chrome" --args --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# Windows
start chrome --remote-debugging-port=9222 --user-data-dir=%TEMP%\tp-chrome

Configure Touchpoint:

hljs language-python

import touchpoint as tp

tp.configure(cdp_discover=True)             # auto-discover from running processes
# or
tp.configure(cdp_ports={"Google Chrome": 9222})  # explicit mapping

Control what you get with the source parameter:

hljs language-python

tp.elements(app="Google Chrome", source="full")     # native chrome + web content (default)
tp.elements(app="Google Chrome", source="cdp_ax")   # web content only (CDP accessibility tree)
tp.elements(app="Google Chrome", source="native")   # native UI only (toolbar, tabs, menus)
tp.elements(app="Google Chrome", source="dom")      # DOM walker (catches what AX misses)

CDP results are merged with native backend results — you get the toolbar and window controls from AT-SPI2/UIA/AX, combined with the full web page content from CDP, in a single elements() call.

source="ax" remains accepted as a compatibility alias for source="cdp_ax". Prefer cdp_ax in new code so it is not confused with the native macOS AX backend.

API Reference

Discovery

Function	Description
`tp.apps()`	List application names in the accessibility tree
`tp.windows()`	All windows with id, title, app, position, size, active state
`tp.elements(app, role, states, ...)`	UI elements, with filtering, tree mode, and formatting
`tp.element_at(x, y)`	Deepest element at screen coordinates
`tp.get_element(id)`	Fresh snapshot of a single element by ID

Search & Wait

Function	Description
`tp.find(query, app, role, ...)`	Search by name — 4-stage matching: exact → contains → word → fuzzy
`tp.wait_for(query, ...)`	Poll until elements appear (or disappear with `gone=True`)
`tp.wait_for_app(app, ...)`	Poll until an app appears or disappears
`tp.wait_for_window(title, ...)`	Poll until a window appears or disappears

Actions

Function	Description
`tp.click(element)`	Click via accessibility action, with coordinate fallback
`tp.double_click(element)`	Double-click
`tp.right_click(element)`	Right-click / context menu
`tp.set_value(element, text)`	Set text content (`replace=True` to clear first)
`tp.set_numeric_value(element, n)`	Set slider or spinbox value
`tp.select_text(element, text)`	Select a substring within text content across Linux, Windows, macOS, and web/CDP
`tp.select_text_range(element, start, end)`	Select a character range when you already know the offsets
`tp.focus(element)`	Move keyboard focus
`tp.action(element, name)`	Execute a raw accessibility action by name
`tp.activate_window(window)`	Bring a window to the foreground (restores from minimized)
`tp.minimize_window(window)`	Minimize a window. Use `activate_window` to restore.
`tp.fullscreen_window(window, fullscreen=True)`	Enter or exit fullscreen for a window
`tp.close_window(window)`	Politely close a window
`tp.move_window(window, x, y)`	Move a window to a new screen position
`tp.resize_window(window, width, height)`	Resize a window to width × height pixels

Input

Function	Description
`tp.type_text(text)`	Type into the currently focused element
`tp.press_key(key)`	Press and release a key (`"enter"`, `"tab"`, `"escape"`)
`tp.hotkey(*keys)`	Key combination (`tp.hotkey("ctrl", "s")`)
`tp.click_at(x, y)`	Click at screen coordinates
`tp.double_click_at(x, y)`	Double-click at coordinates
`tp.right_click_at(x, y)`	Right-click at coordinates
`tp.mouse_move(x, y)`	Move the cursor
`tp.scroll(direction, amount)`	Scroll at current cursor position

Screenshot & Config

Function	Description
`tp.screenshot(app, element, ...)`	Full desktop or cropped to app/window/element/monitor
`tp.monitor_count()`	Number of connected monitors
`tp.configure(...)`	Set runtime options (see Configuration)
`tp.diagnostics()`	Report backend, input, CDP, timeout, and dependency health

All action functions accept an Element object or a string ID. elements(), find(), and get_element() support format="flat", format="json", or format="tree" (elements only) to return pre-formatted strings instead of objects. Window management is implemented across Linux AT-SPI2, Windows UIA, and macOS AX backends.

Architecture

hljs language-scss

┌───────────────────────────────────────────────────────┐
│               import touchpoint as tp                 │
│  tp.find() · tp.click() · tp.screenshot() · ...       │
│                    (Public API)                       │
├─────────────────────────┬─────────────────────────────┤
│     Backend (ABC)       │    InputProvider (ABC)      │
├─────────────────────────┼─────────────────────────────┤
│  AT-SPI2     (Linux)    │  Xdotool       (X11)        │
│  UIA         (Windows)  │  SendInput     (Win32)      │
│  AX          (macOS)    │  CGEvent       (macOS)      │
│  CDP         (browsers) │                             │
├─────────────────────────┴─────────────────────────────┤
│  Utilities: formatter · matcher · screenshot · scale  │
└───────────────────────────────────────────────────────┘

Two-layer design:

Backend reads the accessibility tree and runs structured actions (click, set_value, focus). Element-aware and reliable.
InputProvider simulates raw keyboard and mouse input. Coordinate-based and element-blind. Used as an automatic fallback when a native accessibility action isn't available.

CDP runs alongside the platform backend. Their results are merged: native window chrome (toolbar, tabs, menus) from AT-SPI2/UIA/AX, plus full web content from CDP, unified under one API.

For detailed internals, see ARCHITECTURE.md.

Configuration

hljs language-python

tp.configure(
    fuzzy_threshold=0.6,          # minimum match score for find() (0.0–1.0)
    fallback_input=True,          # use InputProvider when native actions fail
    type_chunk_size=40,           # split long text into chunks for typing (0 = disable)
    max_elements=5000,            # max elements per query
    max_depth=20,                 # default tree depth limit
    scale_factor=None,            # display scale override (None = auto-detect)
    cdp_ports={"Chrome": 9222},   # explicit CDP port mapping
    cdp_discover=True,            # auto-discover CDP ports from running processes
    cdp_refresh_interval=5.0,     # seconds between CDP target scans
    ax_messaging_timeout=1.0,     # max seconds to wait for a macOS AX app reply
)

tp.diagnostics() returns a JSON-friendly health report. It includes the active backend, input provider, CDP targets, optional platform tools, configured timeouts, and macOS apps recently skipped after an AX messaging timeout.

Development

hljs language-bash

git clone https://github.com/Touchpoint-Labs/touchpoint.git
cd touchpoint
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Status

Alpha — fully functional and tested on all three platforms. The API may change before 1.0 based on user feedback.

Platform	Backend	Input	CDP	Tests
Linux (X11)	✅ AT-SPI2	✅ xdotool	✅	✅
Windows	✅ UIA	✅ SendInput	✅	✅
macOS	✅ AX	✅ CGEvent	✅	✅

Known limitations

Wayland input — The Linux InputProvider uses xdotool, which requires X11. On pure Wayland (no XWayland), keyboard/mouse simulation is unavailable. The accessibility tree and native actions still work.
Synchronous CDP — CDP calls block on WebSocket responses. JavaScript dialogs (alert, confirm, prompt) are auto-dismissed to prevent deadlocks. An async rewrite is planned.
No browser navigation API — Touchpoint doesn't have built-in URL navigation. Agents can navigate by interacting with UI elements directly: find the address bar, type a URL, press Enter.
CDP windows are page targets, not OS windows — but window management still works: tp.activate_window() brings the target forward via CDP, and minimize/fullscreen/close/move/resize on a surfaced cdp: window are routed to the underlying native OS window (resolved by owning PID) and handled by the platform backend. They raise ActionFailedError only if no native OS window for that target can be found (e.g. it has been closed).
Backend role/state parity is still uneven — macOS AX and Windows UIA both improved significantly in 0.3.0, but Windows still relies on more heuristics and has more unmapped long-tail roles than the other backends.

Roadmap

High Priority

Async CDP architecture — non-blocking WebSocket, proper dialog queuing, concurrent multi-tab queries

Medium Priority

Backend role/state parity — close remaining role mapping gaps, especially UIA long-tail roles on Windows
Wayland input backend — libei / xdg-desktop-portal RemoteDesktop when X11 isn't available

Lower Priority

Tooltip and notification visibility
Element caching

License

MIT

Touchpoint

Give your AI agent eyes and hands on any desktop.

pip install touchpoint-py

Touchpoint demo — AI agent creates a formatted Excel table using Touchpoint

AI agent researches data in Chrome, then creates a formatted Excel table — full task completed in ~12 minutes

hljs language-python

import touchpoint as tp

elements = tp.find("Send", role=tp.Role.BUTTON, app="Slack")
tp.click(elements[0])

Why Touchpoint?

	Screenshot / vision	Browser automation	Touchpoint
Native desktop apps	⚠️ inaccurate or slow	❌	✅
Browsers	⚠️ inaccurate or slow	✅	✅ via CDP
Electron apps (Slack, VS Code, ...)	⚠️ inaccurate or slow	⚠️ web content only	✅ native + web
Structured element data	❌ needs OCR/vision model	✅ web only	✅ names, roles, states, positions
Works with local / non-vision models	❌	✅ web only	✅ all apps
Works across Linux, macOS, Windows	✅	✅	✅

Table of Contents
Install
- Platform requirements
Quick Start
- Element IDs
- Output formats
MCP Server
Browser & Electron Apps (CDP)
- Setup
API Reference
Architecture
Configuration
Development
Status
- Known limitations
License

Install

Requires Python 3.10+.

hljs language-bash

pip install touchpoint-py

Platform requirements

Platform	Backend	Requirement
Linux	AT-SPI2	Install `xdotool` (required for input + `minimize_window`) and `wmctrl` (required for all window management — used for AT-SPI → X11 id mapping). Most desktops include `python3-gi` and `gir1.2-atspi-2.0` — install them if missing.
Windows	UI Automation	None — uses built-in COM APIs
macOS	Accessibility (AX)	Grant permission: System Settings → Privacy & Security → Accessibility

Quick Start

hljs language-python

import touchpoint as tp

# Discover
apps = tp.apps()                            # ["Firefox", "Slack", "Terminal", ...]
windows = tp.windows()                      # Window objects with title, position, size
all_els = tp.elements(app="Firefox", named_only=True)  # only elements with text labels

# Find
results = tp.find("Search", role=tp.Role.TEXT_FIELD, app="Firefox")

# Act
tp.set_value(results[0], "touchpoint python", replace=True)
tp.press_key("enter")
tp.hotkey("ctrl", "s")                      # keyboard shortcuts

# Wait for UI changes
tp.wait_for("results", app="Firefox", timeout=10)

# Screenshot
img = tp.screenshot()                       # full desktop → PIL.Image
img = tp.screenshot(app="Firefox")           # cropped to app window

Element IDs

Every element has a unique ID like atspi:1234:1:2.0 or cdp:9222:TID:4. Action functions accept either an Element object or a bare ID string — useful for storing references across steps:

hljs language-python

results = tp.find("Send", max_results=1)
element_id = results[0].id                  # "atspi:1234:1:5.2"

# later...
tp.click(element_id)                        # works with just the string

Output formats

Control how results are returned:

hljs language-python

tp.elements(app="Slack", format="flat")     # one compact line per element (best for LLMs)
tp.elements(app="Slack", format="tree")     # indented parent/child hierarchy
tp.elements(app="Slack", format="json")     # full JSON with all fields

MCP Server

Two modes — vision and no-vision

Set TOUCHPOINT_MODE=no-vision (default: vision) to switch modes:

Vision mode — agents use screenshot() to see the screen and interact by element ID or coordinates. Best for frontier models with strong vision capabilities.
No-vision mode — agents use snapshot() to get a compact structured text tree of the active window, then act on element IDs directly. Works with any model including local ones that have no vision capability. Most action tools append auto-verify flags ((new window: ...), (focus moved), (no change detected)) so the agent can detect state changes without taking a screenshot.

Tools

Category	Vision mode	No-vision mode
Orient	`screenshot`, `snapshot`, `apps`, `windows`	`snapshot`, `diff_snapshot`, `apps`, `windows`
Find	`find`, `get_element`	`find`
Read	`read_text`	`read_text`
Actions	`click` (element or coordinates), `set_value`, `set_numeric_value`, `select_text`, `focus`, `action`	`click` (element only), `set_value`, `set_numeric_value`, `select_text`, `focus`, `action`
Keyboard	`type_text`, `press_key`	`type_text`, `press_key`
Mouse	`mouse_move`, `scroll`	`scroll`
Window	`activate_window`, `minimize_window`, `fullscreen_window`, `close_window`, `move_window`, `resize_window`	`activate_window`, `minimize_window`, `fullscreen_window`, `close_window`
Waiting	`wait_for`, `wait_for_app`, `wait_for_window`	`wait_for`, `wait_for_app`, `wait_for_window`
Health	`diagnostics`	`diagnostics`

hljs language-arduino

         ┌──────────┐
    ┌───▶│  ORIENT  │  screenshot · apps · windows
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  LOCATE  │  find · snapshot · get_element
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │   ACT    │  click · set_value · type_text · press_key
    │    └────┬─────┘
    │         ▼
    │    ┌──────────┐
    │    │  VERIFY  │───▶ Done ✅
    │    └────┬─────┘
    │         │ not yet
    └─────────┘

Client setup

Claude Desktop

Config file location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

If using a virtualenv, use the full path: "/path/to/venv/bin/touchpoint-mcp"

VS Code / GitHub Copilot

Add to .vscode/mcp.json in your workspace:

hljs language-json

{
  "servers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Cursor

Create or edit ~/.cursor/mcp.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Windsurf

Edit ~/.codeium/windsurf/mcp_config.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Claude Code (CLI)

hljs language-bash

claude mcp add touchpoint -- touchpoint-mcp

OpenClaw

Add to mcpServers in ~/.openclaw/openclaw.json:

hljs language-json

{
  "mcpServers": {
    "touchpoint": {
      "command": "touchpoint-mcp"
    }
  }
}

Environment variables

All optional — click to see available settings

Variable	Example	Description
`TOUCHPOINT_CDP_DISCOVER`	`true`	Auto-discover CDP ports from running processes
`TOUCHPOINT_CDP_PORTS`	`{"Chrome": 9222}`	Explicit app-to-port mapping (JSON)
`TOUCHPOINT_CDP_APP`	`Google Chrome`	Single app name (pair with `_PORT`)
`TOUCHPOINT_CDP_PORT`	`9222`	Single port (pair with `_APP`)
`TOUCHPOINT_CDP_REFRESH_INTERVAL`	`5.0`	Seconds between CDP port scans
`TOUCHPOINT_SCALE_FACTOR`	`1.25`	Display scale override
`TOUCHPOINT_FUZZY_THRESHOLD`	`0.6`	Minimum match score for find() (0.0–1.0)
`TOUCHPOINT_FALLBACK_INPUT`	`true`	Use coordinate fallback when native actions fail
`TOUCHPOINT_MAX_ELEMENTS`	`5000`	Maximum elements per query
`TOUCHPOINT_MAX_DEPTH`	`20`	Default tree depth limit
`TOUCHPOINT_AX_MESSAGING_TIMEOUT`	`1.0`	Max seconds to wait for a macOS AX app reply

Browser & Electron Apps (CDP)

Setup

Launch the app with a debug port:

hljs language-bash

# Linux
google-chrome --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# macOS
open -na "Google Chrome" --args --remote-debugging-port=9222 --user-data-dir=/tmp/tp-chrome

# Windows
start chrome --remote-debugging-port=9222 --user-data-dir=%TEMP%\tp-chrome

Configure Touchpoint:

hljs language-python

import touchpoint as tp

tp.configure(cdp_discover=True)             # auto-discover from running processes
# or
tp.configure(cdp_ports={"Google Chrome": 9222})  # explicit mapping

Control what you get with the source parameter:

hljs language-python

tp.elements(app="Google Chrome", source="full")     # native chrome + web content (default)
tp.elements(app="Google Chrome", source="cdp_ax")   # web content only (CDP accessibility tree)
tp.elements(app="Google Chrome", source="native")   # native UI only (toolbar, tabs, menus)
tp.elements(app="Google Chrome", source="dom")      # DOM walker (catches what AX misses)

CDP results are merged with native backend results — you get the toolbar and window controls from AT-SPI2/UIA/AX, combined with the full web page content from CDP, in a single elements() call.

source="ax" remains accepted as a compatibility alias for source="cdp_ax". Prefer cdp_ax in new code so it is not confused with the native macOS AX backend.

API Reference

Discovery

Function	Description
`tp.apps()`	List application names in the accessibility tree
`tp.windows()`	All windows with id, title, app, position, size, active state
`tp.elements(app, role, states, ...)`	UI elements, with filtering, tree mode, and formatting
`tp.element_at(x, y)`	Deepest element at screen coordinates
`tp.get_element(id)`	Fresh snapshot of a single element by ID

Search & Wait

Function	Description
`tp.find(query, app, role, ...)`	Search by name — 4-stage matching: exact → contains → word → fuzzy
`tp.wait_for(query, ...)`	Poll until elements appear (or disappear with `gone=True`)
`tp.wait_for_app(app, ...)`	Poll until an app appears or disappears
`tp.wait_for_window(title, ...)`	Poll until a window appears or disappears

Actions

Function	Description
`tp.click(element)`	Click via accessibility action, with coordinate fallback
`tp.double_click(element)`	Double-click
`tp.right_click(element)`	Right-click / context menu
`tp.set_value(element, text)`	Set text content (`replace=True` to clear first)
`tp.set_numeric_value(element, n)`	Set slider or spinbox value
`tp.select_text(element, text)`	Select a substring within text content across Linux, Windows, macOS, and web/CDP
`tp.select_text_range(element, start, end)`	Select a character range when you already know the offsets
`tp.focus(element)`	Move keyboard focus
`tp.action(element, name)`	Execute a raw accessibility action by name
`tp.activate_window(window)`	Bring a window to the foreground (restores from minimized)
`tp.minimize_window(window)`	Minimize a window. Use `activate_window` to restore.
`tp.fullscreen_window(window, fullscreen=True)`	Enter or exit fullscreen for a window
`tp.close_window(window)`	Politely close a window
`tp.move_window(window, x, y)`	Move a window to a new screen position
`tp.resize_window(window, width, height)`	Resize a window to width × height pixels

Input

Function	Description
`tp.type_text(text)`	Type into the currently focused element
`tp.press_key(key)`	Press and release a key (`"enter"`, `"tab"`, `"escape"`)
`tp.hotkey(*keys)`	Key combination (`tp.hotkey("ctrl", "s")`)
`tp.click_at(x, y)`	Click at screen coordinates
`tp.double_click_at(x, y)`	Double-click at coordinates
`tp.right_click_at(x, y)`	Right-click at coordinates
`tp.mouse_move(x, y)`	Move the cursor
`tp.scroll(direction, amount)`	Scroll at current cursor position

Screenshot & Config

Function	Description
`tp.screenshot(app, element, ...)`	Full desktop or cropped to app/window/element/monitor
`tp.monitor_count()`	Number of connected monitors
`tp.configure(...)`	Set runtime options (see Configuration)
`tp.diagnostics()`	Report backend, input, CDP, timeout, and dependency health

Architecture

hljs language-scss

┌───────────────────────────────────────────────────────┐
│               import touchpoint as tp                 │
│  tp.find() · tp.click() · tp.screenshot() · ...       │
│                    (Public API)                       │
├─────────────────────────┬─────────────────────────────┤
│     Backend (ABC)       │    InputProvider (ABC)      │
├─────────────────────────┼─────────────────────────────┤
│  AT-SPI2     (Linux)    │  Xdotool       (X11)        │
│  UIA         (Windows)  │  SendInput     (Win32)      │
│  AX          (macOS)    │  CGEvent       (macOS)      │
│  CDP         (browsers) │                             │
├─────────────────────────┴─────────────────────────────┤
│  Utilities: formatter · matcher · screenshot · scale  │
└───────────────────────────────────────────────────────┘

Two-layer design:

Backend reads the accessibility tree and runs structured actions (click, set_value, focus). Element-aware and reliable.
InputProvider simulates raw keyboard and mouse input. Coordinate-based and element-blind. Used as an automatic fallback when a native accessibility action isn't available.

CDP runs alongside the platform backend. Their results are merged: native window chrome (toolbar, tabs, menus) from AT-SPI2/UIA/AX, plus full web content from CDP, unified under one API.

For detailed internals, see ARCHITECTURE.md.

Configuration

hljs language-python

tp.configure(
    fuzzy_threshold=0.6,          # minimum match score for find() (0.0–1.0)
    fallback_input=True,          # use InputProvider when native actions fail
    type_chunk_size=40,           # split long text into chunks for typing (0 = disable)
    max_elements=5000,            # max elements per query
    max_depth=20,                 # default tree depth limit
    scale_factor=None,            # display scale override (None = auto-detect)
    cdp_ports={"Chrome": 9222},   # explicit CDP port mapping
    cdp_discover=True,            # auto-discover CDP ports from running processes
    cdp_refresh_interval=5.0,     # seconds between CDP target scans
    ax_messaging_timeout=1.0,     # max seconds to wait for a macOS AX app reply
)

Development

hljs language-bash

git clone https://github.com/Touchpoint-Labs/touchpoint.git
cd touchpoint
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest

Status

Alpha — fully functional and tested on all three platforms. The API may change before 1.0 based on user feedback.

Platform	Backend	Input	CDP	Tests
Linux (X11)	✅ AT-SPI2	✅ xdotool	✅	✅
Windows	✅ UIA	✅ SendInput	✅	✅
macOS	✅ AX	✅ CGEvent	✅	✅

Known limitations

Wayland input — The Linux InputProvider uses xdotool, which requires X11. On pure Wayland (no XWayland), keyboard/mouse simulation is unavailable. The accessibility tree and native actions still work.
Synchronous CDP — CDP calls block on WebSocket responses. JavaScript dialogs (alert, confirm, prompt) are auto-dismissed to prevent deadlocks. An async rewrite is planned.
No browser navigation API — Touchpoint doesn't have built-in URL navigation. Agents can navigate by interacting with UI elements directly: find the address bar, type a URL, press Enter.
CDP windows are page targets, not OS windows — but window management still works: tp.activate_window() brings the target forward via CDP, and minimize/fullscreen/close/move/resize on a surfaced cdp: window are routed to the underlying native OS window (resolved by owning PID) and handled by the platform backend. They raise ActionFailedError only if no native OS window for that target can be found (e.g. it has been closed).
Backend role/state parity is still uneven — macOS AX and Windows UIA both improved significantly in 0.3.0, but Windows still relies on more heuristics and has more unmapped long-tail roles than the other backends.

Roadmap

High Priority

Async CDP architecture — non-blocking WebSocket, proper dialog queuing, concurrent multi-tab queries

Medium Priority

Backend role/state parity — close remaining role mapping gaps, especially UIA long-tail roles on Windows
Wayland input backend — libei / xdg-desktop-portal RemoteDesktop when X11 isn't available

Lower Priority

Tooltip and notification visibility
Element caching

License

MIT

Touchpoint

Touchpoint

Why Touchpoint?

Table of Contents

Install

Platform requirements

Quick Start

Element IDs

Output formats

MCP Server

Two modes — vision and no-vision

Tools

Client setup

Environment variables

Browser & Electron Apps (CDP)

Setup

API Reference

Discovery

Search & Wait

Actions

Input

Screenshot & Config

Architecture

Configuration

Development

Status

Known limitations

Roadmap

High Priority

Medium Priority

Lower Priority

License

Similar Packages

Touchpoint

Touchpoint

Why Touchpoint?

Table of Contents

Install

Platform requirements

Quick Start

Element IDs

Output formats

MCP Server

Two modes — vision and no-vision

Tools

Client setup

Environment variables

Browser & Electron Apps (CDP)

Setup

API Reference

Discovery

Search & Wait

Actions

Input

Screenshot & Config

Architecture

Configuration

Development

Status

Known limitations

Roadmap

High Priority

Medium Priority

Lower Priority

License

Similar Packages