Skip to content

FareedKhan-dev/ai-desktop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Desktop

Simple AI Desktop that uses the OmniParser and Vision-Language Model to interact with the system. It can perform various tasks like opening applications, searching the web, and answering questions.

User Query: Open Google Chrome and search for google stock price

sample_result.mp4

Table of Contents

How it works

graph TD;
    A[User Prompt: Open Chrome and buy me a milk] -->|User Input| B[VLMAgent];

    B -->|Parse Screen Content| C[Omniparser];
    C -->|Extracted Info| D[Computer];

    B -->|Analyze Screen, Determine Action| E[LLM OpenAI];
    E -->|Generate Action e.g., Mouse Move, Type| F[Action Execution];

    F -->|Execute Action on Computer| D;
    D -->|Get Result/Feedback| B;
    
    F -->|Repeat until Task Complete| G[Task Complete];
Loading

It takes a user prompt and processes it through a vision-language model (VLMAgent). The agent analyzes the screen, extracts information, and determines the required actions using an AI model. These actions are then executed on the computer, repeating until the task is complete.

Installation

Clone the repository along with the OmniParser submodule

git clone --recursive https://github.com/FareedKhan-dev/ai-desktop

Or, if already cloned, to update OmniParser submodule

git submodule update --init --recursive

To install the dependencies, run the following command:

cd ai-desktop/OmniParser
pip install -r requirements.txt

AI-Dekstop does not require any additional dependencies.

OmniParser Setup

Navigate to the OmniParser directory

cd OmniParser

Download the model checkpoints:

# Download the model checkpoints to the local directory OmniParser/weights/
mkdir -p weights/icon_detect weights/icon_caption_florence

for file in icon_detect/{train_args.yaml,model.pt,model.yaml} \
            icon_caption/{config.json,generation_config.json,model.safetensors}; do
    huggingface-cli download microsoft/OmniParser-v2.0 "$file" --local-dir weights
done

mv weights/icon_caption weights/icon_caption_florence

make sure the weights are downloaded in the weights directory and it should be called icon_detect and icon_caption_florence respectively.

To start the gradio api of omniparser, run the following command:

python gradio_demo.py

The gradio api will start at localhost:<port> and live sharaing link will be generated.

Configuration

Modify the config.py file to set up the API URLs, model names, and authentication keys.

OMNIPARSER_API_URL = "OMNIPARSER_Gradio_link"  # Set the OmniParser Gradio API link  (Follow the Usage section to get the link)
VLM_MODEL_NAME = "OPENAI/LOCAL_MODEL_NAME"  # Define the vision-language model  
BASE_URL = "BASE_URL"  # Set the base URL for the API  
API_KEY = "API_KEY"  # Provide the API key  

The SYSTEM_PROMPT in config.py defines the AI agent behavior, guiding it to interact with the system using various actions like mouse movements, clicks, typing, and screenshots. Modify it as needed for custom AI interactions.

Running the AI Desktop

To start the AI Desktop, run the following command:

python main.py

you can modify the user_query in main.py to test different queries.