Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add Olostep integration #1213

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from

Conversation

aadithyanr
Copy link

@aadithyanr aadithyanr commented Mar 5, 2025

User description

Add Olostep Integration

This PR adds Olostep as a new integration provider to Julep, offering superior web scraping capabilities.

Features

  • Faster response times (1-6 seconds vs. traditional crawlers)
  • Built-in bot detection avoidance (no proxy needed)
  • Parallel processing up to 100K requests
  • Better content extraction with multiple formats

Implementation

  • Added TypeSpec definition for Olostep
  • Created integration implementation with full async support
  • Added documentation for usage
  • Updated provider registry

Files Added

  • typespec/tools/olostep.tsp - TypeSpec definition
  • integrations-service/integrations/utils/integrations/olostep.py - Integration logic
  • integrations-service/tests/mocks/olostep.py - Test mocks
  • documentation/docs/integrations/olostep.mdx - Documentation

Files Modified

  • typespec/tools/models.tsp - Added Olostep to provider list
  • integrations-service/integrations/providers.py - Registered provider
  • integrations-service/integrations/env.py - Added API key variable

PS: Let me know if you'd like any changes or have questions about the implementation.


PR Type

Enhancement, Documentation, Tests


Description

  • Added Olostep integration for advanced web scraping.

    • Implemented scrape method with async support.
    • Defined TypeSpec models for setup and arguments.
  • Updated provider registry and environment configuration.

    • Registered Olostep as a provider in providers.py.
    • Added OLOSTEP_API_KEY to environment variables.
  • Provided comprehensive documentation for Olostep usage.

    • Detailed configuration, methods, and examples in olostep.mdx.
  • Added test mocks for Olostep integration.


Changes walkthrough 📝

Relevant files
Configuration changes
env.py
Add Olostep API key to environment variables                         

integrations-service/integrations/env.py

  • Added olostep_api_key environment variable.
+1/-0     
Enhancement
olostep.py
Add Pydantic models for Olostep responses                               

integrations-service/integrations/models/olostep.py

  • Defined Pydantic models for Olostep responses.
  • Added OlostepResponse and OlostepOutput classes.
  • +12/-0   
    providers.py
    Register Olostep provider and methods                                       

    integrations-service/integrations/providers.py

  • Registered Olostep as a provider.
  • Added scrape method and provider info.
  • +22/-0   
    olostep.py
    Implement Olostep scrape method with retry logic                 

    integrations-service/integrations/utils/integrations/olostep.py

  • Implemented scrape method for Olostep API.
  • Added retry logic and payload handling.
  • Created helper functions for API key and response formatting.
  • +76/-0   
    models.tsp
    Update TypeSpec models with Olostep provider                         

    typespec/tools/models.tsp

    • Added olostep to integration provider list.
    +1/-0     
    olostep.tsp
    Add TypeSpec definitions for Olostep integration                 

    typespec/tools/olostep.tsp

  • Defined TypeSpec models for Olostep setup and arguments.
  • Added models for scraping parameters and responses.
  • +83/-0   
    Tests
    olostep.py
    Add test mocks for Olostep integration                                     

    integrations-service/tests/mocks/olostep.py

    • Added mock response for Olostep API.
    +13/-0   
    Documentation
    olostep.mdx
    Add documentation for Olostep integration                               

    documentation/docs/integrations/web_browser_automation/olostep.mdx

  • Documented Olostep integration setup and methods.
  • Included examples and best practices.
  • Detailed error handling and features.
  • +184/-0 

    Need help?
  • Type /help how to ... in the comments thread for any questions about Qodo Merge usage.
  • Check out the documentation for more information.

  • EntelligenceAI PR Summary

    Purpose: Enhance web scraping with Olostep for faster, bot-resistant, parallel processing.

    Changes:

    • New Feature: Integrated Olostep provider; added API key environment variable and response models.
    • Documentation: Comprehensive Olostep setup and usage guides introduced.
    • Test: Mock responses for Olostep API added for improved testing.
    • Enhancement: Updated models.tsp and created olostep.tsp for setup and argument models.

    Impact: Improved data extraction speed and reliability, enhanced user understanding, and more robust testing.

    Copy link
    Contributor

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 Security concerns

    API Key Exposure:
    The implementation uses API keys for authentication with the Olostep service. While the code properly retrieves keys from environment variables, there's no validation to ensure the API key is not empty or malformed before making requests, which could lead to unexpected behavior or errors that might expose debugging information.

    ⚡ Recommended focus areas for review

    Error Handling

    The error handling in the scrape function could be improved. It catches all exceptions generically and raises a RuntimeError with a message, but doesn't preserve the original exception type or traceback, which might make debugging harder.

    except Exception as e:
        # Log the exception or handle it as needed
        msg = f"Error executing Olostep scrape: {str(e)}"
        raise RuntimeError(msg)
    Response Parsing

    The create_olostep_response function assumes a specific response structure but doesn't handle cases where the expected fields might be missing, potentially causing KeyError exceptions.

    def create_olostep_response(result: Dict) -> List[OlostepResponse]:
        """Convert API response to our format"""
        return [OlostepResponse(
            url=result.get("url_to_scrape"),
            content=result.get("result", {}).get("markdown_content", ""),
            error=result.get("error"),
            status="success" if result.get("result") else "error"
        )]
    Metadata Handling

    The OlostepResponse model includes a metadata field, but the create_olostep_response function doesn't populate it, resulting in metadata always being None despite being defined in the model.

    def create_olostep_response(result: Dict) -> List[OlostepResponse]:
        """Convert API response to our format"""
        return [OlostepResponse(
            url=result.get("url_to_scrape"),
            content=result.get("result", {}).get("markdown_content", ""),
            error=result.get("error"),
            status="success" if result.get("result") else "error"
        )]

    Copy link
    Contributor

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    General
    Improve error handling

    The error handling in the scrape function doesn't properly populate the
    OlostepResponse with error information when an exception occurs. Instead of just
    raising an exception, create a proper error response that follows the defined
    model structure.

    integrations-service/integrations/utils/integrations/olostep.py [71-74]

     except Exception as e:
    -    # Log the exception or handle it as needed
    -    msg = f"Error executing Olostep scrape: {str(e)}"
    -    raise RuntimeError(msg)
    +    # Create error response instead of just raising
    +    error_msg = f"Error executing Olostep scrape: {str(e)}"
    +    final_result = [OlostepResponse(
    +        url=arguments.url,
    +        content="",
    +        error=error_msg,
    +        status="error"
    +    )]
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    __

    Why: The suggestion significantly improves error handling by creating a proper OlostepResponse object with error information instead of just raising an exception. This approach is more consistent with the API design pattern and provides better error information to consumers.

    Medium
    Include missing model field

    The create_olostep_response function doesn't properly handle the metadata field
    that's defined in the OlostepResponse model. Add the metadata field to the
    response creation to ensure all model fields are properly populated.

    integrations-service/integrations/utils/integrations/olostep.py [24-31]

     def create_olostep_response(result: Dict) -> List[OlostepResponse]:
         """Convert API response to our format"""
         return [OlostepResponse(
             url=result.get("url_to_scrape"),
             content=result.get("result", {}).get("markdown_content", ""),
             error=result.get("error"),
    -        status="success" if result.get("result") else "error"
    +        status="success" if result.get("result") else "error",
    +        metadata={"created": result.get("created")} if result.get("created") else None
         )]
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    __

    Why: The suggestion correctly identifies that the metadata field defined in the OlostepResponse model is not being populated in the create_olostep_response function. Adding this field improves consistency with the model definition and ensures complete data handling.

    Medium
    • More

    Copy link
    Contributor

    Walkthrough

    The update integrates Olostep for web scraping, enhancing capabilities with fast response times, bot detection avoidance, and parallel processing. It includes a detailed configuration guide, scraping methods, and error handling. An environment variable for the Olostep API key is added, along with models for handling responses. The Olostep provider is integrated into the system, and a mock response is included for testing.

    Changes

    File(s) Summary
    documentation/docs/integrations/web_browser_automation/olostep.mdx Added comprehensive documentation for Olostep integration, covering setup, methods, and best practices.
    integrations-service/integrations/ Introduced olostep_api_key in env.py, defined response models in models/olostep.py, integrated Olostep provider in providers.py, and implemented scrape function in utils/integrations/olostep.py.
    integrations-service/tests/mocks/olostep.py Added mock function for simulating Olostep API responses for testing purposes.
    typespec/tools/ Updated models.tsp to include Olostep in the provider list and created olostep.tsp for setup and argument models.
    Entelligence.ai can learn from your feedback. Simply add 👍 / 👎 emojis to teach it your preferences. More shortcuts below

    Emoji Descriptions:

    • ⚠️ Potential Issue - May require further investigation.
    • 🔒 Security Vulnerability - Fix to ensure system safety.
    • 💻 Code Improvement - Suggestions to enhance code quality.
    • 🔨 Refactor Suggestion - Recommendations for restructuring code.
    • ℹ️ Others - General comments and information.

    Interact with the Bot:

    • Send a message or request using the format:
      @bot + *your message*
    Example: @bot Can you suggest improvements for this code?
    
    • Help the Bot learn by providing feedback on its responses.
      @bot + *feedback*
    Example: @bot Do not comment on `save_auth` function !
    

    Comment on lines +93 to +110
    olostep = BaseProvider(
    provider="olostep",
    setup=OlostepSetup,
    methods=[
    BaseProviderMethod(
    method="scrape",
    description="Scrape website content using Olostep API",
    arguments=OlostepScrapeArguments,
    output=OlostepOutput,
    ),
    ],
    info=ProviderInfo(
    url="https://olostep.com/",
    docs="https://docs.olostep.com/",
    icon="https://www.olostep.com/images/olostep-logo-cropped.svg",
    friendly_name="Olostep",
    ),
    )
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Missing imports for OlostepSetup, OlostepScrapeArguments, and OlostepOutput which are required for the new Olostep provider. This will cause runtime import errors.

    📝 Committable Code Suggestion

    ‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

    Suggested change
    olostep = BaseProvider(
    provider="olostep",
    setup=OlostepSetup,
    methods=[
    BaseProviderMethod(
    method="scrape",
    description="Scrape website content using Olostep API",
    arguments=OlostepScrapeArguments,
    output=OlostepOutput,
    ),
    ],
    info=ProviderInfo(
    url="https://olostep.com/",
    docs="https://docs.olostep.com/",
    icon="https://www.olostep.com/images/olostep-logo-cropped.svg",
    friendly_name="Olostep",
    ),
    )
    from .olostep.setup import OlostepSetup
    from .olostep.arguments import OlostepScrapeArguments
    from .olostep.output import OlostepOutput
    olostep = BaseProvider(
    provider="olostep",
    setup=OlostepSetup,
    methods=[
    BaseProviderMethod(
    method="scrape",
    description="Scrape website content using Olostep API",
    arguments=OlostepScrapeArguments,
    output=OlostepOutput,
    ),
    ],
    info=ProviderInfo(
    url="https://olostep.com/",
    docs="https://docs.olostep.com/",
    icon="https://www.olostep.com/images/olostep-logo-cropped.svg",
    friendly_name="Olostep",
    ),
    )

    Comment on lines +49 to +51
    try:
    async with get_olostep_client(api_key) as client:
    payload = {
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The aiohttp.ClientSession is not properly closed if an exception occurs before entering the async with block. Should use try/finally to ensure cleanup.

    📝 Committable Code Suggestion

    ‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

    Suggested change
    try:
    async with get_olostep_client(api_key) as client:
    payload = {
    client = await get_olostep_client(api_key)
    try:
    async with client as session:
    payload = {

    if value := arguments.params.get(key):
    payload[key] = value

    async with client.post("https://api.olostep.com/v1/scrapes", json=payload) as response:
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Missing timeout configuration for HTTP requests which could lead to hanging requests. Should add timeout parameter to client.post().

    📝 Committable Code Suggestion

    ‼️ Ensure you review the code suggestion before committing it to the branch. Make sure it replaces the highlighted code, contains no missing lines, and has no issues with indentation.

    Suggested change
    async with client.post("https://api.olostep.com/v1/scrapes", json=payload) as response:
    async with client.post("https://api.olostep.com/v1/scrapes", json=payload, timeout=30) as response:

    Copy link
    Contributor

    @ellipsis-dev ellipsis-dev bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ❌ Changes requested. Reviewed everything up to cbacd6c in 3 minutes and 20 seconds

    More details
    • Looked at 470 lines of code in 8 files
    • Skipped 0 files when reviewing.
    • Skipped posting 9 drafted comments based on config settings.
    1. documentation/docs/integrations/web_browser_automation/olostep.mdx:184
    • Draft comment:
      Missing newline at end of file. Please add one for POSIX compliance.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
      Missing newlines at end of files are a common convention and can cause issues with some tools. However, this is a documentation file and the impact would be minimal. Most modern editors handle this automatically. The comment is technically correct but very minor.
      The comment is about a real issue and provides a concrete fix. Missing newlines can cause issues with git diffs and some text processing tools.
      While technically correct, this is an extremely minor issue that most modern tools handle gracefully. It's more noise than value in the PR review.
      Delete the comment as it's too minor of an issue to warrant attention in code review, especially for a documentation file.
    2. integrations-service/integrations/providers.py:93
    • Draft comment:
      OlostepSetup and OlostepScrapeArguments are used but not imported. Please add the appropriate imports from autogen.Tools.
    • Reason this comment was not posted:
      Marked as duplicate.
    3. integrations-service/integrations/utils/integrations/olostep.py:71
    • Draft comment:
      Consider logging exception details before re-raising RuntimeError for better diagnostics.
    • Reason this comment was not posted:
      Confidence changes required: 50% <= threshold 50%
      None
    4. documentation/docs/integrations/web_browser_automation/olostep.mdx:184
    • Draft comment:
      Add a newline at the end-of-file to improve POSIX compliance.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
      While having a newline at EOF is generally good practice, this is a very minor issue that would typically be handled by editor configurations or linting rules. The comment doesn't point out a significant problem that needs human attention. Most modern text editors automatically add newlines at EOF.
      The suggestion is technically correct and follows Unix conventions. Not having a newline at EOF could cause issues with some tools.
      While technically valid, this is too minor of an issue to warrant a PR comment. It should be handled by automation rather than human review.
      Delete this comment as it's too minor of an issue to warrant human attention in code review. This should be handled by editor configs or linting rules.
    5. integrations-service/integrations/utils/integrations/olostep.py:51
    • Draft comment:
      Handle the optional 'params' gracefully to avoid AttributeError when it is None.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50%
      The code uses @beartype decorator and has type assertions checking that arguments is OlostepScrapeArguments. If this is properly typed, params should be defined. Without seeing OlostepScrapeArguments definition, we can't be certain if params can be None. The code seems to follow a consistent pattern assuming params exists.
      I don't have visibility into OlostepScrapeArguments class definition. The params field could potentially be optional.
      Given the strict typing and assertions, plus consistent usage pattern, the code likely assumes params is a required field of OlostepScrapeArguments. If it wasn't, we'd expect to see null checks elsewhere too.
      The comment should be deleted as it appears to be speculative without strong evidence that params can be None, especially given the strict typing in place.
    6. integrations-service/integrations/utils/integrations/olostep.py:26
    • Draft comment:
      Consider including additional metadata from the API response (e.g., a 'created' timestamp) in the response's metadata field.
    • Reason this comment was not posted:
      Confidence changes required: 50% <= threshold 50%
      None
    7. integrations-service/integrations/utils/integrations/olostep.py:76
    • Draft comment:
      The file does not end with a newline which might create issues in some setups. Please add a newline at the end of the file.
    • Reason this comment was not posted:
      Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50%
      Missing newlines at end of file can cause issues with some tools and is generally considered good practice. However, this is a very minor issue that would likely be caught by linters or formatters. It's not a logic issue or code quality problem that needs human attention.
      The comment is technically correct and follows a common convention. Missing newlines can cause subtle issues in some scenarios.
      While technically valid, this is exactly the kind of minor formatting issue that should be handled by automated tools rather than manual review comments. It doesn't affect functionality or code quality in a meaningful way.
      Delete the comment as it addresses a minor formatting issue that should be handled by automated tools rather than taking up reviewer and author attention.
    8. typespec/tools/models.tsp:62
    • Draft comment:
      Typo: In the comment on line 61/62, "overriden" should be corrected to "overridden" for clarity and accuracy.
    • Reason this comment was not posted:
      Comment was not on a location in the diff, so it can't be submitted as a review comment.
    9. typespec/tools/models.tsp:65
    • Draft comment:
      Typo: In the comment on line 64/65, "overriden" should be changed to "overridden" for better readability and correctness.
    • Reason this comment was not posted:
      Comment was not on a location in the diff, so it can't be submitted as a review comment.

    Workflow ID: wflow_EcaolMJRkrRXAfie


    Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

    @@ -50,6 +50,8 @@
    WikipediaSearchOutput,
    )

    from .utils.integrations.olostep import scrape
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Explicitly import OlostepSetup and OlostepScrapeArguments to ensure they’re defined.

    Suggested change
    from .utils.integrations.olostep import scrape
    from .utils.integrations.olostep import scrape, OlostepSetup, OlostepScrapeArguments

    except Exception as e:
    # Log the exception or handle it as needed
    msg = f"Error executing Olostep scrape: {str(e)}"
    raise RuntimeError(msg)
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Use exception chaining (e.g., raise RuntimeError(...) from e) to preserve original traceback details.

    Suggested change
    raise RuntimeError(msg)
    raise RuntimeError(msg) from e

    @creatorrr
    Copy link
    Contributor

    sorry for the long wait @aadithyanr , this slipped through the cracks. picking this up this week. in the meantime, it'd be super helpful if you are able to resolve merge conflicts and as many of the AI reviewer comments as possible

    @creatorrr creatorrr requested a review from HamadaSalhab March 22, 2025 19:28
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants