diff --git a/README.md b/README.md new file mode 100644 index 0000000..fd6254f --- /dev/null +++ b/README.md @@ -0,0 +1,162 @@ +## **Fetch Rewards - Health Monitor Script** + +This project is a Python-based health check monitor that continuously checks the availability and responsiveness of web service endpoints defined in a YAML configuration file. + +## **1\. Given Requirement** + +The following were the original project requirements: + +- Must accept a YAML configuration file as a command-line argument. +- YAML format must match that in the provided sample. +- Must accurately determine the availability of all endpoints during every check cycle. +- Output must include the availability percentage per domain. +- Should perform checks every 15 seconds. +- Should be fault-tolerant and log errors without crashing. + +## **2\. Installation & Usage** + +### 1. Clone the repository +First, clone the repository from GitHub to your local machine: + +```bash +git clone https://github.com/iamsanjayboga/sre-take-home-exercise-python +cd sre-take-home-exercise-python +``` + +### **Installation** +After cloning the repository, install the necessary dependencies using the requirements.txt file: + +```bash +pip install -r requirements.txt +``` + +### **Running the Script** +Once the dependencies are installed, you can run the application: + +```bash +python main.py + +Example 1: python main.py Sample.yaml +Example 2: python main.py customConfig.yaml +``` + +## **3\. Issues Identified in Initial Code** + +1. **Default HTTP Method Handling** + **Issue**: The original code did not provide a default HTTP method when the `method` field was missing in the endpoint configuration. + **Fix**: Assigned `"GET"` as the default method if none is specified, ensuring consistent request behavior. + ``` + method = endpoint.get('method', 'GET') + ``` + +2. **YAML Format Validation** + **Issue**: The original code did not validate whether the YAML configuration file was in the correct format (list of endpoints). + **Fix**: Added a check to ensure the loaded YAML is a list. If not, the program exits with an error, preventing invalid configurations from causing runtime issues. + ``` + if not isinstance(config, list): + raise ValueError("Configuration should be a list of endpoints") + return config + ``` + +3. **Timeout Handling** + **Issue**: The original code did not enforce a response timeout, which could lead to hanging if a server did not respond. + **Fix**: Added a 500ms timeout to all HTTP requests and validated both response time and status code to mark endpoints as UP or DOWN. + ``` + response = requests.request(method, url, headers=headers, json=body if body else None, timeout=0.5) + ``` + +4. **Missing URL Handling** + **Issue**: If an endpoint configuration was missing a URL field, it could cause a crash or undefined behavior. + **Fix**: Added a guard clause that checks for the URL field and logs a meaningful error if it's missing, returning "DOWN" for that endpoint. + ``` + if not url: + logger.error(f"Missing URL for endpoint: {endpoint.get('name', 'unknown')}") + return "DOWN" + ``` + +5. **Improved Error Logging** + **Issue**: Errors were logged vaguely or printed directly, making debugging difficult. + **Fix**: Switched to structured `logger.debug`/`logger.error` messages with required information like endpoint names and specific exception messages. + ``` + logger.debug(f"{endpoint.get('name', url)} is DOWN - Error: {str(e)}") + ``` + +6. **Domain Extraction Accuracy** + **Issue**: Domain names were extracted along with port numbers, which caused inaccurate grouping of stats per domain. + **Fix**: Used `urllib.parse.urlparse` to extract the domain cleanly and strip any port numbers, ensuring correct domain-based stats. + ``` + parsed_url = urlparse(url) + return parsed_url.netloc.split(':')[0] + ``` + +## **4\. Summary - Fixes Made for the Above Issues** + +| **Bug #** | **Description** | **Fix** | +| --- | -------------------------------------- |----------------------------------------- | +| #1 | Default method not handled | Added default GET for missing method | +| #2 | Invalid YAML structure | Added type validation after loading YAML | +| #3 | No timeout in requests | Added 500ms timeout to HTTP requests | +| #4 | URL field may be missing | Added guard clause to check for URL | +| #5 | Vague logging | Switched to logger.debug / logger.error with required messages | +| #6 | Port in domain affected stats | Extracted domain name without port | + + +## **5. Key Enhancements and Their Benefits** + +| **Enh. #** | **Enhancement Description** | **Implementation** | +|------------|------------------------------------------------------|-------------------------------------------------------------------------------------| +| #1 | Added concurrency for faster health checks | Integrated `ThreadPoolExecutor` to execute multiple endpoint checks in parallel | +| #2 | Enabled daily rotation of log files | Configured logging to store output in date-based files inside the `logs/` directory | +| #3 | Maintained consistent monitoring intervals | Ensured each health check cycle runs every 15 seconds by adjusting sleep time | +| #4 | Enhanced logging for better observability | Added detailed logs showing both success and failure with contextual messages | + + +## **6. Benefits of Enhancements** + +These enhancements significantly improve the robustness, clarity, and efficiency of the monitoring system: + +- **Faster Monitoring with Concurrency** + Parallel health checks reduce the time required per monitoring cycle, allowing quicker detection of failures. + +- **Simplified Log Management** + Daily log rotation keeps the logging organized and manageable over time, aiding in debugging and traceability. + +- **Consistent Monitoring Intervals** + Health checks run precisely every 15 seconds, improving reliability and reducing drift in monitoring frequency. + +- **Better Observability and Debugging** + Enhanced logging provides clear insights into system behavior, making troubleshooting faster and easier. + + +## **6. Screenshots: Error vs Working** + +### Error State + +This is a screenshot showing the error state before the fixes were implemented. + +Input File: Sample.yaml + +![Error Screenshot](images/error_screenshot.png) + +### Working State + +This is a screenshot showing the system after the fixes, demonstrating it is now working correctly. + +Input File: Sample.yaml + +![Working Screenshot](images/working_screenshot.png) + +![Working Screenshot](images/working_screenshot_ended.png) + +Input File: customConfig.yaml + +![Working Screenshot](images/customConfig_working.png) + +## **7. Conclusion** + +All the original requirements have been implemented successfully, and additional enhancements have been made to improve performance, logging, and reliability. + + +### + +### diff --git a/customConfig.yaml b/customConfig.yaml new file mode 100644 index 0000000..45cf2bd --- /dev/null +++ b/customConfig.yaml @@ -0,0 +1,19 @@ +- name: "Google" + url: "https://www.google.com" + method: "GET" + headers: + User-Agent: "SRE Monitor" + +- name: "GitHub" + url: "https://github.com" + method: "GET" + headers: + User-Agent: "SRE Monitor" + timeout: 0.5 # Custom timeout value for testing + + +- name: "Stack Overflow" + url: "https://stackoverflow.com" + method: "GET" + headers: + User-Agent: "SRE Monitor" diff --git a/images/customConfig_working.png b/images/customConfig_working.png new file mode 100644 index 0000000..0263d30 Binary files /dev/null and b/images/customConfig_working.png differ diff --git a/images/error_screenshot.png b/images/error_screenshot.png new file mode 100644 index 0000000..a5e310c Binary files /dev/null and b/images/error_screenshot.png differ diff --git a/images/working_screenshot.png b/images/working_screenshot.png new file mode 100644 index 0000000..f294653 Binary files /dev/null and b/images/working_screenshot.png differ diff --git a/images/working_screenshot_ended.png b/images/working_screenshot_ended.png new file mode 100644 index 0000000..f84a8b3 Binary files /dev/null and b/images/working_screenshot_ended.png differ diff --git a/logs/monitor_2025-04-20.log b/logs/monitor_2025-04-20.log new file mode 100644 index 0000000..1982f1a --- /dev/null +++ b/logs/monitor_2025-04-20.log @@ -0,0 +1,109 @@ +2025-04-20 17:02:44 - INFO - Starting monitoring of 4 endpoints +2025-04-20 17:02:45 - INFO - --- Health Check Results --- +2025-04-20 17:02:45 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 17:03:00 - INFO - --- Health Check Results --- +2025-04-20 17:03:00 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 17:03:15 - INFO - --- Health Check Results --- +2025-04-20 17:03:15 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 19:17:15 - INFO - Starting monitoring of 4 endpoints +2025-04-20 19:17:16 - INFO - --- Health Check Results --- +2025-04-20 19:17:16 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 0% availability percentage +2025-04-20 19:17:31 - INFO - --- Health Check Results --- +2025-04-20 19:17:31 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 12% availability percentage +2025-04-20 19:17:46 - INFO - --- Health Check Results --- +2025-04-20 19:17:46 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 17% availability percentage +2025-04-20 19:18:01 - INFO - --- Health Check Results --- +2025-04-20 19:18:01 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 19% availability percentage +2025-04-20 19:18:22 - INFO - Starting monitoring of 4 endpoints +2025-04-20 19:18:23 - INFO - --- Health Check Results --- +2025-04-20 19:18:23 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 19:18:38 - INFO - --- Health Check Results --- +2025-04-20 19:18:38 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 12% availability percentage +2025-04-20 19:20:35 - INFO - Starting monitoring of 4 endpoints +2025-04-20 19:20:36 - INFO - --- Health Check Results --- +2025-04-20 19:20:36 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 19:20:51 - INFO - --- Health Check Results --- +2025-04-20 19:20:51 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 19:21:06 - INFO - --- Health Check Results --- +2025-04-20 19:21:06 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 25% availability percentage +2025-04-20 20:53:42 - INFO - Starting monitoring of 4 endpoints +2025-04-20 20:53:43 - INFO - --- Health Check Results --- +2025-04-20 20:53:43 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 0% availability percentage +2025-04-20 20:53:58 - INFO - --- Health Check Results --- +2025-04-20 20:53:58 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 12% availability percentage +2025-04-20 20:54:13 - INFO - --- Health Check Results --- +2025-04-20 20:54:13 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 17% availability percentage +2025-04-20 20:54:28 - INFO - --- Health Check Results --- +2025-04-20 20:54:28 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 19% availability percentage +2025-04-20 20:54:43 - INFO - --- Health Check Results --- +2025-04-20 20:54:43 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 20% availability percentage +2025-04-20 20:54:58 - INFO - --- Health Check Results --- +2025-04-20 20:54:58 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 21% availability percentage +2025-04-20 20:55:13 - INFO - --- Health Check Results --- +2025-04-20 20:55:13 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 21% availability percentage +2025-04-20 20:55:28 - INFO - --- Health Check Results --- +2025-04-20 20:55:28 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 22% availability percentage +2025-04-20 20:55:43 - INFO - --- Health Check Results --- +2025-04-20 20:55:43 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 22% availability percentage +2025-04-20 20:55:58 - INFO - --- Health Check Results --- +2025-04-20 20:55:58 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 22% availability percentage +2025-04-20 20:56:13 - INFO - --- Health Check Results --- +2025-04-20 20:56:13 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:56:28 - INFO - --- Health Check Results --- +2025-04-20 20:56:28 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:56:43 - INFO - --- Health Check Results --- +2025-04-20 20:56:43 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:56:58 - INFO - --- Health Check Results --- +2025-04-20 20:56:58 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:57:13 - INFO - --- Health Check Results --- +2025-04-20 20:57:13 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:57:28 - INFO - --- Health Check Results --- +2025-04-20 20:57:28 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 23% availability percentage +2025-04-20 20:57:43 - INFO - --- Health Check Results --- +2025-04-20 20:57:43 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 24% availability percentage +2025-04-20 20:57:58 - INFO - --- Health Check Results --- +2025-04-20 20:57:58 - INFO - dev-sre-take-home-exercise-rubric.us-east-1.recruiting-public.fetchrewards.com has 24% availability percentage +2025-04-20 21:11:49 - INFO - Starting monitoring of 4 endpoints +2025-04-20 21:11:50 - INFO - --- Health Check Results --- +2025-04-20 21:11:50 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:11:50 - INFO - github.com has 100% availability percentage +2025-04-20 21:11:50 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:11:50 - INFO - example.com has 100% availability percentage +2025-04-20 21:12:05 - INFO - --- Health Check Results --- +2025-04-20 21:12:05 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:12:05 - INFO - github.com has 100% availability percentage +2025-04-20 21:12:05 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:12:05 - INFO - example.com has 100% availability percentage +2025-04-20 21:12:20 - INFO - --- Health Check Results --- +2025-04-20 21:12:20 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:12:20 - INFO - github.com has 100% availability percentage +2025-04-20 21:12:20 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:12:20 - INFO - example.com has 100% availability percentage +2025-04-20 21:12:35 - INFO - --- Health Check Results --- +2025-04-20 21:12:35 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:12:35 - INFO - github.com has 100% availability percentage +2025-04-20 21:12:35 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:12:35 - INFO - example.com has 100% availability percentage +2025-04-20 21:12:50 - INFO - --- Health Check Results --- +2025-04-20 21:12:50 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:12:50 - INFO - github.com has 100% availability percentage +2025-04-20 21:12:50 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:12:50 - INFO - example.com has 100% availability percentage +2025-04-20 21:13:05 - INFO - --- Health Check Results --- +2025-04-20 21:13:05 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:13:05 - INFO - github.com has 100% availability percentage +2025-04-20 21:13:05 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:13:05 - INFO - example.com has 100% availability percentage +2025-04-20 21:13:42 - INFO - Starting monitoring of 3 endpoints +2025-04-20 21:13:42 - INFO - --- Health Check Results --- +2025-04-20 21:13:42 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:13:42 - INFO - github.com has 100% availability percentage +2025-04-20 21:13:42 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:13:58 - INFO - --- Health Check Results --- +2025-04-20 21:13:58 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:13:58 - INFO - github.com has 100% availability percentage +2025-04-20 21:13:58 - INFO - stackoverflow.com has 100% availability percentage +2025-04-20 21:14:12 - INFO - --- Health Check Results --- +2025-04-20 21:14:12 - INFO - www.google.com has 100% availability percentage +2025-04-20 21:14:12 - INFO - github.com has 100% availability percentage +2025-04-20 21:14:12 - INFO - stackoverflow.com has 100% availability percentage diff --git a/main.py b/main.py index e3f2bef..1aa2127 100644 --- a/main.py +++ b/main.py @@ -1,57 +1,121 @@ import yaml import requests import time +import sys +import logging from collections import defaultdict +from urllib.parse import urlparse +import concurrent.futures +import os +from datetime import datetime + +# ENHANCEMENT 2: Added daily rotating log file in 'logs' directory +os.makedirs("logs", exist_ok=True) +log_filename = datetime.now().strftime("logs/monitor_%Y-%m-%d.log") + +logging.basicConfig( + level=logging.INFO, + format='%(asctime)s - %(levelname)s - %(message)s', + datefmt='%Y-%m-%d %H:%M:%S', + handlers=[ + logging.FileHandler(log_filename), + logging.StreamHandler(sys.stdout) + ] +) +logger = logging.getLogger(__name__) + # Function to load configuration from the YAML file def load_config(file_path): - with open(file_path, 'r') as file: - return yaml.safe_load(file) + try: + with open(file_path, 'r') as file: + config = yaml.safe_load(file) + if not isinstance(config, list): #BUG FIX 2 - Validation of YAML format + raise ValueError("Configuration should be a list of endpoints") + return config + except (yaml.YAMLError, FileNotFoundError) as e: + logger.error(f"Error loading configuration file: {str(e)}") # BUG FIX #5 - Better error logging + sys.exit(1) # Function to perform health checks def check_health(endpoint): - url = endpoint['url'] - method = endpoint.get('method') + url = endpoint.get('url') + if not url: + logger.error(f"Missing URL for endpoint: {endpoint.get('name', 'unknown')}") # BUG FIX #4: Handle missing URL + return "DOWN" + + method = endpoint.get('method', 'GET') # BUG FIX #1 - Default method to GET headers = endpoint.get('headers') body = endpoint.get('body') - + try: - response = requests.request(method, url, headers=headers, json=body) - if 200 <= response.status_code < 300: + start_time = time.time() + response = requests.request(method, url, headers=headers, json=body if body else None, timeout=0.5) # BUG FIX #3 - Timeout added + response_time = time.time() - start_time + + # BUG FIX #3 (continued): Check both status code and response time + if 200 <= response.status_code < 300 and response_time <= 0.5: + logger.debug(f"{endpoint.get('name', url)} is UP - Status: {response.status_code}, Time: {response_time:.3f}s") return "UP" else: + reason = f"Status: {response.status_code}" if response_time <= 0.5 else f"Timeout: {response_time:.3f}s" + logger.debug(f"{endpoint.get('name', url)} is DOWN - {reason}") return "DOWN" - except requests.RequestException: + except requests.RequestException as e: + logger.debug(f"{endpoint.get('name', url)} is DOWN - Error: {str(e)}") # BUG FIX #5 - Better error logging return "DOWN" + +def extract_domain(url): + parsed_url = urlparse(url) + return parsed_url.netloc.split(':')[0] # BUG FIX #6 - Extract domain and strip port + # Main function to monitor endpoints def monitor_endpoints(file_path): config = load_config(file_path) domain_stats = defaultdict(lambda: {"up": 0, "total": 0}) - + + logger.info(f"Starting monitoring of {len(config)} endpoints") + while True: - for endpoint in config: - domain = endpoint["url"].split("//")[-1].split("/")[0] - result = check_health(endpoint) - - domain_stats[domain]["total"] += 1 - if result == "UP": - domain_stats[domain]["up"] += 1 - - # Log cumulative availability percentages + start_time = time.time() + check_results = [] + + # ENHANCEMENT 1 - Added concurrency using ThreadPoolExecutor + with concurrent.futures.ThreadPoolExecutor() as executor: + futures = [executor.submit(check_health, endpoint) for endpoint in config] + + for endpoint, future in zip(config, futures): + domain = extract_domain(endpoint["url"]) + result = future.result() + + domain_stats[domain]["total"] += 1 + if result == "UP": + domain_stats[domain]["up"] += 1 + + check_results.append({ + "domain": domain, + "name": endpoint.get("name", endpoint["url"]), + "result": result + }) + + logger.info("--- Health Check Results ---") for domain, stats in domain_stats.items(): availability = round(100 * stats["up"] / stats["total"]) - print(f"{domain} has {availability}% availability percentage") - - print("---") - time.sleep(15) + logger.info(f"{domain} has {availability}% availability percentage") + + # ENHANCEMENT 3: Maintain 15-second intervals between checks + elapsed_time = time.time() - start_time + sleep_time = max(0, 15 - elapsed_time) + logger.debug(f"Check cycle completed in {elapsed_time:.2f}s, sleeping for {sleep_time:.2f}s") # ENHANCEMENT 4: Improved Logging for Success and Errors + time.sleep(sleep_time) # Entry point of the program if __name__ == "__main__": import sys if len(sys.argv) != 2: - print("Usage: python monitor.py ") + print("Usage: python main.py ") sys.exit(1) config_file = sys.argv[1] diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..b3663cb --- /dev/null +++ b/requirements.txt @@ -0,0 +1,2 @@ +requests +PyYAML \ No newline at end of file