We hope you're eagerly anticipating a new challenge 😉
Today, you will gain skills in using Bloom filters and the HyperLogLog algorithm to solve problems.
This homework consists of two independent tasks.
The first task—checking password uniqueness—will prepare you for developing systems that handle large-scale data, where memory efficiency and processing are critical.
In the second task, you will learn to work with different approaches for counting unique elements in large datasets, including both exact methods and approximate algorithms like HyperLogLog. You will also gain experience analyzing algorithm performance using execution time metrics. This task will help you develop data optimization skills and apply modern methods for efficient problem-solving.
Let's get started! 💪🏼
Create a function to check password uniqueness using a Bloom filter. This function should determine whether a password has been used before without storing the actual passwords.
- Implement a
BloomFilter
class that supports adding elements to the filter and checking for their existence. - Implement a
check_password_uniqueness
function that utilizes aBloomFilter
instance to check a list of new passwords for uniqueness. It should return the validation results for each password. - Ensure proper handling of all data types. Passwords should be processed as plain strings without hashing. Empty or invalid values must also be considered and handled appropriately.
- The function and class should be capable of handling large datasets while using minimal memory.
📌 The acceptance criteria for this homework are mandatory for review by the
mentor. If any of these criteria are not met, the mentor will return the
assignment for revision without evaluation.
If you need "just a clarification" 😉 or get "stuck" at any stage—reach out
to your mentor on Slack.
- The
BloomFilter
class correctly implements Bloom filter logic (20 points). - The
check_password_uniqueness
function verifies new passwords using the provided filter (20 points). - The code executes as expected based on given usage examples (10 points).
if __name__ == "__main__":
# Initialize Bloom filter
bloom = BloomFilter(size=1000, num_hashes=3)
# Add existing passwords
existing_passwords = ["password123", "admin123", "qwerty123"]
for password in existing_passwords:
bloom.add(password)
# Check new passwords
new_passwords_to_check = ["password123", "newpassword", "admin123", "guest"]
results = check_password_uniqueness(bloom, new_passwords_to_check)
# Print results
for password, status in results.items():
print(f"Password '{password}' - {status}.")
Create a script to compare exact unique element counting with the HyperLogLog approximation.
- Load a dataset from a real log file (
lms-stage-access.log
) containing IP address information. - Implement a method for exact unique IP address counting using a
set
structure. - Implement a method for approximate unique IP address counting using HyperLogLog.
- Compare both methods in terms of execution time.
- The data loading method processes the log file while ignoring incorrect lines (10 points).
- The exact counting function returns the correct number of unique IP addresses (10 points).
- HyperLogLog provides an approximation within an acceptable error margin (10 points).
- The comparison results are presented in a table format (10 points).
- The code is scalable for large datasets (10 points).
Comparison Results:
Method | Exact Counting | HyperLogLog |
---|---|---|
Unique Elements | 100000.0 | 99652.0 |
Execution Time (s) | 0.45 | 0.1 |