From ecf56ff98f6b11ed9e6fa2e1d702bf0423483230 Mon Sep 17 00:00:00 2001 From: Rashmi Banthia Date: Sun, 6 Oct 2024 22:15:11 -0400 Subject: [PATCH] m2 --- _site/assets/js/search-data.json | 2 +- _site/milestone2/index.html | 2 +- milestone2.md | 53 ++++++++++++++++++++++++++++++-- 3 files changed, 53 insertions(+), 4 deletions(-) diff --git a/_site/assets/js/search-data.json b/_site/assets/js/search-data.json index 302560c..42f45de 100644 --- a/_site/assets/js/search-data.json +++ b/_site/assets/js/search-data.json @@ -169,7 +169,7 @@ },"24": { "doc": "Milestone 2", "title": "Milestone 2", - "content": "Milestone 2 . Coming soon! . Key dates: . | Due date: Oct 18th | . ", + "content": "Milestone 2 : MLOps Infrastructure & Advanced Training Workflows - . Building Atomic Containers, Versioned Data Pipelines, and Scalable Computing Solutions . This milestone focuses on establishing the core infrastructure necessary for an MLOps pipeline. Teams are expected to create functional environments, containerized components, and a versioned data management strategy to ensure their work is reproducible and scalable. For teams utilizing Large Language Models (LLMs), the emphasis is on setting up a RAG workflow, including data chunking and integration with a vector database. Teams focusing on computer vision or other modalities will develop fine-tune models, and conduct experiments to optimize performance. By the end of this milestone, teams will have built foundational elements for their project, enabling integration of components and supporting the continued evolution of their models and applications. They will also be required to create a mock-up of their final application, either refining or extending previous submissions. Key dates: . | Due date: Oct 18th | . Template Repository . Objectives: . Virtual Environment Setup: Virtual machines and environments tailored to support containerized components must be fully implemented. This should include detailed documentation on the setup process. Deliverables: . | A screenshot of the running instances in the cloud or local environment. | . Containerized Components: All individual project components should be containerized using Docker, ensuring atomicity and isolation. Each container must perform a specific function (e.g., data scraping, preprocessing, data labeling) and be ready for integration into the project architecture. Deliverables: . | Dockerfiles for each container and build instructions | Pipfiles files for package management within each container | Shell scripts or docker-compose.yml for orchestration, if multiple containers need to be run together | Documentation explaining the purpose of each container and instructions for running them. | . Versioned Data Strategy: Implement a data versioning strategy using tools like DVC or other suitable solutions. If feasible, this strategy should also be containerized to ensure portability and reproducibility of data processes. (Optional but recommended) . Deliverables: . | Documentation on the data versioning strategy chosen (e.g., DVC) and why | A working containerized version of the data versioning pipeline (if applicable) | Version control history showing tracked datasets, along with their respective versions, commits, and logs. | . Teams Utilizing LLMs (Large Language Models): Teams working with LLMs should implement a RAG setup. This setup should include data collection, chunking into appropriate sizes for processing, and the integration of a vector database. Teams should also fine-tune models and document the experimentation process. Deliverables: . | A containerized RAG pipeline, including scripts for data chunking, vectorization, and integration with a vector database | Documentation of the fine-tuning process, including datasets used, hyperparameters, and models | Experiment logs showing model performance across different fine-tuning and RAG configurations. | . Teams Focusing on Computer Vision or Other Modalities: Teams working on computer vision or other modalities (e.g., audio, time series) should focus on creating a robust data pipeline, fine-tuning models for their respective task, and experimenting with different model architectures. Deliverables: . | A containerized pipeline for data ingestion and preprocessing | Model fine-tuning scripts with detailed documentation on hyperparameters, datasets, and model versions | Experiment logs, including results of different models, architectures, or techniques used. | . Mock-up of the Application: A working prototype or mock-up of the final application that integrates all project components. Teams that have already submitted this in Milestone 1 should refine or extend their prototype based on feedback or new progress. Deliverables: . | An application mock-up or wireframe, including user interface elements and how the app will interact with back-end components | . ", "url": "/milestone2/", "relUrl": "/milestone2/" diff --git a/_site/milestone2/index.html b/_site/milestone2/index.html index 6facafc..7ad992b 100644 --- a/_site/milestone2/index.html +++ b/_site/milestone2/index.html @@ -1 +1 @@ - Milestone 2 | AC215 Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Milestone 2

Coming soon!

Key dates:

  • Due date: Oct 18th
+ Milestone 2 | AC215 Skip to main content Link Menu Expand (external link) Document Search Copy Copied

Milestone 2 : MLOps Infrastructure & Advanced Training Workflows -

Building Atomic Containers, Versioned Data Pipelines, and Scalable Computing Solutions

This milestone focuses on establishing the core infrastructure necessary for an MLOps pipeline. Teams are expected to create functional environments, containerized components, and a versioned data management strategy to ensure their work is reproducible and scalable.

For teams utilizing Large Language Models (LLMs), the emphasis is on setting up a RAG workflow, including data chunking and integration with a vector database. Teams focusing on computer vision or other modalities will develop fine-tune models, and conduct experiments to optimize performance.

By the end of this milestone, teams will have built foundational elements for their project, enabling integration of components and supporting the continued evolution of their models and applications. They will also be required to create a mock-up of their final application, either refining or extending previous submissions.

Key dates:

  • Due date: Oct 18th

Template Repository

Objectives:

Virtual Environment Setup: Virtual machines and environments tailored to support containerized components must be fully implemented. This should include detailed documentation on the setup process.

Deliverables:

  • A screenshot of the running instances in the cloud or local environment.

Containerized Components: All individual project components should be containerized using Docker, ensuring atomicity and isolation. Each container must perform a specific function (e.g., data scraping, preprocessing, data labeling) and be ready for integration into the project architecture.

Deliverables:

  • Dockerfiles for each container and build instructions
  • Pipfiles files for package management within each container
  • Shell scripts or docker-compose.yml for orchestration, if multiple containers need to be run together
  • Documentation explaining the purpose of each container and instructions for running them.

Versioned Data Strategy: Implement a data versioning strategy using tools like DVC or other suitable solutions. If feasible, this strategy should also be containerized to ensure portability and reproducibility of data processes. (Optional but recommended)

Deliverables:

  • Documentation on the data versioning strategy chosen (e.g., DVC) and why
  • A working containerized version of the data versioning pipeline (if applicable)
  • Version control history showing tracked datasets, along with their respective versions, commits, and logs.

Teams Utilizing LLMs (Large Language Models): Teams working with LLMs should implement a RAG setup. This setup should include data collection, chunking into appropriate sizes for processing, and the integration of a vector database. Teams should also fine-tune models and document the experimentation process.

Deliverables:

  • A containerized RAG pipeline, including scripts for data chunking, vectorization, and integration with a vector database
  • Documentation of the fine-tuning process, including datasets used, hyperparameters, and models
  • Experiment logs showing model performance across different fine-tuning and RAG configurations.

Teams Focusing on Computer Vision or Other Modalities: Teams working on computer vision or other modalities (e.g., audio, time series) should focus on creating a robust data pipeline, fine-tuning models for their respective task, and experimenting with different model architectures.

Deliverables:

  • A containerized pipeline for data ingestion and preprocessing
  • Model fine-tuning scripts with detailed documentation on hyperparameters, datasets, and model versions
  • Experiment logs, including results of different models, architectures, or techniques used.

Mock-up of the Application: A working prototype or mock-up of the final application that integrates all project components. Teams that have already submitted this in Milestone 1 should refine or extend their prototype based on feedback or new progress.

Deliverables:

  • An application mock-up or wireframe, including user interface elements and how the app will interact with back-end components
diff --git a/milestone2.md b/milestone2.md index 69969df..be0c8e3 100644 --- a/milestone2.md +++ b/milestone2.md @@ -5,10 +5,59 @@ parent: Projects nav_order: 2 --- -### Milestone 2 +### Milestone 2 : _MLOps Infrastructure & Advanced Training Workflows -_ +**Building Atomic Containers, Versioned Data Pipelines, and Scalable Computing Solutions** -Coming soon! +This milestone focuses on establishing the core infrastructure necessary for an MLOps pipeline. Teams are expected to create functional environments, containerized components, and a versioned data management strategy to ensure their work is reproducible and scalable. + +For teams utilizing Large Language Models (LLMs), the emphasis is on setting up a RAG workflow, including data chunking and integration with a vector database. Teams focusing on computer vision or other modalities will develop fine-tune models, and conduct experiments to optimize performance. + +By the end of this milestone, teams will have built foundational elements for their project, enabling integration of components and supporting the continued evolution of their models and applications. They will also be required to create a mock-up of their final application, either refining or extending previous submissions. ### Key dates: - Due date: Oct 18th + +### [Template Repository](https://github.com/ac2152024/ac2152024_template/tree/milestone2) + +### Objectives: + +**Virtual Environment Setup:** Virtual machines and environments tailored to support containerized components must be fully implemented. This should include detailed documentation on the setup process. + +Deliverables: +- A screenshot of the running instances in the cloud or local environment. + +**Containerized Components:** All individual project components should be containerized using Docker, ensuring atomicity and isolation. Each container must perform a specific function (e.g., data scraping, preprocessing, data labeling) and be ready for integration into the project architecture. + +Deliverables: +- Dockerfiles for each container and build instructions +- Pipfiles files for package management within each container +- Shell scripts or docker-compose.yml for orchestration, if multiple containers need to be run together +- Documentation explaining the purpose of each container and instructions for running them. + + +**Versioned Data Strategy:** Implement a data versioning strategy using tools like DVC or other suitable solutions. If feasible, this strategy should also be containerized to ensure portability and reproducibility of data processes. (Optional but recommended) + +Deliverables: +- Documentation on the data versioning strategy chosen (e.g., DVC) and why +- A working containerized version of the data versioning pipeline (if applicable) +- Version control history showing tracked datasets, along with their respective versions, commits, and logs. + +**Teams Utilizing LLMs (Large Language Models):** Teams working with LLMs should implement a RAG setup. This setup should include data collection, chunking into appropriate sizes for processing, and the integration of a vector database. Teams should also fine-tune models and document the experimentation process. + + Deliverables: + - A containerized RAG pipeline, including scripts for data chunking, vectorization, and integration with a vector database + - Documentation of the fine-tuning process, including datasets used, hyperparameters, and models + - Experiment logs showing model performance across different fine-tuning and RAG configurations. + +**Teams Focusing on Computer Vision or Other Modalities:** Teams working on computer vision or other modalities (e.g., audio, time series) should focus on creating a robust data pipeline, fine-tuning models for their respective task, and experimenting with different model architectures. + +Deliverables: +- A containerized pipeline for data ingestion and preprocessing +- Model fine-tuning scripts with detailed documentation on hyperparameters, datasets, and model versions +- Experiment logs, including results of different models, architectures, or techniques used. + +**Mock-up of the Application:** A working prototype or mock-up of the final application that integrates all project components. Teams that have already submitted this in Milestone 1 should refine or extend their prototype based on feedback or new progress. + +Deliverables: +- An application mock-up or wireframe, including user interface elements and how the app will interact with back-end components \ No newline at end of file