-
Notifications
You must be signed in to change notification settings - Fork 50
Feature Release Plan 2021 05 30
Vinay Kulkarni edited this page Jun 3, 2021
·
47 revisions
-
Label based network policies – Security Groups for IaaS network
- Owner: Hong Chang
- Summary: Insert a label (uint/uint64) into packet's GENEVE header options / encap IP options to speed-up ingress policy enforcement
- Status:
- Requirement clear, 5/30 scope identified.
- Design doc complete and reviewed - See Label-based Network Policies.
- Coding compete and checked-in.
- Integration testing in progress.
-
Bandwidth QoS monitor & control (for container networking)
- Owner: Vinay
- Summary: Guarantee network bandwidth for high priority containers
- Status:
- Requirement clear, 5/30 scope OK.
- Design doc complete. See Bandwidth QoS for Kubernetes Pods
- Coding complete.
- Scoped down deliverable to the core feature.
- High priority pods bw usage and low priority quota adjustment moved to next release.
- Scoped down deliverable to the core feature.
- PR in review. Integration testing in progress.
- Implemented deployment of K8s cluster in AWS with Mizar as CNI
- Single yaml Mizar deployment does not work in my case (works with kubeadm manual cluster). Investigating..
-
Mizar Stability and Robustness
- Owner: Phu
-
Summary: Improvements to Mizar so that it works in a stable, robust manner for Arktos and upstream K8s.
- Investigate and fix issues to get Mizar single yaml deployment to work.
- Investigate and fix the broken Mizar CNI.
- Fix reliability issues causing 'kind-setup.sh' & 'kind-setup.sh dev'
- Investigate and fix issues causing coredns and lpp pods to not come up and run successfully.
- Status:
- Coding complete and checked in.
- Investigating corner-case issues.
-
Network Policy Support for Arktos
- Owner: Cathy
- Summary: Network Controller changes to add network policy support in Arktos. Design & implement.
- Status:
- Design doc complete and reviewed. See Mizar Network Policies Support in Arktos
- Prototyping in progress.
- Cut from 5/30 release due to unforeseen issues delaying schedule.
-
Mizar <--> Arktos improvements
- Owner: Community effort (C2C + our team)
-
Summary: Identify bugs and improvements to get Mizar working for Arktos
- Working on simplifying single-node Mizar setup for Arktos.
- Status:
- Not 5/30 scoped, on-going effort.
- Identified a few tasks C2C can help with and contribute to Mizar/Arktos projects, engaging with them during weekly community eetings and on slack.
-
Mizar Performance Test & eBPF/XDP offloading to NIC card - DE-PRIORITIZED FOR 5/30 DUE TO BLOCKING ISSUES
- Owner: Phu
-
Summary: Compare Mizar perf (non-offloaded) with peer CNI solutions, e.g cilium, ovn-k8s
- Metrics: Scalability, Latency, (TODO: What else..)
- Status:
- 5/30 scope OK.
- Test setup identified.
- Driving collaboration with USTC team, engaging with them during weekly meetings and on slack.
- Potential contributions:
- Move egress traffic handling from veth-pair host Rx to host NIC to reduce XDP memory footprint of Mizar - USTC started working on it
- Add/enrich statistics gathering to Mizar - USTC team starting to work on this.
- Potential contributions:
-
Summary: Evaluate Netronome NIC support for Mizar eBPF/XDP offload. If feasible, design & implement. Compare with DPDK.
- Status:
- 5/30 likely risk - external dependency, large task
- NICs arrived, installed and switch configured.
- Working on offloading Mizar eBPF to NIC.
- Status:
-
Stretch goal: Make Mizar work for scaleout architecture
- Owner: TBD
Vinay:
- Investigating Google EDT for Bandwidth QoS project
- Setup a talk to go over k8s components arktos-up.
- Setup a session to understand mizar - how things flow.
Phu:
- Working on understanding data-plane and DPDK.
Cathy:
- Working on Arktos<-->Mizar bugs
Hong:
- Design doc for Net policy WIP, ETA 03/15
Vinay:
- Upload the arktos-up & kube-up overview and share.
- Design doc for QoS and colocation of high/low pri pods.
- Use mutating webhook to add anti-affinity to high-network-priority annotated pods.
- Investigating how to use EDT.
- Change sync meetings to Tue & Thu @ 2pm instead of M-W-F
Phu:
- Investigating how to load XDP program into offload NIC. Add CLI option for it.
Cathy:
- Working on binary searching Akrtos master CL to narrow down where the Mizar CNI integration broke.
Hong:
- Working on adding E2E workflow diagram.
- Need pointers with how GENEVE is used today in Mizar.
- Check if Phu's video, Sherif's KC talk has this info.
Vinay:
- Upload the arktos-up & kube-up overview and share.
- Design doc for QoS and colocation of high/low priority pods.
- Going through workflow of CNI network interface addition in Mizar.
- Team uses kind. Check k8s version used
- Try k8s in GCE with Mizar with kind k8s version that is known to work.
- Investigating how to use EDT.
- Going through workflow of CNI network interface addition in Mizar.
- Change sync meetings to Tue & Thu @ 2pm instead of M-W-F
Phu:
- Out sick today.
Cathy:
- The issue exists in July version of arktos as well.
- Create issue with all details of CLs we tried and loop me an XiaoNing.
Hong:
- Hong to schedule design review meeting for Monday.
- Need pointers with how GENEVE is used today in Mizar.
- Find and send a video on overlay networks.
- Check if Phu's video, Sherif's KC talk has this info.
Vinay:
- Design doc for QoS and colocation of high/low priority pods.
- Going through workflow of CNI network interface addition in Mizar.
- Team uses kind. Check k8s version used
- Investigating how to use EDT.
- Trying Cilium EDT code.
- Going through workflow of CNI network interface addition in Mizar.
Phu:
- CLI update to install XDP program in kernel vs offload
- NICs will be here tomorrow maybe.
Cathy:
- Hongwei fixed containerd version, it works now.
- Network controller changes to add network policy to arktos.
- Working on design doc, need to identify changes in Arktos work with network policy in Mizar.
Hong:
- Update design doc with details of E2E flow.
Vinay:
- Design doc for QoS in progress.
- Trying out XDP tutorial.
- Trying out tc prototype to limit bandwidth.
Phu:
- CLI update done. PR out.
- NICs received, dropped it off to office.
- Start looking into how to do perf-test.
- Vinay to look into what switch to buy. Prefer to keep it local. Check with David on our own switch install.
Cathy:
- Arktos keeps sending pod create requests without reason - investigating.
- Send Cathy arktos PR submit pre-checks.
Hong:
- Update design doc with details of E2E flow.
- Setup follow-up design doc review
Vinay:
- Design doc for QoS in progress.
- Investigating how to leverage linux TC after encapsulating outgoing packet at veth Rx.
- Working with David to get the cabling needed to hookup Netronome NICs.
Phu:
- Investigating DPDK setup for comparison with XDP.
- Proposed test-setup:
- 1 master, 2 workers [physical servers with 1 netronome NIC each] = total of 3 baremetal servers
- Identify upstream version of k8s to use for cluster.
- Mizar vs ovn vs Cilium vs DPDK
Cathy:
- Arktos keeps sending pod create requests without reason - still investigating.
- Arktos PR - hit CI issues. No progress yet. Not urgent, postponed until later.
- Working on design doc for Network policy in Arktos. Review ETA tomorrow afternoon. #1 priority.
Hong:
- Update design doc with details of E2E flow.
- Struct change details identified.
Vinay:
- Going to office tomorrow to install NICs
Phu:
- Investigating DPDK setup for comparison with XDP.
- Working on perf test setup and plan.
Cathy:
- Network Policy Design Doc in review, looks promising.
Hong:
- Out today.
Vinay:
- Switch config for Phu - working now.
- Attending NSDI talks.
- Single-node deployment debugging - Pods stuck in container creating.
- VMware fusion deployment (1 master 2 worker cluster Ubuntu 20.04 latest kernel) of Hongwei mizar yaml .. hit several issues.
- Arktos AWS deployment with Mizar support. (kube-up).. arktos aws-kube-up has regressed.
- Upstream k8s v1.19.2 + Ubuntu 20.10.
Phu:
- kubeadm setup worked -- v1.21 + Ubuntu 18.04 + updated kernel.
- Mizar deploy.mizar.components.yaml - did not work. Working on a fix..
- Also hitting interface not found in mizarcni.log
- Loading XDP in offload mode hitting issues..
Cathy:
- Prototyping the network policy changes.
- Blocked on GRPC error. Hong Chang will help.
- Review Amit's change to simplify single-node arktos deployment for mizar.
Hong:
- Good progress implementing label plumbing to XDP. PR needs review.
- Vinay to review PR.
Vinay:
- Sent PR to fix docker image in kind-setup - fixes the issue USTC folks hit.
- Submitted PR (in my k8s fork) to deploy k8s with Mizar in AWS (Flannel works, mizar single-yaml currently does not)
- Investigating single-YAML in AWS & arktos-up. Phu also hit this issue.
- arktos-up seems currently broken.
- Reviewed Hong's PR, have some questions.
Phu:
- Out sick today.
Cathy:
- GRPC issue resolved. Progress is being made.
- Reviewed Amit's changes, will provide more comments to update user-guide.
Hong:
- Good progress implementing label plumbing to XDP. PR is in review.
- Discussed the idea of creating a new generic struct to hold Geneve options data rather than using endpoint struct.
Vinay:
- Working on fixing the single-yaml deployment. Pod-to-pod pings don't work.
- Mizar currently broken - regression. Need CI tests. Looking for commit that caused regression.
- Next: pick back up on Bandwidth QoS project.
Phu:
- Found issue with droplets not coming up in single Yaml. Sending PR today / tomorrow.
- This is blocking Phu as well.
- Will be mentoring Wei with kind-setup CI test task next week.
Cathy:
- Two PRs out for review.
- Working on converting arktos n/w policy to mizar for py.
Hong:
- Investigating creating a separate BPF map to store packet options info.
Vinay:
- PR to cleanup kind-setup merged.
- Plan to start working on QoS project soon.
Phu:
- Found issue with droplets not coming up in single Yaml.
- Fixed issue with droplet not being main interface.
- Looking into CNI issue.
- Will be mentoring Wei with kind-setup CI test task next week.
Cathy:
- PR are merged.
- Working on converting arktos n/w policy to mizar for py.
Hong:
- Investigating creating a separate BPF map to store packet options.
- Need help with creating a new BPF map
Vinay:
- Started working on QoS project.
- Blocked on trying to find the underlying sk_buff for in XDP code (xdp_md / xdp_buff)
- Project proposal for summer of code.
Phu:
- Netronome does not support XDP_REDIRECT. This is a blocker for us.
- Found issue with droplets not coming up in single Yaml.
- Working on pods not being created. CNI ADD call is not handled correctly.
- Will be mentoring Wei with kind-setup CI test task next week.
Cathy:
- Out sick - getting COVID vaccine, maybe out tomorrow as well.
Hong:
- Reviewed PR. Fixing issues, adding unit tests.
Vinay:
- Started working on QoS project.
- sk_buff is not created at the point where we intercept egress packet. Cannot use EDT mechanism as is done by Cilium or goog (at TC hook)
- Need a different approach - investigating.
- Project proposal for summer of code written up and reviewed.
Phu:
- Netronome does not support XDP_REDIRECT. This is a blocker for us.
- Found issue with droplets not coming up in single Yaml.
- Continuing to investigate CNI ADD call is not handled correctly.
Cathy:
- Working on JSON conversion from mizar <--> arktos.
Hong:
- Updated PR needs review.
- Vinay will try and review it today.
Vinay:
- Continuing work on QoS project. Going SLOW...
Phu:
- Node not ready issue fixed.
- Investigating another issue in pod create - CNI issue.
Cathy:
- Working on JSON conversion from mizar <--> arktos.
- Ingress rules fixed. Working on egress & podSelector rules.
- PR by EOW.
Hong:
- Updated PR for plumbing labels for egress processing.
- Working on XDP code to build Geneve frame with label options.
- Vinay will try and review it today.
Wei:
- Ramping up.
Vinay:
- Continuing work on QoS project.
Phu:
- Out today
Cathy:
- PR by EoW.
- Code for Arktos side is getting done.
- Minor changes needed for Mizar side.
Hong:
- PR for plumbing labels for egress processing done and submitted.
- XDP code to build Geneve frame with label options is working.
- Vinay will review XDP PR.
- Working on extracting labels from ingress packet and applying ingress policy processing.
Wei:
- Ramping up.
Vinay:
- Continuing work on QoS project. Added code to XDP_PASS for low-priority pods and apply EDT.
- Facing various issues, investigating.
- Working on SoW metrics for C2C
Phu:
- Mentoring & ramping up Wei.
- Fixing the yaml. Targeting a PR by tomorrow.
Cathy:
- PR in review. Fixing feedback.
- Troubleshooting Mizar gRPC issue.
Hong:
- PR for plumbing labels for egress processing done and submitted.
- XDP code to build Geneve frame with label options is working.
- Vinay will review XDP PR. Review is done. Will merge today.
- Working on extracting labels from ingress packet and applying ingress policy processing.
Wei:
- Ramping up, no blocking issues.
Vinay:
- Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
- Facing issue with bpf_debug causing agent XDP program to fail to load. Debugging..
Phu:
- Mentoring & ramping up Wei.
- UW presentation and zeta meeting presentation went very well.
- PR out for fixing yaml. Needs review. Fixing it for kind-setup.
Cathy:
- PR merged.
- Mizar code changes and then testing.
Hong:
- PR for plumbing labels for egress processing done and submitted.
- Plumb the labels data to ingress XDP maps and do e2e tests.
Wei:
- Ramping up. Tried out arktos-mizar with Cathy's help in AWS.
Vinay:
- Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
- Filed issue for benchmark=True causing _agent XDP program to load. We can look at this after 5/30.
- Ping drop in the TC slow path was due to bridge calling into iptables. Disabled iptables and things work.
- Writing EDT program for slow path.
Phu:
- Mentoring & ramping up Wei.
- PR out for fixing yaml. Needs review. Fixed it for kind-setup.
- Resolve conflicts and update PR.
- Wei will help test the change in kind & real-cluster in AWS.
- Working on USTC tasks (tail-call to offloaded XDP program) & CI.
Cathy:
- PR merged.
- Mizar code changes done.
- Deployment scripts need changes to test the mizar and arktos changes together.
- Phu can help with it once done with his work.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e tests.
- Running into issues with plumbing data to maps. Investigating..
Wei:
- Ramping up. Tried out arktos-mizar with Cathy's help in AWS.
- Will work with Phu to test kind-setup & deploy.yaml in AWS.
Vinay:
- Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
- Filed issue for benchmark=True causing _agent XDP program to load. We can look at this after 5/30.
- Ping drop in the TC slow path was due to bridge calling into iptables. Disabled iptables and things work.
- Writing EDT program for slow path.
- Send docker credentials to Phu for image update.
Phu:
- Mentoring & ramping up Wei.
- PR out for fixing yaml. Needs review. Fixed it for kind-setup.
- Resolve conflicts and update PR.
- Wei will help test the change in kind - this works.
- Wei to try it in real-cluster (latest k8s) in AWS.
- Working on USTC tasks (tail-call to offloaded XDP program) & CI.
- Working on CI test framework.
- Meeting with Peng Du.
Cathy:
- PR merged.
- Mizar code changes done.
- Deployment scripts need changes to test the mizar and arktos changes together.
- Phu can help with it once done with his work.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e tests.
- PR ready to review. E2E testing completed.
Wei:
- Ramping up.
- Will work with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster).
Vinay:
- Partially working prototype of steering low priority traffic into TC framework for EDT rate-limiting is done.
- Working on code changes to add mizar bridge, attach tc edt program to eth0.
- Planning to send out a draft-PR tomorrow.
- Send docker credentials to Phu for image update.
- Working on code changes to add mizar bridge, attach tc edt program to eth0.
Phu:
- Mentoring & ramping up Wei.
- Wei to try it in real-cluster (latest k8s) in AWS.
- Working on USTC tasks (tail-call to offloaded XDP program) & CI.
- Working on CI test framework.
- Peng Du gave us an overview.
- Phu out Thu/Fri.
Cathy:
- Arktos deployment PR needs review.
- PR for param-diff issues needs review.
- PRs merged in arktos repo.
- Deployment scripts need changes to test the mizar and arktos changes together.
- Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e tests.
- PR merged.
- Will demo simple policy (calico example) on Friday.
- Review next PR that decides whether to pass or block traffic.
Wei:
- Working with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster).
- Recommend using t2.2xlarge , and t3.2xlarge for future iterations.
- Experimenting with XDP offload in Netronome.
Vinay:
- Sent out PR 492 for review covering the low priority b/w limiting feature.
- Send docker credentials to Phu for image update.
Phu:
- Mentoring & ramping up Wei.
- Wei tried Phu's change (latest k8s) in AWS. It works now.
- Working on USTC tasks (tail-call to offloaded XDP program).
- Working on CI test framework (Travis doing pay-mode, using github actions)
- Phu out Thu/Fri.
Cathy:
- Arktos deployment PR needs review.
- PR for param-diff issues needs review.
- Deployment scripts need changes to test mizar and arktos changes together.
- Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e tests. Found and fixed issue
- Will send updated PR.
- Review updated PR that decides whether to pass or block traffic.
- Plan to demo simple policy (calico example) on Friday is still on.
Wei:
- Working with Phu to test kind-setup & deploy.yaml in AWS with k8s latest release (kubeadm cluster) - DONE.
- Recommend using t2.2xlarge , and t3.2xlarge for future iterations.
- Experimenting with XDP offload in Netronome.
- Help Vinay with TC EDT program bandwidth rate-limit configuration.
Vinay:
- Sent out PR 492 for review covering the low priority b/w limiting feature.
- Working on automated unit tests and e2e tests.
- I will schedule time for reviewing PR 492.
Phu:
- Working on USTC tasks (tail-call to offloaded XDP program).
- Working on CI test framework (Travis doing pay-mode, using github actions)
Cathy:
- Arktos deployment PR needs review. Vinay will review by Thursday.
- PR for param-diff issues needs review. (This can wait post 5/30 release)
- (Post 5/30) Deployment scripts need changes to test mizar and arktos changes together.
- (Post 5/30) Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e test.
- PR in review. Cathy/Phu will review today.
- Demo on 5/28 went well. Thanks!
- Working on another round of E2E tests. Automate possible? "make teste2e"
Wei:
- Helping Vinay with TC EDT program bandwidth rate-limit configuration.
- Clean kind-setup works. Able to try out my code and verify iperf works, ping works.
- Working on modifying trn_edt_tc.c to figure out iperf bandwidth - rate limit BPS disconnect.
- Add code to configure egress bandwidth bps from usermode instead of default/fixed kernel mode.
- Clean kind-setup works. Able to try out my code and verify iperf works, ping works.
- Translating slides from Tencent.
Vinay:
- Sent out PR 496 - code cleanup.
- Working on e2e tests.
- I will schedule time for reviewing PR 492.
Phu:
- PR for single yaml and docker image updated
- PR for the bug fixes
- Working on CI test framework with github actions.
- USTC collab (tail-call to offloaded XDP program) .. gonna do this next week.
Cathy:
- Arktos deployment PR needs review. Vinay will review by Thursday.
- PR for param-diff issues needs review. (This can wait post 5/30 release)
- Fixed failing network policy unit tests. PR ready for review.
- (Post 5/30) Deployment scripts need changes to test mizar and arktos changes together.
- (Post 5/30) Issue with Pod from store does not have pod IP. Hong Chang suggested trying to find endpoint for pod and taking the IP from there.
Hong:
- Plumb the labels data to ingress XDP maps and do e2e test.
- PR in review. Cathy/Phu will review today.
- Need to do another round of E2E test after Vinay's PR is merged.
Wei:
- Discussed Tencent slides
- Overview of QoS project I have worked on. Potentially work on the next phase.
- Delete pod and create same pod again and it does not work.
- Create github issue with repro steps and details.