Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to sitrep mechanism for T5X and PAXML MGMN tests #401

Merged
merged 54 commits into from
Dec 8, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
6d709be
Migrate to sitrep mechanism for T5X MGMN tests
hemildesai Nov 28, 2023
187cee2
Fix missing output file
hemildesai Nov 29, 2023
3328f46
Fix missing output dir
hemildesai Nov 29, 2023
b0fdebf
Upload artifacts
hemildesai Nov 29, 2023
6263408
bash fix
hemildesai Nov 29, 2023
ba4f05e
Download artifacts in sitrep
hemildesai Nov 29, 2023
cf6574b
Source to_json
hemildesai Nov 29, 2023
325277e
Checkout repo
hemildesai Nov 29, 2023
31b5809
change order
hemildesai Nov 29, 2023
dfd1a34
bash fix
hemildesai Nov 29, 2023
9d42afb
Write output at end
hemildesai Nov 29, 2023
fcaba7f
Fix sitrep output
hemildesai Nov 29, 2023
550e490
Finalize
hemildesai Nov 29, 2023
4dcf31b
Fix message and summary
hemildesai Nov 29, 2023
4dbe474
Publish badge
hemildesai Nov 29, 2023
8d73989
Remove publish-completion
hemildesai Nov 29, 2023
da3f66b
Refactor sitrep generation
hemildesai Nov 29, 2023
f882f2f
Comment out publish
hemildesai Nov 29, 2023
cd90c6a
Change name
hemildesai Nov 29, 2023
7af36b0
Revert changes to jobs
hemildesai Nov 29, 2023
3ff47ce
Add EOL
hemildesai Nov 29, 2023
abf6d4e
Add sitrep to pax mgmn
hemildesai Nov 29, 2023
fd5baf3
disable publish for full run t5x
hemildesai Nov 29, 2023
86a4f1d
reenable publish t5x
hemildesai Nov 30, 2023
9cd1ebd
PR feedback part 1
hemildesai Nov 30, 2023
06794e2
Parameterize FW name
hemildesai Nov 30, 2023
eac4c8f
Fix bug
hemildesai Nov 30, 2023
e7f0d89
Fix bug
hemildesai Nov 30, 2023
c794476
fix bug
hemildesai Nov 30, 2023
e925149
fix bug
hemildesai Nov 30, 2023
6456d7e
fix bug
hemildesai Nov 30, 2023
319b1e1
fix bug
hemildesai Nov 30, 2023
ad70713
fix bug
hemildesai Nov 30, 2023
2951a49
Rewrite to_json for multiline variables
hemildesai Nov 30, 2023
b7219fb
Fix bug
hemildesai Nov 30, 2023
1db7ea6
Reenable t5x jobs
hemildesai Nov 30, 2023
d388d14
disable t5x mgmn downstream
hemildesai Nov 30, 2023
643624b
disable pax mgmn downstream
hemildesai Nov 30, 2023
0bf70f7
Fix numerics in to_json
hemildesai Dec 1, 2023
cf95235
Fix pax extra metrics
hemildesai Dec 2, 2023
cdc7a48
Fix badge message
hemildesai Dec 3, 2023
40aebf5
PR feedback
hemildesai Dec 4, 2023
276ac33
Merge branch 'main' into hemil/fix-badge-mgmn-tests
hemildesai Dec 5, 2023
ae25d1e
Fix
hemildesai Dec 5, 2023
21cf73b
Extract out FW name
hemildesai Dec 5, 2023
00d0e00
Fix indent
hemildesai Dec 5, 2023
827a6ce
Use correct field in metrics summary
hemildesai Dec 5, 2023
c89367d
Merge branch 'main' into hemil/fix-badge-mgmn-tests
hemildesai Dec 5, 2023
5f9f401
Fix
hemildesai Dec 5, 2023
70bea03
Merge branch 'main' into hemil/fix-badge-mgmn-tests
yhtang Dec 7, 2023
9f5d230
simplify job dependency and conditional execution
yhtang Dec 7, 2023
339a228
fix typo
yhtang Dec 7, 2023
b496f90
Merge branch 'main' into hemil/fix-badge-mgmn-tests
yhtang Dec 7, 2023
1e2389e
Merge branch 'main' into hemil/fix-badge-mgmn-tests
hemildesai Dec 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/_publish_t5x_pax_results.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,4 @@ jobs:
[view metrics](https://${{ vars.HOSTNAME_TENSORBOARD }}/#scalars&regexInput=$(jq -nr --arg url "${FOLDER}" '$url|@uri')&_smoothingWeight=0&tagFilter=seqs_per)

EOF
) | tee $GITHUB_STEP_SUMMARY
) | tee $GITHUB_STEP_SUMMARY
134 changes: 134 additions & 0 deletions .github/workflows/_sitrep_mgmn.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
name: ~Generate sitrep for Multi-Node Multi-GPU tests

on:
workflow_call:
inputs:
BADGE_FILENAME:
type: string
description: 'Name of the endpoint JSON file for shields.io badge'
required: true
ARTIFACT_NAME:
type: string
description: 'Name of the artifact zip file'
required: true
FW_NAME:
type: string
description: 'Name of the framework being used'
required: true
outputs:
STATUS:
description: 'Summary of all tests run for the workflow. Set to "success" when all metrics per job and all jobs pass, whereas a single metric failure or job error sets the status to "failure"'
value: ${{ jobs.sitrep.outputs.STATUS }}

jobs:
sitrep:
runs-on: ubuntu-22.04
outputs:
STATUS: ${{ steps.gen-sitrep.outputs.STATUS }}
steps:
- name: Check out repository
uses: actions/checkout@v3

- name: Download all artifacts from the previous jobs
uses: actions/download-artifact@v3

- name: Write exit status summary
id: exit-status
shell: bash -x -e {0}
run: |
EXIT_STATUSES="${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-*/*-status.json"
EXIT_STATUS_SUMMARY_FILE="exit_status_summary.json"
echo -e "\n\n## ${{ inputs.FW_NAME }} MGMN+SPMD Test Status" >> $EXIT_STATUS_SUMMARY_FILE
cat <<EOF >>$EXIT_STATUS_SUMMARY_FILE
| Test Case | State | Exit Code |
| --- | --- | --- |
EOF

for i in $EXIT_STATUSES; do
# Files are named <FW_NAME>-<GHID>-<NAME>/<NAME>-status.json
echo "| $(echo $i | cut -d/ -f1 | awk -F- '{print $NF}') | $(jq -r .state $i) | $(jq -r .exitcode $i)"
done | tee -a $EXIT_STATUS_SUMMARY_FILE

echo "Test statuses:"
jq -rc 'input_filename,.' $EXIT_STATUSES

echo "EXIT_STATUS_SUMMARY_FILE=$EXIT_STATUS_SUMMARY_FILE" >> ${GITHUB_OUTPUT}

- name: Write metrics summary
id: metrics
shell: bash -x -e {0}
run: |
METRICS_SUMMARY_FILE="metrics_summary.json"
echo -e "\n\n## ${{ inputs.FW_NAME }} MGMN Test Metrics" >> $METRICS_SUMMARY_FILE
for i in metrics-test-log/*_metrics.json; do
echo $i | cut -d'.' -f1
echo '```json'
jq . $i
echo '```'
done | tee -a $METRICS_SUMMARY_FILE

echo "METRICS_SUMMARY_FILE=$METRICS_SUMMARY_FILE" >> ${GITHUB_OUTPUT}

- name: Generate sitrep
id: gen-sitrep
shell: bash -x -e {0}
run: |
source .github/workflows/scripts/to_json.sh

EXIT_STATUSES="${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-*/*-status.json"

passed_tests=$(jq -r '. | select ((.state == "COMPLETED") and (.exitcode == "0")) | .state' $EXIT_STATUSES | wc -l)
failed_tests=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
total_tests=$(ls $EXIT_STATUSES | wc -l)

METRICS_LOG=metrics-test-log/report.jsonl
all_outcomes() {
cat $METRICS_LOG | jq -r '. | select((.["$report_type"] == "TestReport") and (.when == "call")) | .outcome'
}
cnt_type() {
cat $METRICS_LOG | jq '. | select((.["$report_type"] == "TestReport") and (.when == "call") and (.outcome | contains("'${1}'"))) | .outcome' | wc -l
}
pytest_failed_tests=$(cnt_type failed)
pytest_passed_tests=$(cnt_type passed)
pytest_total_tests=$(all_outcomes | wc -l)

if ([[ $failed_tests -eq 0 ]] && [[ $total_tests -gt 0 ]] && \
[[ $pytest_failed_tests -eq 0 ]] && [[ $pytest_total_tests -gt 0 ]]); then
status=success
badge_color=brightgreen
elif [[ $passed_tests -eq 0 ]] || [[ $pytest_passed_tests -eq 0 ]]; then
status=failure
badge_color=red
else
status=failure
badge_color=yellow
fi
badge_message="${passed_tests}/${total_tests} jobs | ${pytest_passed_tests}/${pytest_total_tests} metrics"

badge_label='Upstream Tests'
summary="# ${{ inputs.FW_NAME }} MGMN Test: $badge_message"
summary+=`cat ${{ steps.exit-status.outputs.EXIT_STATUS_SUMMARY_FILE }}`
summary+=`cat ${{ steps.metrics.outputs.METRICS_SUMMARY_FILE }}`

to_json \
summary \
total_tests passed_tests failed_tests \
badge_label badge_color badge_message \
> sitrep.json

schemaVersion=1 \
label="${badge_label}" \
message="${badge_message}" \
color="${badge_color}" \
to_json schemaVersion label message color \
> ${{ inputs.BADGE_FILENAME }}

echo "STATUS='${status}'" >> ${GITHUB_OUTPUT}

- name: Upload artifacts
uses: actions/upload-artifact@v3
with:
name: ${{ inputs.ARTIFACT_NAME }}
path: |
sitrep.json
${{ inputs.BADGE_FILENAME }}
121 changes: 39 additions & 82 deletions .github/workflows/_test_pax.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,15 +13,25 @@ on:
description: Extra command line args to pass to test-pax.sh
default: ""
required: false
BADGE_FILENAME:
type: string
description: 'Name of the endpoint JSON file for shields.io badge'
required: false
default: 'badge-pax-mgmn-test.json'
ARTIFACT_NAME:
type: string
description: If provided, will prepend a prefix to the artifact name. Helpful if re-running this reusable workflow to prevent clobbering of artifacts
default: ""
description: 'Name of the artifact zip file'
required: false
default: 'artifact-pax-mgmn-test'
FW_NAME:
type: string
description: 'Name of the framework being used'
required: false
default: 'pax'
outputs:
TEST_STATUS:
description: 'Summary pass/fail value indicating if results from tests are acceptable'
value: ${{ jobs.publish-test.outputs.STATUS }}
value: ${{ jobs.sitrep.outputs.STATUS }}
Copy link
Collaborator

@yhtang yhtang Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is similar to our current practice, which uses a final postprocessing job to determine whether the overall MGMN test suite succeeds.

It has the problem that error feedback is delayed and made it difficult to investigate which individual tests failed.

For the new sitrep reporting system, let's do the following:

  • Individual jobs should fail immediately if the tests that it ran do not succeed. However, it should collect and generate test artifacts regardless of whether the tests pass or fail. The continue-on-error option can be helpful here.
  • The sitrep/postprocessing step should then follow regardless of the success/fail status of individual test jobs.

This helps to localize the error status for easier debugging and tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a single job fails, what should be the outcome of the entire workflow? Also, I think we still need this output for downstream jobs like triage or publish container since those depend on this overall outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the MGMN usecase, the tests are submitted to slurm via SSH. To mark an individual job as failure I can inspect the exit code and mark it accordingly. We would still need the overall status though, if there's any other recommended way to get this output, I'm happy to incorporate that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. How about we make individual jobs fail while also let the overall status be the Boolean AND of the individual job states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a single job failure is already indicated. (ref: https://github.com/NVIDIA/JAX-Toolbox/actions/runs/7053172928/job/19199718748). The STATUS has to be derived not only from the job result, but also from its metrics so it is not possible to set status just based on the job states.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At face value, the specific error message from the job that you provided looks coincidental. I understand that it may be the result of the true underlying error, but the message itself does not clearly indicate that. Could we improve on it?

Copy link
Collaborator

@yhtang yhtang Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you please add a natural language description of the logic regarding

The STATUS has to be derived not only from the job result, but also from its metrics

so that it serves as a form of documentation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At face value, the specific error message from the job that you provided looks coincidental. I understand that it may be the result of the true underlying error, but the message itself does not clearly indicate that. Could we improve on it?

Not sure if there's a way to provide a more informative error based on the failure, might have to look at it outside the scope of this PR.

Copy link
Contributor Author

@hemildesai hemildesai Dec 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could you please add a natural language description of the logic regarding

The STATUS has to be derived not only from the job result, but also from its metrics

so that it serves as a form of documentation?

Done in 40aebf5


jobs:

Expand Down Expand Up @@ -63,7 +73,7 @@ jobs:
MAX_GPUS_PER_NODE=8
NODES=1
GPUS_PER_NODE=8
JOB_NAME=${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand All @@ -74,7 +84,7 @@ jobs:
shell: bash -O expand_aliases -x -e {0}
run: |
alias sshx='ssh -o "ServerAliveInterval 7" ${{ secrets.CLUSTER_LOGIN_USER }}@${{ vars.HOSTNAME_SLURM_LOGIN }}'
sshx "date && hostname && sinfo"
sshx "date && hostname && sinfo"
sshx mkdir -p ${{ steps.meta.outputs.MODEL_PATH }}
JOB=$(sshx sbatch --parsable << EOF
#!/bin/bash
Expand Down Expand Up @@ -129,7 +139,7 @@ jobs:
output/ || true
rsync -rtz --progress \
output/ \
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}/ || true
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}/ || true
- name: Write SLURM job status to file
shell: bash -x -e {0}
run: |
Expand All @@ -139,7 +149,7 @@ jobs:
dump = {'state': "${{ steps.submit.outputs.SLURM_STATE }}", 'exitcode': "${{ steps.submit.outputs.SLURM_EXITCODE }}"}
json.dump(dump, f)
EOF

- name: Upload training logs as artifacts
uses: actions/upload-artifact@v3
with:
Expand Down Expand Up @@ -191,7 +201,7 @@ jobs:
NODES=$(((TOTAL_TASKS+MAX_GPUS_PER_NODE-1)/MAX_GPUS_PER_NODE))
GPUS_PER_NODE=$((TOTAL_TASKS/NODES))

JOB_NAME=${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand All @@ -203,7 +213,7 @@ jobs:
shell: bash -O expand_aliases -x -e {0}
run: |
alias sshx='ssh -o "ServerAliveInterval 7" ${{ secrets.CLUSTER_LOGIN_USER }}@${{ vars.HOSTNAME_SLURM_LOGIN }}'
sshx "date && hostname && sinfo"
sshx "date && hostname && sinfo"
sshx mkdir -p ${{ steps.meta.outputs.MODEL_PATH }}
JOB=$(sshx sbatch --parsable << EOF
#!/bin/bash
Expand Down Expand Up @@ -265,7 +275,7 @@ jobs:
output/ || true
rsync -rtz --progress \
output/ \
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}/ || true
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}/ || true

- name: Write SLURM job status to file
shell: bash -x -e {0}
Expand All @@ -276,7 +286,7 @@ jobs:
dump = {'state': "${{ steps.submit.outputs.SLURM_STATE }}", 'exitcode': "${{ steps.submit.outputs.SLURM_EXITCODE }}"}
json.dump(dump, f)
EOF

- name: Upload training logs as artifacts
uses: actions/upload-artifact@v3
with:
Expand Down Expand Up @@ -321,7 +331,7 @@ jobs:
NODES=1
GPUS_PER_NODE=8

JOB_NAME=${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
JOB_NAME=${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-${TEST_CASE_NAME}
LOG_FILE=/nfs/cluster/${JOB_NAME}.log
MODEL_PATH=/nfs/cluster/${JOB_NAME}
for var in IMAGE TEST_CASE_NAME TOTAL_TASKS NODES GPUS_PER_NODE JOB_NAME LOG_FILE MODEL_PATH; do
Expand All @@ -333,7 +343,7 @@ jobs:
shell: bash -O expand_aliases -x -e {0}
run: |
alias sshx='ssh -o "ServerAliveInterval 7" ${{ secrets.CLUSTER_LOGIN_USER }}@${{ vars.HOSTNAME_SLURM_LOGIN }}'
sshx "date && hostname && sinfo"
sshx "date && hostname && sinfo"
sshx mkdir -p ${{ steps.meta.outputs.MODEL_PATH }}
JOB=$(sshx sbatch --parsable << EOF
#!/bin/bash
Expand Down Expand Up @@ -396,7 +406,7 @@ jobs:
output/ || true
rsync -rtz --progress \
output/ \
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}/ || true
${{ secrets.TENSORBOARD_UPLOAD_USER }}@${{ vars.HOSTNAME_TENSORBOARD }}:/tensorboard-logs/${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}/ || true

- name: Write SLURM job status to file
shell: bash -x -e {0}
Expand All @@ -407,7 +417,7 @@ jobs:
dump = {'state': "${{ steps.submit.outputs.SLURM_STATE }}", 'exitcode': "${{ steps.submit.outputs.SLURM_EXITCODE }}"}
json.dump(dump, f)
EOF

- name: Upload training logs as artifacts
uses: actions/upload-artifact@v3
with:
Expand All @@ -429,83 +439,30 @@ jobs:
shell: bash -x {0}
run: |
pip install pytest pytest-reportlog tensorboard
for i in ${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}-*; do
for i in ${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}-*; do
SUBDIR=$(echo $i | cut -d'-' -f3)
mv $i/$SUBDIR* .
python3 .github/workflows/baselines/summarize_metrics.py $SUBDIR # create result json in baseline format
done

echo '## PAX MGMN Test Metrics' >> $GITHUB_STEP_SUMMARY
for i in *_metrics.json; do
echo $i | cut -d'.' -f1
echo '```json'
jq . $i
echo '```'
done | tee -a $GITHUB_STEP_SUMMARY

RESULTS_DIR=$PWD BASELINES_DIR=PAX_MGMN/upstream pytest --report-log=report.jsonl .github/workflows/baselines/test_pax_mgmn_metrics.py || true

- name: Upload metrics test json logs
uses: actions/upload-artifact@v3
with:
name: metrics-test-log
path: report.jsonl
path: |
report.jsonl
*_metrics.json


publish-test:
sitrep:
needs: [single-process-multi-device, pax-multi-node, single-process-evaluation, metrics]
uses: ./.github/workflows/_publish_badge.yaml
if: ( always() )
secrets: inherit
if: success() || failure()
uses: ./.github/workflows/_sitrep_mgmn.yaml
with:
ENDPOINT_FILENAME: '${{ inputs.ARTIFACT_NAME }}pax-test-status.json'
PUBLISH: false
SCRIPT: |
EXIT_STATUSES="${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}-*DP*FSDP*TP*PP*/*-status.json"
PASSED_TESTS=$(jq -r '. | select ((.state == "COMPLETED") and (.exitcode == "0")) | .state' $EXIT_STATUSES | wc -l)
FAILED_TESTS=$(jq -r '. | select ((.state != "COMPLETED") or (.exitcode != "0")) | .state' $EXIT_STATUSES | wc -l)
TOTAL_TESTS=$(ls $EXIT_STATUSES | wc -l)

cat <<EOF >>$GITHUB_STEP_SUMMARY
## Pax MGMN+SPMD Test Status
| Test Case | State | Exit Code |
| --- | --- | --- |
EOF
for i in $EXIT_STATUSES; do
# Files are named pax-<GHID>-<NAME>/<NAME>-status.json
echo "| $(echo $i | cut -d/ -f1 | cut -d- -f3) | $(jq -r .state $i) | $(jq -r .exitcode $i)"
done | tee -a $GITHUB_STEP_SUMMARY

echo "Test statuses:"
jq -rc 'input_filename,.' $EXIT_STATUSES

METRICS_LOG=metrics-test-log/report.jsonl
all_outcomes() {
cat $METRICS_LOG | jq -r '. | select((.["$report_type"] == "TestReport") and (.when == "call")) | .outcome'
}
cnt_type() {
cat $METRICS_LOG | jq '. | select((.["$report_type"] == "TestReport") and (.when == "call") and (.outcome | contains("'${1}'"))) | .outcome' | wc -l
}
PYTEST_FAILED_TESTS=$(cnt_type failed)
PYTEST_PASSED_TESTS=$(cnt_type passed)
PYTEST_TOTAL_TESTS=$(all_outcomes | wc -l)

if ([[ $FAILED_TESTS -eq 0 ]] && [[ $TOTAL_TESTS -gt 0 ]] && \
[[ $PYTEST_FAILED_TESTS -eq 0 ]] && [[ $PYTEST_TOTAL_TESTS -gt 0 ]]); then
STATUS=success
BADGE_COLOR=brightgreen
elif [[ $PASSED_TESTS -eq 0 ]] || [[ $PYTEST_PASSED_TESTS -eq 0 ]]; then
STATUS=failure
BADGE_COLOR=red
else
STATUS=failure
BADGE_COLOR=yellow
fi
echo "STATUS='${STATUS}'" >> ${GITHUB_OUTPUT}
echo "LABEL='Completion'" >> $GITHUB_OUTPUT
echo "MESSAGE='${PASSED_TESTS}/${TOTAL_TESTS} ran ${PYTEST_PASSED_TESTS}/${PYTEST_TOTAL_TESTS} pass loss+perf'" >> $GITHUB_OUTPUT
echo "COLOR='${BADGE_COLOR}'" >> $GITHUB_OUTPUT

BADGE_FILENAME: ${{ inputs.BADGE_FILENAME }}
ARTIFACT_NAME: ${{ inputs.ARTIFACT_NAME }}
FW_NAME: ${{ inputs.FW_NAME }}

summary:
runs-on: ubuntu-22.04
Expand All @@ -518,18 +475,18 @@ jobs:

## PAX MGMN training

[view metrics](https://${{ vars.HOSTNAME_TENSORBOARD }}/#scalars&regexInput=${{ inputs.ARTIFACT_NAME }}pax-${GITHUB_RUN_ID}&_smoothingWeight=0&tagFilter=seqs_per)
[view metrics](https://${{ vars.HOSTNAME_TENSORBOARD }}/#scalars&regexInput=${{ inputs.FW_NAME }}-${GITHUB_RUN_ID}&_smoothingWeight=0&tagFilter=seqs_per)

EOF
) | tee $GITHUB_STEP_SUMMARY

outcome:
needs: publish-test
needs: sitrep
Copy link
Collaborator

@yhtang yhtang Nov 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This job would be unnecessary if we localize the error status to individual test jobs. However, there may still be situations where it is needed, i.e. if some checks use the collective results of many/all test jobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this job is still needed to publish the overall badge since that's a collection of the results of all tests.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

runs-on: ubuntu-22.04
if: ( always() )
steps:
- name: Sets workflow status based on test outputs
- name: Sets workflow status based on test outputs
run: |
if [[ ${{ needs.publish-test.outputs.STATUS }} != success ]]; then
if [[ ${{ needs.sitrep.outputs.STATUS }} != 'success' ]]; then
exit 1
fi
Loading
Loading