Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve experimental source code pattern analysis of pypi packages #965

Open
wants to merge 23 commits into
base: staging
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
0bd7992
refactor: refactoring existing source code analysis functionality
art1f1c3R Jan 17, 2025
307d815
build: updated project to include semgrep as an experimental dependency
art1f1c3R Jan 20, 2025
b3d94bd
refactor: support for semgrep as the code analysis tool
art1f1c3R Jan 23, 2025
5b95eca
fix: entire source code is no longer stored in memory
art1f1c3R Jan 23, 2025
f90a4c2
feat: support for semgrep rules, currently two implemented, with cust…
art1f1c3R Jan 30, 2025
f0136a7
test: setup test environment for source code analyzer
art1f1c3R Feb 3, 2025
2cb17d5
test: finished sample test files for obfuscation rules
art1f1c3R Feb 4, 2025
4650aca
fix: obfuscation tests were incorrect
art1f1c3R Feb 4, 2025
d15646e
test: tests for exfiltration and fixes to semgrep rules
art1f1c3R Feb 4, 2025
e027004
test: testing for invalid pathways in defaults configuration
art1f1c3R Feb 5, 2025
49e4b84
feat: dependency on empty project link, and context manager for sourc…
art1f1c3R Feb 5, 2025
e79098b
chore: added pre-commit hook for sourcecode sample files execution pe…
art1f1c3R Feb 5, 2025
7cb796f
fix: path outputs are now relative to package, making tests work and …
art1f1c3R Feb 6, 2025
489e858
fix: semgrep now only runs open-source functionality, and disabled th…
art1f1c3R Feb 6, 2025
7e42ecc
test: added experimental feature to main malware check, tests updated…
art1f1c3R Feb 11, 2025
da0853a
chore: updated pre-commit hook to only consider tracked files
art1f1c3R Feb 12, 2025
c162b40
chore: added oss only to semgrep validate
art1f1c3R Feb 12, 2025
8933aa6
chore: removed old code
art1f1c3R Feb 24, 2025
16f92da
feat: updated semgrep rules to reduce false positives based on ICSE25…
art1f1c3R Feb 27, 2025
e99d4dc
test: fixed broken tests for semgrep rules
art1f1c3R Feb 27, 2025
386481e
fix: obfuscation rules has updated socket patterns
art1f1c3R Feb 27, 2025
df90484
feat: added new, refined inline imports rule back in
art1f1c3R Feb 27, 2025
df42e78
docs: made API docs and updated malware analyzer README
art1f1c3R Feb 27, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

# See https://pre-commit.com for more information
Expand Down Expand Up @@ -30,6 +30,7 @@ repos:
- id: isort
name: Sort import statements
args: [--settings-path, pyproject.toml]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*

# Add Black code formatters.
- repo: https://github.com/ambv/black
Expand All @@ -38,6 +39,7 @@ repos:
- id: black
name: Format code
args: [--config, pyproject.toml]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
- repo: https://github.com/asottile/blacken-docs
rev: 1.16.0
hooks:
Expand All @@ -64,6 +66,7 @@ repos:
name: Check flake8 issues
files: ^src/macaron/|^tests/
types: [text, python]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
additional_dependencies: [flake8-bugbear==22.10.27, flake8-builtins==2.0.1, flake8-comprehensions==3.10.1, flake8-docstrings==1.6.0, flake8-mutable==1.2.0, flake8-noqa==1.3.0, flake8-pytest-style==1.6.0, flake8-rst-docstrings==0.3.0, pep8-naming==0.13.2]
args: [--config, .flake8]

Expand All @@ -82,6 +85,7 @@ repos:
entry: pylint
language: python
files: ^src/macaron/|^tests/
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
types: [text, python]
args: [--rcfile, pyproject.toml]

Expand All @@ -94,6 +98,7 @@ repos:
language: python
files: ^src/macaron/|^tests/
types: [text, python]
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*
args: [--show-traceback, --config-file, pyproject.toml]

# Check for potential security issues.
Expand All @@ -106,6 +111,7 @@ repos:
files: ^src/macaron/|^tests/
types: [text, python]
additional_dependencies: ['bandit[toml]']
exclude: ^tests/malware_analyzer/pypi/resources/sourcecode_samples.*

# Enable a whole bunch of useful helper hooks, too.
# See https://pre-commit.com/hooks.html for more hooks.
Expand Down Expand Up @@ -197,6 +203,18 @@ repos:
always_run: true
pass_filenames: false

# Checks that tests/malware_analyzer/pypi/resources/sourcecode_samples files do not have executable permissions
# This is another measure to make sure the files can't be accidentally executed
- repo: local
hooks:
- id: sourcecode-sample-permissions
name: Sourcecode sample executable permissions checker
entry: scripts/dev_scripts/samples_permissions_checker.sh
language: system
always_run: true
pass_filenames: false


# A linter for Golang
- repo: https://github.com/golangci/golangci-lint
rev: v1.61.0
Expand Down
1 change: 1 addition & 0 deletions .semgrepignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
# Items added to this file will be ignored by Semgrep.
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -205,7 +205,7 @@ upgrade: .venv/upgraded-on
.venv/upgraded-on: pyproject.toml
python -m pip install --upgrade pip
python -m pip install --upgrade wheel
python -m pip install --upgrade --upgrade-strategy eager --editable .[actions,dev,docs,hooks,test,test-docker]
python -m pip install --upgrade --upgrade-strategy eager --editable .[actions,dev,docs,hooks,test,test-docker,experimental]
$(MAKE) upgrade-quiet
force-upgrade:
rm -f .venv/upgraded-on
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,14 @@ macaron.malware\_analyzer.pypi\_heuristics.sourcecode package
Submodules
----------

macaron.malware\_analyzer.pypi\_heuristics.sourcecode.pypi\_sourcecode\_analyzer module
---------------------------------------------------------------------------------------

.. automodule:: macaron.malware_analyzer.pypi_heuristics.sourcecode.pypi_sourcecode_analyzer
:members:
:undoc-members:
:show-inheritance:

macaron.malware\_analyzer.pypi\_heuristics.sourcecode.suspicious\_setup module
------------------------------------------------------------------------------

Expand Down
12 changes: 6 additions & 6 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

# https://flit.pypa.io/en/latest/pyproject_toml.html
Expand Down Expand Up @@ -105,6 +105,10 @@ test-docker = [
"ruamel.yaml >=0.18.6,<1.0.0",
]

experimental = [
"semgrep == 1.102.0",
]

[project.urls]
Homepage = "https://github.com/oracle/macaron"
Changelog = "https://github.com/oracle/macaron/blob/main/CHANGELOG.md"
Expand All @@ -118,12 +122,10 @@ Issues = "https://github.com/oracle/macaron/issues"
tests = []
skips = ["B101"]


# https://github.com/psf/black#configuration
[tool.black]
line-length = 120


# https://github.com/commitizen-tools/commitizen
# https://commitizen-tools.github.io/commitizen/bump/
[tool.commitizen]
Expand Down Expand Up @@ -168,7 +170,6 @@ exclude = [
"SECURITY.md",
]


# https://pycqa.github.io/isort/
[tool.isort]
profile = "black"
Expand All @@ -179,7 +180,6 @@ skip_gitignore = true

# https://mypy.readthedocs.io/en/stable/config_file.html#using-a-pyproject-toml
[tool.mypy]
# exclude=
show_error_codes = true
show_column_numbers = true
check_untyped_defs = true
Expand All @@ -206,7 +206,6 @@ module = [
]
ignore_missing_imports = true


# https://pylint.pycqa.org/en/latest/user_guide/configuration/index.html
[tool.pylint.MASTER]
fail-under = 10.0
Expand Down Expand Up @@ -258,6 +257,7 @@ addopts = """-vv -ra --tb native \
--doctest-modules --doctest-continue-on-failure --doctest-glob '*.rst' \
--cov macaron \
--ignore tests/integration \
--ignore tests/malware_analyzer/pypi/resources/sourcecode_samples \
""" # Consider adding --pdb
# https://docs.python.org/3/library/doctest.html#option-flags
doctest_optionflags = "IGNORE_EXCEPTION_DETAIL"
Expand Down
20 changes: 20 additions & 0 deletions scripts/dev_scripts/samples_permissions_checker.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env bash

# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

#
# Checks if the files in tests/malware_analyzer/pypi/resources/sourcecode_samples have executable permissions,
# failing if any do.
#

MACARON_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && cd ../.. && pwd)"
SAMPLES_PATH="${MACARON_DIR}/tests/malware_analyzer/pypi/resources/sourcecode_samples"

# any files have any of the executable bits set
executables=$( ( find "$SAMPLES_PATH" -type f -perm -u+x -o -type f -perm -g+x -o -type f -perm -o+x | sed "s|$MACARON_DIR/||"; git ls-files "$SAMPLES_PATH" --full-name) | sort | uniq -d)
if [ -n "$executables" ]; then
echo "The following files should not have any executable permissions:"
echo "$executables"
exit 1
fi
11 changes: 7 additions & 4 deletions src/macaron/__main__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2022 - 2024, Oracle and/or its affiliates. All rights reserved.
# Copyright (c) 2022 - 2025, Oracle and/or its affiliates. All rights reserved.
# Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl/.

"""This is the main entrypoint to run Macaron."""
Expand Down Expand Up @@ -173,7 +173,7 @@ def analyze_slsa_levels_single(analyzer_single_args: argparse.Namespace) -> None
analyzer_single_args.sbom_path,
deps_depth,
provenance_payload=prov_payload,
validate_malware_switch=analyzer_single_args.validate_malware_switch,
analyze_source=analyzer_single_args.analyze_source,
)
sys.exit(status_code)

Expand Down Expand Up @@ -470,10 +470,13 @@ def main(argv: list[str] | None = None) -> None:
)

single_analyze_parser.add_argument(
"--validate-malware-switch",
"--analyze-source",
required=False,
action="store_true",
help=("Enable malware validation."),
help=(
"EXPERIMENTAL. For improved malware detection, analyze the source code of the"
+ " (PyPI) package using a textual scan and dataflow analysis."
),
)

# Dump the default values.
Expand Down
4 changes: 4 additions & 0 deletions src/macaron/config/defaults.ini
Original file line number Diff line number Diff line change
Expand Up @@ -594,3 +594,7 @@ major_threshold = 20
epoch_threshold = 3
# The number of days +/- the day of publish the calendar versioning day may be.
day_publish_error = 4

# absolute path to where a custom set of semgrep rules for source code analysis are stored. These will be included
# with Macaron's default rules. The path will be normalised to the OS path type.
custom_semgrep_rules =
4 changes: 4 additions & 0 deletions src/macaron/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,7 @@ class HeuristicAnalyzerValueError(MacaronError):

class LocalArtifactFinderError(MacaronError):
"""Happens when there is an error looking for local artifacts."""


class SourceCodeError(MacaronError):
"""Error for operations on package source code."""
11 changes: 11 additions & 0 deletions src/macaron/malware_analyzer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,17 @@ When a heuristic fails, with `HeuristicResult.FAIL`, then that is an indicator b
- **Rule**: Return `HeuristicResult.FAIL` if the major or epoch is abnormally high; otherwise, return `HeuristicResult.PASS`.
- **Dependency**: Will be run if the One Release heuristic fails.

### Experimental: Source Code Analysis with Semgrep

The following analyzer has been added in as an experimental feature, available by supplying `--analyze-source` in the CLI to `macaron analyze`:

**PyPI Source Code Analyzer**
- **Description**: Uses Semgrep, with default rules written in `src/macaron/resources/pypi_malware_rules` and custom rules available by supplying a path to `custom_semgrep_rules` in `defaults.ini`, to scan the package `.tar` source code.
- **Rule**: If any Semgrep rule is triggered, the heuristic fails with `HeuristicResult.FAIL` and subsequently fails the package with `CheckResultType.FAILED`. If no rule is triggered, the heuristic passes with `HeuristicResult.PASS` and the `CheckResultType` result from the combination of all other heuristics is maintained.
- **Dependency**: Will be run if the Source Code Repo fails.

This feature is currently a work in progress, and supports detection of code obfuscation techniques and remote exfiltration behaviors. It uses Semgrep OSS for detection.

### Confidence Score Motivation

The original seven heuristics which started this work were Empty Project Link, Unreachable Project Links, One Release, High Release Frequency, Unchange Release, Closer Release Join Date, and Suspicious Setup. These heuristics (excluding those with a dependency) were run on 1167 packages from trusted organizations, with the following results:
Expand Down
3 changes: 3 additions & 0 deletions src/macaron/malware_analyzer/pypi_heuristics/heuristics.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ class Heuristics(str, Enum):
#: Indicates that the package has an unusually large version number for a single release.
ANOMALOUS_VERSION = "anomalous_version"

#: Indicates that the package source code contains suspicious code patterns.
SUSPICIOUS_PATTERNS = "suspicious_patterns"


class HeuristicResult(str, Enum):
"""Result type indicating the outcome of a heuristic."""
Expand Down
Loading
Loading