Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: improve experimental source code pattern analysis of pypi packages #965

Open
wants to merge 23 commits into
base: staging
Choose a base branch
from

Conversation

art1f1c3R
Copy link
Member

@art1f1c3R art1f1c3R commented Jan 17, 2025

Summary

This pull request builds upon the work completed in #851, refactoring and refining the source code analysis of PyPI packages by identifying suspicious patterns and dataflows. This PR introduces the open source components of the tool Semgrep, using Semgrep rules written in .yaml files to specify suspicious code patterns, and suspicious dataflows using its taint tracking features. Included in the changes are default Semgrep rules, the option to include user-created Semgrep rules, and unit tests for each default Semgrep ruleset. This new feature is encompassed in a refactored version of the PyPISourcecodeAnalyzer, which analyzes the source code downloaded from the package tarball.

Usage

This feature has been added in as an "experimental" feature, that is, it is only intended to be used when the user has explicitly installed the extra required components and includes the argument --analyze-source along with the macaron analyze command. The reason for this feature being experimental is that the dataflow and pattern analysis rules are still under development, as many more are intended to be added to the default ruleset, along with further refinements, before this may produce results sufficient enough to include as a complete feature. The user may provide their own Semgrep rules by including a directory with one or many semgrep .yaml rule files in the defaults.ini file under custom_semgrep_rules. By default, this is blank. If rules are provided, they are run in addition to the default rules provided by Macaron.

For post-run analysis, provided in the information returned by this analyzer is the rule identifier, along with the file path, start, and end line number of the file where the suspicious pattern was detected.

Function

If this feature is turned on, the PyPISourcecodeAnalyzer is run after all other current heuristics have been run. If this analyzer fails and determines there were suspicious patterns in the source code, the overall package analysis is failed. If it had previously passed, this is done with low confidence (Confidence.LOW). The reason this analyzer may override the previous decision by the metadata heuristics is that it is only designed to be run when certain metadata heuristics fail, so if there is enough confidence in the results of the metadata heuristics, the source code analysis will not be run. Currently, this is done by including a dependency on the EmptyProjectLinks heuristic failing, intending to imply that there is no source code repository available. In this instance, the package is trusted less as the source code is not published, so the tarball is downloaded and analyzed. As heuristics change, this dependency may be updated.

Testing

A new unit test was added for this analyzer. It tests error pathways through the use and instantiation of the analyzer, as well as its ability to detect the patterns outlined in the Semgrep rules. To do this, sample files with suspicious patterns were included for each suspicious pattern category under tests/malware_analyzer/pypi/resources/sourcecode_samples. Each directory has an expected_results.json file with the expected file names and line numbers relative to that directory that should be detected, categorized in rule IDs. None of the code within these sample files is malicious in any way, the samples are all variants of printing Hello world!. However, to maintain their code patterns and behaviors, they must be excluded from many of the pre-commit hooks. To take precautions to ensure excluding them is safe, the following measures are put in place:

  • Each file calls sys.exit() immediately at every possible entry point.
  • The exported symbols are set to empty (using __all__) to ensure nothing can be imported from these files.
  • All patterns are isolated to a single function, test_function.
  • A pre-commit hook has been added which uses the script samples_permissions_checker.sh to ensure no execute permissions are available on any of the sample files.

Behavior Without Experimental Features

If this feature is not installed by the user, then Macaron's behavior will remain unchanged. If the --analyze-source command line argument is provided without the installation of this feature, then the only expected error that may arise is when the tool attempts to run semgrep. If this happens, it will be caught by the analyzer as a subprocess error, raised appropriately and caught by detect_malicious_metadata_check, and the result of the analysis will be UNKNOWN.

Further Development

The following is recommended to be implemented and/or considered for further development of this feature:

  • Updates to obfuscation rules (obfuscation.yaml): improvements to this rule may include more decoding and execution patterns, and coverage of more obfuscation tool behaviors.
  • Updates to exfiltration rules (exfiltration.yaml): further common sensitive information behaviors and accesses should be included (e.g. OS-specific accesses of things like the Windows registry or Linux /etc files, keystroke reading for keylogging, clipboard access), along with further methods of creating remote connections and sending data (e.g. FTP).
  • Network operations (new rule): detect the flow of network operations to sensitive calls and behaviors, such as remote connections downloading files to then write then to disk or running malicious scripts downloaded from a remote connection, or spawning processes and shells after a remote connection. Operations like keylogging may required some more special treatment due to their typical flow.
  • Suspicious constants (new rule): detect hard-coded constant values to suspicious URL links for exfiltrating data or downloading malicious payloads, or sensitive constants such as files or directories commonly containing sensitive information. Constants included in this should also be included in exfiltration and network operation rules. This rule would need to be carefully crafted to avoid false positives.
  • The current suspicious_setup.py heuristic may benefit from using a custom Semgrep rule as opposed to its current method.
  • Source code analysis should only be performed when necessary, given it is an expensive operation. This analyzer's dependency on metadata heuristics and other heuristics should be refined to ensure this is the case.
  • Function Arguments: changes to all heuristics pertaining to suspicious function calls may assist in reducing false positives by attempting to discern what is passed as those arguments (e.g. is a hard-coded string passed to a base64 decode call, or something pulled from elsewhere that may be benign)

Development Notes

  • A new top-level file, .semgrepignore, has been added but is blank except for a single comment; this is intentional. This file is used by Semgrep with the same syntax as a .gitignore file to exclude items from being scanned by Semgrep. As our intention is to scan only specific directories, but ignore nothing (given we are scanning for malicious behavior in any part of the package), this file has been created so nothing is ignored. If Semgrep does not detect a .semgrepignore, it uses its own default one, which is not empty, hence this file's presence. For a similar reason, the --disable-nosem argument has been added to the main scan, so packages may not bypass this analyzer's scan by adding a # nosemgrep comment in the source code. This is tested in the unit tests (see remote-exfiltration.py).

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jan 17, 2025
@art1f1c3R art1f1c3R self-assigned this Jan 22, 2025
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from f599234 to 38cc36b Compare January 28, 2025 02:23
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 8bc6532 to 6dfa198 Compare February 5, 2025 06:42
@art1f1c3R art1f1c3R marked this pull request as ready for review February 6, 2025 05:31
@art1f1c3R art1f1c3R requested a review from tromai as a code owner February 6, 2025 05:31
.semgrepignore Outdated
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has been added to ensure semgrep does not ignore any files during its scan, so packages can not bypass this analyzer by naming malicious files in the same manner that is ignored by the default .semgrepignore. See the development notes in the PR description for a more in-depth description.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been run through Shellcheck, no issues are detected.

@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch 3 times, most recently from f449d13 to 0d8e735 Compare February 14, 2025 07:49
@art1f1c3R art1f1c3R force-pushed the art1f1c3R/pypi_code_pattern_analysis branch from 0d8e735 to c162b40 Compare February 24, 2025 05:50
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
… dataset results

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
@art1f1c3R
Copy link
Member Author

Semgrep rules have been refined and updated based on the results of scanning the ICSE25 dataset as was provided to Hercule. A summary of what was observed and what was changed as a result:

  • The obfuscation_default-assigning rule triggered many false positives, as packages often perform these operations for custom typing purposes, as well as other unknown but benign behaviors. The rule did not, by itself, detect many malware within the dataset, and was often detected in conjunction with some other rule. For this reason, it has been removed.
  • The rules obfuscation_decode-and-execute and exfiltration_remote-exfiltration have been rewritten to make them more context aware. What is meant by this is, many false positives were being generated for patterns such as $BYTES.decode(...), as since $BYTES is not type checked to actually be a bytes object, anything with a .decode(...) method was triggered. This had a heavy impact on the socket patterns and file read/write patterns. These have been rewritten to ensure that any file read/write, or any socket action, is taken when a relevant object is available and detected. This has significantly reduced false positives.
  • The ast.literal_eval() function has been removed, as it does not, by itself, offer any real malicious functionality (only parsing types and other strings, not executing code).
  • The obfuscation_inline-imports rule has been rewritten to detect hexadecimal and octal import after observing malicious behaviors similar in some of the malware datasets, as well as only detecting certain, hardcoded suspicious imports. This is to avoid the many false positives it triggered for trusted packages which use __import__(...) for configuration purposes, and use this function as opposed to the suggested importlib.import_module(...).

Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants