-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: improve experimental source code pattern analysis of pypi packages #965
base: staging
Are you sure you want to change the base?
Conversation
f599234
to
38cc36b
Compare
8bc6532
to
6dfa198
Compare
.semgrepignore
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has been added to ensure semgrep
does not ignore any files during its scan, so packages can not bypass this analyzer by naming malicious files in the same manner that is ignored by the default .semgrepignore
. See the development notes in the PR description for a more in-depth description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been run through Shellcheck, no issues are detected.
f449d13
to
0d8e735
Compare
chore: Merging staging changes with sourcecode analysis.
…e nosemgrep feature
… to use MACARON_PATH
0d8e735
to
c162b40
Compare
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
… dataset results Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Semgrep rules have been refined and updated based on the results of scanning the ICSE25 dataset as was provided to Hercule. A summary of what was observed and what was changed as a result:
|
Signed-off-by: Carl Flottmann <carl.flottmann@oracle.com>
Summary
This pull request builds upon the work completed in #851, refactoring and refining the source code analysis of PyPI packages by identifying suspicious patterns and dataflows. This PR introduces the open source components of the tool Semgrep, using Semgrep rules written in
.yaml
files to specify suspicious code patterns, and suspicious dataflows using its taint tracking features. Included in the changes are default Semgrep rules, the option to include user-created Semgrep rules, and unit tests for each default Semgrep ruleset. This new feature is encompassed in a refactored version of thePyPISourcecodeAnalyzer
, which analyzes the source code downloaded from the package tarball.Usage
This feature has been added in as an "experimental" feature, that is, it is only intended to be used when the user has explicitly installed the extra required components and includes the argument
--analyze-source
along with themacaron analyze
command. The reason for this feature being experimental is that the dataflow and pattern analysis rules are still under development, as many more are intended to be added to the default ruleset, along with further refinements, before this may produce results sufficient enough to include as a complete feature. The user may provide their own Semgrep rules by including a directory with one or many semgrep.yaml
rule files in thedefaults.ini
file undercustom_semgrep_rules
. By default, this is blank. If rules are provided, they are run in addition to the default rules provided by Macaron.For post-run analysis, provided in the information returned by this analyzer is the rule identifier, along with the file path, start, and end line number of the file where the suspicious pattern was detected.
Function
If this feature is turned on, the
PyPISourcecodeAnalyzer
is run after all other current heuristics have been run. If this analyzer fails and determines there were suspicious patterns in the source code, the overall package analysis is failed. If it had previously passed, this is done with low confidence (Confidence.LOW
). The reason this analyzer may override the previous decision by the metadata heuristics is that it is only designed to be run when certain metadata heuristics fail, so if there is enough confidence in the results of the metadata heuristics, the source code analysis will not be run. Currently, this is done by including a dependency on theEmptyProjectLinks
heuristic failing, intending to imply that there is no source code repository available. In this instance, the package is trusted less as the source code is not published, so the tarball is downloaded and analyzed. As heuristics change, this dependency may be updated.Testing
A new unit test was added for this analyzer. It tests error pathways through the use and instantiation of the analyzer, as well as its ability to detect the patterns outlined in the Semgrep rules. To do this, sample files with suspicious patterns were included for each suspicious pattern category under
tests/malware_analyzer/pypi/resources/sourcecode_samples
. Each directory has anexpected_results.json
file with the expected file names and line numbers relative to that directory that should be detected, categorized in rule IDs. None of the code within these sample files is malicious in any way, the samples are all variants of printingHello world!
. However, to maintain their code patterns and behaviors, they must be excluded from many of the pre-commit hooks. To take precautions to ensure excluding them is safe, the following measures are put in place:sys.exit()
immediately at every possible entry point.__all__
) to ensure nothing can be imported from these files.test_function
.samples_permissions_checker.sh
to ensure no execute permissions are available on any of the sample files.Behavior Without Experimental Features
If this feature is not installed by the user, then Macaron's behavior will remain unchanged. If the
--analyze-source
command line argument is provided without the installation of this feature, then the only expected error that may arise is when the tool attempts to runsemgrep
. If this happens, it will be caught by the analyzer as asubprocess
error, raised appropriately and caught bydetect_malicious_metadata_check
, and the result of the analysis will beUNKNOWN
.Further Development
The following is recommended to be implemented and/or considered for further development of this feature:
obfuscation.yaml
): improvements to this rule may include more decoding and execution patterns, and coverage of more obfuscation tool behaviors.exfiltration.yaml
): further common sensitive information behaviors and accesses should be included (e.g. OS-specific accesses of things like the Windows registry or Linux/etc
files, keystroke reading for keylogging, clipboard access), along with further methods of creating remote connections and sending data (e.g. FTP).suspicious_setup.py
heuristic may benefit from using a custom Semgrep rule as opposed to its current method.base64
decode call, or something pulled from elsewhere that may be benign)Development Notes
.semgrepignore
, has been added but is blank except for a single comment; this is intentional. This file is used by Semgrep with the same syntax as a.gitignore
file to exclude items from being scanned by Semgrep. As our intention is to scan only specific directories, but ignore nothing (given we are scanning for malicious behavior in any part of the package), this file has been created so nothing is ignored. If Semgrep does not detect a.semgrepignore
, it uses its own default one, which is not empty, hence this file's presence. For a similar reason, the--disable-nosem
argument has been added to the main scan, so packages may not bypass this analyzer's scan by adding a# nosemgrep
comment in the source code. This is tested in the unit tests (seeremote-exfiltration.py
).