Skip to content

Commit

Permalink
Create rule S7187: PySpark Pandas DataFrame columns should not use a
Browse files Browse the repository at this point in the history
reserved name
  • Loading branch information
joke1196 committed Jan 30, 2025
1 parent 7cab163 commit 55266d9
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 25 deletions.
10 changes: 5 additions & 5 deletions rules/S7187/python/metadata.json
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
{
"title": "FIXME",
"title": "PySpark Pandas DataFrame columns should not use a reserved name",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7187",
"sqKey": "S7187",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
"quickfix": "infeasible",
"code": {
"impacts": {
"MAINTAINABILITY": "HIGH",
"RELIABILITY": "MEDIUM",
"SECURITY": "LOW"
"RELIABILITY": "MEDIUM"
},
"attribute": "CONVENTIONAL"
}
Expand Down
38 changes: 18 additions & 20 deletions rules/S7187/python/rule.adoc
Original file line number Diff line number Diff line change
@@ -1,44 +1,42 @@
FIXME: add a description

// If you want to factorize the description uncomment the following line and create the file.
//include::../description.adoc[]
This rule raises an issue when a PySpark Pandas DataFrame column name is set to a reserved name.

== Why is this an issue?

FIXME: remove the unused optional headers (that are commented out)
PySpark offers powerful APIs to work with Pandas DataFrames in a distributed environment.
While the integration between PySpark and Pandas is seamless, there are some caveats that should be taken into account.

//=== What is the potential impact?
Spark Pandas API uses some special column names for internal purposes.
These column names contain leading `++__++` and trailing `++__++`.
Therefore, when using PySpark with Pandas and naming or renaming columns,
it is discouraged to use such reserved column names as they are not guaranteed to yield the expected results.

== How to fix it
//== How to fix it in FRAMEWORK NAME

To fix this issue provide a column name without leading and trailing `++__++`.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
FIXME
import pyspark.pandas as ps
df = ps.DataFrame({'__value__': [1, 2, 3]}) # Noncompliant: __value__ is a reserved column name
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
FIXME
----
import pyspark.pandas as ps
//=== How does this work?
df = ps.DataFrame({'value': [1, 2, 3]}) # Compliant
----

//=== Pitfalls

//=== Going the extra mile
== Resources
=== Documentation

* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#avoid-reserved-column-names[Best Practices]

//== Resources
//=== Documentation
//=== Articles & blog posts
//=== Conference presentations
//=== Standards
//=== External coding guidelines
//=== Benchmarks

0 comments on commit 55266d9

Please sign in to comment.