Skip to content

Latest commit

 

History

History
354 lines (226 loc) · 10 KB

README.md

File metadata and controls

354 lines (226 loc) · 10 KB

Table of contents

  1. Introduction
  2. Pyspark
    1. Basic PySpark operations
  3. python
  4. SQL
  5. Markdown
    1. Basic Formatting
    2. Creating Diagrams
    3. Highlighting Syntax
    4. Creating Sections
    5. Emojis
    6. Checklists
    7. Collapsed Sections

Coding Resources sorted by coding language

Range of cheat sheets, coding resources, videos, etc that I want to keep track of & others may find helpful.

Pyspark 🐍

Basic PySpark Operations

PySpark_RDD_Cheat_Sheet.pdf Source: https://www.datacamp.com/cheat-sheet/pyspark-cheat-sheet-spark-in-python

Source: https://aeshantechhub.co.uk/databricks-dbutils-cheat-sheet-and-pyspark-amp-sql-best-practice-cheat-sheet/

    # Create a DataFrame from a CSV file:
    df = spark.read.csv("/mnt/datasets/sample.csv", header=True, inferSchema=True)
    
    # Display the first few rows of the DataFrame:
    df.show(5)

    # Select columns from a DataFrame:
    df.select("column1", "column2").show()

Best Practice: Avoid using collect()

Avoid using collect() as it brings all data to the driver node, which can cause memory issues.

    # Bad practice - using collect() to bring all data to driver:
    data = df.collect()

    # Better practice - use show() or take() instead:
    df.show(5)

Repartitioning DataFrames:

Repartition your DataFrames to optimize performance when dealing with large datasets.

    # Repartition the DataFrame based on a column:
    df = df.repartition("column_name")

Check if your df is pandas or pyspark

print(type(example_df))

Convert pandas to pyspark df

from pyspark.sql import SparkSession

# Initialize Spark session if not already done
spark = SparkSession.builder.getOrCreate()

# Convert Pandas DataFrames to PySpark
example_df = spark.createDataFrame(example_df)

Joins

.join(df1, fn.col("Code") == fn.col("Der_Code"), how="left")

See the df in different ways

# Prints the columns & types
df.printSchema()

# Prints list of columns in a paragraph format
print(df.columns)

#Prints disinct values in a given column

Drop columns

df = df.drop("column1", "column2")

Python 🐍

Cheat Sheets

python-cheat-sheet.pdf

Source: https://www.datacamp.com/cheat-sheet/python-for-data-science-a-cheat-sheet-for-beginners

Numpy_Cheat_Sheet.pdf

Source: https://www.datacamp.com/cheat-sheet/numpy-cheat-sheet-data-analysis-in-python

Pandas_Cheat_Sheet.pdf

Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-for-data-science-in-python

Data_Wrangling_Cheat_Sheet.pdf

Source: https://www.datacamp.com/cheat-sheet/pandas-cheat-sheet-data-wrangling-in-python

Cheat-Importing Data.pdf

Cheat-Jupyiter Notebooks.pdf

Cheat-Pandas Basics.pdf

Python-Cheat-Sheet-for-Scikit-learn-Edureka.pdf

scikit_learn_cheat.pdf

Sklearn-cheat-sheet.pdf


SQL 📜

SQL Best Practices Cheat Sheet Source: https://aeshantechhub.co.uk/databricks-dbutils-cheat-sheet-and-pyspark-amp-sql-best-practice-cheat-sheet/

SQL is widely used in Databricks for data querying and transformation. Below are some best practices to keep your queries optimized.

Common SQL Operations:

    # Creating a table from a DataFrame:
    df.createOrReplaceTempView("temp_table")
    
    # Running a SQL query on the DataFrame:
    spark.sql("SELECT * FROM temp_table").show()

Best Practice: Use LIMIT when previewing data

Avoid fetching large datasets during development. Use LIMIT to preview data instead.

    # Use LIMIT to preview data in SQL:
    spark.sql("SELECT * FROM temp_table LIMIT 10").show()

Best Practice: Leverage Caching

Cache intermediate results in memory to optimise performance for iterative queries.

    # Cache a DataFrame for future use:
    df.cache()

Avoid using SELECT * in production

Using SELECT * can lead to unnecessary data transfer and slow performance, especially with large datasets.

    # Bad practice - using SELECT *:
    spark.sql("SELECT * FROM temp_table")

    # Better practice - select only needed columns:
    
				

Markdown 📘

Complete formatting cheat sheet: Markdown Cheatsheet

Some bits from the above I used alot:

General Formatting

Style Syntax Keyboard shortcut Example Output

Bold	** ** or __ __	
Example: **This is bold text**

Italic	* * or _ _
Example: _This text is italicized_

Strikethrough	~~ ~~
Example: ~~This was mistaken text~~

Bold and nested italic	** ** and _ _
Example: **This text is _extremely_ important**

All bold and italic	*** ***
Example: ***All this text is important***

Subscript	<sub> </sub>
Example: This is a <sub>subscript</sub> text

Superscript	<sup> </sup>
Example:This is a <sup>superscript</sup> text	This is a superscript text

Underline	<ins> </ins>
Example:This is an <ins>underlined</ins> text	This is an underlined text

Horizontal rules

Three or more...

---

Hyphens

***

Asterisks

___

Underscores

Insert a hyperlink

[Insert your text here](https://www.google.com)

Creating Diagrams

You can also use code blocks to create diagrams in Markdown. GitHub supports Mermaid, GeoJSON, TopoJSON, and ASCII STL syntax. For more information, see Creating diagrams.

Syntax Highlighting

Source: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks You can add an optional language identifier to enable syntax highlighting in your fenced code block.

Syntax highlighting changes the color and style of source code to make it easier to read.

For example, to syntax highlight Markdown code:

```markdown
hello this is my markdown

You can put code in this block & it will show in a grey box. Change your language as required. More info can be found below on supported languages.
```

This will display the code block. If using python, the code block will be formatted with colours.

When you create a fenced code block that you also want to have syntax highlighting on a GitHub Pages site, use lower-case language identifiers. For more information, see About GitHub Pages and Jekyll.

We use Linguist to perform language detection and to select third-party grammars for syntax highlighting. You can find out which keywords are valid in the languages YAML file.

Creating sections

To create a contents and sections in your markdown file, you can use the below code. Source: https://stackoverflow.com/questions/11948245/markdown-to-create-pages-and-table-of-contents

# Table of contents
1. [Introduction](#introduction)
2. [Some paragraph](#paragraph1)
    1. [Sub paragraph](#subparagraph1)
3. [Another paragraph](#paragraph2)

## This is the introduction <a name="introduction"></a>
Some introduction text, formatted in heading 2 style

## Some paragraph <a name="paragraph1"></a>
The first paragraph text

### Sub paragraph <a name="subparagraph1"></a>
This is a sub paragraph, formatted in heading 3 style

## Another paragraph <a name="paragraph2"></a>
The second paragraph text

Emojis

See full list here 😄:

https://gist.github.com/rxaviers/7360908

Create a checklist

To create a task list, preface list items with a hyphen and space followed by [ ]. To mark a task as complete, use [x].

Source: https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/about-task-lists#creating-task-lists

- [x] #739
- [ ] https://github.com/octo-org/octo-repo/issues/740
- [ ] Add delight to the experience when all tasks are complete :tada:

Create a collapsed section

Markdown code:

<details>

<summary>This is a collapsed section</summary>

### You can add a header

You can add text within a collapsed section. 

You can add an image or a code block, too.

```python
   print("Hello World")
```

</details>

Preview how it looks:

This is a collapsed section

You can add a header

You can add text within a collapsed section.

You can add an image or a code block, too.

   print("Hello World")