pysum
takes a pandas dataframe (and a few others arguments to
customize the output) and creates a markdown, html, or xlsx report with
summary of each of the variables in the dataframe.
The program iterates through each of the columns in the dataframe and based on the datatype, creates summary statistics for each, and prints them out to a table.
The function takes the following arguments:
dataframe
: pandas dataframe. No Default. The passed dataframe must also have an attributename
that carries thename
of the dataframe. See examples for clarification.round_digits
: Integer. Digits to which the numbers reported should be rounded. The default is 2.var_numbers
: Boolean. Whether or not to add a column indicating the column number. The default istrue
.missing_col
: Boolean. Adds a column that reports the proportion missing. The default is true.max_distinct_values
: Numeric. The maximum number of values to display frequencies for. If a variable has more distinct values than this number, the remaining frequencies will be reported as a whole, along with the number of additional distinct values. Defaults to 10.max_string_width
: Integer. Limits the number of characters to display in the frequency tables. The default is 25.output_type
: String. The file format of the output file.xlsx, html, markdown
. The default ishtml
.output_file
: String. The path and filename to which the script should output the results. The default issummary.html
in the local directoryappend
: Boolean. If there is an existing file, should we append the results, or should we overwrite the file? The default istrue
. When append istrue
, the results are appended. When it isfalse
, the file is overwritten.
The html
output also depends on custom.css in the
local folder.
The output is an xlsx, html, or markdown file. For numeric columns, it reports mean, standard deviation, minimum, maximum, median, IQR, Number of distinct values, Percentage that are valid, and Percentage missing, by default.
Definitions of Things in Output
- Valid = entries with non-missing values
- mean (sd) = mean (standard deviation).
- min = minimum
- med = median
- max = maximum
- IQR = Interquartile range
- CV = Coefficient of variation
For character vectors, it reports as many as max_distinct_values
,
reports the number of other values, and their percentage. It also
reports the percentage of observations that are valid and that are missing
by default.
Limitations: Dates by default are parsed as characters. Dates are best handled as numeric. But given the variety of formats in which dates appear, no standard support is offered for now.
Install the requirements:
pip install -r requirements.txt
You also need pandoc
to be installed on your machine.
import pandas import pysum # Load dataset url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = pandas.read_csv(url, names=names) # Pass the name of the dataset; it is the required dataset.name = 'iris' pysum.summarizeDF(dataset) pysum.summarizeDF(dataset, output_type = "xlsx", append = False) pysum.summarizeDF(dataset, output_type = "markdown", append = False)
Markdown Output, HTML Output and XLSX Output
The package is based on https://github.com/dcomtois/summarytools