All functionalities of this repository plus more features has been implemented in a new package called InMemoryDatasets.jl
DFTranspose.jl
is a Julia
package to explore the implementation of a new function for reshaping DataFrame
s.
See JuliaData/DataFrames.jl#2732 (comment)
run the following code inside a Julia
session.
julia> using Pkg
julia> Pkg.add(url = "https://github.com/sl-solution/DFTranspose.jl")
The DFTranspose.jl
package only exports one function, df_transpose
, which can be used to reshape a DataFrame
.
df_transpose(df, cols [, groupbycols]; kwargs...)
In its simplest form df_transpose
transposes the specified columns, cols
, of a DataFrame
, and attaches a new column to the output data frame to track the original names of the transposed variables. The label of this column can be controlled by variable_name
keyword.
When an id
variable is specified, df_transpose
transpose the data as above, however, it uses the values of the id
variable to label the columns of the output data frame. renamecolid
can be used to modify these labels on fly. When there are multiple id
variables, the default renamecolid
function assigns each label by a Tuple of the values of those variables (see examples).
When a set of groupby variables, groupbycols
, are specified, the df_transpose
function repeats the simple transposing of data within each group constructed by groupby variables. Like the simplest case, an id
variable can be used to label the columns of the output data frame.
The following behaviours may change in future
- The order of the output data frame is based on the order returned by
groupby
function.- The order of columns labels is based on their order of appearance in the original data.
- Currently if
id
value(s) is repeated within a group,df_transpose
throws an error.- Missing values can be a group level or a value for the
id
variable. They will be treated as a category.
Basic usage
julia> df = DataFrame(x1 = [1,2,3,4], x2 = [1,4,9,16])
4ร2 DataFrame
Row โ x1 x2
โ Int64 Int64
โโโโโโผโโโโโโโโโโโโโโ
1 โ 1 1
2 โ 2 4
3 โ 3 9
4 โ 4 16
julia> df_transpose(df, [:x1,:x2])
2ร5 DataFrame
Row โ _variables_ _c1 _c2 _c3 _c4
โ String Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ x1 1 2 3 4
2 โ x2 1 4 9 16
Specifying ID variable
julia> df = DataFrame(id = ["r1", "r2", "r3" , "r4"], x1 = [1,2,3,4], x2 = [1,4,9,16])
4ร3 DataFrame
Row โ id x1 x2
โ String Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
1 โ r1 1 1
2 โ r2 2 4
3 โ r3 3 9
4 โ r4 4 16
julia> df_transpose(df, [:x1,:x2], id = :id)
2ร5 DataFrame
Row โ _variables_ r1 r2 r3 r4
โ String Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ x1 1 2 3 4
2 โ x2 1 4 9 16
Specifying groupby variables
julia> df = DataFrame(group = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
6ร5 DataFrame
Row โ group b c d e
โ Int64 Int64 Int64 Int64 String
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 1 1 a
2 โ 1 1 1 2 b
3 โ 2 1 1 3 c
4 โ 2 2 1 4 d
5 โ 3 2 1 5 e
6 โ 3 2 1 6 f
julia> df_transpose(df, 2:4, [:group])
9ร4 DataFrame
Row โ group _variables_ _c1 _c2
โ Int64 String Int64? Int64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 b 1 1
2 โ 1 c 1 1
3 โ 1 d 1 2
4 โ 2 b 1 2
5 โ 2 c 1 1
6 โ 2 d 3 4
7 โ 3 b 2 2
8 โ 3 c 1 1
9 โ 3 d 5 6
julia> df_transpose(df, 2:4, :group, id = :e)
9ร8 DataFrame
Row โ group _variables_ a b c d e f
โ Int64 String Int64? Int64? Int64? Int64? Int64? Int64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 b 1 1 missing missing missing missing
2 โ 1 c 1 1 missing missing missing missing
3 โ 1 d 1 2 missing missing missing missing
4 โ 2 b missing missing 1 2 missing missing
5 โ 2 c missing missing 1 1 missing missing
6 โ 2 d missing missing 3 4 missing missing
7 โ 3 b missing missing missing missing 2 2
8 โ 3 c missing missing missing missing 1 1
9 โ 3 d missing missing missing missing 5 6
julia> df_transpose(df, 2:4, :group, id = :e, default_fill = 0)
9ร8 DataFrame
Row โ group _variables_ a b c d e f
โ Int64 String Int64 Int64 Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 b 1 1 0 0 0 0
2 โ 1 c 1 1 0 0 0 0
3 โ 1 d 1 2 0 0 0 0
4 โ 2 b 0 0 1 2 0 0
5 โ 2 c 0 0 1 1 0 0
6 โ 2 d 0 0 3 4 0 0
7 โ 3 b 0 0 0 0 2 2
8 โ 3 c 0 0 0 0 1 1
9 โ 3 d 0 0 0 0 5 6
Advanced usage
julia> pop = DataFrame(country = ["c1","c1","c2","c2","c3","c3"],
sex = ["male", "female", "male", "female", "male", "female"],
pop_2000 = [100, 120, 150, 155, 170, 190],
pop_2010 = [110, 120, 155, 160, 178, 200],
pop_2020 = [115, 130, 161, 165, 180, 203])
6ร5 DataFrame
Row โ country sex pop_2000 pop_2010 pop_2020
โ String String Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ c1 male 100 110 115
2 โ c1 female 120 120 130
3 โ c2 male 150 155 161
4 โ c2 female 155 160 165
5 โ c3 male 170 178 180
6 โ c3 female 190 200 203
julia> df_transpose(pop, r"pop_", :country,
id = :sex,
variable_name = "year",
renamerowid = x -> replace(x, "pop_"=>""),
renamecolid = x -> x * "_pop")
9ร4 DataFrame
Row โ country year male_pop female_pop
โ String String Int64? Int64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ c1 2000 100 120
2 โ c1 2010 110 120
3 โ c1 2020 115 130
4 โ c2 2000 150 155
5 โ c2 2010 155 160
6 โ c2 2020 161 165
7 โ c3 2000 170 190
8 โ c3 2010 178 200
9 โ c3 2020 180 203
julia> using Dates
julia> df = DataFrame([[1, 2, 3], [1.1, 2.0, 3.3],[1.1, 2.1, 3.0],[1.1, 2.0, 3.2]]
,[:person, Symbol("11/2020"), Symbol("12/2020"), Symbol("1/2021")])
3ร4 DataFrame
Row โ person 11/2020 12/2020 1/2021
โ Float64 Float64 Float64 Float64
โโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1.0 1.1 1.1 1.1
2 โ 2.0 2.0 2.1 2.0
3 โ 3.0 3.3 3.0 3.2
julia> df_transpose(df, Not(:person), :person,
variable_name = "Date",
renamerowid = x -> Date(x, dateformat"m/y"),
renamecolid = x -> "measurement")
9ร3 DataFrame
Row โ person Date measurement
โ Float64 Date Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1.0 2020-11-01 1.1
2 โ 1.0 2020-12-01 1.1
3 โ 1.0 2021-01-01 1.1
4 โ 2.0 2020-11-01 2.0
5 โ 2.0 2020-12-01 2.1
6 โ 2.0 2021-01-01 2.0
7 โ 3.0 2020-11-01 3.3
8 โ 3.0 2020-12-01 3.0
9 โ 3.0 2021-01-01 3.2
julia> df = DataFrame(region = repeat(["North","North","South","South"],2),
fuel_type = repeat(["gas","coal"],4),
load = rand(8),
time = [1,1,1,1,2,2,2,2],
)
8ร4 DataFrame
Row โ region fuel_type load time
โ String String Float64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ North gas 0.877347 1
2 โ North coal 0.412013 1
3 โ South gas 0.969407 1
4 โ South coal 0.641831 1
5 โ North gas 0.856583 2
6 โ North coal 0.409253 2
7 โ South gas 0.235768 2
8 โ South coal 0.655087 2
julia> df_transpose(df, :load, :time, id = 1:2)
2ร6 DataFrame
Row โ time _variables_ ("North", "gas") ("North", "coal") ("South", "gas") ("South", "coal")
โ Int64 String Float64? Float64? Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 load 0.877347 0.412013 0.969407 0.641831
2 โ 2 load 0.856583 0.409253 0.235768 0.655087
julia> df = DataFrame(A_2018=1:4, A_2019=5:8, B_2017=9:12,
B_2018=9:12, B_2019 = [missing,13,14,15],
ID = [1,2,3,4])
4ร6 DataFrame
Row โ A_2018 A_2019 B_2017 B_2018 B_2019 ID
โ Int64 Int64 Int64 Int64 Int64? Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 5 9 9 missing 1
2 โ 2 6 10 10 13 2
3 โ 3 7 11 11 14 3
4 โ 4 8 12 12 15 4
julia> f(x) = match(r"[0-9]+",x).match
julia> dfA = df_transpose(df, r"A", :ID, renamerowid = f, variable_name = "Year", renamecolid = x->"A");
julia> dfB = df_transpose(df, r"B", :ID, renamerowid = f, variable_name = "Year", renamecolid = x->"B");
julia> outerjoin(dfA, dfB, on = [:ID, :Year])
12ร4 DataFrame
Row โ ID Year A B
โ Int64 SubStrinโฆ Int64? Int64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 2018 1 9
2 โ 1 2019 5 missing
3 โ 2 2018 2 10
4 โ 2 2019 6 13
5 โ 3 2018 3 11
6 โ 3 2019 7 14
7 โ 4 2018 4 12
8 โ 4 2019 8 15
9 โ 1 2017 missing 9
10 โ 2 2017 missing 10
11 โ 3 2017 missing 11
12 โ 4 2017 missing 12
# emulating the pivoting in python pandas
julia> df = DataFrame(paddockId= [0, 0, 1, 1, 2, 2],
color= ["red", "blue", "red", "blue", "red", "blue"],
count= [3, 4, 3, 4, 3, 4],
weight= [0.2, 0.3, 0.2, 0.3, 0.2, 0.2])
6ร4 DataFrame
Row โ paddockId color count weight
โ Int64 String Int64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 0 red 3 0.2
2 โ 0 blue 4 0.3
3 โ 1 red 3 0.2
4 โ 1 blue 4 0.3
5 โ 2 red 3 0.2
6 โ 2 blue 4 0.2
julia> df_transpose( df_transpose(df, [:count,:weight], [:paddockId,:color]),
:_c1, :paddockId, id = 2:3)
3ร6 DataFrame
Row โ paddockId _variables_ ("red", "count") ("red", "weight") ("blue", "count") ("blue", "weight")
โ Int64 String Float64? Float64? Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 0 _c1 3.0 0.2 4.0 0.3
2 โ 1 _c1 3.0 0.2 4.0 0.3
3 โ 2 _c1 3.0 0.2 4.0 0.2
df_transpose(df::AbstractDataFrame, cols)
is similar to permutedims()
with some differences.
df_transpose
only permutes the columnscols
.- If no variable is provided as
id
, it generates the column names of the new data set by mapping a function (renamecolid
) on the sequence of rows indf
. - A function (
renamerowid
) applies to the rows ids in the output data frame before generating thevariable_name
columns.
julia> df = DataFrame(x1 = [9,2,8,6,8], x2 = [8,1,6,2,3], x3 = [6,5,3,10,8])
5ร3 DataFrame
Row โ x1 x2 x3
โ Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโ
1 โ 9 8 6
2 โ 2 1 5
3 โ 8 6 3
4 โ 6 2 10
5 โ 8 3 8
julia> df_transpose(df, r"x")
3ร6 DataFrame
Row โ _variables_ _c1 _c2 _c3 _c4 _c5
โ String Int64 Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ x1 9 2 8 6 8
2 โ x2 8 1 6 2 3
3 โ x3 6 5 3 10 8
julia> df_transpose(df, r"x", renamerowid = x -> match(r"[0-9]+",x).match,
renamecolid = x -> "_column_" * string(x))
3ร6 DataFrame
Row โ _variables_ _column_1 _column_2 _column_3 _column_4 _column_5
โ SubStringโฆ Int64 Int64 Int64 Int64 Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 9 2 8 6 8
2 โ 2 8 1 6 2 3
3 โ 3 6 5 3 10 8
julia> df1 = DataFrame(a=["x", "y"], b=[1.0, 2.0], c=[3, 4], d=[true, false])
2ร4 DataFrame
Row โ a b c d
โ String Float64 Int64 Bool
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ x 1.0 3 true
2 โ y 2.0 4 false
julia> df_transpose(df, [:b, :c, :d], id = :a) # note the column types
3ร3 DataFrame
Row โ _variables_ x y
โ String Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ b 1.0 2.0
2 โ c 3.0 4.0
3 โ d 1.0 0.0
julia> df2 = DataFrame(a=["x", "y"], b=[1, "two"], c=[3, 4], d=[true, false])
2ร4 DataFrame
Row โ a b c d
โ String Any Int64 Bool
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ x 1 3 true
2 โ y two 4 false
julia> df_transpose(df2, [:b, :c, :d], id = :a, variable_name = "new_col")
3ร3 DataFrame
Row โ new_col x y
โ String Any Any
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโ
1 โ b 1 two
2 โ c 3 4
3 โ d true false
df_transpose(df::AbstractDataFrame, cols, gcols)
can be used to emulate stack
functionalities. Generally speaking, stack
transposes each row of a data frame. Thus, to achieve the stack
functionality each row of the input data frame should be an individual group. This can be done by inserting a new column to the input data frame or using df_transpose
twice.
df_transpose
cannot create the output data as a view
(see help for stack
function).
julia> df = DataFrame(a = repeat(1:3, inner = 2),
b = repeat(1:2, inner = 3),
c = repeat(1:1, inner = 6),
d = repeat(1:6, inner = 1),
e = string.('a':'f'))
6ร5 DataFrame
Row โ a b c d e
โ Int64 Int64 Int64 Int64 String
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 1 1 a
2 โ 1 1 1 2 b
3 โ 2 1 1 3 c
4 โ 2 2 1 4 d
5 โ 3 2 1 5 e
6 โ 3 2 1 6 f
julia> df_transpose(df, [:c, :d], Not([:c, :d]),
variable_name = "variable",
renamecolid = x -> "value")
12ร5 DataFrame
Row โ a b e variable value
โ Int64 Int64 String String Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 a c 1
2 โ 1 1 a d 1
3 โ 1 1 b c 1
4 โ 1 1 b d 2
5 โ 2 1 c c 1
6 โ 2 1 c d 3
7 โ 2 2 d c 1
8 โ 2 2 d d 4
9 โ 3 2 e c 1
10 โ 3 2 e d 5
11 โ 3 2 f c 1
12 โ 3 2 f d 6
julia> insertcols!(df, 1, :RowID => 1:nrow(df))
julia> df_transpose(df, [:c, :d], [:RowID, :a])
12ร4 DataFrame
Row โ RowID a _variables_ _c1
โ Int64 Int64 String Int64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 c 1
2 โ 1 1 d 1
3 โ 2 1 c 1
4 โ 2 1 d 2
5 โ 3 2 c 1
6 โ 3 2 d 3
7 โ 4 2 c 1
8 โ 4 2 d 4
9 โ 5 3 c 1
10 โ 5 3 d 5
11 โ 6 3 c 1
12 โ 6 3 d 6
When only one column is set for cols
argument df_transpose(df::AbstractDataFrame, cols, gcols)
can be used to emulate unstack
functionality.
wide = DataFrame(id = 1:6,
a = repeat(1:3, inner = 2),
b = repeat(1.0:2.0, inner = 3),
c = repeat(1.0:1.0, inner = 6),
d = repeat(1.0:3.0, inner = 2))
6ร5 DataFrame
Row โ id a b c d
โ Int64 Int64 Float64 Float64 Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 1.0 1.0 1.0
2 โ 2 1 1.0 1.0 1.0
3 โ 3 2 1.0 1.0 2.0
4 โ 4 2 2.0 1.0 2.0
5 โ 5 3 2.0 1.0 3.0
6 โ 6 3 2.0 1.0 3.0
# stacking to make data long
julia> long = df_transpose(wide, 3:5, [:id, :a],
variable_name = "variable",
renamecolid = x->"value")
18ร4 DataFrame
Row โ id a variable value
โ Int64 Int64 String Float64
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 b 1.0
2 โ 1 1 c 1.0
3 โ 1 1 d 1.0
4 โ 2 1 b 1.0
5 โ 2 1 c 1.0
6 โ 2 1 d 1.0
7 โ 3 2 b 1.0
โฎ โ โฎ โฎ โฎ โฎ
12 โ 4 2 d 2.0
13 โ 5 3 b 2.0
14 โ 5 3 c 1.0
15 โ 5 3 d 3.0
16 โ 6 3 b 2.0
17 โ 6 3 c 1.0
18 โ 6 3 d 3.0
4 rows omitted
# unstack to make the long data wide again
julia> df_transpose(long, [:value], [:id, :a], id = :variable)
6ร6 DataFrame
Row โ id a _variables_ b c d
โ Int64 Int64 String Float64? Float64? Float64?
โโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
1 โ 1 1 value 1.0 1.0 1.0
2 โ 2 1 value 1.0 1.0 1.0
3 โ 3 2 value 1.0 1.0 2.0
4 โ 4 2 value 2.0 1.0 2.0
5 โ 5 3 value 2.0 1.0 3.0
6 โ 6 3 value 2.0 1.0 3.0