-
-
Notifications
You must be signed in to change notification settings - Fork 229
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Full dumps when imported in Spark are partitioned by listened_at's year and month for storage in HDFS. Incremental dumps are imported everyday and appended to a single incremental.parquet. Deleted listens are similarly stored in deleted-listens.parquet and deleted-user-listen-history.parquet. At run time, the full dumps are read and concatenated with incremental dumps and the deleted listens are filtered out from the union. When a new full dump is imported, it contains all the listens till that time and all the deleted listens removed and the additional parquet files for incremental and deleted listens are removed. This happens on a biweekly timeline at the moment. Full dumps are cumbersome to produce, hence we want to reduce our dependence on them inside of ListenBrainz and Spark. After an initial full dump import to seed the cluster, we intend to get rid of the biweekly full dump imports and just rely on incremental dumps continuously. Hence, we need to rethink some steps in how incremental listens are stored in the spark cluster and how to implement deletions. The solution I have come up with is replace the full dump import step with a compaction step which reads all the partitioned base listens combines them with incremental listens, removes the deleted listens and writes them back to HDFS in the partitioned format. Everything else remains same.
- Loading branch information
Showing
5 changed files
with
85 additions
and
49 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
import os | ||
|
||
import listenbrainz_spark | ||
from listenbrainz_spark import hdfs_connection | ||
from listenbrainz_spark.hdfs.utils import path_exists | ||
from listenbrainz_spark.listens.cache import unpersist_incremental_df | ||
from listenbrainz_spark.listens.data import get_listens_from_dump | ||
from listenbrainz_spark.listens.metadata import get_listens_metadata, generate_new_listens_location, \ | ||
update_listens_metadata | ||
|
||
|
||
def main(): | ||
""" | ||
Compacts listen storage by processing base and incremental listen records. | ||
Reads base and incremental listen records, removes deleted listens, and stores the final | ||
processed data partitioned by year and month in a new HDFS location. | ||
""" | ||
table = "listens_to_compact" | ||
old_df = get_listens_from_dump(include_incremental=True, remove_deleted=True) | ||
old_df.createOrReplaceTempView(table) | ||
|
||
write_partitioned_listens(table) | ||
|
||
|
||
def write_partitioned_listens(table): | ||
""" Read listens from the given table and write them to a new HDFS location partitioned | ||
by listened_at's year and month. """ | ||
query = f""" | ||
select extract(year from listened_at) as year | ||
, extract(month from listened_at) as month | ||
, * | ||
from {table} | ||
""" | ||
new_location = generate_new_listens_location() | ||
new_base_listens_location = os.path.join(new_location, "base") | ||
|
||
listenbrainz_spark \ | ||
.sql_context \ | ||
.sql(query) \ | ||
.write \ | ||
.partitionBy("year", "month") \ | ||
.mode("overwrite") \ | ||
.parquet(new_base_listens_location) | ||
|
||
query = f""" | ||
select max(listened_at) as max_listened_at, max(created) as max_created | ||
from parquet.`{new_base_listens_location}` | ||
""" | ||
result = listenbrainz_spark \ | ||
.sql_context \ | ||
.sql(query) \ | ||
.collect()[0] | ||
|
||
metadata = get_listens_metadata() | ||
if metadata is None: | ||
existing_location = None | ||
else: | ||
existing_location = metadata.location | ||
|
||
update_listens_metadata(new_location, result.max_listened_at, result.max_created) | ||
|
||
unpersist_incremental_df() | ||
|
||
if existing_location and path_exists(existing_location): | ||
hdfs_connection.client.delete(existing_location, recursive=True, skip_trash=True) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters