Skip to content

Commit

Permalink
template is ready
Browse files Browse the repository at this point in the history
  • Loading branch information
zjysteven committed Dec 19, 2024
1 parent 185aed8 commit 256143a
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 19 deletions.
15 changes: 4 additions & 11 deletions papers.json
Original file line number Diff line number Diff line change
@@ -1,16 +1,9 @@
[
{
"title": "MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers",
"authors": "Alice, Yu",
"year": 2024,
"link": "https://arxiv.org/abs/1234.5678",
"summary": "A survey on byte-based large language models."
},
{
"title": "Efficient Byte-Based Transformers",
"authors": "Charlie, Dana",
"year": 2023,
"link": "https://arxiv.org/abs/8765.4321",
"summary": "Exploration of efficient methods for byte-level transformers."
"date": "2023-05",
"link": "https://arxiv.org/pdf/2305.07185",
"conference": "NeurIPS'23",
"summary": "The paper introduces MEGABYTE, a multiscale Transformer architecture that segments sequences into patches, enabling efficient modeling of million-byte sequences with sub-quadratic self-attention, enhanced feedforward computation, and improved decoding parallelism, achieving competitive performance on tasks like long-context language modeling, image generation, and audio modeling."
}
]
50 changes: 42 additions & 8 deletions readme.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,51 @@ def generate_markdown(json_file, output_file):
papers = json.load(f)

with open(output_file, 'w') as f:
# Write title and description
f.write("# Awesome Byte-Based Large Language Models\n")
f.write("A curated list of papers on byte-based large language models.\n\n")
f.write("<p align='center'>\n")
f.write(" <img src='assets/teaser.webp' alt='A teaser figure generated by DALL-E' width=70%>\n")
f.write("</p>\n\n")

# title
f.write("# Awesome Byte-Based Large Language Models\n\n")

# intro
f.write("## Introduction\n\n")

f.write(
"Recently, there has been a growing interest in studying byte-based large language models (LLMs). These models eliminate the need for tokenization and operate directly on raw bytes—the universal format of the digital world. Byte-based models offer several promising advantages, including:\n\n" \
"- **Enhanced robustness and generalization:** By removing the heuristic biases introduced by tokenization, these models could achieve better adaptability.\n" \
"- **Cross-modality scalability:** Since all data can be represented as bytes, these models naturally extend to multiple modalities.\n\n"
)

# Table header
f.write("| Title | Authors | Year | Link | Summary |\n")
f.write("|-------|---------|------|------|---------|\n")
f.write("This repository serves as an ongoing collection of papers and resources focused on byte-based LLMs.\n\n")

# papers
f.write("## Papers\n\n")
f.write("""The meaning of most fields are clear by their names. "Date" is the time that the work is released/made public (e.g., the timestamp of its first arXiv version). "Summary" is a one-sentence summary of the paper.\n\n""")

# table header
f.write("| Title | Date | Conference | Code | Summary |\n")
f.write("|-------|:----:|:----------:|:----:|---------|\n")

# Populate the table
# populate the table
for paper in papers:
f.write(f"| [{paper['title']}]({paper['link']}) | {paper['authors']} | {paper['year']} | [Link]({paper['link']}) | {paper['summary']} |\n")
if 'code' not in paper or paper['code'] in ['N/A', 'NA', 'na', 'n/a', '-']:
code = '-'
else:
code = f"[Code]({paper['code']})"

if 'summary' not in paper or paper['summary'] in ['N/A', 'NA', 'na', 'n/a', '-']:
summary = '-'
else:
summary = "<details>" + paper['summary'] + "</details>"

f.write(f"| [{paper['title']}]({paper['link']}) | {paper['date']} | {paper['conference']} | {code} | {summary} |\n")

# contributing
f.write("\n\n## Contributing\n\n")
f.write("Contributions are always welcome. There are two ways to add a new paper:\n")
f.write("1. The easiest way is to open an issue, where I have a template for you to fill out.\n")
f.write("2. If you would like to be listed as a contributor, you add the paper to `papers.json` and make a pull request. Please do NOT directly edit `README.md`.\n")

if __name__ == "__main__":
generate_markdown('papers.json', 'README.md')

0 comments on commit 256143a

Please sign in to comment.