Skip to content

Commit

Permalink
deploy: f36a0e8
Browse files Browse the repository at this point in the history
  • Loading branch information
mallamanis committed Aug 20, 2024
1 parent 5868350 commit bbf733c
Show file tree
Hide file tree
Showing 39 changed files with 9,637 additions and 5,867 deletions.
2 changes: 1 addition & 1 deletion etc/compute_related.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
Expand Down
2 changes: 1 addition & 1 deletion etc/compute_topics.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('punkt_tab')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
Expand Down
3 changes: 3 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,9 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
<tag><a href="/tags.html#API">API</a></tag>
<tag><a href="/tags.html#autocomplete">autocomplete</a></tag>
<tag><a href="/tags.html#benchmark">benchmark</a></tag>
<tag><a href="/tags.html#benchmarking">benchmarking</a></tag>
<tag><a href="/tags.html#bimodal">bimodal</a></tag>
<tag><a href="/tags.html#Binary Code">Binary Code</a></tag>
<tag><a href="/tags.html#clone">clone</a></tag>
<tag><a href="/tags.html#code completion">code completion</a></tag>
<tag><a href="/tags.html#code generation">code generation</a></tag>
Expand Down Expand Up @@ -173,6 +175,7 @@ <h4 id="-browse-papers-by-tag">🏷 Browse Papers by Tag</h4>
<tag><a href="/tags.html#repair">repair</a></tag>
<tag><a href="/tags.html#representation">representation</a></tag>
<tag><a href="/tags.html#retrieval">retrieval</a></tag>
<tag><a href="/tags.html#Reverse Engineering">Reverse Engineering</a></tag>
<tag><a href="/tags.html#review">review</a></tag>
<tag><a href="/tags.html#search">search</a></tag>
<tag><a href="/tags.html#static">static</a></tag>
Expand Down
2 changes: 2 additions & 0 deletions paper-abstracts.json
Original file line number Diff line number Diff line change
Expand Up @@ -89,8 +89,10 @@
{"key": "chen2021evaluating", "year": "2021", "title":"Evaluating Large Language Models Trained on Code", "abstract": "<p>We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.</p>\n", "tags": ["language model","synthesis"] },
{"key": "chen2021plur", "year": "2021", "title":"PLUR: A Unifying, Graph-Based View of Program Learning, Understanding, and Repair", "abstract": "<p>Machine learning for understanding and editing source code has recently attracted significant interest, with many developments in new models, new code representations, and new tasks.This proliferation can appear disparate and disconnected, making each approach seemingly unique and incompatible, thus obscuring the core machine learning challenges and contributions.In this work, we demonstrate that the landscape can be significantly simplified by taking a general approach of mapping a graph to a sequence of tokens and pointers.Our main result is to show that 16 recently published tasks of different shapes can be cast in this form, based on which a single model architecture achieves near or above state-of-the-art results on nearly all tasks, outperforming custom models like code2seq and alternative generic models like Transformers.This unification further enables multi-task learning and a series of cross-cutting experiments about the importance of different modeling choices for code understanding and repair tasks.The full framework, called PLUR, is easily extensible to more tasks, and will be open-sourced (https://github.com/google-research/plur).</p>\n", "tags": ["repair"] },
{"key": "chen2022codet", "year": "2022", "title":"CodeT: Code Generation with Generated Tests", "abstract": "<p>Given a programming problem, pre-trained language models such as Codex have demonstrated the ability to generate multiple different code solutions via sampling. However, selecting a correct or best solution from those samples still remains a challenge. While an easy way to verify the correctness of a code solution is through executing test cases, producing high-quality test cases is prohibitively expensive. In this paper, we explore the use of pre-trained language models to automatically generate test cases, calling our method CodeT: Code generation with generated Tests. CodeT executes the code solutions using the generated test cases, and then chooses the best solution based on a dual execution agreement with both the generated test cases and other generated solutions. We evaluate CodeT on five different pre-trained models with both HumanEval and MBPP benchmarks. Extensive experimental results demonstrate CodeT can achieve significant, consistent, and surprising improvements over previous methods. For example, CodeT improves the pass@1 on HumanEval to 65.8%, an increase of absolute 18.8% on the code-davinci-002 model, and an absolute 20+% improvement over previous state-of-the-art results.</p>\n", "tags": ["synthesis","Transformer","execution"] },
{"key": "chen2022learning.md", "year": "2022", "title":"Learning to Reverse DNNs from AI Programs Automatically", "abstract": "<p>With the privatization deployment of DNNs on edge devices, the security of on-device DNNs has raised significant concern. To quantify the model leakage risk of on-device DNNs automatically, we propose NNReverse, the first learning-based method which can reverse DNNs from AI programs without domain knowledge. NNReverse trains a representation model to represent the semantics of binary code for DNN layers. By searching the most similar function in our database, NNReverse infers the layer type of a given function’s binary code. To represent assembly instructions semantics precisely, NNReverse proposes a more finegrained embedding model to represent the textual and structural-semantic of assembly functions.</p>\n", "tags": ["Reverse Engineering","Binary Code"] },
{"key": "chen2023diversevul", "year": "2023", "title":"DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection", "abstract": "<p>We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection.\nCombining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models.\nHowever, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.</p>\n", "tags": ["dataset","Transformer","vulnerability"] },
{"key": "chen2023supersonic", "year": "2023", "title":"Supersonic: Learning to Generate Source Code Optimizations in C/C++", "abstract": "<p>Software optimization refines programs for resource efficiency while preserving functionality. Traditionally, it is a process done by developers and compilers. This paper introduces a third option, automated optimization at the source code level. We present Supersonic, a neural approach targeting minor source code modifications for optimization. Using a seq2seq model, Supersonic is trained on C/C++ program pairs ($x_{t}$, $x_{t+1}$), where $x_{t+1}$ is an optimized version of $x_{t}$, and outputs a diff. Supersonic’s performance is benchmarked against OpenAI’s GPT-3.5-Turbo and GPT-4 on competitive programming tasks. The experiments show that Supersonic not only outperforms both models on the code optimization task but also minimizes the extent of the change with a model more than 600x smaller than GPT-3.5-Turbo and 3700x smaller than GPT-4.</p>\n", "tags": ["optimization"] },
{"key": "chen2024ppm.md", "year": "2024", "title":"PPM: Automated Generation of Diverse Programming Problems for Benchmarking Code Generation Models", "abstract": "<p>In recent times, a plethora of Large Code Generation Models (LCGMs) have been proposed, showcasing significant potential in assisting developers with complex programming tasks. Benchmarking LCGMs necessitates the creation of a set of diverse programming problems, and each problem comprises the prompt (including the task description), canonical solution, and test inputs. The existing methods for constructing such a problem set can be categorized into two main types: manual methods and perturbation-based methods. However, manual methods demand high effort and lack scalability, while also risking data integrity due to LCGMs’ potentially contaminated data collection, and perturbation-based approaches mainly generate semantically homogeneous problems with the same canonical solutions and introduce typos that can be easily auto-corrected by IDE, making them ineffective and unrealistic. In this work, we propose the idea of programming problem merging (PPM) and provide two implementation of this idea, we utilize our tool on two widely-used datasets and compare it against nine baseline methods using eight code generation models. The results demonstrate the effectiveness of our tool in generating more challenging, diverse, and natural programming problems, comparing to the baselines.</p>\n", "tags": ["benchmarking","evaluation"] },
{"key": "chibotaru2019scalable", "year": "2019", "title":"Scalable Taint Specification Inference with Big Code", "abstract": "<p>We present a new scalable, semi-supervised method for inferring\ntaint analysis specifications by learning from a large dataset of programs.\nTaint specifications capture the role of library APIs (source, sink, sanitizer)\nand are a critical ingredient of any taint analyzer that aims to detect\nsecurity violations based on information flow.</p>\n\n<p>The core idea of our method\nis to formulate the taint specification learning problem as a linear\noptimization task over a large set of information flow constraints.\nThe resulting constraint system can then be efficiently solved with\nstate-of-the-art solvers. Thanks to its scalability, our method can infer\nmany new and interesting taint specifications by simultaneously learning from\na large dataset of programs (e.g., as found on GitHub), while requiring \nfew manual annotations.</p>\n\n<p>We implemented our method in an end-to-end system,\ncalled Seldon, targeting Python, a language where static specification\ninference is particularly hard due to lack of typing information.\nWe show that Seldon is practically effective: it learned almost 7,000 API\nroles from over 210,000 candidate APIs with very little supervision\n(less than 300 annotations) and with high estimated precision (67%).\nFurther,using the learned specifications, our taint analyzer flagged more than\n20,000 violations in open source projects, 97% of which were\nundetectable without the inferred specifications.</p>\n", "tags": ["defect","program analysis"] },
{"key": "chirkova2020empirical", "year": "2020", "title":"Empirical Study of Transformers for Source Code", "abstract": "<p>Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i. e. follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and all consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.</p>\n", "tags": ["Transformer"] },
{"key": "chirkova2021embeddings", "year": "2021", "title":"On the Embeddings of Variables in Recurrent Neural Networks for Source Code", "abstract": "<p>Source code processing heavily relies on the methods widely used in natural language processing (NLP), but involves specifics that need to be taken into account to achieve higher quality. An example of this specificity is that the semantics of a variable is defined not only by its name but also by the contexts in which the variable occurs. In this work, we develop dynamic embeddings, a recurrent mechanism that adjusts the learned semantics of the variable when it obtains more information about the variable’s role in the program. We show that using the proposed dynamic embeddings significantly improves the performance of the recurrent neural network, in code completion and bug fixing tasks.</p>\n", "tags": ["autocomplete"] },
Expand Down
Loading

0 comments on commit bbf733c

Please sign in to comment.