SAFIM Leaderboard

Syntax-Aware Fill-in-the-Middle (SAFIM) is a benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. SAFIM has three subtasks: Algorithmic Block Completion, Control-Flow Expression Completion, and API Function Call Completion. SAFIM is sourced from code submitted from April 2022 to January 2023 to minimize the impact of data contamination on evaluation results.

Paper Code HF Dataset

Algorithmic Block Completion

Calculate max path sum in grid,
only right or down moves allowed

n, m = len(a), len(a[0])
f = np.zeros((n + 1, m + 1))
for i in range(1, n + 1):
  for j in range(1, m + 1):
    v = max(f[i-1,j], f[i,j-1])
    f[i, j] += v
print(f[n, m])

Control-Flow Expression Completion

Calculate (a ^ b) mod m for
large positive integers a, b, m

result = 1
while b > 0:
  if b % 2:
    result = (result * a) % m
  a = (a * a) % m
  b //= 2
print(result)

API Function Call Completion

Define word embedding & learned
positional embedding layers

d_model = args.model_dim
n_words = args.vocab_size
max_len = args.max_src_len
self.word_emb = nn.Embedding(
  n_words, d_model)
self.pos_emb = nn.Embedding(
  max_len, d_model)

Each example includes a problem description and corresponding code file in Python, Java, C++, or C#. The challenge involves a masked AST structure, highlighted by an underline, within each code example, where the AST structure is a code block, control-flow expression, or an API function call depending on the subtask category. Models are tasked with completing these masked parts.