bigcode-project / bigcode-evaluation-harness Public

Notifications You must be signed in to change notification settings
Fork 185
Star 712

Code
Issues 32
Pull requests 23
Actions
Projects 1
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: bigcode-project/bigcode-evaluation-harness

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

32 Open 97 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

[Possibly system specific] Wild (12% vs 20%) run-to-run swings in multiple-cpp reported scores

#258 opened Jul 18, 2024 by alat-rights

Need some context for certain args for Instruct Human Eval

#256 opened Jul 18, 2024 by teknium1

The evaluation results are inconsistent across different GPUs

#252 opened Jul 8, 2024 by DonteFlynn

Using the humanevalpack to test the ChatGLM3 model results in an abnormal score.

#251 opened Jul 5, 2024 by burger-pb

MBPP Llama3-8B-Instruct lower pass@1 score expected

#246 opened Jun 16, 2024 by YangZhou08

Using custom prompts and postprocessing

#245 opened Jun 14, 2024 by anil-gurbuz

API-based evaluation support (humanevalpack_openai.py is too old)

#234 opened May 10, 2024 by s-natsubori

If I want to add my own designed prompts before each question, how should I modify the code

#230 opened Apr 27, 2024 by ALLISWELL8

The results of Llama3-8b pass@1 is worse than report

#228 opened Apr 22, 2024 by shuaiwang2022

ignore --use_auth_token if model doesn't require it

#227 opened Apr 21, 2024 by Vipitis

[FR] include "config" data in generations_only

#226 opened Apr 21, 2024 by Vipitis

Multiple-E Go test file name suffix does not contain _test.go

#224 opened Apr 20, 2024 by hitesh-1997

Support for vLLM

#221 opened Apr 18, 2024 by noforit

Please add flag to log score for each sample (akin to Eleuther's LM Evaluation Harness)

#215 opened Apr 8, 2024 by RylanSchaeffer

Finetune starcoderbase-1b

#214 opened Apr 6, 2024 by SummCoder

MultiPL-E generations step is hung

#213 opened Apr 3, 2024 by Santhoshkumar-p

why change n_copies from 1 to 2?

#209 opened Mar 25, 2024 by Reeleon

Support for StudentEval Dataset (Again)

#198 opened Feb 12, 2024 by guanqun-yang

AATK process_results is missing

#197 opened Feb 9, 2024 by adiprasad

To evaluate Github copilot?

#195 opened Jan 31, 2024 by liw8hz

[FEATURE REQUEST] Support HumanEval+ tests for MultiPL-E

#193 opened Jan 27, 2024 by Randl

Potentially extra slow inference when using LoRA adapter

#192 opened Jan 25, 2024 by sadaisystems

Evaluation speed with multi-gpu

#169 opened Nov 25, 2023 by Cheungki

exact commandline to reproduce leaderboard

#162 opened Nov 13, 2023 by Derekglk

A common interface for APIs and Models.

#161 opened Nov 12, 2023 by Anindyadeep

Previous 1 2 Next

Previous Next

ProTip! Updated in the last three days: updated:>2024-07-16.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly