console output
Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving
Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.
Stop Using 0.5 as the Threshold for Your Binary Classifier
To produce a binary response, classifiers output a real-valued score that is thresholded. For example, logistic regression outputs a probability (a value between 0.0 and 1.0); and observations with a score equal to or higher than 0.5 produce a positive binary output (many other models use the 0.5 threshold by default). However, using the default 0.5 threshold is suboptimal. In this blog post, I'll show you how you can choose the best threshold from your binary classifier. We'll be using Ploomber to execute our experiments in parallel and sklearn-evaluation to generate the plots.
Who needs MLflow when you have SQLite?
I spent about six years working as a data scientist and tried to use MLflow several times (and others as well) to track my experiments; however, every time I tried using it, I abandoned it a few days after. There were a few things I didn't like: it seemed too much to have to start a web server to look at my experiments, and I found the query feature extremely limiting (if my experiments are stored in a SQL table, why not allow me to query them with SQL). I also found comparing the experiments limited. I rarely have a project where a single (or a couple of) metric(s) is enough to evaluate a model. It's mostly a combination of metrics and evaluation plots that I need to look at to assess a model.
vpj/lab
This library lets you organize machine learning experiments. Maintains logs, summaries and checkpoints of all the experiments in a folder structure without you explicitly having to worry about them. It keeps references to git commit when the experiement was run, along with other information like date, the python file executed and experiment description. Optionally, the library can update the python file by inserting experiment results as a comment automatically. You can use monitored code segments to measure time and to get status updates on the console.