Deepseek-R1 has actually created loads of excitement and concern, especially for the competitive model O1 from Openaai. So we tested them in a minor comparison with some easy data analyzes and market research tasks.
In order to put the models on the identical foundations, we used confusion -Pro search that now supports each O1 and R1. Our goal was to look beyond the benchmarks and to examine whether the models can actually perform ad hoc tasks during which information should be collected from the online, select the proper data and perform easy tasks that require considerable manual efforts would.
Both models are impressive, but make mistakes if the input requests haven’t any specificity. O1 is a little bit higher in argue tasks, however the transparency of R1 gives it a bonus in cases (and there can be some) where there are mistakes.
Here you can find a breakdown of a few of our experiments and the links to the confusion pages on which you’ll check the outcomes yourself.
Calculation of returns for investments from the online
Our first test measured whether models could calculate rendits on investment (ROI). We checked out a scenario during which the user on the primary day of each month from January to December from January to December from January to December $ 140 in the good seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla ) invested. We asked the model to calculate the worth of the portfolio for the present date.
In order to perform this task, the model would must deduct 7 price information for the primary day of every month, evenly divide the monthly investment via the shares (20 USD per share), summarize them and the portfolio value in line with the worth of the shares At the present date.
Both models failed on this task. O1 has returned a listing of stock prices for January 2024 and January 2025 along with a formula for calculating the portfolio value. However, it couldn’t calculate the appropriate values ​​and mainly said that there could be no ROI. On the opposite hand, R1 made the error of only investing in January 2024 and calculating the returns for January 2025.
However, the models' argumentation process was interesting. O1 didn’t provide many details about how its results had achieved while it had achieved its results, but The argument of R1 pursued showed that there was no information because the decision engine of the confusion had not received the monthly data for share prices (many applications of the inside generations don’t fail as a result of the model as a result of the model, but as a result of poor access). This turned out to be a crucial feedback that led us to the following experiment.

Argument about file contents
We decided to perform the identical experiment as before, but as an alternative of demanding the model to call up the knowledge from the online, we decided to supply it in a text file. To do that, we copy the monthly stock data for every share from Yahoo! Finance in a text file and gave it to the model. The file contained the name of every share plus the HTML table, which included the worth for the primary day of every month from January to December 2024 and the last recorded price. The data was not cleaned to cut back manual effort and to check whether the model could select the proper parts from the info.
Here, too, each models provided the proper answer. O1 looked as if it would have extracted the info From the file, but suggested, the calculation might be made manually in a tool like Excel. The lane of argument was very vague and contained no useful information to repair the model. R1 also failed And there was no answer, however the argumentation control contained loads of useful information.
For example, it was clear that the model had analyzed the HTML data appropriately for every share and will extract the proper information. It was also capable of make the monthly calculation of investments, add it and to calculate the ultimate value within the table in line with the last share price. However, this end value remained in his chain of argument and didn’t make it into the ultimate answer. The model had also been confused by a series within the Nvidia diagram, which had marked the ten: 1 share a part of the corporate on June 10, 2024 and incorrectly reported the ultimate value of the portfolio.

Here, too, the true distinction feature was not the result itself, but the power to analyze how the model got here to its response. In this case, R1 made us a greater experience and enables us to grasp the model's restrictions and the way we are able to formulate our command prompt and format our data to be able to achieve higher leads to the longer term.
Comparison of knowledge concerning the web
Another experiment that we carried out required the model to match and determine the statistics of 4 leading NBA centers, which the perfect improvement in the sphere goal percentage (FG%) from 2022/2023 to seasons 2023/2024 showed. This task demanded that the model perform multi -stage argumentation via various data points. The catch within the request was that Victor Wembanyama, who was just entering the league as a rookie in 2023.
The call for this prompt was much easier since the player statistics are widespread on the Internet and are frequently included of their Wikipedia and NBA profiles. Both models answered appropriately (it’s Giannis in the event that they were curious), although depending on the sources they used were a little bit different. However, they didn’t know that Wemby didn’t qualify for comparison and picked up other statistics from his time within the European league.
In his answer R1 offered a greater collapse The results with a comparison table along with links to the sources you used for the reply. The added context enabled us to correct the prompt. After we now have modified the command prompt that we looked for FG% from the NBA seasons, the Wemby model was correct from the outcomes.

Conclusion conclusion
Models of argument are powerful tools, but still have a way so that you can trust tasks, especially if other components of LLM applications (Langwarly Model) develop. Both O1 and R1 can still make basic mistakes from our experiments. Although you show impressive results, you continue to need a little bit handwood to realize precise results.
Ideally, an argumentation model should give you the option to clarify to the user if there is no such thing as a information for the duty. Alternatively, the argumentation of the model should give you the option to support users to be able to higher understand errors and proper their requests to be able to increase the accuracy and stability of the answers of the model. In this regard, R1 had the upper hand. Hopefully future argumentation models, including the upcoming O3 series from Openaai, will offer users more visibility and control.