In this post, we’ll be diving in the shifting landscape of generative AI and data analysis. We’ll take a look at 8 leading models and see how they fare with the same data analysis tasks. You might be surprised to learn how wildly the results vary.
A Shifting Landscape in the Numbers Game
Data analysis can be challenging as it requires sifting through mountains of information, determining meaningful insights, and translating those insights into actionable strategies. It’s a complex dance of logic, statistics, and domain expertise.
With the rise of generative AI, we have the potential to transform numerous aspects of our lives. We’ve been wowed by AI models’ abilities to present human-like text, generate stunning images, and even produce code. These models excel at understanding and identifying relationships in language, making them invaluable for tasks like content creation, translation, and customer service. However, generative AI models are primarily trained on patterns in text and images, so when it comes to crunching numbers, their performance has traditionally been less impressive.But the winds are changing. The AI landscape is evolving rapidly, and we’re witnessing significant strides in AI’s ability to tackle data analysis tasks. Researchers are developing new architectures and training methods that enable AI models to better understand and manipulate numerical data. We’re seeing:
- – Improved numerical reasoning: Models are learning to perform more complex calculations and statistical analyses with greater accuracy.
- – Enhanced data visualization: AI is helping to automate the creation of insightful charts and graphs, making it easier to understand complex datasets.
- – Automated data cleaning and preprocessing: AI is streamlining the often tedious task of preparing data for analysis, freeing up human analysts to focus on higher-level tasks.
- – More accurate predictions: AI powered predictive analytics are becoming more accurate, helping businesses make better decisions.
While we’re not yet at a point where AI can fully replace human data analysts, the technology is rapidly becoming a powerful tool for augmenting their capabilities. The near future of data analysis is likely to be a collaborative one, where human experts and AI work together to unlock the full potential of data. AI can handle the grunt work of data processing and analysis, allowing human experts to focus on interpreting the results and applying their domain expertise. Without further ado, let’s give some AI models a whirl with some data sets and analysis tasks and see how they get on.
The Experiment Setup
This experiment focused specifically on how well these models could process air quality data from multiple locations, generate insightful visualizations, and identify meaningful patterns across datasets.
The Contenders
The analysis included eight leading AI models/tools:
- Claude by Anthropic
- ChatGPT
- Llama by Meta
- Gemini by Google
- DeepSeek
- Le Chat by Mistral
- Grok by X
- Julius AI
The Challenge
Each model was evaluated on four key tasks:
- Ability to generate figures and graphs from a single dataset
- Capability to create specific visualizations (calendar cards with EPA colors) from the given dataset
- Ability to compare multiple datasets and provide analysis
- Capability to perform deeper analysis, such as identifying notable events with high readings
Each model was assigned 0-5 points based on how well it accomplished each task above, for a max score of 20.
The Datasets
The experiment utilized air quality index (AQI) data from three locations:
- – Location 1 – Daily max AQI (Oct-Feb), with 120 data points for the first and second tasks
- – For comparative analysis (third and fourth tasks):
- – Location 1: Daily AQI readings (Oct-Feb), with 2,849 data points
- – Location 2: Daily AQI readings (Oct-Feb), with 2,849 data points
- – Location 3: Daily AQI readings (Nov-Feb), with 2,134 data points
Results: Model-by-Model Breakdown
Claude by Anthropic (Total Score: 40/40)
Claude was the clear winner, scoring perfectly across all categories. When asked to create data visualizations, it delivered a “social media-friendly dashboard” with an intuitive line chart, stat cards, and detailed analysis. For the calendar visualization with EPA colors, Claude provided a clean, properly formatted layout with an appropriate color legend.
In the multiple dataset comparison, Claude identified key patterns, noting that Location 1 had both the lowest (7) and highest (162) individual readings while maintaining similar average air quality across all three locations. For notable event analysis, Claude pinpointed specific dates with high readings (November 9, 2024; December 28, 2024; January 26, 2025) and recognized synchronized events affecting all three locations, suggesting regional air quality issues.
Claude also presented the most significant spikes and their timing through a visualization to show the high AQI events more clearly.
ChatGPT (Total Score: 37/40)
ChatGPT performed strongly with a score of 37/40, losing some points only on the calendar visualization task. While it successfully created line charts, bar charts, and summary statistics, the calendar visualization required multiple revisions to correct issues like reverse-ordered weeks and incorrect date formats.
For comparative analysis, ChatGPT effectively identified that Location 3 had the lowest AQI variability, while Location 1 and Location 2 experienced greater fluctuations with higher peak values. Its notable event analysis correctly identified November 9, 2024, as a significant pollution event for Location 1 and Location 2.
Gemini by Google (Total Score: 25/40)
Gemini scored well on multi-dataset comparisons but struggled with specific visualizations. It created monthly median AQI charts, but the months weren’t arranged chronologically, making the graphics difficult to interpret. It failed to produce calendar visualizations with EPA colors, instead offering a heat map that didn’t utilize the EPA AQI color scale.
Despite these visualization shortcomings, Gemini performed well in comparative analysis, correctly identifying that the three locations had similar average AQI levels while Location 1 showed greater variability and experienced the highest individual readings.
Le Chat by Mistral (Total Score: 21/40)
Le Chat couldn’t directly generate visualizations but provided useful comparative analysis. Instead of creating graphics, it offered guidance on using Excel, Python, or online tools to visualize the data.
Where Le Chat shined was in its comparative analysis. It correctly identified similar mean AQI values across datasets (35.23, 34.55, and 35.27) and recognized that the second dataset (Location 1) had the highest maximum AQI value (162). For notable events, it systematically listed all readings above 100 AQI across the datasets.
Grok by X (Total Score: 18/40)
Grok attempted to generate visuals but produced chart-like AI-generated images that weren’t actual data visualizations. Its calendar visualizations were interesting but ultimately nonsensical.
In its comparative analysis, Grok made some data import errors, incorrectly reporting the number of readings for each dataset. However, it offered unique insights, such as noting how Location 3’s dataset might show higher average AQI because it was missing the typically cleaner October data.
Julius AI (Total Score: 17/40)
Julius AI performed well on data visualization tasks but couldn’t be evaluated on multi-dataset comparisons due to paid-tier restrictions. It successfully created line charts and calendar visualizations, though the calendar weeks were in reverse order. Julius AI also produced a line chart with EPA AQI categories clearly marked.
Since multiple file uploads required a paid upgrade, its capabilities for comparative analysis could not be assessed in the context of this experiment – we were using free plans only.
Llama by Meta (Total Score: 3/40)
Llama faced significant limitations in handling data uploads and visualization tasks. It provided text-based data summaries and attempted ASCII art representations of charts but couldn’t generate proper visualizations. Since Llama doesn’t allow CSV file uploads, comparative analysis couldn’t be performed.
DeepSeek (Total Score: 2/40)
DeepSeek performed the poorest, unable to generate visualizations or properly analyze multiple datasets. Like Le Chat, it offered guidance on creating visualizations using external tools but couldn’t produce them directly. When attempting to upload multiple datasets, there were size limitations and server issues that prevented proper analysis.
Key Findings
Data Visualization Capabilities
Claude, ChatGPT, and Julius AI (which uses models from Claude and Open AI behind the scenes) led in generating useful graphics from datasets. These were the only models that could create visualizations similar to the requested calendar cards with EPA colors. The lower-ranked models either produced confusing graphics or could only offer text-based guidance.
Data Comparison Capabilities
Claude, ChatGPT, Gemini, and Le Chat tied in their ability to compare multiple datasets. All four models correctly identified that Location 1 showed the highest variability, Location 3 had more stability, and all three locations had comparable mean values. They also consistently identified the same high AQI events.
Interestingly, despite data import issues, Grok offered unique insights that other models missed, such as noting how missing October data might skew the comparison and being the only model to provide a comparison over a common time period.
Conclusion
Claude emerged as the clear winner in this comparison, with ChatGPT following closely behind. For data visualization specifically, Claude, ChatGPT, and Julius AI showed the strongest capabilities. For comparative analysis, Claude, ChatGPT, Gemini, and Le Chat performed equally well.
While Julius AI showed promise in visualization tasks, its full potential couldn’t be evaluated due to limitations in the free version. Grok, despite its data import issues, offered some of the most unique insights in its analysis.
For users seeking AI assistance with data analysis, the choice of model can significantly impact the quality and depth of insights obtained. Claude and ChatGPT currently offer the most comprehensive capabilities in this domain. However, looking forward to the future of data analysis, collaboration between human experts and AI will be required to unlock the full potential of data.
Are you interested in exploring how AI can help take your data analysis to the next level? Get in touch with us – we’d love to hear about your project.
