Home > Digital technologies > Development > Why Python is better than Excel for processing massive data

Why Python is better than Excel for processing massive data

Published on July 7, 2022
Share this page :
For big data, Python rules the roost

Excel is the software most used by businesses to process and present data. However, it reaches its limits as soon as we wish to manipulate large quantities of information. To overcome technical limits, there is a powerful solution accessible to everyone: Python. Senior data scientist and ORSYS trainer, Audrey Quessada Vial* demonstrates by example the superiority of Python for analyzing data. 

Since the 2010s, data has become the new black gold for businesses. Big data, digital transformation and the development of artificial intelligence have changed the situation. How to use data effectively?

Of course, Excel comes first among the solutions. Very widespread and easy to use, it remains essential. But as soon as we want to exploit massive data, beyond 1 million entries, we come up against technical limitations. The spreadsheet exceeds 1 million lines (1,048,576 lines to be precise, or 220).

Many datasets exceed this value. This is the case for stock price histories (809 MB in zip format for American stocks), the Sirene database of French companies from INSEE (1.3 GB), the national database of buildings of the Scientific and Technical Center of the building (5.3 GB), or the database of French INPI brands (15 GB).

To remove these limitations, you must abandon Excel and use Python. Why Python? This language stands out from other solutions for processing massive data (R, Julia, MATLAB, Scala, SQL, etc.) by its simplicity of syntax, its memory management and above all its imposing number of libraries (or bookstores).

Python is therefore aimed at data scientists as well as non-specialists such as finance and marketing professionals, and more generally all advanced Excel users.

Most often open source, Python libraries allow you to extend its capabilities in many areas. For data, THE flagship library is called Pandas.

Python and Pandas, your allies for data

Getting started with Python and Pandas is relatively simple and quick, suitable for programming newbies. No wonder Python will be the most used language in the world in 2022, ahead of Java and C/C++, according to the PYPL popularity index. Python skills are also highly sought after by companies: it is the second most requested language in job advertisements after JavaScript according to a study by theCalifornia University of Berkeley.

Pandas does not come from the name of the friendly animal, symbol of China, but from Bangel Data, a data structure widely used in econometrics, the statistical study of economic data.

The power of Pandas comes from its speed in processing data. It can easily handle a million entries in a matter of seconds.

Flexible, it is also used for basic data visualization and to create different views of the dataset using dynamic cross tables. It is also possible to group data by category and perform aggregation operations from the simplest (cumulative or average sums for example) to the most complex. In addition, with one line of code, you can open your file and then work on the data. It is even possible to perform SQL queries with Pandas.

Another advantage, Pandas accepts an impressive list of file formats: CSV, XLSX, SQL, Apache Parquet, HDF5, JSON… and many others.

To summarize, the major advantages of Pandas compared to Excel are:

  • managing millions of lines
  • the rapidity
  • task automation: with Excel, this automation is limited when using VBA or macros
  • reporting using the Jupyter Notebook platform for data visualization
  • cross-platform compatibility: you can work on both macOS and Windows.

Moving from Excel to Python and Pandas

With some knowledge of Python, it is easy to move from Microsoft Excel to Pandas. Most tasks that are done in Excel can also be done with Pandas.

We will take an example to illustrate the ease and power of Python and Pandas. Here is a dataset from the World Bank that can be found on the site Kaggle. This file is 574.3 MB in size and contains almost 6 million entries. It brings together a certain number of economic and social indicators by country and by year.

It takes less than 3 seconds to read this file with a single line of code.

We will now view the first five rows to see what this data represents. Here again, a single line of code is enough:

It is also possible to retrieve the main statistical indicators associated with each column:

To filter data, nothing could be simpler. In the following example, we will filter the data to keep only GDP per capita as an indicator:

What if we calculated the average and maximum GDP per capita over all years by country? Again, only one line of code with Pandas is needed:

These few examples illustrate the power of Pandas. They are far from exhaustive of the possibilities that are within your reach.

We can summarize in a table for which uses to use Python and Pandas instead of Excel.

Comparison of Excel and Python Pandas according to uses

Pandas is an extremely powerful library for working with data. It has many advantages over Excel and allows you to go much further in data processing and task automation. Its flexibility and speed make it an essential tool for data science.

However, Pandas is not the only Python library that has advantages for processing data. Visualization tools make it possible to understand the interactions between parameters. Many libraries offer visualization tools: geovisualization with Folium, interactive visualization with Plotly, creation of interactive dashboards with Dash, visualization of big data with Holoviews, there is no shortage of choice.

To conclude, Python has a lot of advantages. Thanks to its freely accessible libraries, it outperforms Excel to process massive data (structured or not) and visualize it. Our training courses will allow you to quickly get to grips with this powerful tool.

Our expert

Audrey QUESSADA VIAL

Python

Holder of a doctorate in laser physics, she does fundamental research in interface […]

associated domain

Software development

associated training

Big Data Analytics with Python

Python for data science

Introduction to programming with Python