What is a dataframe in Python?

You’ll frequently hear the word dataframe thrown around, sometimes by data scientists or Python programmers. It can be an intimidating subject but it doesn’t need to be. Let’s talk about what a dataframe is in Python, in lay person’s terms, and how you can use them.

What a dataframe is in Python

what is a dataframe
Think of a dataframe like a hotrodded spreadsheet with rows and columns you can manipulate in code.

Think of a dataframe as an overgrown spreadsheet, with columns that you can address by position or by their name, and without that pesky 1.04-million-row limit that Excel has, or the 65,535-row limit that some other spreadsheets have.

It is a data structure that you have to initialize and populate, but Python’s Pandas module makes it extremely easy to convert common data formats like Excel, CSV, and JSON into a dataframe. You can convert XML also, it’s just not quite as easy. But it’s still very doable.

The upside to dataframes is that you can add additional rows and columns to them, to deal with large datasets. You can easily delete columns with data you don’t care about. You can fill data from other columns. And you can do any calculation that a spreadsheet would allow, and even do pivots.

Advantages of dataframes

Dataframes were a game changer for me. I’m a security analyst by trade, and I specialize in analyzing vulnerability scan data. These datasets become huge in a hurry. The limit of 1.04 million rows is the one that gets me most frequently, since the datasets can often be several million rows. Those aren’t necessarily all vulnerabilities, mind you. Some of them are informational findings I need to further explain some of the other items. I can filter them to save rows, but then I lose valuable data like system up time, how long a system takes to scan, and other information that I need to do analysis well. If I can tell you how many vulnerabilities you have per system on average and what that number was last month, I’m an average or maybe an above average security analyst. But that doesn’t give you any insight into why. If I can’t tell you why those counts are what they are, I’m just going to be watching those numbers track up and up.

But if I load that huge dataset into a dataframe so I can analyze all of it, I can do cool things like prove whether there is a correlation between vulnerability count and system up time.

But the number of rows can be overwhelming too. I’m going to provide an example dataset from a government source. It’s public domain so you can run the code too if you want, and study the code, data, and anything else you want. It’s 524 rows when fully decoded. I care about 20 of them. Manually deleting 504 rows in a spreadsheet is slow going even on the best of days. But with a dataframe, I can drop all but the rows I want in seconds.

Dataframes aren’t nearly as user friendly as Excel, but you don’t need a degree in computer science to learn them either. I don’t have one. And with the right library, and output the results of your analysis to Microsoft Office formats such as Excel.

Use cases in your own job

There are plenty of other jobs that deal with huge datasets, and any of them can benefit from learning how to use dataframes. And even if you don’t deal in huge datasets, but you do a lot of repetitive work with data, dataframes can be helpful and allowing you to automate some of that process. You can write a script that imports the data into a dataframe, performs repetitive calculations, filters the data, then writes out the file for you. This allows you to produce the report more quickly, but just as importantly, it ensures that you follow the same process every time, so the results will be more consistent as well.

Not only that, it gives you a new buzzword or two to include in your LinkedIn profile. Recruiters and hiring managers may or may not care if you know what a dataframe is, but if you know how to automate stuff, they’re always interested.

An example

This example downloads CVE data from the National Vulnerability Database. At the time I write this, it’s about 210,000 rows by 524 columns, which is an enormous dataset. You can load it into Excel or Libreoffice Calc, but it will get realy sluggish. It’s much faster to manipulate it as a dataframe, then export the parts you need.

This example dataframe code is written in Python 3.11. I explain the API pagination part of the code in another blog post.

from flatten_json import flatten
import json
import pandas as pd
import requests

nvddf = pd.DataFrame() # initialize the dataframe

# download NVD CVE data and populate the dataframe
start = 0
limit = 1
url = 'https://services.nvd.nist.gov/rest/json/cves/2.0'
while start <= limit:
    response = requests.get(url, params={'startIndex': start}) # startIndex and resultsPerPage are the parameters
    data = json.loads(response.text)
    print('startIndex: ', data['startIndex']) 
    start += 1
    limit = data['totalResults'] / data['resultsPerPage']

    # flatten the json - json.normalize would be simpler but this one is too deeply nested for that
    d = data['vulnerabilities']
    nvd_flat = (flatten(record, '.') for record in d)
    nvd = pd.DataFrame(nvd_flat)

	# concatenate this page to the rest
    nvddf = pd.concat([nvddf, nvd])

print('Original dataframe:')
print(nvddf)

# drop the columns we don't need
nvddf = nvddf[['cve.id','cve.sourceIdentifier','cve.published','cve.lastModified','cve.vulnStatus',
               'cve.descriptions.0.value','cve.references','cve.evaluatorSolution',
               'cve.evaluatorImpact','cve.vendorComments.0.comment','cve.evaluatorComment',
               'cve.metrics.cvssMetricV2.0.cvssData.baseScore','cve.metrics.cvssMetricV2.0.baseSeverity',
               'cve.metrics.cvssMetricV31.0.cvssData.baseScore','cve.metrics.cvssMetricV31.0.cvssData.baseSeverity',
               'cve.metrics.cvssMetricV30.0.cvssData.baseScore','cve.metrics.cvssMetricV30.0.cvssData.baseSeverity']]
print('Truncated dataframe:')
print(nvddf)

writer = pd.ExcelWriter("NVD-CVEs.xlsx", engine='xlsxwriter')
nvddf.to_excel(writer, sheet_name='CVE data', index=False)

What a dataframe is: In conclusion

It would be easy to add code to add more columns calculating averages, fill in missing data from other columns or with default values, and much more. If you’re feeling really ambitious, you can make pivot tables from it too, and it’s infinitely less likely to choke on large datasets than Excel. Think of a dataframe as a hotrod spreadsheet with a steeper learning curve, but an infinitely higher payoff to go with it.

If you found this post informative or helpful, please share it!