🥇 Gold Mining PitchBook Data with ChatGPT

Unlocking the Wealth of Private Market Data

[Disclaimer: This post is strictly my personal view and written for educational purposes only. Please exercise diligence when consuming third-party data and/or source code. Consult with your investment and/or legal team(s) when in doubt.]

On a coaching call last week, a client (fictitiously known as A Venture) was looking to build a performance benchmarking tool using PitchBook private equity data.

Only a few hundreds of A Venture’s highest profile investors would see the presentation at the annual meeting.

Within 24 hours.

No pressure! We got busy right away.

Need hands-on help? Schedule a private session NOW!

❓What is PitchBook Direct Data?

In the world of finance and investment, data is king.

PitchBook is a leading provider of private market data, with a vast database covering private capital markets, including venture capital, private equity, and M&A transactions.

PitchBook Direct Data enables direct access to its extensive data sets via Application Programming Interface (API) and direct feeds.

If you seek detailed insights into the private equity and venture capital markets, PitchBook absolutely must be on your short list.

🗝️ Key to the Kingdom

To fully maximize PitchBook Direct Data, you first need to match your target entities with the corresponding ones within the PitchBook database.

PitchBook uses a ranking algorithm to match entities.

It takes into account various criteria: entity name, address information, and most importantly, the URL.

The URL field stands out as most reliable - the key to this data kingdom.

Unlike names, which can be common, or locations, which can be shared, the URL is a distinct identifier that is exclusive to each entity (to a large extent).

Using the URL as the primary matching criterion significantly enhances the accuracy of entity matching.

Let's start matchmaking!

🐍 Python vs Excel

So. How does Python get in the picture? Won't Excel suffice?

At first glance, Python and Excel might seem as different as humanly possible.

Excel is user-friendly, interactive, and excellent for handling small to medium-sized datasets.

However, when it comes to dealing with large volumes of data or complex operations, Excel starts to fade.

This is where Python comes into play.

Python is built for large datasets. It offers robust tools for cleaning, transforming, and analyzing data.

It has a steeper learning curve compared to Excel.

But its potential for automation and performing complex operations is unmatched.

Clock is ticking. Let's start coding!

🛠️ The Setup

You need a development environment for Python.

Pressed for time, let's keep things simple and go for a cloud platform with minimal setup.

Google Colab is a free cloud service that supports Python and GPU usage. It seamlessly integrates with Google Drive and other Google services.

Other candidates include Jupyter Notebook and Microsoft Azure Notebooks.

You don't have time to overanalyze on tooling. March on!

🧹 Preparing Source Data with Python and ChatGPT

You firm has a prolific deal flow and 50,000 private companies to be matched.

Even the most diligent analysts don't work that fast.

Python does.

ChatGPT Prompt

Write Python code. Use Pandas to read an Excel file containing a table of companies. Columns are CompanyName, URL, City, State, Country. Name this dataframe as "df".

#ChatGPT Output

import pandas as pd

# Replace 'your_file.xlsx' with the path to your Excel file
excel_file = 'your_file.xlsx'

# Read the Excel file
df = pd.read_excel(excel_file)

# Display the DataFrame
print(df)

You notice lots of inconsistencies in the URL column. For instance:

  • https://www.acme.com

  • http://acme.com/investors/

  • acme.com/en

Here comes the secret weapon.

You’ll extract the top-level domain (eg acme.com) for cleaner comparisons.

Write code to extract top-level domains out of the URL column.

from urllib.parse import urlparse

# Extract the top-level domain from the URL
df['domain'] = df['url'].apply(lambda url: urlparse(url).netloc)

# Display the DataFrame
print(df)

The source data is looking suave.

🛁 Preparing PitchBook Data

PitchBook data comes as CSV files and are highly normalized.

In this case, you'll simply read the CSV content and massage the URL column to match with the source data.

Write Python code. Use Pandas to read a CSV file containing a table of companies. Columns are PitchBookID, PitchBookCompany, URL, City, State, Country. Name this dataframe as "df_pb".

# Replace 'your_file.csv' with the path to your CSV file
csv_file_pb = 'your_file.csv'

# Read the CSV file
df_pb = pd.read_csv(csv_file_pb)

# Extract the top-level domain from the URL
df_pb['domain_pb'] = df_pb['url_pb'].apply(lambda url: urlparse(url).netloc)

# Display the DataFrame
print(df_pb)

The PitchBook data smells like fresh daisies.

💵 Millionaire Matchmaking

With domains standardized in both the Source and PitchBook Data, matchmaking is an act of looking for identical matches in both columns.

Write Python code. Merge "df" and "df_pb" by the domain columns.

# Merge df and df_pb on the 'domain' and 'domain_pb' columns
merged_df = df.merge(df_pb, left_on='domain', right_on='domain_pb')

# Display the merged DataFrame
print(merged_df)

You just unknowingly became a legendary matchmaker in your zip code. Congratulations!

🥂 Home Run

Once the matches are made, it's a smooth ride home.

Getting metrics like Vintage Year, AUM, IRR, or TVPI from PitchBook Direct Data is a breeze.

I sent another delighted client home.

Happy day.

🥡 Takeaway

Armed with the power of ChatGPT, Python, and PitchBook, investment professionals can readily extract actionable intelligence out of an ocean of private market data.

With ChatGPT's vast knowledge base, Python's data prowess, and PitchBook's comprehensive insights, mere mortals can forge ahead and seize opportunities in the ever-evolving financial landscape.

🧰 Resources