CodeForFinance
← Back to Tutorials

Web Scraping Financial Data with Python

Extract stock prices, news headlines, and market data from the web using requests and BeautifulSoup.

Why Scrape Financial Data?

Not all financial data is available through clean APIs. Earnings calendars, analyst ratings, insider transactions, and niche market data often live on web pages with no official API. Web scraping lets you extract that data programmatically and turn it into structured datasets you can analyse.

Python is the go-to language for web scraping thanks to two brilliant libraries: requests for fetching pages and BeautifulSoup for parsing HTML. Together they handle 90% of scraping tasks you will encounter.

pip install requests beautifulsoup4 lxml pandas

Basic Page Scraping

Every scrape follows the same pattern: fetch the page with requests, parse the HTML with BeautifulSoup, then use CSS selectors or tag names to find the data you need. Always set a User-Agent header so the server knows what is making the request.

import requests
from bs4 import BeautifulSoup
import time

def scrape_stock_page(url):
    """Fetch a page and parse it with BeautifulSoup."""
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                      'AppleWebKit/537.36 Chrome/120.0.0.0'
    }
    response = requests.get(url, headers=headers, timeout=10)
    response.raise_for_status()
    return BeautifulSoup(response.text, 'lxml')

# Example: scrape a financial news page
soup = scrape_stock_page('https://example-finance-site.com/news')

# Find all headline elements
headlines = soup.find_all('h3', class_='article-title')
for h in headlines:
    print(h.get_text(strip=True))

Ethical Scraping: Check robots.txt

Before scraping any site, check its robots.txt file. This file tells bots which pages they are allowed and forbidden to access. Ignoring it can get your IP banned and potentially land you in legal trouble. Python has a built-in parser for it.

Beyond robots.txt, follow these rules: do not hammer servers with rapid-fire requests, do not scrape data behind a login unless you have explicit permission, and do not republish copyrighted content. Scraping for personal analysis is generally fine. Reselling someone else's data is not.

import requests
from urllib.robotparser import RobotFileParser

def can_scrape(url):
    """Check robots.txt before scraping."""
    rp = RobotFileParser()
    # Extract base URL
    from urllib.parse import urlparse
    parsed = urlparse(url)
    robots_url = f'{parsed.scheme}://{parsed.netloc}/robots.txt'
    rp.set_url(robots_url)
    rp.read()
    return rp.can_fetch('*', url)

url = 'https://example-finance-site.com/stocks'
if can_scrape(url):
    print('Allowed to scrape this page')
else:
    print('Blocked by robots.txt - do not scrape')

Rate Limiting: Be a Good Citizen

Never blast a server with hundreds of requests per second. Add random delays between requests to mimic human browsing behaviour. Two to five seconds between pages is a sensible default. If the site has rate limit headers, respect them.

import time
import random
import requests
from bs4 import BeautifulSoup

def polite_scrape(urls, min_delay=2, max_delay=5):
    """Scrape multiple pages with random delays."""
    results = []
    headers = {
        'User-Agent': 'Mozilla/5.0 (compatible; FinanceBot/1.0)'
    }

    for i, url in enumerate(urls):
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'lxml')
            results.append({
                'url': url,
                'title': soup.title.string if soup.title else 'No title',
                'status': response.status_code
            })
            print(f'[{i+1}/{len(urls)}] Scraped: {url}')
        except requests.RequestException as e:
            print(f'Error scraping {url}: {e}')
            results.append({'url': url, 'error': str(e)})

        # Be polite - random delay between requests
        if i < len(urls) - 1:
            delay = random.uniform(min_delay, max_delay)
            print(f'Waiting {delay:.1f}s...')
            time.sleep(delay)

    return results

Saving Scraped Data to CSV

Once you have parsed the HTML, save the data to CSV so you can load it into pandas, Excel, or a database later. Always add a timestamp column so you know when each row was scraped.

import csv
import requests
from bs4 import BeautifulSoup
from datetime import datetime

def scrape_prices_to_csv(url, output_file):
    """Scrape a table of stock prices and save to CSV."""
    headers = {'User-Agent': 'Mozilla/5.0 (compatible; FinanceBot/1.0)'}
    response = requests.get(url, headers=headers, timeout=10)
    soup = BeautifulSoup(response.text, 'lxml')

    # Find the data table
    table = soup.find('table', class_='stock-table')
    if not table:
        print('No table found on page')
        return

    rows = table.find_all('tr')

    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Symbol', 'Price', 'Change', 'Scraped_At'])

        for row in rows[1:]:  # Skip header row
            cols = row.find_all('td')
            if len(cols) >= 3:
                symbol = cols[0].get_text(strip=True)
                price = cols[1].get_text(strip=True)
                change = cols[2].get_text(strip=True)
                writer.writerow([
                    symbol, price, change,
                    datetime.now().isoformat()
                ])

    print(f'Saved to {output_file}')

scrape_prices_to_csv(
    'https://example-finance-site.com/prices',
    'stock_prices.csv'
)

Shortcut: pandas read_html

If the data you need is already in an HTML table, pandas can extract it in a single line. The read_html function finds every table on the page and returns them as DataFrames. No BeautifulSoup needed.

import pandas as pd
import requests

# Many financial sites have HTML tables that pandas
# can extract directly - no BeautifulSoup needed
url = 'https://example-finance-site.com/market-data'
tables = pd.read_html(url)

# read_html returns a list of all tables on the page
print(f'Found {len(tables)} tables')

# Grab the first table and clean it up
df = tables[0]
df.columns = ['Symbol', 'Price', 'Change', 'Volume']
df['Price'] = df['Price'].str.replace('$', '', regex=False).astype(float)

# Save to CSV
df.to_csv('market_data.csv', index=False)
print(df.head())

Key Takeaways

  • - Use requests + BeautifulSoup for most scraping tasks
  • - Always check robots.txt before scraping any website
  • - Add random delays between requests to avoid getting banned
  • - Set a descriptive User-Agent header so sites know what you are
  • - Save scraped data to CSV with timestamps for traceability
  • - Use pandas read_html as a shortcut for table-based data
  • - For JavaScript-rendered pages, look into Selenium or Playwright
N

Recommended

Stay Anonymous While Scraping

Avoid IP bans and protect your identity when scraping financial websites. Encrypt all your traffic with one click.

Get NordVPN →

We may earn a commission at no extra cost to you

Developer Essentials

As an Amazon Associate we may earn from qualifying purchases.