Have you ever needed to keep up with news from a specific website without opening a browser every few minutes? Building a news web scraper that sends headlines to Telegram is one of the most practical ways to put Python automation to work. In this complete guide, you will build from scratch a bot that reads the main headlines from a news portal and delivers real-time alerts directly to your phone. This technique is essential for professionals who need real-time monitoring and for developers who want to understand how machines read and interpret the web.
What Is Web Scraping and Why Use Python?
Web scraping is the process of extracting information from websites in an automated way. Instead of copying and pasting manually, you create a script that simulates a browser visit and collects only what you need, such as titles, links, or prices. Python has become the preferred language for this task because of its clean syntax and its rich ecosystem of Python libraries specialized in HTML processing. Two of the most important external references for going deeper on this topic are the official BeautifulSoup documentation, which explains the HTML node tree in detail, and the MDN Web Docs on HTTP, which covers how browser requests work at the protocol level.
Setting Up the Development Environment
Before writing the first line of code, prepare your environment. Using a dedicated Python virtual environment is strongly recommended to keep the project dependencies isolated and avoid conflicts with other scripts on your system. Open your terminal and install the required tools:
pip install requests
pip install beautifulsoup4
pip install python-dotenvThe requests library sends the HTTP request to the target site, beautifulsoup4 parses the downloaded HTML so you can find specific tags, and python-dotenv lets you manage your Telegram API credentials securely without exposing them in your source code.
Creating Your Telegram Bot
To send the news, you need a messenger. Telegram makes this straightforward through BotFather. Follow these steps to get your bot set up in under five minutes:
- Open Telegram and search for @BotFather.
- Type the command
/newbotand follow the prompts to give your bot a name and a username. - At the end of the process, you will receive an API Token. Save this string. It is the access key your script will use to authenticate with Telegram.
- Start a conversation with your new bot and send it any message. This is required before the bot can send messages to you.
- To find your Chat ID, open the following URL in your browser after replacing the token:
https://api.telegram.org/botYOUR_TOKEN/getUpdates. Look for the “id” field inside the “chat” block in the response.
If you want a deeper walkthrough with screenshots, the guide on how to create a Telegram bot with Python covers the full setup process visually.
Building the News Scraper
With the bot ready, focus on extracting the data. The key to web scraping is identifying the HTML tag pattern of the target site. In most news portals, headlines are inside <h2> or <h3> tags with a specific CSS class. Use your browser’s developer tools (F12) to inspect the page and find the exact selector you need.
Making the HTTP Request
The first step is downloading the page source. Always set a User-Agent header to simulate a real browser request, which reduces the chance of being blocked by the server’s bot detection:
import requests
from bs4 import BeautifulSoup
url = "https://example-news-site.com"
headers = {"User-Agent": "Mozilla/5.0"}
response = requests.get(url, headers=headers)
if response.status_code == 200:
print("Connection successful!")
html = response.text
else:
print(f"Error accessing the site: {response.status_code}")Parsing and Filtering the Data
With the HTML in hand, BeautifulSoup finds the specific elements. The example below looks for <h3> tags with the class “news-title”. If you encounter character encoding issues in the terminal when printing results, the guide on resolving UTF-8 encoding errors in Python covers the most common fixes:
soup = BeautifulSoup(html, "html.parser")
headlines = soup.find_all("h3", class_="news-title")
messages = []
for item in headlines:
title = item.get_text().strip()
link = item.find("a")["href"] if item.find("a") else "No link available"
messages.append(f"{title}nRead more: {link}")Connecting the Scraper to Telegram
With the list of headlines ready, you need to send it. The Telegram API works through simple web requests. For clean code organization, use Python functions to separate the sending logic from the scraping logic. This makes the code easier to maintain and test independently:
def send_to_telegram(message, token, chat_id):
api_url = f"https://api.telegram.org/bot{token}/sendMessage"
payload = {
"chat_id": chat_id,
"text": message,
"parse_mode": "HTML"
}
response = requests.post(api_url, data=payload)
return response.status_code == 200Automating the Process
A news tracker is only useful if it runs on its own. You can use a Python while loop with a timer to run the check every hour, or schedule it as a system task using Windows Task Scheduler or a Linux cron job. For production-level reliability where you need the bot to never stop, running the script inside a container via Docker is the professional solution, as it automatically restarts on failure and runs independently of your local machine.
Complete Project Code
Here is the full unified script. Replace the TOKEN and CHAT_ID placeholders with your real values, and adjust the HTML tag selectors to match the news site you want to monitor:
import requests
from bs4 import BeautifulSoup
import time
# Telegram credentials (use environment variables in production)
TOKEN = "YOUR_TOKEN_HERE"
CHAT_ID = "YOUR_CHAT_ID_HERE"
def extract_headlines(target_url):
"""Extracts top headlines from a news website."""
try:
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
response = requests.get(target_url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, "html.parser")
# Adjust this selector to match the target site's HTML structure
posts = soup.find_all("h2", limit=5)
results = []
for post in posts:
text = post.get_text().strip()
anchor = post.find("a")
link = anchor["href"] if anchor else "Link not available"
results.append(f"<b>Headline:</b> {text}n<a href="{link}">Read the full article</a>")
return results
except Exception as e:
print(f"Extraction error: {e}")
return []
def send_to_telegram(message):
"""Sends a formatted message to the Telegram bot."""
api_url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
payload = {
"chat_id": CHAT_ID,
"text": message,
"parse_mode": "HTML"
}
response = requests.post(api_url, data=payload)
return response.status_code == 200
def run_bot():
"""Main execution loop."""
target_site = "https://techcrunch.com" # Replace with your preferred news site
print("Starting news monitoring...")
headlines = extract_headlines(target_site)
if headlines:
for headline in headlines:
success = send_to_telegram(headline)
if success:
print("Headline sent successfully.")
else:
print("Failed to send headline.")
time.sleep(2) # Pause between messages to avoid rate limiting
print(f"nDone! {len(headlines)} headlines were sent.")
else:
print("No headlines found. Check your HTML selectors.")
if __name__ == "__main__":
run_bot()Ethical and Legal Considerations
When building any web scraper, it is essential to respect the target website’s robots.txt file, which specifies which parts of the site bots are allowed to access. Never send requests in extremely short intervals like milliseconds, as this can overload the news portal’s servers and result in your IP being permanently banned. Using a User-Agent header that simulates a real browser, combined with a reasonable delay between requests, is a sign of responsible scraping behavior.
Always check whether the site’s terms of service explicitly prohibit automated access. Scraping for personal use or research is generally lower risk than building a commercial product on top of someone else’s content.
Frequently Asked Questions
My script returns an empty list. What should I do?
This usually happens because the site’s HTML structure uses CSS classes or IDs that have changed since you wrote the script, or because the content loads dynamically with JavaScript. Open the browser developer tools (F12), inspect the headline elements on the live page, and update your selectors to match the current structure.
Can I send images alongside the headlines?
Yes. The Telegram API offers a /sendPhoto endpoint. You would need to extract the image URL from the article with BeautifulSoup and pass it to that endpoint instead of /sendMessage.
Can Telegram ban my bot for sending too many messages?
Yes, if you send hundreds of messages per minute. Always keep at least a one to two second pause between messages and only send the most relevant updates. The Telegram Bot API has rate limits that, if exceeded, will result in temporary blocks.
Can I run this script on a free cloud server?
Yes. Platforms like Google Cloud (free tier), Railway, Render, or GitHub Actions allow scheduled Python script execution at no initial cost. Each has different limitations, so check their documentation for script runtime and scheduling options.
How do I filter only headlines that contain specific keywords?
Inside your for loop, add an if/elif/else check to verify whether a keyword (like “AI” or “market”) appears in the headline text before appending it to the results list.
How do I avoid sending duplicate headlines?
Save the URL of each headline you have already sent in a simple SQLite database or a plain text file. Before sending, check whether the link already exists in your “sent” record. If it does, skip it.
What should I do if the site requires a login?
For sites that require authentication, you need a tool that can interact with forms and click buttons like a real user. Selenium is the standard choice for this, as it controls a real browser instance rather than just downloading static HTML.
Is the Requests library the best tool for this job?
For simple sites that serve static HTML, yes. For projects that require high performance and thousands of simultaneous requests, consider combining the approach with asyncio in Python to run non-blocking HTTP calls concurrently and dramatically increase throughput.






