Version 3.1

This commit is contained in:
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
2025-12-12 15:38:09 +01:00
committed by GitHub
parent 8d4e092b1b
commit fb7ad3742e
2 changed files with 229 additions and 107 deletions

View File

@@ -11,41 +11,39 @@ ___________________ _________
\/ \/ \/ \/
``` ```
## What's New in v3.0 🎉 ## What's New in v3.1 🎉
**QR Code Authentication:** **Enhanced Message Data:**
- **No phone number required** - Login with QR code scanning (still need API credentials) - **Message statistics** - Captures views, forwards, and post_author for each message
- **Faster authentication** - Just scan with your phone after API setup - **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
- **Secure login** - Recommended authentication method - **Automatic database migration** - Seamlessly adds new columns to existing databases
- **2FA support** for both QR and phone methods - **Richer exports** - All new data included in CSV/JSON exports
**Enhanced User Experience:** **Improved Channel Management:**
- **Numbered channel selection** - Use 1,2,3 instead of full channel IDs - **Channel names displayed** - Shows channel names alongside IDs everywhere
- **Multi-channel operations** - Add, remove, and scrape multiple channels at once - **Smart filtering** - List option now only shows Channels and Groups (no private chats)
- **Streamlined menu** - Cleaner interface with fewer redundant options - **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
- **Progress bars** for media downloads with visual feedback - **"all" selection** - Quickly add all listed channels at once
- **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`
**Media Download Improvements:** **Bug Fixes:**
- **Fixed file overwriting** - Unique naming prevents media files from being overwritten - **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
- **5x concurrent downloads** - Increased from 3 to 5 for faster media processing - **Better entity resolution** - Handles both numeric IDs and channel usernames
- **Better error handling** - Improved retry logic and recovery - **Improved error messages** - Shows channel names with IDs for clearer debugging
**Performance & Stability:**
- **Database optimizations** - WAL mode and faster operations
- **Hidden warnings** - Cleaner output without technical messages
- **Better error recovery** - More robust handling of network issues
## Features 🚀 ## Features 🚀
- **QR Code & Phone Authentication** - Choose your preferred login method - **QR Code & Phone Authentication** - Choose your preferred login method
- Scrape messages from multiple Telegram channels - Scrape messages with full metadata (views, forwards, reactions, post author)
- Download media files with parallel processing and unique naming - Download media files with parallel processing and unique naming
- Real-time continuous scraping - Real-time continuous scraping
- Export data to JSON and CSV formats - Export data to JSON and CSV formats with enhanced metadata
- SQLite database storage with optimized performance - SQLite database storage with automatic schema migration
- Resume capability (saves progress) - Resume capability (saves progress)
- Interactive menu with numbered channel selection - Interactive menu with channel names and numbered selection
- Smart channel filtering (only shows channels/groups)
- Progress tracking with visual progress bars - Progress tracking with visual progress bars
- Automatic channels list export to CSV
## Prerequisites 📋 ## Prerequisites 📋
@@ -128,11 +126,18 @@ Instead of typing long channel IDs, use numbers:
**Adding Channels:** **Adding Channels:**
``` ```
[1] The News (Chat) (id: -1002116176890) [1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
[2] Python Channel (id: -1001597139842) [2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
[3] The Corner (id: -1002274713954) [3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)
Enter: 1,3 (adds channels 1 and 3) Enter: 1,3 (adds channels 1 and 3)
Or: all (adds all listed channels)
```
**Viewing Your Channels:**
```
[1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
[2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
``` ```
**Scraping Channels:** **Scraping Channels:**
@@ -149,6 +154,8 @@ Data is stored in SQLite databases, one per channel:
- Location: `./channelname/channelname.db` - Location: `./channelname/channelname.db`
- Optimized with indexes for fast queries - Optimized with indexes for fast queries
- WAL mode for better performance - WAL mode for better performance
- Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
- Automatic migration adds new columns to existing databases
### Media Storage 📁 ### Media Storage 📁
@@ -160,8 +167,11 @@ Media files are stored with unique naming:
### Exported Data 📊 ### Exported Data 📊
Export formats: Export formats:
1. **CSV**: `./channelname/channelname.csv` 1. **CSV**: `./channelname/channelid_username.csv`
2. **JSON**: `./channelname/channelname.json` 2. **JSON**: `./channelname/channelid_username.json`
3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)
All exports include complete message metadata: views, forwards, reactions, and post author information.
## Performance Features ⚙️ ## Performance Features ⚙️

View File

@@ -12,7 +12,7 @@ from typing import Dict, List, Optional, Any
from pathlib import Path from pathlib import Path
from io import StringIO from io import StringIO
from telethon import TelegramClient from telethon import TelegramClient
from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument, MessageMediaWebPage, User, PeerChannel from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument, MessageMediaWebPage, User, PeerChannel, Channel, Chat
from telethon.errors import FloodWaitError, SessionPasswordNeededError from telethon.errors import FloodWaitError, SessionPasswordNeededError
import qrcode import qrcode
@@ -43,6 +43,10 @@ class MessageData:
media_type: Optional[str] media_type: Optional[str]
media_path: Optional[str] media_path: Optional[str]
reply_to: Optional[int] reply_to: Optional[int]
post_author: Optional[str]
views: Optional[int]
forwards: Optional[int]
reactions: Optional[str]
class OptimizedTelegramScraper: class OptimizedTelegramScraper:
def __init__(self): def __init__(self):
@@ -66,6 +70,7 @@ class OptimizedTelegramScraper:
'api_id': None, 'api_id': None,
'api_hash': None, 'api_hash': None,
'channels': {}, 'channels': {},
'channel_names': {},
'scrape_media': True, 'scrape_media': True,
} }
@@ -86,16 +91,44 @@ class OptimizedTelegramScraper:
conn.execute('''CREATE TABLE IF NOT EXISTS messages conn.execute('''CREATE TABLE IF NOT EXISTS messages
(id INTEGER PRIMARY KEY, message_id INTEGER UNIQUE, date TEXT, (id INTEGER PRIMARY KEY, message_id INTEGER UNIQUE, date TEXT,
sender_id INTEGER, first_name TEXT, last_name TEXT, username TEXT, sender_id INTEGER, first_name TEXT, last_name TEXT, username TEXT,
message TEXT, media_type TEXT, media_path TEXT, reply_to INTEGER)''') message TEXT, media_type TEXT, media_path TEXT, reply_to INTEGER,
post_author TEXT, views INTEGER, forwards INTEGER, reactions TEXT)''')
conn.execute('CREATE INDEX IF NOT EXISTS idx_message_id ON messages(message_id)') conn.execute('CREATE INDEX IF NOT EXISTS idx_message_id ON messages(message_id)')
conn.execute('CREATE INDEX IF NOT EXISTS idx_date ON messages(date)') conn.execute('CREATE INDEX IF NOT EXISTS idx_date ON messages(date)')
conn.execute('PRAGMA journal_mode=WAL') conn.execute('PRAGMA journal_mode=WAL')
conn.execute('PRAGMA synchronous=NORMAL') conn.execute('PRAGMA synchronous=NORMAL')
conn.commit() conn.commit()
self.migrate_database(conn)
self.db_connections[channel] = conn self.db_connections[channel] = conn
return self.db_connections[channel] return self.db_connections[channel]
def migrate_database(self, conn: sqlite3.Connection):
cursor = conn.cursor()
cursor.execute("PRAGMA table_info(messages)")
columns = {row[1] for row in cursor.fetchall()}
migrations = []
if 'post_author' not in columns:
migrations.append('ALTER TABLE messages ADD COLUMN post_author TEXT')
if 'views' not in columns:
migrations.append('ALTER TABLE messages ADD COLUMN views INTEGER')
if 'forwards' not in columns:
migrations.append('ALTER TABLE messages ADD COLUMN forwards INTEGER')
if 'reactions' not in columns:
migrations.append('ALTER TABLE messages ADD COLUMN reactions TEXT')
for migration in migrations:
try:
conn.execute(migration)
except:
pass
if migrations:
conn.commit()
def close_db_connections(self): def close_db_connections(self):
for conn in self.db_connections.values(): for conn in self.db_connections.values():
conn.close() conn.close()
@@ -108,12 +141,14 @@ class OptimizedTelegramScraper:
conn = self.get_db_connection(channel) conn = self.get_db_connection(channel)
data = [(msg.message_id, msg.date, msg.sender_id, msg.first_name, data = [(msg.message_id, msg.date, msg.sender_id, msg.first_name,
msg.last_name, msg.username, msg.message, msg.media_type, msg.last_name, msg.username, msg.message, msg.media_type,
msg.media_path, msg.reply_to) for msg in messages] msg.media_path, msg.reply_to, msg.post_author, msg.views,
msg.forwards, msg.reactions) for msg in messages]
conn.executemany('''INSERT OR IGNORE INTO messages conn.executemany('''INSERT OR IGNORE INTO messages
(message_id, date, sender_id, first_name, last_name, username, (message_id, date, sender_id, first_name, last_name, username,
message, media_type, media_path, reply_to) message, media_type, media_path, reply_to, post_author, views,
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', data) forwards, reactions)
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', data)
conn.commit() conn.commit()
async def download_media(self, channel: str, message) -> Optional[str]: async def download_media(self, channel: str, message) -> Optional[str]:
@@ -196,6 +231,17 @@ class OptimizedTelegramScraper:
try: try:
sender = await message.get_sender() sender = await message.get_sender()
reactions_str = None
if message.reactions and message.reactions.results:
reactions_parts = []
for reaction in message.reactions.results:
emoji = getattr(reaction.reaction, 'emoticon', '')
count = reaction.count
if emoji:
reactions_parts.append(f"{emoji} {count}")
if reactions_parts:
reactions_str = ' '.join(reactions_parts)
msg_data = MessageData( msg_data = MessageData(
message_id=message.id, message_id=message.id,
date=message.date.strftime('%Y-%m-%d %H:%M:%S'), date=message.date.strftime('%Y-%m-%d %H:%M:%S'),
@@ -206,7 +252,11 @@ class OptimizedTelegramScraper:
message=message.message or '', message=message.message or '',
media_type=message.media.__class__.__name__ if message.media else None, media_type=message.media.__class__.__name__ if message.media else None,
media_path=None, media_path=None,
reply_to=message.reply_to_msg_id if message.reply_to else None reply_to=message.reply_to_msg_id if message.reply_to else None,
post_author=message.post_author,
views=message.views,
forwards=message.forwards,
reactions=reactions_str
) )
message_batch.append(msg_data) message_batch.append(msg_data)
@@ -289,14 +339,19 @@ class OptimizedTelegramScraper:
cursor.execute('SELECT message_id FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND media_path IS NULL') cursor.execute('SELECT message_id FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND media_path IS NULL')
message_ids = [row[0] for row in cursor.fetchall()] message_ids = [row[0] for row in cursor.fetchall()]
channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
if not message_ids: if not message_ids:
print(f"No media files to reprocess for channel {channel}") print(f"No media files to reprocess for {channel_name} (ID: {channel})")
return return
print(f"📥 Reprocessing {len(message_ids)} media files for channel {channel}") print(f"📥 Reprocessing {len(message_ids)} media files for {channel_name} (ID: {channel})")
try: try:
entity = await self.client.get_entity(PeerChannel(int(channel))) if channel.lstrip('-').isdigit():
entity = await self.client.get_entity(PeerChannel(int(channel)))
else:
entity = await self.client.get_entity(channel)
semaphore = asyncio.Semaphore(self.max_concurrent_downloads) semaphore = asyncio.Semaphore(self.max_concurrent_downloads)
completed_media = 0 completed_media = 0
successful_downloads = 0 successful_downloads = 0
@@ -348,7 +403,8 @@ class OptimizedTelegramScraper:
missing_count = total_with_media - total_with_files missing_count = total_with_media - total_with_files
print(f"\n📊 Media Analysis for {channel}:") channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
print(f"\n📊 Media Analysis for {channel_name} (ID: {channel}):")
print(f"Messages with media: {total_with_media}") print(f"Messages with media: {total_with_media}")
print(f"Media files downloaded: {total_with_files}") print(f"Media files downloaded: {total_with_files}")
print(f"Missing media files: {missing_count}") print(f"Missing media files: {missing_count}")
@@ -367,7 +423,10 @@ class OptimizedTelegramScraper:
print(f"\n🔧 Attempting to download {len(missing_media)} missing media files...") print(f"\n🔧 Attempting to download {len(missing_media)} missing media files...")
try: try:
entity = await self.client.get_entity(PeerChannel(int(channel))) if channel.lstrip('-').isdigit():
entity = await self.client.get_entity(PeerChannel(int(channel)))
else:
entity = await self.client.get_entity(channel)
semaphore = asyncio.Semaphore(self.max_concurrent_downloads) semaphore = asyncio.Semaphore(self.max_concurrent_downloads)
completed_media = 0 completed_media = 0
successful_downloads = 0 successful_downloads = 0
@@ -432,9 +491,14 @@ class OptimizedTelegramScraper:
finally: finally:
self.continuous_scraping_active = False self.continuous_scraping_active = False
def get_export_filename(self, channel: str):
username = self.state.get('channel_names', {}).get(channel, 'no_username')
return f"{channel}_{username}"
def export_to_csv(self, channel: str): def export_to_csv(self, channel: str):
conn = self.get_db_connection(channel) conn = self.get_db_connection(channel)
csv_file = Path(channel) / f'{channel}.csv' filename = self.get_export_filename(channel)
csv_file = Path(channel) / f'{filename}.csv'
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('SELECT * FROM messages ORDER BY date') cursor.execute('SELECT * FROM messages ORDER BY date')
@@ -452,7 +516,8 @@ class OptimizedTelegramScraper:
def export_to_json(self, channel: str): def export_to_json(self, channel: str):
conn = self.get_db_connection(channel) conn = self.get_db_connection(channel)
json_file = Path(channel) / f'{channel}.json' filename = self.get_export_filename(channel)
json_file = Path(channel) / f'{filename}.json'
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('SELECT * FROM messages ORDER BY date') cursor.execute('SELECT * FROM messages ORDER BY date')
@@ -504,20 +569,45 @@ class OptimizedTelegramScraper:
cursor = conn.cursor() cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM messages') cursor.execute('SELECT COUNT(*) FROM messages')
count = cursor.fetchone()[0] count = cursor.fetchone()[0]
print(f"[{i}] Channel ID: {channel}, Last Message ID: {last_id}, Messages: {count}") channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}, Messages: {count}")
except: except:
print(f"[{i}] Channel ID: {channel}, Last Message ID: {last_id}") channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}")
async def list_channels(self): async def list_channels(self):
try: try:
print("\nList of channels joined by account:") print("\nList of channels and groups joined by account:")
count = 1 count = 1
channels_data = []
async for dialog in self.client.iter_dialogs(): async for dialog in self.client.iter_dialogs():
if dialog.id != 777000: entity = dialog.entity
print(f"[{count}] {dialog.title} (id: {dialog.id})") if dialog.id != 777000 and (isinstance(entity, Channel) or isinstance(entity, Chat)):
channel_type = "Channel" if isinstance(entity, Channel) and entity.broadcast else "Group"
username = getattr(entity, 'username', None) or 'no_username'
print(f"[{count}] {dialog.title} (ID: {dialog.id}, Type: {channel_type}, Username: @{username})")
channels_data.append({
'number': count,
'channel_name': dialog.title,
'channel_id': str(dialog.id),
'username': username,
'type': channel_type
})
count += 1 count += 1
if channels_data:
csv_file = Path('channels_list.csv')
with open(csv_file, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['number', 'channel_name', 'channel_id', 'username', 'type'])
writer.writeheader()
writer.writerows(channels_data)
print(f"\n✅ Saved channels list to {csv_file}")
return channels_data
except Exception as e: except Exception as e:
print(f"Error listing channels: {e}") print(f"Error listing channels: {e}")
return []
def display_qr_code_ascii(self, qr_login): def display_qr_code_ascii(self, qr_login):
qr = qrcode.QRCode(box_size=1, border=1) qr = qrcode.QRCode(box_size=1, border=1)
@@ -736,44 +826,66 @@ class OptimizedTelegramScraper:
await self.export_data() await self.export_data()
elif choice == 'l': elif choice == 'l':
channels_list = [] channels_data = await self.list_channels()
async for dialog in self.client.iter_dialogs():
if dialog.id != 777000: if not channels_data:
channels_list.append(str(dialog.id)) continue
await self.list_channels()
print("\nTo add channels from the list above:") print("\nTo add channels from the list above:")
print("• Single: 1 or -1001234567890") print("• Single: 1 or -1001234567890")
print("• Multiple: 1,3,5 or mix formats") print("• Multiple: 1,3,5 or mix formats")
print("• All channels: all")
print("• Press Enter to skip adding") print("• Press Enter to skip adding")
selection = input("\nEnter selection (or Enter to skip): ").strip() selection = input("\nEnter selection (or Enter to skip): ").strip()
if selection: if selection:
added_count = 0 added_count = 0
for sel in [x.strip() for x in selection.split(',')]:
try:
if sel.startswith('-'):
channel = sel
else:
num = int(sel)
if 1 <= num <= len(channels_list):
channel = channels_list[num - 1]
else:
print(f"Invalid number: {num}. Choose 1-{len(channels_list)}")
continue
if channel in self.state['channels']: if selection.lower() == 'all':
print(f"Channel {channel} already added") for channel_info in channels_data:
else: channel_id = channel_info['channel_id']
self.state['channels'][channel] = 0 if channel_id not in self.state['channels']:
self.save_state() self.state['channels'][channel_id] = 0
print(f"✅ Added channel {channel}") if 'channel_names' not in self.state:
self.state['channel_names'] = {}
self.state['channel_names'][channel_id] = channel_info['username']
print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})")
added_count += 1 added_count += 1
else:
print(f"Channel {channel_info['channel_name']} already added")
else:
for sel in [x.strip() for x in selection.split(',')]:
try:
if sel.startswith('-'):
channel_id = sel
channel_info = next((c for c in channels_data if c['channel_id'] == channel_id), None)
if not channel_info:
print(f"Channel ID {channel_id} not found")
continue
else:
num = int(sel)
if 1 <= num <= len(channels_data):
channel_info = channels_data[num - 1]
channel_id = channel_info['channel_id']
else:
print(f"Invalid number: {num}. Choose 1-{len(channels_data)}")
continue
except ValueError: if channel_id in self.state['channels']:
print(f"Invalid input: {sel}") print(f"Channel {channel_info['channel_name']} already added")
else:
self.state['channels'][channel_id] = 0
if 'channel_names' not in self.state:
self.state['channel_names'] = {}
self.state['channel_names'][channel_id] = channel_info['username']
print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})")
added_count += 1
except ValueError:
print(f"Invalid input: {sel}")
if added_count > 0: if added_count > 0:
self.save_state()
print(f"\n🎉 Added {added_count} new channel(s)!") print(f"\n🎉 Added {added_count} new channel(s)!")
await self.view_channels() await self.view_channels()
else: else: