Version 3.1

2025-12-12 15:38:09 +01:00
parent 8d4e092b1b
commit fb7ad3742e
2 changed files with 229 additions and 107 deletions
--- a/README.md
+++ b/README.md
@@ -11,41 +11,39 @@ ___________________  _________
                 \/        \/
 ```

-## What's New in v3.0 🎉
+## What's New in v3.1 🎉

-**QR Code Authentication:**
- **No phone number required** - Login with QR code scanning (still need API credentials)
- **Faster authentication** - Just scan with your phone after API setup
- **Secure login** - Recommended authentication method
- **2FA support** for both QR and phone methods
+**Enhanced Message Data:**
+- **Message statistics** - Captures views, forwards, and post_author for each message
+- **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
+- **Automatic database migration** - Seamlessly adds new columns to existing databases
+- **Richer exports** - All new data included in CSV/JSON exports

-**Enhanced User Experience:**
- **Numbered channel selection** - Use 1,2,3 instead of full channel IDs
- **Multi-channel operations** - Add, remove, and scrape multiple channels at once
- **Streamlined menu** - Cleaner interface with fewer redundant options
- **Progress bars** for media downloads with visual feedback
+**Improved Channel Management:**
+- **Channel names displayed** - Shows channel names alongside IDs everywhere
+- **Smart filtering** - List option now only shows Channels and Groups (no private chats)
+- **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
+- **"all" selection** - Quickly add all listed channels at once
+- **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`

-**Media Download Improvements:**
- **Fixed file overwriting** - Unique naming prevents media files from being overwritten
- **5x concurrent downloads** - Increased from 3 to 5 for faster media processing
- **Better error handling** - Improved retry logic and recovery
-
-**Performance & Stability:**
- **Database optimizations** - WAL mode and faster operations
- **Hidden warnings** - Cleaner output without technical messages
- **Better error recovery** - More robust handling of network issues
+**Bug Fixes:**
+- **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
+- **Better entity resolution** - Handles both numeric IDs and channel usernames
+- **Improved error messages** - Shows channel names with IDs for clearer debugging

 ## Features 🚀

 - **QR Code & Phone Authentication** - Choose your preferred login method
- Scrape messages from multiple Telegram channels
+- Scrape messages with full metadata (views, forwards, reactions, post author)
 - Download media files with parallel processing and unique naming
 - Real-time continuous scraping
- Export data to JSON and CSV formats
- SQLite database storage with optimized performance
+- Export data to JSON and CSV formats with enhanced metadata
+- SQLite database storage with automatic schema migration
 - Resume capability (saves progress)
- Interactive menu with numbered channel selection
+- Interactive menu with channel names and numbered selection
+- Smart channel filtering (only shows channels/groups)
 - Progress tracking with visual progress bars
+- Automatic channels list export to CSV

 ## Prerequisites 📋

@@ -128,11 +126,18 @@ Instead of typing long channel IDs, use numbers:

 **Adding Channels:**
 ```
-[1] The News (Chat) (id: -1002116176890)
-[2] Python Channel (id: -1001597139842)
-[3] The Corner (id: -1002274713954)
+[1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
+[2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
+[3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)

 Enter: 1,3 (adds channels 1 and 3)
+Or: all (adds all listed channels)
+```
+
+**Viewing Your Channels:**
+```
+[1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
+[2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
 ```

 **Scraping Channels:**
@@ -149,6 +154,8 @@ Data is stored in SQLite databases, one per channel:
 - Location: `./channelname/channelname.db`
 - Optimized with indexes for fast queries
 - WAL mode for better performance
+- Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
+- Automatic migration adds new columns to existing databases

 ### Media Storage 📁

@@ -160,8 +167,11 @@ Media files are stored with unique naming:
 ### Exported Data 📊

 Export formats:
-1. **CSV**: `./channelname/channelname.csv`
-2. **JSON**: `./channelname/channelname.json`
+1. **CSV**: `./channelname/channelid_username.csv`
+2. **JSON**: `./channelname/channelid_username.json`
+3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)
+
+All exports include complete message metadata: views, forwards, reactions, and post author information.

 ## Performance Features ⚙️

--- a/telegram-scraper.py
+++ b/telegram-scraper.py
@@ -12,7 +12,7 @@ from typing import Dict, List, Optional, Any
 from pathlib import Path
 from io import StringIO
 from telethon import TelegramClient
-from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument, MessageMediaWebPage, User, PeerChannel
+from telethon.tl.types import MessageMediaPhoto, MessageMediaDocument, MessageMediaWebPage, User, PeerChannel, Channel, Chat
 from telethon.errors import FloodWaitError, SessionPasswordNeededError
 import qrcode

@@ -43,6 +43,10 @@ class MessageData:
    media_type: Optional[str]
    media_path: Optional[str]
    reply_to: Optional[int]
+    post_author: Optional[str]
+    views: Optional[int]
+    forwards: Optional[int]
+    reactions: Optional[str]

 class OptimizedTelegramScraper:
    def __init__(self):
@@ -66,6 +70,7 @@ class OptimizedTelegramScraper:
            'api_id': None,
            'api_hash': None,
            'channels': {},
+            'channel_names': {},
            'scrape_media': True,
        }

@@ -86,16 +91,44 @@ class OptimizedTelegramScraper:
            conn.execute('''CREATE TABLE IF NOT EXISTS messages
                          (id INTEGER PRIMARY KEY, message_id INTEGER UNIQUE, date TEXT,
                           sender_id INTEGER, first_name TEXT, last_name TEXT, username TEXT,
-                           message TEXT, media_type TEXT, media_path TEXT, reply_to INTEGER)''')
+                           message TEXT, media_type TEXT, media_path TEXT, reply_to INTEGER,
+                           post_author TEXT, views INTEGER, forwards INTEGER, reactions TEXT)''')
            conn.execute('CREATE INDEX IF NOT EXISTS idx_message_id ON messages(message_id)')
            conn.execute('CREATE INDEX IF NOT EXISTS idx_date ON messages(date)')
            conn.execute('PRAGMA journal_mode=WAL')
            conn.execute('PRAGMA synchronous=NORMAL')
            conn.commit()
+
+            self.migrate_database(conn)
+
            self.db_connections[channel] = conn

        return self.db_connections[channel]

+    def migrate_database(self, conn: sqlite3.Connection):
+        cursor = conn.cursor()
+        cursor.execute("PRAGMA table_info(messages)")
+        columns = {row[1] for row in cursor.fetchall()}
+
+        migrations = []
+        if 'post_author' not in columns:
+            migrations.append('ALTER TABLE messages ADD COLUMN post_author TEXT')
+        if 'views' not in columns:
+            migrations.append('ALTER TABLE messages ADD COLUMN views INTEGER')
+        if 'forwards' not in columns:
+            migrations.append('ALTER TABLE messages ADD COLUMN forwards INTEGER')
+        if 'reactions' not in columns:
+            migrations.append('ALTER TABLE messages ADD COLUMN reactions TEXT')
+
+        for migration in migrations:
+            try:
+                conn.execute(migration)
+            except:
+                pass
+
+        if migrations:
+            conn.commit()
+
    def close_db_connections(self):
        for conn in self.db_connections.values():
            conn.close()
@@ -108,12 +141,14 @@ class OptimizedTelegramScraper:
        conn = self.get_db_connection(channel)
        data = [(msg.message_id, msg.date, msg.sender_id, msg.first_name,
                msg.last_name, msg.username, msg.message, msg.media_type,
-                msg.media_path, msg.reply_to) for msg in messages]
+                msg.media_path, msg.reply_to, msg.post_author, msg.views,
+                msg.forwards, msg.reactions) for msg in messages]

        conn.executemany('''INSERT OR IGNORE INTO messages
                           (message_id, date, sender_id, first_name, last_name, username,
-                            message, media_type, media_path, reply_to)
-                           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', data)
+                            message, media_type, media_path, reply_to, post_author, views,
+                            forwards, reactions)
+                           VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)''', data)
        conn.commit()

    async def download_media(self, channel: str, message) -> Optional[str]:
@@ -196,6 +231,17 @@ class OptimizedTelegramScraper:
                try:
                    sender = await message.get_sender()

+                    reactions_str = None
+                    if message.reactions and message.reactions.results:
+                        reactions_parts = []
+                        for reaction in message.reactions.results:
+                            emoji = getattr(reaction.reaction, 'emoticon', '')
+                            count = reaction.count
+                            if emoji:
+                                reactions_parts.append(f"{emoji} {count}")
+                        if reactions_parts:
+                            reactions_str = ' '.join(reactions_parts)
+
                    msg_data = MessageData(
                        message_id=message.id,
                        date=message.date.strftime('%Y-%m-%d %H:%M:%S'),
@@ -206,7 +252,11 @@ class OptimizedTelegramScraper:
                        message=message.message or '',
                        media_type=message.media.__class__.__name__ if message.media else None,
                        media_path=None,
-                        reply_to=message.reply_to_msg_id if message.reply_to else None
+                        reply_to=message.reply_to_msg_id if message.reply_to else None,
+                        post_author=message.post_author,
+                        views=message.views,
+                        forwards=message.forwards,
+                        reactions=reactions_str
                    )

                    message_batch.append(msg_data)
@@ -289,14 +339,19 @@ class OptimizedTelegramScraper:
        cursor.execute('SELECT message_id FROM messages WHERE media_type IS NOT NULL AND media_type != "MessageMediaWebPage" AND media_path IS NULL')
        message_ids = [row[0] for row in cursor.fetchall()]

+        channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
+
        if not message_ids:
-            print(f"No media files to reprocess for channel {channel}")
+            print(f"No media files to reprocess for {channel_name} (ID: {channel})")
            return

-        print(f"📥 Reprocessing {len(message_ids)} media files for channel {channel}")
+        print(f"📥 Reprocessing {len(message_ids)} media files for {channel_name} (ID: {channel})")

        try:
-            entity = await self.client.get_entity(PeerChannel(int(channel)))
+            if channel.lstrip('-').isdigit():
+                entity = await self.client.get_entity(PeerChannel(int(channel)))
+            else:
+                entity = await self.client.get_entity(channel)
            semaphore = asyncio.Semaphore(self.max_concurrent_downloads)
            completed_media = 0
            successful_downloads = 0
@@ -348,7 +403,8 @@ class OptimizedTelegramScraper:

        missing_count = total_with_media - total_with_files

-        print(f"\n📊 Media Analysis for {channel}:")
+        channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
+        print(f"\n📊 Media Analysis for {channel_name} (ID: {channel}):")
        print(f"Messages with media: {total_with_media}")
        print(f"Media files downloaded: {total_with_files}")
        print(f"Missing media files: {missing_count}")
@@ -367,7 +423,10 @@ class OptimizedTelegramScraper:
        print(f"\n🔧 Attempting to download {len(missing_media)} missing media files...")

        try:
-            entity = await self.client.get_entity(PeerChannel(int(channel)))
+            if channel.lstrip('-').isdigit():
+                entity = await self.client.get_entity(PeerChannel(int(channel)))
+            else:
+                entity = await self.client.get_entity(channel)
            semaphore = asyncio.Semaphore(self.max_concurrent_downloads)
            completed_media = 0
            successful_downloads = 0
@@ -432,9 +491,14 @@ class OptimizedTelegramScraper:
        finally:
            self.continuous_scraping_active = False

+    def get_export_filename(self, channel: str):
+        username = self.state.get('channel_names', {}).get(channel, 'no_username')
+        return f"{channel}_{username}"
+
    def export_to_csv(self, channel: str):
        conn = self.get_db_connection(channel)
-        csv_file = Path(channel) / f'{channel}.csv'
+        filename = self.get_export_filename(channel)
+        csv_file = Path(channel) / f'{filename}.csv'

        cursor = conn.cursor()
        cursor.execute('SELECT * FROM messages ORDER BY date')
@@ -452,7 +516,8 @@ class OptimizedTelegramScraper:

    def export_to_json(self, channel: str):
        conn = self.get_db_connection(channel)
-        json_file = Path(channel) / f'{channel}.json'
+        filename = self.get_export_filename(channel)
+        json_file = Path(channel) / f'{filename}.json'

        cursor = conn.cursor()
        cursor.execute('SELECT * FROM messages ORDER BY date')
@@ -504,20 +569,45 @@ class OptimizedTelegramScraper:
                cursor = conn.cursor()
                cursor.execute('SELECT COUNT(*) FROM messages')
                count = cursor.fetchone()[0]
-                print(f"[{i}] Channel ID: {channel}, Last Message ID: {last_id}, Messages: {count}")
+                channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
+                print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}, Messages: {count}")
            except:
-                print(f"[{i}] Channel ID: {channel}, Last Message ID: {last_id}")
+                channel_name = self.state.get('channel_names', {}).get(channel, 'Unknown')
+                print(f"[{i}] {channel_name} (ID: {channel}), Last Message ID: {last_id}")

    async def list_channels(self):
        try:
-            print("\nList of channels joined by account:")
+            print("\nList of channels and groups joined by account:")
            count = 1
+            channels_data = []
            async for dialog in self.client.iter_dialogs():
-                if dialog.id != 777000:
-                    print(f"[{count}] {dialog.title} (id: {dialog.id})")
+                entity = dialog.entity
+                if dialog.id != 777000 and (isinstance(entity, Channel) or isinstance(entity, Chat)):
+                    channel_type = "Channel" if isinstance(entity, Channel) and entity.broadcast else "Group"
+                    username = getattr(entity, 'username', None) or 'no_username'
+                    print(f"[{count}] {dialog.title} (ID: {dialog.id}, Type: {channel_type}, Username: @{username})")
+                    channels_data.append({
+                        'number': count,
+                        'channel_name': dialog.title,
+                        'channel_id': str(dialog.id),
+                        'username': username,
+                        'type': channel_type
+                    })
                    count += 1
+
+            if channels_data:
+                csv_file = Path('channels_list.csv')
+                with open(csv_file, 'w', newline='', encoding='utf-8') as f:
+                    writer = csv.DictWriter(f, fieldnames=['number', 'channel_name', 'channel_id', 'username', 'type'])
+                    writer.writeheader()
+                    writer.writerows(channels_data)
+                print(f"\n✅ Saved channels list to {csv_file}")
+
+            return channels_data
+
        except Exception as e:
            print(f"Error listing channels: {e}")
+            return []

    def display_qr_code_ascii(self, qr_login):
        qr = qrcode.QRCode(box_size=1, border=1)
@@ -736,44 +826,66 @@ class OptimizedTelegramScraper:
                    await self.export_data()
                    
                elif choice == 'l':
-                    channels_list = []
-                    async for dialog in self.client.iter_dialogs():
-                        if dialog.id != 777000:
-                            channels_list.append(str(dialog.id))
+                    channels_data = await self.list_channels()
+
+                    if not channels_data:
+                        continue

-                    await self.list_channels()
                    print("\nTo add channels from the list above:")
                    print("• Single: 1 or -1001234567890")
                    print("• Multiple: 1,3,5 or mix formats")
+                    print("• All channels: all")
                    print("• Press Enter to skip adding")
                    selection = input("\nEnter selection (or Enter to skip): ").strip()

                    if selection:
                        added_count = 0
-                        for sel in [x.strip() for x in selection.split(',')]:
-                            try:
-                                if sel.startswith('-'):
-                                    channel = sel
-                                else:
-                                    num = int(sel)
-                                    if 1 <= num <= len(channels_list):
-                                        channel = channels_list[num - 1]
-                                    else:
-                                        print(f"Invalid number: {num}. Choose 1-{len(channels_list)}")
-                                        continue

-                                if channel in self.state['channels']:
-                                    print(f"Channel {channel} already added")
-                                else:
-                                    self.state['channels'][channel] = 0
-                                    self.save_state()
-                                    print(f"✅ Added channel {channel}")
+                        if selection.lower() == 'all':
+                            for channel_info in channels_data:
+                                channel_id = channel_info['channel_id']
+                                if channel_id not in self.state['channels']:
+                                    self.state['channels'][channel_id] = 0
+                                    if 'channel_names' not in self.state:
+                                        self.state['channel_names'] = {}
+                                    self.state['channel_names'][channel_id] = channel_info['username']
+                                    print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})")
                                    added_count += 1
+                                else:
+                                    print(f"Channel {channel_info['channel_name']} already added")
+                        else:
+                            for sel in [x.strip() for x in selection.split(',')]:
+                                try:
+                                    if sel.startswith('-'):
+                                        channel_id = sel
+                                        channel_info = next((c for c in channels_data if c['channel_id'] == channel_id), None)
+                                        if not channel_info:
+                                            print(f"Channel ID {channel_id} not found")
+                                            continue
+                                    else:
+                                        num = int(sel)
+                                        if 1 <= num <= len(channels_data):
+                                            channel_info = channels_data[num - 1]
+                                            channel_id = channel_info['channel_id']
+                                        else:
+                                            print(f"Invalid number: {num}. Choose 1-{len(channels_data)}")
+                                            continue

-                            except ValueError:
-                                print(f"Invalid input: {sel}")
+                                    if channel_id in self.state['channels']:
+                                        print(f"Channel {channel_info['channel_name']} already added")
+                                    else:
+                                        self.state['channels'][channel_id] = 0
+                                        if 'channel_names' not in self.state:
+                                            self.state['channel_names'] = {}
+                                        self.state['channel_names'][channel_id] = channel_info['username']
+                                        print(f"✅ Added channel {channel_info['channel_name']} (ID: {channel_id})")
+                                        added_count += 1
+
+                                except ValueError:
+                                    print(f"Invalid input: {sel}")

                        if added_count > 0:
+                            self.save_state()
                            print(f"\n🎉 Added {added_count} new channel(s)!")
                            await self.view_channels()
                        else: