Update README.md

2025-09-11 17:32:56 +02:00
parent 65b221ade6
commit 7db46018ce
1 changed files with 80 additions and 136 deletions
--- a/README.md
+++ b/README.md
@@ -11,40 +11,41 @@ ___________________  _________
                 \/        \/
 ```
-## What's New in v2.0 🎉
+## What's New in v3.0 🎉
-**Major Performance Improvements:**
+**QR Code Authentication:**
- **5-10x faster scraping** with batch database operations
+- **No phone number required** - Login with QR code scanning (still need API credentials)
- **3x faster media downloads** with parallel processing (up to 3 concurrent downloads)
+- **Faster authentication** - Just scan with your phone after API setup
- **10-20x faster database operations** through connection pooling and batch insertions
+- **Secure login** - Recommended authentication method
- **Memory-efficient exports** that handle large datasets without running out of memory
+- **2FA support** for both QR and phone methods
 - **Enhanced progress reporting** with actual message counts and percentages
-**New Features:**
+**Enhanced User Experience:**
- **Message count display** in channel view
+- **Numbered channel selection** - Use 1,2,3 instead of full channel IDs
- **Configurable download concurrency** (adjustable in code)
+- **Multi-channel operations** - Add, remove, and scrape multiple channels at once
- **Better error handling** with exponential backoff retry mechanism
+- **Streamlined menu** - Cleaner interface with fewer redundant options
- **Optimized database structure** with indexes for faster queries
+- **Progress bars** for media downloads with visual feedback
 - **Object-oriented design** for better code maintainability
-**Technical Improvements:**
+**Media Download Improvements:**
- Database connection pooling
+- **Fixed file overwriting** - Unique naming prevents media files from being overwritten
- Batch message insertions (100 messages per batch)
+- **5x concurrent downloads** - Increased from 3 to 5 for faster media processing
- Streaming exports for large datasets
+- **Better error handling** - Improved retry logic and recovery
- Improved flood control handling
+
- Periodic state saving (every 50 messages)
+**Performance & Stability:**
 - **Database optimizations** - WAL mode and faster operations
 - **Hidden warnings** - Cleaner output without technical messages
 - **Better error recovery** - More robust handling of network issues
 ## Features 🚀
 - **QR Code & Phone Authentication** - Choose your preferred login method
 - Scrape messages from multiple Telegram channels
- Download media files (photos, documents) with parallel processing
+- Download media files with parallel processing and unique naming
 - Real-time continuous scraping
 - Export data to JSON and CSV formats
 - SQLite database storage with optimized performance
 - Resume capability (saves progress)
- Media reprocessing for failed downloads
+- Interactive menu with numbered channel selection
- Enhanced progress tracking with message counts
+- Progress tracking with visual progress bars
 - Interactive menu interface
 ## Prerequisites 📋
@@ -60,13 +61,6 @@ Before running the script, you'll need:
 pip install -r requirements.txt
 ```
 Contents of `requirements.txt`:
 ```
 telethon
 aiohttp
 asyncio
 ```
 ## Getting Telegram API Credentials 🔑
 1. Visit https://my.telegram.org/auth
@@ -88,7 +82,7 @@ Keep these credentials safe, you'll need them to run the script!
 1. Clone the repository:
 ```bash
-git clone https://github.com/robertaitch/telegram-scraper.git
+git clone https://github.com/unnohwn/telegram-scraper.git
 cd telegram-scraper
 ```
@@ -103,132 +97,86 @@ python telegram-scraper.py
 ```
 4. On first run, you'll be prompted to enter:
-   - Your API ID
+   - Your API ID (from my.telegram.org)
-   - Your API Hash
+   - Your API Hash (from my.telegram.org)
-   - Your phone number (with country code)
+   - **Choose authentication method:**
-   - Your phone number (with country code) or bot, but use the phone number option when prompted second time.
+     - **QR Code** (Recommended) - Scan with your phone (no phone number needed)
-   - Verification code (sent to your Telegram)
+     - **Phone Number** - Traditional SMS verification
 ## Initial Scraping Behavior 🕒
 When scraping a channel for the first time, please note:
 - The script will attempt to retrieve the entire channel history, starting from the oldest messages
 - **Significantly faster than previous versions** due to batch processing and parallel downloads
 - Initial scraping time depends on:
  - The total number of messages in the channel
  - Whether media downloading is enabled
  - The size and number of media files
  - Your internet connection speed
  - Telegram's rate limiting
 - The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
 - **Enhanced progress display** shows actual message counts (e.g., "1,500/10,000 messages")
 - Messages are stored in the database in batches for optimal performance
 - **Media downloads run in parallel** (up to 3 simultaneous downloads) for faster processing
 ## Usage 📝
-The script provides an interactive menu with the following options:
+The script provides a clean interactive menu:
- **[A]** Add new channel
+```
-  - Enter the channel ID or channelname
+========================================
- **[R]** Remove channel
+           TELEGRAM SCRAPER
-  - Remove a channel from scraping list
+========================================
- **[S]** Scrape all channels
+[S] Scrape channels
-  - One-time scraping of all configured channels
+[C] Continuous scraping  
- **[M]** Toggle media scraping
+[M] Media scraping: ON
-  - Enable/disable downloading of media files
+[L] List & add channels
- **[C]** Continuous scraping
+[R] Remove channels
-  - Real-time monitoring of channels for new messages
+[E] Export data
- **[E]** Export data
+[T] Rescrape media
-  - Export to JSON and CSV formats (memory-efficient for large datasets)
+[Q] Quit
- **[V]** View saved channels
+========================================
-  - List all saved channels **with message counts**
+```
 - **[L]** List account channels
  - List all channels with ID:s for account
 - **[Q]** Quit
-### Channel IDs 📢
+### Channel Selection Made Easy 🔢
-You can use either:
+Instead of typing long channel IDs, use numbers:
- Channel username (e.g., `channelname`)
+
- Channel ID (e.g., `-1001234567890`)
+**Adding Channels:**
 ```
 [1] The News (Chat) (id: -1002116176890)
 [2] Python Channel (id: -1001597139842)
 [3] The Corner (id: -1002274713954)
 Enter: 1,3 (adds channels 1 and 3)
 ```
 **Scraping Channels:**
 - Single: `1`
 - Multiple: `1,3,5` 
 - All: `all`
 - Mix formats: `1,-1001597139842,3`
 ## Data Storage 💾
 ### Database Structure
-Data is stored in SQLite databases, one per channel with **optimized indexes**:
+Data is stored in SQLite databases, one per channel:
 - Location: `./channelname/channelname.db`
- Table: `messages`
+- Optimized with indexes for fast queries
-  - `id`: Primary key
+- WAL mode for better performance
  - `message_id`: Telegram message ID (indexed)
  - `date`: Message timestamp (indexed)
  - `sender_id`: Sender's Telegram ID
  - `first_name`: Sender's first name
  - `last_name`: Sender's last name
  - `username`: Sender's username
  - `message`: Message text
  - `media_type`: Type of media (if any)
  - `media_path`: Local path to downloaded media
  - `reply_to`: ID of replied message (if any)
 ### Media Storage 📁
-Media files are stored in:
+Media files are stored with unique naming:
 - Location: `./channelname/media/`
- Files are named using message ID or original filename
+- Format: `{message_id}-{unique_id}-{original_name}.ext`
- **Parallel downloads** for faster media acquisition
+- **No more file overwrites** - Each file gets a unique name
 ### Exported Data 📊
-Data can be exported in two formats with **memory-efficient processing**:
+Export formats:
 1. **CSV**: `./channelname/channelname.csv`
   - Human-readable spreadsheet format
   - Easy to import into Excel/Google Sheets
   - **Streaming export** handles large datasets
 2. **JSON**: `./channelname/channelname.json`
   - Structured data format
   - Ideal for programmatic processing
   - **Memory-optimized** for large files
-## Performance Tuning ⚙️
+## Performance Features ⚙️
-You can adjust these performance settings in the code:
+- **5 concurrent downloads** for faster media processing
- `max_concurrent_downloads = 3`: Number of simultaneous media downloads
+- **Batch database operations** for optimal speed
- `batch_size = 100`: Number of messages processed in each batch
+- **Progress bars** with real-time feedback
- `state_save_interval = 50`: How often to save progress
+- **Resume capability** - Continue where you left off
-
+- **Memory-efficient** exports for large datasets
 ## Features in Detail 🔍
 ### Continuous Scraping
 The continuous scraping feature (`[C]` option) allows you to:
 - Monitor channels in real-time
 - Automatically download new messages
 - Download media as it's posted with parallel processing
 - Run indefinitely until interrupted (Ctrl+C)
 - Maintains state between runs
 ### Media Handling
 The script can download:
 - Photos
 - Documents
 - Other media types supported by Telegram
 - **Parallel downloads** for faster processing
 - **Improved retry mechanism** with exponential backoff
 - Skips existing files to avoid duplicates
 ## Error Handling 🛠️
-The script includes:
+- Automatic retry with exponential backoff
- **Enhanced retry mechanism** with exponential backoff for failed media downloads
+- Rate limit compliance
- State preservation in case of interruption
+- Network error recovery
- **Improved flood control** compliance
+- State preservation during interruptions
 - Comprehensive error logging for failed operations
 - **Better rate limit handling** with automatic waiting
 ## Limitations ⚠️
@@ -236,10 +184,6 @@ The script includes:
 - Can only access public channels or channels you're a member of
 - Media download size limits apply as per Telegram's restrictions
 ## Contributing 🤝
 Contributions are welcome! Please feel free to submit a Pull Request.
 ## License 📄
 This project is licensed under the MIT License - see the LICENSE file for details.