Update README.md

This commit is contained in:
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
2025-09-11 17:32:56 +02:00
committed by GitHub
parent 65b221ade6
commit 7db46018ce

214
README.md
View File

@@ -11,40 +11,41 @@ ___________________ _________
\/ \/ \/ \/
``` ```
## What's New in v2.0 🎉 ## What's New in v3.0 🎉
**Major Performance Improvements:** **QR Code Authentication:**
- **5-10x faster scraping** with batch database operations - **No phone number required** - Login with QR code scanning (still need API credentials)
- **3x faster media downloads** with parallel processing (up to 3 concurrent downloads) - **Faster authentication** - Just scan with your phone after API setup
- **10-20x faster database operations** through connection pooling and batch insertions - **Secure login** - Recommended authentication method
- **Memory-efficient exports** that handle large datasets without running out of memory - **2FA support** for both QR and phone methods
- **Enhanced progress reporting** with actual message counts and percentages
**New Features:** **Enhanced User Experience:**
- **Message count display** in channel view - **Numbered channel selection** - Use 1,2,3 instead of full channel IDs
- **Configurable download concurrency** (adjustable in code) - **Multi-channel operations** - Add, remove, and scrape multiple channels at once
- **Better error handling** with exponential backoff retry mechanism - **Streamlined menu** - Cleaner interface with fewer redundant options
- **Optimized database structure** with indexes for faster queries - **Progress bars** for media downloads with visual feedback
- **Object-oriented design** for better code maintainability
**Technical Improvements:** **Media Download Improvements:**
- Database connection pooling - **Fixed file overwriting** - Unique naming prevents media files from being overwritten
- Batch message insertions (100 messages per batch) - **5x concurrent downloads** - Increased from 3 to 5 for faster media processing
- Streaming exports for large datasets - **Better error handling** - Improved retry logic and recovery
- Improved flood control handling
- Periodic state saving (every 50 messages) **Performance & Stability:**
- **Database optimizations** - WAL mode and faster operations
- **Hidden warnings** - Cleaner output without technical messages
- **Better error recovery** - More robust handling of network issues
## Features 🚀 ## Features 🚀
- **QR Code & Phone Authentication** - Choose your preferred login method
- Scrape messages from multiple Telegram channels - Scrape messages from multiple Telegram channels
- Download media files (photos, documents) with parallel processing - Download media files with parallel processing and unique naming
- Real-time continuous scraping - Real-time continuous scraping
- Export data to JSON and CSV formats - Export data to JSON and CSV formats
- SQLite database storage with optimized performance - SQLite database storage with optimized performance
- Resume capability (saves progress) - Resume capability (saves progress)
- Media reprocessing for failed downloads - Interactive menu with numbered channel selection
- Enhanced progress tracking with message counts - Progress tracking with visual progress bars
- Interactive menu interface
## Prerequisites 📋 ## Prerequisites 📋
@@ -60,13 +61,6 @@ Before running the script, you'll need:
pip install -r requirements.txt pip install -r requirements.txt
``` ```
Contents of `requirements.txt`:
```
telethon
aiohttp
asyncio
```
## Getting Telegram API Credentials 🔑 ## Getting Telegram API Credentials 🔑
1. Visit https://my.telegram.org/auth 1. Visit https://my.telegram.org/auth
@@ -88,7 +82,7 @@ Keep these credentials safe, you'll need them to run the script!
1. Clone the repository: 1. Clone the repository:
```bash ```bash
git clone https://github.com/robertaitch/telegram-scraper.git git clone https://github.com/unnohwn/telegram-scraper.git
cd telegram-scraper cd telegram-scraper
``` ```
@@ -103,132 +97,86 @@ python telegram-scraper.py
``` ```
4. On first run, you'll be prompted to enter: 4. On first run, you'll be prompted to enter:
- Your API ID - Your API ID (from my.telegram.org)
- Your API Hash - Your API Hash (from my.telegram.org)
- Your phone number (with country code) - **Choose authentication method:**
- Your phone number (with country code) or bot, but use the phone number option when prompted second time. - **QR Code** (Recommended) - Scan with your phone (no phone number needed)
- Verification code (sent to your Telegram) - **Phone Number** - Traditional SMS verification
## Initial Scraping Behavior 🕒
When scraping a channel for the first time, please note:
- The script will attempt to retrieve the entire channel history, starting from the oldest messages
- **Significantly faster than previous versions** due to batch processing and parallel downloads
- Initial scraping time depends on:
- The total number of messages in the channel
- Whether media downloading is enabled
- The size and number of media files
- Your internet connection speed
- Telegram's rate limiting
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
- **Enhanced progress display** shows actual message counts (e.g., "1,500/10,000 messages")
- Messages are stored in the database in batches for optimal performance
- **Media downloads run in parallel** (up to 3 simultaneous downloads) for faster processing
## Usage 📝 ## Usage 📝
The script provides an interactive menu with the following options: The script provides a clean interactive menu:
- **[A]** Add new channel ```
- Enter the channel ID or channelname ========================================
- **[R]** Remove channel TELEGRAM SCRAPER
- Remove a channel from scraping list ========================================
- **[S]** Scrape all channels [S] Scrape channels
- One-time scraping of all configured channels [C] Continuous scraping
- **[M]** Toggle media scraping [M] Media scraping: ON
- Enable/disable downloading of media files [L] List & add channels
- **[C]** Continuous scraping [R] Remove channels
- Real-time monitoring of channels for new messages [E] Export data
- **[E]** Export data [T] Rescrape media
- Export to JSON and CSV formats (memory-efficient for large datasets) [Q] Quit
- **[V]** View saved channels ========================================
- List all saved channels **with message counts** ```
- **[L]** List account channels
- List all channels with ID:s for account
- **[Q]** Quit
### Channel IDs 📢 ### Channel Selection Made Easy 🔢
You can use either: Instead of typing long channel IDs, use numbers:
- Channel username (e.g., `channelname`)
- Channel ID (e.g., `-1001234567890`) **Adding Channels:**
```
[1] The News (Chat) (id: -1002116176890)
[2] Python Channel (id: -1001597139842)
[3] The Corner (id: -1002274713954)
Enter: 1,3 (adds channels 1 and 3)
```
**Scraping Channels:**
- Single: `1`
- Multiple: `1,3,5`
- All: `all`
- Mix formats: `1,-1001597139842,3`
## Data Storage 💾 ## Data Storage 💾
### Database Structure ### Database Structure
Data is stored in SQLite databases, one per channel with **optimized indexes**: Data is stored in SQLite databases, one per channel:
- Location: `./channelname/channelname.db` - Location: `./channelname/channelname.db`
- Table: `messages` - Optimized with indexes for fast queries
- `id`: Primary key - WAL mode for better performance
- `message_id`: Telegram message ID (indexed)
- `date`: Message timestamp (indexed)
- `sender_id`: Sender's Telegram ID
- `first_name`: Sender's first name
- `last_name`: Sender's last name
- `username`: Sender's username
- `message`: Message text
- `media_type`: Type of media (if any)
- `media_path`: Local path to downloaded media
- `reply_to`: ID of replied message (if any)
### Media Storage 📁 ### Media Storage 📁
Media files are stored in: Media files are stored with unique naming:
- Location: `./channelname/media/` - Location: `./channelname/media/`
- Files are named using message ID or original filename - Format: `{message_id}-{unique_id}-{original_name}.ext`
- **Parallel downloads** for faster media acquisition - **No more file overwrites** - Each file gets a unique name
### Exported Data 📊 ### Exported Data 📊
Data can be exported in two formats with **memory-efficient processing**: Export formats:
1. **CSV**: `./channelname/channelname.csv` 1. **CSV**: `./channelname/channelname.csv`
- Human-readable spreadsheet format
- Easy to import into Excel/Google Sheets
- **Streaming export** handles large datasets
2. **JSON**: `./channelname/channelname.json` 2. **JSON**: `./channelname/channelname.json`
- Structured data format
- Ideal for programmatic processing
- **Memory-optimized** for large files
## Performance Tuning ⚙️ ## Performance Features ⚙️
You can adjust these performance settings in the code: - **5 concurrent downloads** for faster media processing
- `max_concurrent_downloads = 3`: Number of simultaneous media downloads - **Batch database operations** for optimal speed
- `batch_size = 100`: Number of messages processed in each batch - **Progress bars** with real-time feedback
- `state_save_interval = 50`: How often to save progress - **Resume capability** - Continue where you left off
- **Memory-efficient** exports for large datasets
## Features in Detail 🔍
### Continuous Scraping
The continuous scraping feature (`[C]` option) allows you to:
- Monitor channels in real-time
- Automatically download new messages
- Download media as it's posted with parallel processing
- Run indefinitely until interrupted (Ctrl+C)
- Maintains state between runs
### Media Handling
The script can download:
- Photos
- Documents
- Other media types supported by Telegram
- **Parallel downloads** for faster processing
- **Improved retry mechanism** with exponential backoff
- Skips existing files to avoid duplicates
## Error Handling 🛠️ ## Error Handling 🛠️
The script includes: - Automatic retry with exponential backoff
- **Enhanced retry mechanism** with exponential backoff for failed media downloads - Rate limit compliance
- State preservation in case of interruption - Network error recovery
- **Improved flood control** compliance - State preservation during interruptions
- Comprehensive error logging for failed operations
- **Better rate limit handling** with automatic waiting
## Limitations ⚠️ ## Limitations ⚠️
@@ -236,10 +184,6 @@ The script includes:
- Can only access public channels or channels you're a member of - Can only access public channels or channels you're a member of
- Media download size limits apply as per Telegram's restrictions - Media download size limits apply as per Telegram's restrictions
## Contributing 🤝
Contributions are welcome! Please feel free to submit a Pull Request.
## License 📄 ## License 📄
This project is licensed under the MIT License - see the LICENSE file for details. This project is licensed under the MIT License - see the LICENSE file for details.