Update README.md
This commit is contained in:
216
README.md
216
README.md
@@ -11,40 +11,41 @@ ___________________ _________
|
|||||||
\/ \/
|
\/ \/
|
||||||
```
|
```
|
||||||
|
|
||||||
## What's New in v2.0 🎉
|
## What's New in v3.0 🎉
|
||||||
|
|
||||||
**Major Performance Improvements:**
|
**QR Code Authentication:**
|
||||||
- **5-10x faster scraping** with batch database operations
|
- **No phone number required** - Login with QR code scanning (still need API credentials)
|
||||||
- **3x faster media downloads** with parallel processing (up to 3 concurrent downloads)
|
- **Faster authentication** - Just scan with your phone after API setup
|
||||||
- **10-20x faster database operations** through connection pooling and batch insertions
|
- **Secure login** - Recommended authentication method
|
||||||
- **Memory-efficient exports** that handle large datasets without running out of memory
|
- **2FA support** for both QR and phone methods
|
||||||
- **Enhanced progress reporting** with actual message counts and percentages
|
|
||||||
|
|
||||||
**New Features:**
|
**Enhanced User Experience:**
|
||||||
- **Message count display** in channel view
|
- **Numbered channel selection** - Use 1,2,3 instead of full channel IDs
|
||||||
- **Configurable download concurrency** (adjustable in code)
|
- **Multi-channel operations** - Add, remove, and scrape multiple channels at once
|
||||||
- **Better error handling** with exponential backoff retry mechanism
|
- **Streamlined menu** - Cleaner interface with fewer redundant options
|
||||||
- **Optimized database structure** with indexes for faster queries
|
- **Progress bars** for media downloads with visual feedback
|
||||||
- **Object-oriented design** for better code maintainability
|
|
||||||
|
|
||||||
**Technical Improvements:**
|
**Media Download Improvements:**
|
||||||
- Database connection pooling
|
- **Fixed file overwriting** - Unique naming prevents media files from being overwritten
|
||||||
- Batch message insertions (100 messages per batch)
|
- **5x concurrent downloads** - Increased from 3 to 5 for faster media processing
|
||||||
- Streaming exports for large datasets
|
- **Better error handling** - Improved retry logic and recovery
|
||||||
- Improved flood control handling
|
|
||||||
- Periodic state saving (every 50 messages)
|
**Performance & Stability:**
|
||||||
|
- **Database optimizations** - WAL mode and faster operations
|
||||||
|
- **Hidden warnings** - Cleaner output without technical messages
|
||||||
|
- **Better error recovery** - More robust handling of network issues
|
||||||
|
|
||||||
## Features 🚀
|
## Features 🚀
|
||||||
|
|
||||||
|
- **QR Code & Phone Authentication** - Choose your preferred login method
|
||||||
- Scrape messages from multiple Telegram channels
|
- Scrape messages from multiple Telegram channels
|
||||||
- Download media files (photos, documents) with parallel processing
|
- Download media files with parallel processing and unique naming
|
||||||
- Real-time continuous scraping
|
- Real-time continuous scraping
|
||||||
- Export data to JSON and CSV formats
|
- Export data to JSON and CSV formats
|
||||||
- SQLite database storage with optimized performance
|
- SQLite database storage with optimized performance
|
||||||
- Resume capability (saves progress)
|
- Resume capability (saves progress)
|
||||||
- Media reprocessing for failed downloads
|
- Interactive menu with numbered channel selection
|
||||||
- Enhanced progress tracking with message counts
|
- Progress tracking with visual progress bars
|
||||||
- Interactive menu interface
|
|
||||||
|
|
||||||
## Prerequisites 📋
|
## Prerequisites 📋
|
||||||
|
|
||||||
@@ -60,13 +61,6 @@ Before running the script, you'll need:
|
|||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
Contents of `requirements.txt`:
|
|
||||||
```
|
|
||||||
telethon
|
|
||||||
aiohttp
|
|
||||||
asyncio
|
|
||||||
```
|
|
||||||
|
|
||||||
## Getting Telegram API Credentials 🔑
|
## Getting Telegram API Credentials 🔑
|
||||||
|
|
||||||
1. Visit https://my.telegram.org/auth
|
1. Visit https://my.telegram.org/auth
|
||||||
@@ -88,7 +82,7 @@ Keep these credentials safe, you'll need them to run the script!
|
|||||||
|
|
||||||
1. Clone the repository:
|
1. Clone the repository:
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/robertaitch/telegram-scraper.git
|
git clone https://github.com/unnohwn/telegram-scraper.git
|
||||||
cd telegram-scraper
|
cd telegram-scraper
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -103,132 +97,86 @@ python telegram-scraper.py
|
|||||||
```
|
```
|
||||||
|
|
||||||
4. On first run, you'll be prompted to enter:
|
4. On first run, you'll be prompted to enter:
|
||||||
- Your API ID
|
- Your API ID (from my.telegram.org)
|
||||||
- Your API Hash
|
- Your API Hash (from my.telegram.org)
|
||||||
- Your phone number (with country code)
|
- **Choose authentication method:**
|
||||||
- Your phone number (with country code) or bot, but use the phone number option when prompted second time.
|
- **QR Code** (Recommended) - Scan with your phone (no phone number needed)
|
||||||
- Verification code (sent to your Telegram)
|
- **Phone Number** - Traditional SMS verification
|
||||||
|
|
||||||
## Initial Scraping Behavior 🕒
|
|
||||||
|
|
||||||
When scraping a channel for the first time, please note:
|
|
||||||
|
|
||||||
- The script will attempt to retrieve the entire channel history, starting from the oldest messages
|
|
||||||
- **Significantly faster than previous versions** due to batch processing and parallel downloads
|
|
||||||
- Initial scraping time depends on:
|
|
||||||
- The total number of messages in the channel
|
|
||||||
- Whether media downloading is enabled
|
|
||||||
- The size and number of media files
|
|
||||||
- Your internet connection speed
|
|
||||||
- Telegram's rate limiting
|
|
||||||
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
|
|
||||||
- **Enhanced progress display** shows actual message counts (e.g., "1,500/10,000 messages")
|
|
||||||
- Messages are stored in the database in batches for optimal performance
|
|
||||||
- **Media downloads run in parallel** (up to 3 simultaneous downloads) for faster processing
|
|
||||||
|
|
||||||
## Usage 📝
|
## Usage 📝
|
||||||
|
|
||||||
The script provides an interactive menu with the following options:
|
The script provides a clean interactive menu:
|
||||||
|
|
||||||
- **[A]** Add new channel
|
```
|
||||||
- Enter the channel ID or channelname
|
========================================
|
||||||
- **[R]** Remove channel
|
TELEGRAM SCRAPER
|
||||||
- Remove a channel from scraping list
|
========================================
|
||||||
- **[S]** Scrape all channels
|
[S] Scrape channels
|
||||||
- One-time scraping of all configured channels
|
[C] Continuous scraping
|
||||||
- **[M]** Toggle media scraping
|
[M] Media scraping: ON
|
||||||
- Enable/disable downloading of media files
|
[L] List & add channels
|
||||||
- **[C]** Continuous scraping
|
[R] Remove channels
|
||||||
- Real-time monitoring of channels for new messages
|
[E] Export data
|
||||||
- **[E]** Export data
|
[T] Rescrape media
|
||||||
- Export to JSON and CSV formats (memory-efficient for large datasets)
|
[Q] Quit
|
||||||
- **[V]** View saved channels
|
========================================
|
||||||
- List all saved channels **with message counts**
|
```
|
||||||
- **[L]** List account channels
|
|
||||||
- List all channels with ID:s for account
|
|
||||||
- **[Q]** Quit
|
|
||||||
|
|
||||||
### Channel IDs 📢
|
### Channel Selection Made Easy 🔢
|
||||||
|
|
||||||
You can use either:
|
Instead of typing long channel IDs, use numbers:
|
||||||
- Channel username (e.g., `channelname`)
|
|
||||||
- Channel ID (e.g., `-1001234567890`)
|
**Adding Channels:**
|
||||||
|
```
|
||||||
|
[1] The News (Chat) (id: -1002116176890)
|
||||||
|
[2] Python Channel (id: -1001597139842)
|
||||||
|
[3] The Corner (id: -1002274713954)
|
||||||
|
|
||||||
|
Enter: 1,3 (adds channels 1 and 3)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Scraping Channels:**
|
||||||
|
- Single: `1`
|
||||||
|
- Multiple: `1,3,5`
|
||||||
|
- All: `all`
|
||||||
|
- Mix formats: `1,-1001597139842,3`
|
||||||
|
|
||||||
## Data Storage 💾
|
## Data Storage 💾
|
||||||
|
|
||||||
### Database Structure
|
### Database Structure
|
||||||
|
|
||||||
Data is stored in SQLite databases, one per channel with **optimized indexes**:
|
Data is stored in SQLite databases, one per channel:
|
||||||
- Location: `./channelname/channelname.db`
|
- Location: `./channelname/channelname.db`
|
||||||
- Table: `messages`
|
- Optimized with indexes for fast queries
|
||||||
- `id`: Primary key
|
- WAL mode for better performance
|
||||||
- `message_id`: Telegram message ID (indexed)
|
|
||||||
- `date`: Message timestamp (indexed)
|
|
||||||
- `sender_id`: Sender's Telegram ID
|
|
||||||
- `first_name`: Sender's first name
|
|
||||||
- `last_name`: Sender's last name
|
|
||||||
- `username`: Sender's username
|
|
||||||
- `message`: Message text
|
|
||||||
- `media_type`: Type of media (if any)
|
|
||||||
- `media_path`: Local path to downloaded media
|
|
||||||
- `reply_to`: ID of replied message (if any)
|
|
||||||
|
|
||||||
### Media Storage 📁
|
### Media Storage 📁
|
||||||
|
|
||||||
Media files are stored in:
|
Media files are stored with unique naming:
|
||||||
- Location: `./channelname/media/`
|
- Location: `./channelname/media/`
|
||||||
- Files are named using message ID or original filename
|
- Format: `{message_id}-{unique_id}-{original_name}.ext`
|
||||||
- **Parallel downloads** for faster media acquisition
|
- **No more file overwrites** - Each file gets a unique name
|
||||||
|
|
||||||
### Exported Data 📊
|
### Exported Data 📊
|
||||||
|
|
||||||
Data can be exported in two formats with **memory-efficient processing**:
|
Export formats:
|
||||||
1. **CSV**: `./channelname/channelname.csv`
|
1. **CSV**: `./channelname/channelname.csv`
|
||||||
- Human-readable spreadsheet format
|
|
||||||
- Easy to import into Excel/Google Sheets
|
|
||||||
- **Streaming export** handles large datasets
|
|
||||||
|
|
||||||
2. **JSON**: `./channelname/channelname.json`
|
2. **JSON**: `./channelname/channelname.json`
|
||||||
- Structured data format
|
|
||||||
- Ideal for programmatic processing
|
|
||||||
- **Memory-optimized** for large files
|
|
||||||
|
|
||||||
## Performance Tuning ⚙️
|
## Performance Features ⚙️
|
||||||
|
|
||||||
You can adjust these performance settings in the code:
|
- **5 concurrent downloads** for faster media processing
|
||||||
- `max_concurrent_downloads = 3`: Number of simultaneous media downloads
|
- **Batch database operations** for optimal speed
|
||||||
- `batch_size = 100`: Number of messages processed in each batch
|
- **Progress bars** with real-time feedback
|
||||||
- `state_save_interval = 50`: How often to save progress
|
- **Resume capability** - Continue where you left off
|
||||||
|
- **Memory-efficient** exports for large datasets
|
||||||
## Features in Detail 🔍
|
|
||||||
|
|
||||||
### Continuous Scraping
|
|
||||||
|
|
||||||
The continuous scraping feature (`[C]` option) allows you to:
|
|
||||||
- Monitor channels in real-time
|
|
||||||
- Automatically download new messages
|
|
||||||
- Download media as it's posted with parallel processing
|
|
||||||
- Run indefinitely until interrupted (Ctrl+C)
|
|
||||||
- Maintains state between runs
|
|
||||||
|
|
||||||
### Media Handling
|
|
||||||
|
|
||||||
The script can download:
|
|
||||||
- Photos
|
|
||||||
- Documents
|
|
||||||
- Other media types supported by Telegram
|
|
||||||
- **Parallel downloads** for faster processing
|
|
||||||
- **Improved retry mechanism** with exponential backoff
|
|
||||||
- Skips existing files to avoid duplicates
|
|
||||||
|
|
||||||
## Error Handling 🛠️
|
## Error Handling 🛠️
|
||||||
|
|
||||||
The script includes:
|
- Automatic retry with exponential backoff
|
||||||
- **Enhanced retry mechanism** with exponential backoff for failed media downloads
|
- Rate limit compliance
|
||||||
- State preservation in case of interruption
|
- Network error recovery
|
||||||
- **Improved flood control** compliance
|
- State preservation during interruptions
|
||||||
- Comprehensive error logging for failed operations
|
|
||||||
- **Better rate limit handling** with automatic waiting
|
|
||||||
|
|
||||||
## Limitations ⚠️
|
## Limitations ⚠️
|
||||||
|
|
||||||
@@ -236,10 +184,6 @@ The script includes:
|
|||||||
- Can only access public channels or channels you're a member of
|
- Can only access public channels or channels you're a member of
|
||||||
- Media download size limits apply as per Telegram's restrictions
|
- Media download size limits apply as per Telegram's restrictions
|
||||||
|
|
||||||
## Contributing 🤝
|
|
||||||
|
|
||||||
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
||||||
|
|
||||||
## License 📄
|
## License 📄
|
||||||
|
|
||||||
This project is licensed under the MIT License - see the LICENSE file for details.
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
||||||
@@ -250,4 +194,4 @@ This tool is for educational purposes only. Make sure to:
|
|||||||
- Respect Telegram's Terms of Service
|
- Respect Telegram's Terms of Service
|
||||||
- Obtain necessary permissions before scraping
|
- Obtain necessary permissions before scraping
|
||||||
- Use responsibly and ethically
|
- Use responsibly and ethically
|
||||||
- Comply with data protection regulations
|
- Comply with data protection regulations
|
||||||
|
|||||||
Reference in New Issue
Block a user