diff --git a/README.md b/README.md index fccb433..353f7fa 100644 --- a/README.md +++ b/README.md @@ -11,40 +11,41 @@ ___________________ _________ \/ \/ ``` -## What's New in v2.0 🎉 +## What's New in v3.0 🎉 -**Major Performance Improvements:** -- **5-10x faster scraping** with batch database operations -- **3x faster media downloads** with parallel processing (up to 3 concurrent downloads) -- **10-20x faster database operations** through connection pooling and batch insertions -- **Memory-efficient exports** that handle large datasets without running out of memory -- **Enhanced progress reporting** with actual message counts and percentages +**QR Code Authentication:** +- **No phone number required** - Login with QR code scanning (still need API credentials) +- **Faster authentication** - Just scan with your phone after API setup +- **Secure login** - Recommended authentication method +- **2FA support** for both QR and phone methods -**New Features:** -- **Message count display** in channel view -- **Configurable download concurrency** (adjustable in code) -- **Better error handling** with exponential backoff retry mechanism -- **Optimized database structure** with indexes for faster queries -- **Object-oriented design** for better code maintainability +**Enhanced User Experience:** +- **Numbered channel selection** - Use 1,2,3 instead of full channel IDs +- **Multi-channel operations** - Add, remove, and scrape multiple channels at once +- **Streamlined menu** - Cleaner interface with fewer redundant options +- **Progress bars** for media downloads with visual feedback -**Technical Improvements:** -- Database connection pooling -- Batch message insertions (100 messages per batch) -- Streaming exports for large datasets -- Improved flood control handling -- Periodic state saving (every 50 messages) +**Media Download Improvements:** +- **Fixed file overwriting** - Unique naming prevents media files from being overwritten +- **5x concurrent downloads** - Increased from 3 to 5 for faster media processing +- **Better error handling** - Improved retry logic and recovery + +**Performance & Stability:** +- **Database optimizations** - WAL mode and faster operations +- **Hidden warnings** - Cleaner output without technical messages +- **Better error recovery** - More robust handling of network issues ## Features 🚀 +- **QR Code & Phone Authentication** - Choose your preferred login method - Scrape messages from multiple Telegram channels -- Download media files (photos, documents) with parallel processing +- Download media files with parallel processing and unique naming - Real-time continuous scraping - Export data to JSON and CSV formats - SQLite database storage with optimized performance - Resume capability (saves progress) -- Media reprocessing for failed downloads -- Enhanced progress tracking with message counts -- Interactive menu interface +- Interactive menu with numbered channel selection +- Progress tracking with visual progress bars ## Prerequisites 📋 @@ -60,13 +61,6 @@ Before running the script, you'll need: pip install -r requirements.txt ``` -Contents of `requirements.txt`: -``` -telethon -aiohttp -asyncio -``` - ## Getting Telegram API Credentials 🔑 1. Visit https://my.telegram.org/auth @@ -88,7 +82,7 @@ Keep these credentials safe, you'll need them to run the script! 1. Clone the repository: ```bash -git clone https://github.com/robertaitch/telegram-scraper.git +git clone https://github.com/unnohwn/telegram-scraper.git cd telegram-scraper ``` @@ -103,132 +97,86 @@ python telegram-scraper.py ``` 4. On first run, you'll be prompted to enter: - - Your API ID - - Your API Hash - - Your phone number (with country code) - - Your phone number (with country code) or bot, but use the phone number option when prompted second time. - - Verification code (sent to your Telegram) - -## Initial Scraping Behavior 🕒 - -When scraping a channel for the first time, please note: - -- The script will attempt to retrieve the entire channel history, starting from the oldest messages -- **Significantly faster than previous versions** due to batch processing and parallel downloads -- Initial scraping time depends on: - - The total number of messages in the channel - - Whether media downloading is enabled - - The size and number of media files - - Your internet connection speed - - Telegram's rate limiting -- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off -- **Enhanced progress display** shows actual message counts (e.g., "1,500/10,000 messages") -- Messages are stored in the database in batches for optimal performance -- **Media downloads run in parallel** (up to 3 simultaneous downloads) for faster processing + - Your API ID (from my.telegram.org) + - Your API Hash (from my.telegram.org) + - **Choose authentication method:** + - **QR Code** (Recommended) - Scan with your phone (no phone number needed) + - **Phone Number** - Traditional SMS verification ## Usage 📝 -The script provides an interactive menu with the following options: +The script provides a clean interactive menu: -- **[A]** Add new channel - - Enter the channel ID or channelname -- **[R]** Remove channel - - Remove a channel from scraping list -- **[S]** Scrape all channels - - One-time scraping of all configured channels -- **[M]** Toggle media scraping - - Enable/disable downloading of media files -- **[C]** Continuous scraping - - Real-time monitoring of channels for new messages -- **[E]** Export data - - Export to JSON and CSV formats (memory-efficient for large datasets) -- **[V]** View saved channels - - List all saved channels **with message counts** -- **[L]** List account channels - - List all channels with ID:s for account -- **[Q]** Quit +``` +======================================== + TELEGRAM SCRAPER +======================================== +[S] Scrape channels +[C] Continuous scraping +[M] Media scraping: ON +[L] List & add channels +[R] Remove channels +[E] Export data +[T] Rescrape media +[Q] Quit +======================================== +``` -### Channel IDs 📢 +### Channel Selection Made Easy 🔢 -You can use either: -- Channel username (e.g., `channelname`) -- Channel ID (e.g., `-1001234567890`) +Instead of typing long channel IDs, use numbers: + +**Adding Channels:** +``` +[1] The News (Chat) (id: -1002116176890) +[2] Python Channel (id: -1001597139842) +[3] The Corner (id: -1002274713954) + +Enter: 1,3 (adds channels 1 and 3) +``` + +**Scraping Channels:** +- Single: `1` +- Multiple: `1,3,5` +- All: `all` +- Mix formats: `1,-1001597139842,3` ## Data Storage 💾 ### Database Structure -Data is stored in SQLite databases, one per channel with **optimized indexes**: +Data is stored in SQLite databases, one per channel: - Location: `./channelname/channelname.db` -- Table: `messages` - - `id`: Primary key - - `message_id`: Telegram message ID (indexed) - - `date`: Message timestamp (indexed) - - `sender_id`: Sender's Telegram ID - - `first_name`: Sender's first name - - `last_name`: Sender's last name - - `username`: Sender's username - - `message`: Message text - - `media_type`: Type of media (if any) - - `media_path`: Local path to downloaded media - - `reply_to`: ID of replied message (if any) +- Optimized with indexes for fast queries +- WAL mode for better performance ### Media Storage 📁 -Media files are stored in: +Media files are stored with unique naming: - Location: `./channelname/media/` -- Files are named using message ID or original filename -- **Parallel downloads** for faster media acquisition +- Format: `{message_id}-{unique_id}-{original_name}.ext` +- **No more file overwrites** - Each file gets a unique name ### Exported Data 📊 -Data can be exported in two formats with **memory-efficient processing**: +Export formats: 1. **CSV**: `./channelname/channelname.csv` - - Human-readable spreadsheet format - - Easy to import into Excel/Google Sheets - - **Streaming export** handles large datasets - 2. **JSON**: `./channelname/channelname.json` - - Structured data format - - Ideal for programmatic processing - - **Memory-optimized** for large files -## Performance Tuning ⚙️ +## Performance Features ⚙️ -You can adjust these performance settings in the code: -- `max_concurrent_downloads = 3`: Number of simultaneous media downloads -- `batch_size = 100`: Number of messages processed in each batch -- `state_save_interval = 50`: How often to save progress - -## Features in Detail 🔍 - -### Continuous Scraping - -The continuous scraping feature (`[C]` option) allows you to: -- Monitor channels in real-time -- Automatically download new messages -- Download media as it's posted with parallel processing -- Run indefinitely until interrupted (Ctrl+C) -- Maintains state between runs - -### Media Handling - -The script can download: -- Photos -- Documents -- Other media types supported by Telegram -- **Parallel downloads** for faster processing -- **Improved retry mechanism** with exponential backoff -- Skips existing files to avoid duplicates +- **5 concurrent downloads** for faster media processing +- **Batch database operations** for optimal speed +- **Progress bars** with real-time feedback +- **Resume capability** - Continue where you left off +- **Memory-efficient** exports for large datasets ## Error Handling 🛠️ -The script includes: -- **Enhanced retry mechanism** with exponential backoff for failed media downloads -- State preservation in case of interruption -- **Improved flood control** compliance -- Comprehensive error logging for failed operations -- **Better rate limit handling** with automatic waiting +- Automatic retry with exponential backoff +- Rate limit compliance +- Network error recovery +- State preservation during interruptions ## Limitations ⚠️ @@ -236,10 +184,6 @@ The script includes: - Can only access public channels or channels you're a member of - Media download size limits apply as per Telegram's restrictions -## Contributing 🤝 - -Contributions are welcome! Please feel free to submit a Pull Request. - ## License 📄 This project is licensed under the MIT License - see the LICENSE file for details. @@ -250,4 +194,4 @@ This tool is for educational purposes only. Make sure to: - Respect Telegram's Terms of Service - Obtain necessary permissions before scraping - Use responsibly and ethically -- Comply with data protection regulations \ No newline at end of file +- Comply with data protection regulations