Resume-python/README.md

# Telegram Channel Scraper 📱

A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.

```
___________________  _________
\__    ___/  _____/ /   _____/
  |    | /   \  ___ \_____  \
  |    | \    \_\  \/        \
  |____|  \______  /_______  /
                 \/        \/
```

## What's New in v2.0 🎉

**Major Performance Improvements:**
- **5-10x faster scraping** with batch database operations
- **3x faster media downloads** with parallel processing (up to 3 concurrent downloads)
- **10-20x faster database operations** through connection pooling and batch insertions
- **Memory-efficient exports** that handle large datasets without running out of memory
- **Enhanced progress reporting** with actual message counts and percentages

**New Features:**
- **Message count display** in channel view
- **Configurable download concurrency** (adjustable in code)
- **Better error handling** with exponential backoff retry mechanism
- **Optimized database structure** with indexes for faster queries
- **Object-oriented design** for better code maintainability

**Technical Improvements:**
- Database connection pooling
- Batch message insertions (100 messages per batch)
- Streaming exports for large datasets
- Improved flood control handling
- Periodic state saving (every 50 messages)

## Features 🚀

- Scrape messages from multiple Telegram channels
- Download media files (photos, documents) with parallel processing
- Real-time continuous scraping
- Export data to JSON and CSV formats
- SQLite database storage with optimized performance
- Resume capability (saves progress)
- Media reprocessing for failed downloads
- Enhanced progress tracking with message counts
- Interactive menu interface

## Prerequisites 📋

Before running the script, you'll need:

- Python 3.7 or higher
- Telegram account
- API credentials from Telegram

### Required Python packages

```
pip install -r requirements.txt
```

Contents of `requirements.txt`:
```
telethon
aiohttp
asyncio
```

## Getting Telegram API Credentials 🔑

1. Visit https://my.telegram.org/auth
2. Log in with your phone number
3. Click on "API development tools"
4. Fill in the form:
   - App title: Your app name
   - Short name: Your app short name
   - Platform: Can be left as "Desktop"
   - Description: Brief description of your app
5. Click "Create application"
6. You'll receive:
   - `api_id`: A number
   - `api_hash`: A string of letters and numbers

Keep these credentials safe, you'll need them to run the script!

## Setup and Running 🔧

1. Clone the repository:
```bash
git clone https://github.com/robertaitch/telegram-scraper.git
cd telegram-scraper
```

2. Install requirements:
```bash
pip install -r requirements.txt
```

3. Run the script:
```bash
python telegram-scraper.py
```

4. On first run, you'll be prompted to enter:
   - Your API ID
   - Your API Hash
   - Your phone number (with country code)
   - Your phone number (with country code) or bot, but use the phone number option when prompted second time.
   - Verification code (sent to your Telegram)

## Initial Scraping Behavior 🕒

When scraping a channel for the first time, please note:

- The script will attempt to retrieve the entire channel history, starting from the oldest messages
- **Significantly faster than previous versions** due to batch processing and parallel downloads
- Initial scraping time depends on:
  - The total number of messages in the channel
  - Whether media downloading is enabled
  - The size and number of media files
  - Your internet connection speed
  - Telegram's rate limiting
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
- **Enhanced progress display** shows actual message counts (e.g., "1,500/10,000 messages")
- Messages are stored in the database in batches for optimal performance
- **Media downloads run in parallel** (up to 3 simultaneous downloads) for faster processing

## Usage 📝

The script provides an interactive menu with the following options:

- **[A]** Add new channel
  - Enter the channel ID or channelname
- **[R]** Remove channel
  - Remove a channel from scraping list
- **[S]** Scrape all channels
  - One-time scraping of all configured channels
- **[M]** Toggle media scraping
  - Enable/disable downloading of media files
- **[C]** Continuous scraping
  - Real-time monitoring of channels for new messages
- **[E]** Export data
  - Export to JSON and CSV formats (memory-efficient for large datasets)
- **[V]** View saved channels
  - List all saved channels **with message counts**
- **[L]** List account channels
  - List all channels with ID:s for account
- **[Q]** Quit

### Channel IDs 📢

You can use either:
- Channel username (e.g., `channelname`)
- Channel ID (e.g., `-1001234567890`)

## Data Storage 💾

### Database Structure

Data is stored in SQLite databases, one per channel with **optimized indexes**:
- Location: `./channelname/channelname.db`
- Table: `messages`
  - `id`: Primary key
  - `message_id`: Telegram message ID (indexed)
  - `date`: Message timestamp (indexed)
  - `sender_id`: Sender's Telegram ID
  - `first_name`: Sender's first name
  - `last_name`: Sender's last name
  - `username`: Sender's username
  - `message`: Message text
  - `media_type`: Type of media (if any)
  - `media_path`: Local path to downloaded media
  - `reply_to`: ID of replied message (if any)

### Media Storage 📁

Media files are stored in:
- Location: `./channelname/media/`
- Files are named using message ID or original filename
- **Parallel downloads** for faster media acquisition

### Exported Data 📊

Data can be exported in two formats with **memory-efficient processing**:
1. **CSV**: `./channelname/channelname.csv`
   - Human-readable spreadsheet format
   - Easy to import into Excel/Google Sheets
   - **Streaming export** handles large datasets

2. **JSON**: `./channelname/channelname.json`
   - Structured data format
   - Ideal for programmatic processing
   - **Memory-optimized** for large files

## Performance Tuning ⚙️

You can adjust these performance settings in the code:
- `max_concurrent_downloads = 3`: Number of simultaneous media downloads
- `batch_size = 100`: Number of messages processed in each batch
- `state_save_interval = 50`: How often to save progress

## Features in Detail 🔍

### Continuous Scraping

The continuous scraping feature (`[C]` option) allows you to:
- Monitor channels in real-time
- Automatically download new messages
- Download media as it's posted with parallel processing
- Run indefinitely until interrupted (Ctrl+C)
- Maintains state between runs

### Media Handling

The script can download:
- Photos
- Documents
- Other media types supported by Telegram
- **Parallel downloads** for faster processing
- **Improved retry mechanism** with exponential backoff
- Skips existing files to avoid duplicates

## Error Handling 🛠️

The script includes:
- **Enhanced retry mechanism** with exponential backoff for failed media downloads
- State preservation in case of interruption
- **Improved flood control** compliance
- Comprehensive error logging for failed operations
- **Better rate limit handling** with automatic waiting

## Limitations ⚠️

- Respects Telegram's rate limits
- Can only access public channels or channels you're a member of
- Media download size limits apply as per Telegram's restrictions

## Contributing 🤝

Contributions are welcome! Please feel free to submit a Pull Request.

## License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

## Disclaimer ⚖️

This tool is for educational purposes only. Make sure to:
- Respect Telegram's Terms of Service
- Obtain necessary permissions before scraping
- Use responsibly and ethically
- Comply with data protection regulations