Resume-python/README.md

# Telegram Channel Scraper 📱

> **⚠️ DISCONTINUED**
>
> This project is no longer maintained. After a lot of support and interest from the community, A far more capable successor has been released:
>
> **➜ [Harrier — Telegram Scraping & Intelligence Platform](https://github.com/skuggrev/harrier)**
>
> Harrier has everything this tool had and much more - web UI, real-time progress, user lookup, webhook alerts, continuous scraping, and a proper export system. I recommend switching over.
>
> A huge thank you to everyone who used, starred, and supported this project.

---

A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.

```
___________________  _________
\__    ___/  _____/ /   _____/
  |    | /   \  ___ \_____  \
  |    | \    \_\  \/        \
  |____|  \______  /_______  /
                 \/        \/
```

## What's New in v3.1 🎉

**Enhanced Message Data:**

- **Message statistics** - Captures views, forwards, and post_author for each message
- **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
- **Automatic database migration** - Seamlessly adds new columns to existing databases
- **Richer exports** - All new data included in CSV/JSON exports

**Improved Channel Management:**

- **Channel names displayed** - Shows channel names alongside IDs everywhere
- **Smart filtering** - List option now only shows Channels and Groups (no private chats)
- **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
- **"all" selection** - Quickly add all listed channels at once
- **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`

**Bug Fixes:**

- **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
- **Better entity resolution** - Handles both numeric IDs and channel usernames
- **Improved error messages** - Shows channel names with IDs for clearer debugging

## Features 🚀

- **QR Code & Phone Authentication** - Choose your preferred login method
- Scrape messages with full metadata (views, forwards, reactions, post author)
- Download media files with parallel processing and unique naming
- Real-time continuous scraping
- Export data to JSON and CSV formats with enhanced metadata
- SQLite database storage with automatic schema migration
- Resume capability (saves progress)
- Interactive menu with channel names and numbered selection
- Smart channel filtering (only shows channels/groups)
- Progress tracking with visual progress bars
- Automatic channels list export to CSV

## Prerequisites 📋

Before running the script, you'll need:

- Python 3.7 or higher
- Telegram account
- API credentials from Telegram

### Required Python packages

```
pip install -r requirements.txt
```

## Getting Telegram API Credentials 🔑

1. Visit https://my.telegram.org/auth
2. Log in with your phone number
3. Click on "API development tools"
4. Fill in the form:
   - App title: Your app name
   - Short name: Your app short name
   - Platform: Can be left as "Desktop"
   - Description: Brief description of your app
5. Click "Create application"
6. You'll receive:
   - `api_id`: A number
   - `api_hash`: A string of letters and numbers

Keep these credentials safe, you'll need them to run the script!

## Setup and Running 🔧

1. Clone the repository:

```bash
git clone https://github.com/unnohwn/telegram-scraper.git
cd telegram-scraper
```

2. Install requirements:

```bash
pip install -r requirements.txt
```

3. Run the script:

```bash
python telegram-scraper.py
```

4. On first run, you'll be prompted to enter:
   - Your API ID (from my.telegram.org)
   - Your API Hash (from my.telegram.org)
   - **Choose authentication method:**
     - **QR Code** (Recommended) - Scan with your phone (no phone number needed)
     - **Phone Number** - Traditional SMS verification

## Web Console (MVP) 🌐

You can run a simple web control panel that manages `.env` configuration and starts/stops the scraper process:

```bash
pip install -r requirements.txt
uvicorn app_web:app --host 0.0.0.0 --port 8000 --reload
```

Then open:

```text
http://127.0.0.1:8000
```

Features:
- Edit core config values from the web page (saved back to `.env`)
- Start / stop scraper process from browser
- View recent runtime logs

## Usage 📝

The script provides a clean interactive menu:

```
========================================
           TELEGRAM SCRAPER
========================================
[S] Scrape channels
[C] Continuous scraping
[M] Media scraping: ON
[L] List & add channels
[R] Remove channels
[E] Export data
[T] Rescrape media
[Q] Quit
========================================
```

### Channel Selection Made Easy 🔢

Instead of typing long channel IDs, use numbers:

**Adding Channels:**

```
[1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
[2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
[3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)

Enter: 1,3 (adds channels 1 and 3)
Or: all (adds all listed channels)
```

**Viewing Your Channels:**

```
[1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
[2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
```

**Scraping Channels:**

- Single: `1`
- Multiple: `1,3,5`
- All: `all`
- Mix formats: `1,-1001597139842,3`

## Data Storage 💾

### Database Structure

Data is stored in SQLite databases, one per channel:

- Location: `./channelname/channelname.db`
- Optimized with indexes for fast queries
- WAL mode for better performance
- Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
- Automatic migration adds new columns to existing databases

### Media Storage 📁

Media files are stored with unique naming:

- Location: `./channelname/media/`
- Format: `{message_id}-{unique_id}-{original_name}.ext`
- **No more file overwrites** - Each file gets a unique name

### Exported Data 📊

Export formats:

1. **CSV**: `./channelname/channelid_username.csv`
2. **JSON**: `./channelname/channelid_username.json`
3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)

All exports include complete message metadata: views, forwards, reactions, and post author information.

## Performance Features ⚙️

- **5 concurrent downloads** for faster media processing
- **Batch database operations** for optimal speed
- **Progress bars** with real-time feedback
- **Resume capability** - Continue where you left off
- **Memory-efficient** exports for large datasets

## Error Handling 🛠️

- Automatic retry with exponential backoff
- Rate limit compliance
- Network error recovery
- State preservation during interruptions

## Limitations ⚠️

- Respects Telegram's rate limits
- Can only access public channels or channels you're a member of
- Media download size limits apply as per Telegram's restrictions

## License 📄

This project is licensed under the MIT License - see the LICENSE file for details.

## Disclaimer ⚖️

This tool is for educational purposes only. Make sure to:

- Respect Telegram's Terms of Service
- Obtain necessary permissions before scraping
- Use responsibly and ethically
- Comply with data protection regulations