Update README.md

2024-11-04 22:22:38 +01:00
parent ef022f2afd
commit 5fda668b9f
1 changed files with 218 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,219 @@
-# telegram-scraper
+# Telegram Channel Scraper 📱
+
 A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
+
+```
+___________________  _________
+\__    ___/  _____/ /   _____/
+  |    | /   \  ___ \_____  \ 
+  |    | \    \_\  \/        \
+  |____|  \______  /_______  /
+                 \/        \/
+```
+
+## Features 🚀
+
+- Scrape messages from multiple Telegram channels
+- Download media files (photos, documents)
+- Real-time continuous scraping
+- Export data to JSON and CSV formats
+- SQLite database storage
+- Resume capability (saves progress)
+- Media reprocessing for failed downloads
+- Progress tracking
+- Interactive menu interface
+
+## Prerequisites 📋
+
+Before running the script, you'll need:
+
+- Python 3.7 or higher
+- Telegram account
+- API credentials from Telegram
+
+### Required Python packages
+
+```
+pip install -r requirements.txt
+```
+
+Contents of `requirements.txt`:
+```
+telethon
+aiohttp
+sqlite3
+asyncio
+```
+
+## Getting Telegram API Credentials 🔑
+
+1. Visit https://my.telegram.org/auth
+2. Log in with your phone number
+3. Click on "API development tools"
+4. Fill in the form:
+   - App title: Your app name
+   - Short name: Your app short name
+   - Platform: Can be left as "Desktop"
+   - Description: Brief description of your app
+5. Click "Create application"
+6. You'll receive:
+   - `api_id`: A number
+   - `api_hash`: A string of letters and numbers
+   
+Keep these credentials safe, you'll need them to run the script!
+
+## Setup and Running 🔧
+
+1. Clone the repository:
+```bash
+git clone https://github.com/unnohwn/telegram-scraper.git
+cd telegram-scraper
+```
+
+2. Install requirements:
+```bash
+pip install -r requirements.txt
+```
+
+3. Run the script:
+```bash
+python telegram-scraper.py
+```
+
+4. On first run, you'll be prompted to enter:
+   - Your API ID
+   - Your API Hash
+   - Your phone number (with country code)
+   - Your phone number (with country code) or bot, but use the phone number option when prompted second time.
+   - Verification code (sent to your Telegram)
+
+## Initial Scraping Behavior 🕒
+
+When scraping a channel for the first time, please note:
+
+- The script will attempt to retrieve the entire channel history, starting from the oldest messages
+- Initial scraping can take several minutes or even hours, depending on:
+  - The total number of messages in the channel
+  - Whether media downloading is enabled
+  - The size and number of media files
+  - Your internet connection speed
+  - Telegram's rate limiting
+- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
+- Progress percentage is displayed in real-time to track the scraping status
+- Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete
+
+## Usage 📝
+
+The script provides an interactive menu with the following options:
+
+- **[A]** Add new channel
+  - Enter the channel ID or username
+- **[R]** Remove channel
+  - Remove a channel from scraping list
+- **[S]** Scrape all channels
+  - One-time scraping of all configured channels
+- **[M]** Toggle media scraping
+  - Enable/disable downloading of media files
+- **[C]** Continuous scraping
+  - Real-time monitoring of channels for new messages
+- **[E]** Export data
+  - Export to JSON and CSV formats
+- **[V]** View channels
+  - List all configured channels
+- **[Q]** Quit
+
+### Channel IDs 📢
+
+You can use either:
+- Channel username (e.g., `channelname`)
+- Channel ID (e.g., `-1001234567890`)
+
+To get a channel's ID:
+1. Forward a message from the channel to @userinfobot
+2. The bot will reply with channel information including its ID
+
+## Data Storage 💾
+
+### Database Structure
+
+Data is stored in SQLite databases, one per channel:
+- Location: `./channelname/channelname.db`
+- Table: `messages`
+  - `id`: Primary key
+  - `message_id`: Telegram message ID
+  - `date`: Message timestamp
+  - `sender_id`: Sender's Telegram ID
+  - `first_name`: Sender's first name
+  - `last_name`: Sender's last name
+  - `username`: Sender's username
+  - `message`: Message text
+  - `media_type`: Type of media (if any)
+  - `media_path`: Local path to downloaded media
+  - `reply_to`: ID of replied message (if any)
+
+### Media Storage 📁
+
+Media files are stored in:
+- Location: `./channelname/media/`
+- Files are named using message ID or original filename
+
+### Exported Data 📊
+
+Data can be exported in two formats:
+1. **CSV**: `./channelname/channelname.csv`
+   - Human-readable spreadsheet format
+   - Easy to import into Excel/Google Sheets
+
+2. **JSON**: `./channelname/channelname.json`
+   - Structured data format
+   - Ideal for programmatic processing
+
+## Features in Detail 🔍
+
+### Continuous Scraping
+
+The continuous scraping feature (`[C]` option) allows you to:
+- Monitor channels in real-time
+- Automatically download new messages
+- Download media as it's posted
+- Run indefinitely until interrupted (Ctrl+C)
+- Maintains state between runs
+
+### Media Handling
+
+The script can download:
+- Photos
+- Documents
+- Other media types supported by Telegram
+- Automatically retries failed downloads
+- Skips existing files to avoid duplicates
+
+## Error Handling 🛠️
+
+The script includes:
+- Automatic retry mechanism for failed media downloads
+- State preservation in case of interruption
+- Flood control compliance
+- Error logging for failed operations
+
+## Limitations ⚠️
+
+- Respects Telegram's rate limits
+- Can only access public channels or channels you're a member of
+- Media download size limits apply as per Telegram's restrictions
+
+## Contributing 🤝
+
+Contributions are welcome! Please feel free to submit a Pull Request.
+
+## License 📄
+
+This project is licensed under the MIT License - see the LICENSE file for details.
+
+## Disclaimer ⚖️
+
+This tool is for educational purposes only. Make sure to:
+- Respect Telegram's Terms of Service
+- Obtain necessary permissions before scraping
+- Use responsibly and ethically
+- Comply with data protection regulations