Update README.md

2024-11-04 22:22:38 +01:00
parent ef022f2afd
commit 5fda668b9f
1 changed files with 218 additions and 1 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,219 @@
-# telegram-scraper
+# Telegram Channel Scraper 📱
 A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
 ```
 ___________________  _________
 \__    ___/  _____/ /   _____/
  |    | /   \  ___ \_____  \ 
  |    | \    \_\  \/        \
  |____|  \______  /_______  /
                 \/        \/
 ```
 ## Features 🚀
 - Scrape messages from multiple Telegram channels
 - Download media files (photos, documents)
 - Real-time continuous scraping
 - Export data to JSON and CSV formats
 - SQLite database storage
 - Resume capability (saves progress)
 - Media reprocessing for failed downloads
 - Progress tracking
 - Interactive menu interface
 ## Prerequisites 📋
 Before running the script, you'll need:
 - Python 3.7 or higher
 - Telegram account
 - API credentials from Telegram
 ### Required Python packages
 ```
 pip install -r requirements.txt
 ```
 Contents of `requirements.txt`:
 ```
 telethon
 aiohttp
 sqlite3
 asyncio
 ```
 ## Getting Telegram API Credentials 🔑
 1. Visit https://my.telegram.org/auth
 2. Log in with your phone number
 3. Click on "API development tools"
 4. Fill in the form:
   - App title: Your app name
   - Short name: Your app short name
   - Platform: Can be left as "Desktop"
   - Description: Brief description of your app
 5. Click "Create application"
 6. You'll receive:
   - `api_id`: A number
   - `api_hash`: A string of letters and numbers
 Keep these credentials safe, you'll need them to run the script!
 ## Setup and Running 🔧
 1. Clone the repository:
 ```bash
 git clone https://github.com/unnohwn/telegram-scraper.git
 cd telegram-scraper
 ```
 2. Install requirements:
 ```bash
 pip install -r requirements.txt
 ```
 3. Run the script:
 ```bash
 python telegram-scraper.py
 ```
 4. On first run, you'll be prompted to enter:
   - Your API ID
   - Your API Hash
   - Your phone number (with country code)
   - Your phone number (with country code) or bot, but use the phone number option when prompted second time.
   - Verification code (sent to your Telegram)
 ## Initial Scraping Behavior 🕒
 When scraping a channel for the first time, please note:
 - The script will attempt to retrieve the entire channel history, starting from the oldest messages
 - Initial scraping can take several minutes or even hours, depending on:
  - The total number of messages in the channel
  - Whether media downloading is enabled
  - The size and number of media files
  - Your internet connection speed
  - Telegram's rate limiting
 - The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
 - Progress percentage is displayed in real-time to track the scraping status
 - Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete
 ## Usage 📝
 The script provides an interactive menu with the following options:
 - **[A]** Add new channel
  - Enter the channel ID or username
 - **[R]** Remove channel
  - Remove a channel from scraping list
 - **[S]** Scrape all channels
  - One-time scraping of all configured channels
 - **[M]** Toggle media scraping
  - Enable/disable downloading of media files
 - **[C]** Continuous scraping
  - Real-time monitoring of channels for new messages
 - **[E]** Export data
  - Export to JSON and CSV formats
 - **[V]** View channels
  - List all configured channels
 - **[Q]** Quit
 ### Channel IDs 📢
 You can use either:
 - Channel username (e.g., `channelname`)
 - Channel ID (e.g., `-1001234567890`)
 To get a channel's ID:
 1. Forward a message from the channel to @userinfobot
 2. The bot will reply with channel information including its ID
 ## Data Storage 💾
 ### Database Structure
 Data is stored in SQLite databases, one per channel:
 - Location: `./channelname/channelname.db`
 - Table: `messages`
  - `id`: Primary key
  - `message_id`: Telegram message ID
  - `date`: Message timestamp
  - `sender_id`: Sender's Telegram ID
  - `first_name`: Sender's first name
  - `last_name`: Sender's last name
  - `username`: Sender's username
  - `message`: Message text
  - `media_type`: Type of media (if any)
  - `media_path`: Local path to downloaded media
  - `reply_to`: ID of replied message (if any)
 ### Media Storage 📁
 Media files are stored in:
 - Location: `./channelname/media/`
 - Files are named using message ID or original filename
 ### Exported Data 📊
 Data can be exported in two formats:
 1. **CSV**: `./channelname/channelname.csv`
   - Human-readable spreadsheet format
   - Easy to import into Excel/Google Sheets
 2. **JSON**: `./channelname/channelname.json`
   - Structured data format
   - Ideal for programmatic processing
 ## Features in Detail 🔍
 ### Continuous Scraping
 The continuous scraping feature (`[C]` option) allows you to:
 - Monitor channels in real-time
 - Automatically download new messages
 - Download media as it's posted
 - Run indefinitely until interrupted (Ctrl+C)
 - Maintains state between runs
 ### Media Handling
 The script can download:
 - Photos
 - Documents
 - Other media types supported by Telegram
 - Automatically retries failed downloads
 - Skips existing files to avoid duplicates
 ## Error Handling 🛠️
 The script includes:
 - Automatic retry mechanism for failed media downloads
 - State preservation in case of interruption
 - Flood control compliance
 - Error logging for failed operations
 ## Limitations ⚠️
 - Respects Telegram's rate limits
 - Can only access public channels or channels you're a member of
 - Media download size limits apply as per Telegram's restrictions
 ## Contributing 🤝
 Contributions are welcome! Please feel free to submit a Pull Request.
 ## License 📄
 This project is licensed under the MIT License - see the LICENSE file for details.
 ## Disclaimer ⚖️
 This tool is for educational purposes only. Make sure to:
 - Respect Telegram's Terms of Service
 - Obtain necessary permissions before scraping
 - Use responsibly and ethically
 - Comply with data protection regulations