From 5fda668b9f87be04e21d89a2b3c40576b0c34382 Mon Sep 17 00:00:00 2001 From: unnohwn <169650232+unnohwn@users.noreply.github.com> Date: Mon, 4 Nov 2024 22:22:38 +0100 Subject: [PATCH] Update README.md --- README.md | 219 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 218 insertions(+), 1 deletion(-) diff --git a/README.md b/README.md index 0758e1f..7f45f31 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,219 @@ -# telegram-scraper +# Telegram Channel Scraper 📱 + A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities. + +``` +___________________ _________ +\__ ___/ _____/ / _____/ + | | / \ ___ \_____ \ + | | \ \_\ \/ \ + |____| \______ /_______ / + \/ \/ +``` + +## Features 🚀 + +- Scrape messages from multiple Telegram channels +- Download media files (photos, documents) +- Real-time continuous scraping +- Export data to JSON and CSV formats +- SQLite database storage +- Resume capability (saves progress) +- Media reprocessing for failed downloads +- Progress tracking +- Interactive menu interface + +## Prerequisites 📋 + +Before running the script, you'll need: + +- Python 3.7 or higher +- Telegram account +- API credentials from Telegram + +### Required Python packages + +``` +pip install -r requirements.txt +``` + +Contents of `requirements.txt`: +``` +telethon +aiohttp +sqlite3 +asyncio +``` + +## Getting Telegram API Credentials 🔑 + +1. Visit https://my.telegram.org/auth +2. Log in with your phone number +3. Click on "API development tools" +4. Fill in the form: + - App title: Your app name + - Short name: Your app short name + - Platform: Can be left as "Desktop" + - Description: Brief description of your app +5. Click "Create application" +6. You'll receive: + - `api_id`: A number + - `api_hash`: A string of letters and numbers + +Keep these credentials safe, you'll need them to run the script! + +## Setup and Running 🔧 + +1. Clone the repository: +```bash +git clone https://github.com/unnohwn/telegram-scraper.git +cd telegram-scraper +``` + +2. Install requirements: +```bash +pip install -r requirements.txt +``` + +3. Run the script: +```bash +python telegram-scraper.py +``` + +4. On first run, you'll be prompted to enter: + - Your API ID + - Your API Hash + - Your phone number (with country code) + - Your phone number (with country code) or bot, but use the phone number option when prompted second time. + - Verification code (sent to your Telegram) + +## Initial Scraping Behavior 🕒 + +When scraping a channel for the first time, please note: + +- The script will attempt to retrieve the entire channel history, starting from the oldest messages +- Initial scraping can take several minutes or even hours, depending on: + - The total number of messages in the channel + - Whether media downloading is enabled + - The size and number of media files + - Your internet connection speed + - Telegram's rate limiting +- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off +- Progress percentage is displayed in real-time to track the scraping status +- Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete + +## Usage 📝 + +The script provides an interactive menu with the following options: + +- **[A]** Add new channel + - Enter the channel ID or username +- **[R]** Remove channel + - Remove a channel from scraping list +- **[S]** Scrape all channels + - One-time scraping of all configured channels +- **[M]** Toggle media scraping + - Enable/disable downloading of media files +- **[C]** Continuous scraping + - Real-time monitoring of channels for new messages +- **[E]** Export data + - Export to JSON and CSV formats +- **[V]** View channels + - List all configured channels +- **[Q]** Quit + +### Channel IDs 📢 + +You can use either: +- Channel username (e.g., `channelname`) +- Channel ID (e.g., `-1001234567890`) + +To get a channel's ID: +1. Forward a message from the channel to @userinfobot +2. The bot will reply with channel information including its ID + +## Data Storage 💾 + +### Database Structure + +Data is stored in SQLite databases, one per channel: +- Location: `./channelname/channelname.db` +- Table: `messages` + - `id`: Primary key + - `message_id`: Telegram message ID + - `date`: Message timestamp + - `sender_id`: Sender's Telegram ID + - `first_name`: Sender's first name + - `last_name`: Sender's last name + - `username`: Sender's username + - `message`: Message text + - `media_type`: Type of media (if any) + - `media_path`: Local path to downloaded media + - `reply_to`: ID of replied message (if any) + +### Media Storage 📁 + +Media files are stored in: +- Location: `./channelname/media/` +- Files are named using message ID or original filename + +### Exported Data 📊 + +Data can be exported in two formats: +1. **CSV**: `./channelname/channelname.csv` + - Human-readable spreadsheet format + - Easy to import into Excel/Google Sheets + +2. **JSON**: `./channelname/channelname.json` + - Structured data format + - Ideal for programmatic processing + +## Features in Detail 🔍 + +### Continuous Scraping + +The continuous scraping feature (`[C]` option) allows you to: +- Monitor channels in real-time +- Automatically download new messages +- Download media as it's posted +- Run indefinitely until interrupted (Ctrl+C) +- Maintains state between runs + +### Media Handling + +The script can download: +- Photos +- Documents +- Other media types supported by Telegram +- Automatically retries failed downloads +- Skips existing files to avoid duplicates + +## Error Handling 🛠️ + +The script includes: +- Automatic retry mechanism for failed media downloads +- State preservation in case of interruption +- Flood control compliance +- Error logging for failed operations + +## Limitations ⚠️ + +- Respects Telegram's rate limits +- Can only access public channels or channels you're a member of +- Media download size limits apply as per Telegram's restrictions + +## Contributing 🤝 + +Contributions are welcome! Please feel free to submit a Pull Request. + +## License 📄 + +This project is licensed under the MIT License - see the LICENSE file for details. + +## Disclaimer ⚖️ + +This tool is for educational purposes only. Make sure to: +- Respect Telegram's Terms of Service +- Obtain necessary permissions before scraping +- Use responsibly and ethically +- Comply with data protection regulations