aa

2026-04-27 13:23:23 +08:00 · 2026-04-27 12:06:02 +08:00 · 2026-04-27 11:43:10 +08:00 · 2026-04-27 02:07:31 +08:00 · 2026-04-27 02:02:46 +08:00 · 2026-04-27 02:00:03 +08:00
10 changed files with 4384 additions and 405 deletions
--- a/.dockerignore
+++ b/.dockerignore
@@ -0,0 +1,13 @@
 # 构建镜像时不打进上下文（减小体积；数据在宿主机卷里）
 .git
 .env
 .env.*
 *.session
 *.session-journal
 state.json
 -100*/
 __pycache__
 *.pyc
 .cursor
 .venv
 venv
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,41 @@
 # ========== 密钥与登录（切勿提交到远程）==========
 .env
 .env.*
 *.session
 *.session-journal
 # ========== 运行状态与抓取进度（与频道数据配套，勿提交）==========
 state.json
 # ========== 按频道存放的抓取结果（SQLite、媒体、导出文件）==========
 # 目录名一般为 Telegram 超级群/频道 ID（-100xxxxxxxxxx）
 -100*/
 # ========== 脚本生成的列表（可随时再生成）==========
 channels_list.csv
 # ========== Python ==========
 __pycache__/
 *.py[cod]
 *$py.class
 .Python
 venv/
 .venv/
 *.egg-info/
 .eggs/
 dist/
 build/
 # ========== 编辑器 / 本地工具 ==========
 .cursor/
 .vscode/
 .idea/
 *.swp
 *.swo
 .DS_Store
 Thumbs.db
 # ========== 日志与临时文件 ===========
 *.log
 *.tmp
 *.temp
--- a/23
+++ b/23
@@ -0,0 +1,23 @@
 # 运行 Web 控制台；抓取数据通过卷挂载到 /data，见 docker-compose 说明
 FROM python:3.11-slim
 ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1
 WORKDIR /app
 RUN apt-get update \
    && apt-get install -y --no-install-recommends ca-certificates \
    && rm -rf /var/lib/apt/lists/*
 COPY requirements.txt .
 RUN pip install --no-cache-dir -U pip setuptools wheel \
    && pip install --no-cache-dir -r requirements.txt
 COPY telegram-scraper.py app_web.py ./
 COPY templates ./templates/
 COPY static ./static/
 EXPOSE 8000
 CMD ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/README.md
+++ b/README.md
@@ -1,5 +1,17 @@
 # Telegram Channel Scraper 📱
 > **⚠️ DISCONTINUED**
 >
 > This project is no longer maintained. After a lot of support and interest from the community, A far more capable successor has been released:
 >
 > **➜ [Harrier — Telegram Scraping & Intelligence Platform](https://github.com/skuggrev/harrier)**
 >
 > Harrier has everything this tool had and much more - web UI, real-time progress, user lookup, webhook alerts, continuous scraping, and a proper export system. I recommend switching over.
 >
 > A huge thank you to everyone who used, starred, and supported this project.
 ---
 A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
 ```
@@ -10,17 +22,43 @@ ___________________  _________
  |____|  \______  /_______  /
                 \/        \/
 ```
 ## What's New in v3.1 🎉
 **Enhanced Message Data:**
 - **Message statistics** - Captures views, forwards, and post_author for each message
 - **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
 - **Automatic database migration** - Seamlessly adds new columns to existing databases
 - **Richer exports** - All new data included in CSV/JSON exports
 **Improved Channel Management:**
 - **Channel names displayed** - Shows channel names alongside IDs everywhere
 - **Smart filtering** - List option now only shows Channels and Groups (no private chats)
 - **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
 - **"all" selection** - Quickly add all listed channels at once
 - **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`
 **Bug Fixes:**
 - **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
 - **Better entity resolution** - Handles both numeric IDs and channel usernames
 - **Improved error messages** - Shows channel names with IDs for clearer debugging
 ## Features 🚀
- Scrape messages from multiple Telegram channels
+- **QR Code & Phone Authentication** - Choose your preferred login method
- Download media files (photos, documents)
+- Scrape messages with full metadata (views, forwards, reactions, post author)
 - Download media files with parallel processing and unique naming
 - Real-time continuous scraping
- Export data to JSON and CSV formats
+- Export data to JSON and CSV formats with enhanced metadata
- SQLite database storage
+- SQLite database storage with automatic schema migration
 - Resume capability (saves progress)
- Media reprocessing for failed downloads
+- Interactive menu with channel names and numbered selection
- Progress tracking
+- Smart channel filtering (only shows channels/groups)
- Interactive menu interface
+- Progress tracking with visual progress bars
 - Automatic channels list export to CSV
 ## Prerequisites 📋
@@ -36,13 +74,6 @@ Before running the script, you'll need:
 pip install -r requirements.txt
 ```
 Contents of `requirements.txt`:
 ```
 telethon
 aiohttp
 asyncio
 ```
 ## Getting Telegram API Credentials 🔑
 1. Visit https://my.telegram.org/auth
@@ -57,140 +88,149 @@ asyncio
 6. You'll receive:
   - `api_id`: A number
   - `api_hash`: A string of letters and numbers
-   
+
 Keep these credentials safe, you'll need them to run the script!
 ## Setup and Running 🔧
 1. Clone the repository:
 ```bash
 git clone https://github.com/unnohwn/telegram-scraper.git
 cd telegram-scraper
 ```
 2. Install requirements:
 ```bash
 pip install -r requirements.txt
 ```
 3. Run the script:
 ```bash
 python telegram-scraper.py
 ```
 4. On first run, you'll be prompted to enter:
-   - Your API ID
+   - Your API ID (from my.telegram.org)
-   - Your API Hash
+   - Your API Hash (from my.telegram.org)
-   - Your phone number (with country code)
+   - **Choose authentication method:**
-   - Your phone number (with country code) or bot, but use the phone number option when prompted second time.
+     - **QR Code** (Recommended) - Scan with your phone (no phone number needed)
-   - Verification code (sent to your Telegram)
+     - **Phone Number** - Traditional SMS verification
-## Initial Scraping Behavior 🕒
+## Web Console (MVP) 🌐
-When scraping a channel for the first time, please note:
+You can run a simple web control panel that manages `.env` configuration and starts/stops the scraper process:
- The script will attempt to retrieve the entire channel history, starting from the oldest messages
+```bash
- Initial scraping can take several minutes or even hours, depending on:
+pip install -r requirements.txt
-  - The total number of messages in the channel
+uvicorn app_web:app --host 0.0.0.0 --port 8000 --reload
-  - Whether media downloading is enabled
+```
-  - The size and number of media files
+
-  - Your internet connection speed
+Then open:
-  - Telegram's rate limiting
+
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
+```text
- Progress percentage is displayed in real-time to track the scraping status
+http://127.0.0.1:8000
- Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete
+```
 Features:
 - Edit core config values from the web page (saved back to `.env`)
 - Start / stop scraper process from browser
 - View recent runtime logs
 ## Usage 📝
-The script provides an interactive menu with the following options:
+The script provides a clean interactive menu:
- **[A]** Add new channel
+```
-  - Enter the channel ID or channelname
+========================================
- **[R]** Remove channel
+           TELEGRAM SCRAPER
-  - Remove a channel from scraping list
+========================================
- **[S]** Scrape all channels
+[S] Scrape channels
-  - One-time scraping of all configured channels
+[C] Continuous scraping  
- **[M]** Toggle media scraping
+[M] Media scraping: ON
-  - Enable/disable downloading of media files
+[L] List & add channels
- **[C]** Continuous scraping
+[R] Remove channels
-  - Real-time monitoring of channels for new messages
+[E] Export data
- **[E]** Export data
+[T] Rescrape media
-  - Export to JSON and CSV formats
+[Q] Quit
- **[V]** View saved channels
+========================================
-  - List all saved channels
+```
 - **[L]** List account channels
  - List all channels with ID:s for account
 - **[Q]** Quit
-### Channel IDs 📢
+### Channel Selection Made Easy 🔢
-You can use either:
+Instead of typing long channel IDs, use numbers:
- Channel username (e.g., `channelname`)
+
- Channel ID (e.g., `-1001234567890`)
+**Adding Channels:**
 ```
 [1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
 [2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
 [3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)
 Enter: 1,3 (adds channels 1 and 3)
 Or: all (adds all listed channels)
 ```
 **Viewing Your Channels:**
 ```
 [1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
 [2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
 ```
 **Scraping Channels:**
 - Single: `1`
 - Multiple: `1,3,5`
 - All: `all`
 - Mix formats: `1,-1001597139842,3`
 ## Data Storage 💾
 ### Database Structure
 Data is stored in SQLite databases, one per channel:
 - Location: `./channelname/channelname.db`
- Table: `messages`
+- Optimized with indexes for fast queries
-  - `id`: Primary key
+- WAL mode for better performance
-  - `message_id`: Telegram message ID
+- Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
-  - `date`: Message timestamp
+- Automatic migration adds new columns to existing databases
  - `sender_id`: Sender's Telegram ID
  - `first_name`: Sender's first name
  - `last_name`: Sender's last name
  - `username`: Sender's username
  - `message`: Message text
  - `media_type`: Type of media (if any)
  - `media_path`: Local path to downloaded media
  - `reply_to`: ID of replied message (if any)
 ### Media Storage 📁
-Media files are stored in:
+Media files are stored with unique naming:
 - Location: `./channelname/media/`
- Files are named using message ID or original filename
+- Format: `{message_id}-{unique_id}-{original_name}.ext`
 - **No more file overwrites** - Each file gets a unique name
 ### Exported Data 📊
-Data can be exported in two formats:
+Export formats:
 1. **CSV**: `./channelname/channelname.csv`
   - Human-readable spreadsheet format
   - Easy to import into Excel/Google Sheets
-2. **JSON**: `./channelname/channelname.json`
+1. **CSV**: `./channelname/channelid_username.csv`
-   - Structured data format
+2. **JSON**: `./channelname/channelid_username.json`
-   - Ideal for programmatic processing
+3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)
-## Features in Detail 🔍
+All exports include complete message metadata: views, forwards, reactions, and post author information.
-### Continuous Scraping
+## Performance Features ⚙️
-The continuous scraping feature (`[C]` option) allows you to:
+- **5 concurrent downloads** for faster media processing
- Monitor channels in real-time
+- **Batch database operations** for optimal speed
- Automatically download new messages
+- **Progress bars** with real-time feedback
- Download media as it's posted
+- **Resume capability** - Continue where you left off
- Run indefinitely until interrupted (Ctrl+C)
+- **Memory-efficient** exports for large datasets
 - Maintains state between runs
 ### Media Handling
 The script can download:
 - Photos
 - Documents
 - Other media types supported by Telegram
 - Automatically retries failed downloads
 - Skips existing files to avoid duplicates
 ## Error Handling 🛠️
-The script includes:
+- Automatic retry with exponential backoff
- Automatic retry mechanism for failed media downloads
+- Rate limit compliance
- State preservation in case of interruption
+- Network error recovery
- Flood control compliance
+- State preservation during interruptions
 - Error logging for failed operations
 ## Limitations ⚠️
@@ -198,10 +238,6 @@ The script includes:
 - Can only access public channels or channels you're a member of
 - Media download size limits apply as per Telegram's restrictions
 ## Contributing 🤝
 Contributions are welcome! Please feel free to submit a Pull Request.
 ## License 📄
 This project is licensed under the MIT License - see the LICENSE file for details.
@@ -209,6 +245,7 @@ This project is licensed under the MIT License - see the LICENSE file for detail
 ## Disclaimer ⚖️
 This tool is for educational purposes only. Make sure to:
 - Respect Telegram's Terms of Service
 - Obtain necessary permissions before scraping
 - Use responsibly and ethically
--- a/app_web.py
+++ b/app_web.py
--- a/docker-compose.yml
+++ b/docker-compose.yml
@@ -0,0 +1,17 @@
 # 用法（在项目根目录）：
 #   docker compose build
 #   docker compose up -d
 # 数据持久化：把宿主机上的项目目录挂到 /data，与 app 内工作目录一致（见下方 volumes）
 services:
  web:
    build: .
    image: telegram-scraper-web:local
    container_name: telegram-scraper-web
    restart: unless-stopped
    ports:
      - "8000:8000"
    working_dir: /data
    volumes:
      # 改成你服务器上「已有代码 + .env + state + session + 各 -100* 目录」的绝对路径
      - ${HOST_PROJECT_DIR:-.}:/data
    command: ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,3 +1,10 @@
-telethon
+# 直接依赖（子依赖由 pip 自动解析）
-aiohttp
+# 若仍装不上：请先执行 python3 --version，CentOS 自带 Python 3.6 过旧，建议安装 python39/python311 后再 pip install
-asyncio
+Telethon>=1.28.0,<2
 fastapi>=0.65.0,<1
 uvicorn>=0.17.0,<1
 itsdangerous>=2.0.0
 jinja2>=3.0.0,<4
 python-multipart>=0.0.5
 qrcode>=7.3.0
 PySocks>=1.7.0
--- a/static/favicon.png
+++ b/static/favicon.png
--- a/telegram-scraper.py
+++ b/telegram-scraper.py
--- a/templates/index.html
+++ b/templates/index.html
Author	SHA1	Message	Date
RISE	77f0d404fa	aa	2026-04-27 13:23:23 +08:00
RISE	cf64bc4703	aa	2026-04-27 12:06:02 +08:00
RISE	ff022bce5d	aa	2026-04-27 11:43:10 +08:00
RISE	4ae6898be0	aa	2026-04-27 02:07:31 +08:00
RISE	440416ba0c	aa	2026-04-27 02:02:46 +08:00
RISE	459a5299a0	aa	2026-04-27 02:00:03 +08:00
RISE	d4378afbc9	aa	2026-04-27 01:56:43 +08:00
RISE	dfb5fe0c89	aa	2026-04-27 01:45:59 +08:00
RISE	384d7e4838	aa	2026-04-27 01:42:47 +08:00
RISE	e30292e330	aa	2026-04-27 01:37:22 +08:00
RISE	ec804afc60	aa	2026-04-27 01:30:29 +08:00
RISE	4c48525b3a	aa	2026-04-27 01:28:50 +08:00
RISE	b00a0c40d8	aa	2026-04-27 01:25:45 +08:00
RISE	5ec4c38495	aa	2026-04-27 01:20:49 +08:00
RISE	2d0eeaa78f	aa	2026-04-27 01:19:38 +08:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	c84141674a	Update README.md	2026-04-11 23:38:09 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	fb7ad3742e	Version 3.1	2025-12-12 15:38:09 +01:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	8d4e092b1b	Update telegram-scraper.py v3.0	2025-09-11 17:34:59 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	e7bf2b1ed7	Update requirements.txt	2025-09-11 17:34:30 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	7db46018ce	Update README.md	2025-09-11 17:32:56 +02:00
Robert Aitch	65b221ade6	Update requirements.txt	2025-07-20 20:18:12 +02:00
Robert Aitch	ac7d6de06b	Performance improvements major performance overhaul with 5-10x speed improvements	2025-07-20 00:57:54 +02:00
Robert Aitch	57bf125ca1	Delete gai.py	2025-07-20 00:36:53 +02:00
Robert Aitch	f383f222c4	Update README.md	2025-07-20 00:35:41 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷	6273c9c11c	Merge pull request #12 from chaseyoungcn/main get total_messages speed up	2025-07-18 10:33:00 +02:00
fxxk-research	85d3f0f935	Rename gai to gai.py rename	2025-06-26 13:36:58 +08:00
fxxk-research	30bda684fe	Update gai filiter no message channel	2025-06-26 13:36:15 +08:00
fxxk-research	aa9b756d37	Create gai 完善进度条、日志系统	2025-06-23 11:03:53 +08:00
fxxk-research	6baf4bdd13	get total_messages speed up O(n) -> O(1)	2025-06-19 20:42:10 +08:00