Compare commits

...

29 Commits

Author SHA1 Message Date
77f0d404fa aa 2026-04-27 13:23:23 +08:00
cf64bc4703 aa 2026-04-27 12:06:02 +08:00
ff022bce5d aa 2026-04-27 11:43:10 +08:00
4ae6898be0 aa 2026-04-27 02:07:31 +08:00
440416ba0c aa 2026-04-27 02:02:46 +08:00
459a5299a0 aa 2026-04-27 02:00:03 +08:00
d4378afbc9 aa 2026-04-27 01:56:43 +08:00
dfb5fe0c89 aa 2026-04-27 01:45:59 +08:00
384d7e4838 aa 2026-04-27 01:42:47 +08:00
e30292e330 aa 2026-04-27 01:37:22 +08:00
ec804afc60 aa 2026-04-27 01:30:29 +08:00
4c48525b3a aa 2026-04-27 01:28:50 +08:00
b00a0c40d8 aa 2026-04-27 01:25:45 +08:00
5ec4c38495 aa 2026-04-27 01:20:49 +08:00
2d0eeaa78f aa 2026-04-27 01:19:38 +08:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
c84141674a Update README.md 2026-04-11 23:38:09 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
fb7ad3742e Version 3.1 2025-12-12 15:38:09 +01:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
8d4e092b1b Update telegram-scraper.py
v3.0
2025-09-11 17:34:59 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
e7bf2b1ed7 Update requirements.txt 2025-09-11 17:34:30 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
7db46018ce Update README.md 2025-09-11 17:32:56 +02:00
Robert Aitch
65b221ade6 Update requirements.txt 2025-07-20 20:18:12 +02:00
Robert Aitch
ac7d6de06b Performance improvements
major performance overhaul with 5-10x speed improvements
2025-07-20 00:57:54 +02:00
Robert Aitch
57bf125ca1 Delete gai.py 2025-07-20 00:36:53 +02:00
Robert Aitch
f383f222c4 Update README.md 2025-07-20 00:35:41 +02:00
𝓾𝓷𝓷𝓸𝓱𝔀𝓷
6273c9c11c Merge pull request #12 from chaseyoungcn/main
get total_messages speed up
2025-07-18 10:33:00 +02:00
fxxk-research
85d3f0f935 Rename gai to gai.py
rename
2025-06-26 13:36:58 +08:00
fxxk-research
30bda684fe Update gai
filiter no message channel
2025-06-26 13:36:15 +08:00
fxxk-research
aa9b756d37 Create gai
完善进度条、日志系统
2025-06-23 11:03:53 +08:00
fxxk-research
6baf4bdd13 get total_messages speed up
O(n) -> O(1)
2025-06-19 20:42:10 +08:00
10 changed files with 4384 additions and 405 deletions

13
.dockerignore Normal file
View File

@@ -0,0 +1,13 @@
# 构建镜像时不打进上下文(减小体积;数据在宿主机卷里)
.git
.env
.env.*
*.session
*.session-journal
state.json
-100*/
__pycache__
*.pyc
.cursor
.venv
venv

41
.gitignore vendored Normal file
View File

@@ -0,0 +1,41 @@
# ========== 密钥与登录(切勿提交到远程)==========
.env
.env.*
*.session
*.session-journal
# ========== 运行状态与抓取进度(与频道数据配套,勿提交)==========
state.json
# ========== 按频道存放的抓取结果SQLite、媒体、导出文件==========
# 目录名一般为 Telegram 超级群/频道 ID-100xxxxxxxxxx
-100*/
# ========== 脚本生成的列表(可随时再生成)==========
channels_list.csv
# ========== Python ==========
__pycache__/
*.py[cod]
*$py.class
.Python
venv/
.venv/
*.egg-info/
.eggs/
dist/
build/
# ========== 编辑器 / 本地工具 ==========
.cursor/
.vscode/
.idea/
*.swp
*.swo
.DS_Store
Thumbs.db
# ========== 日志与临时文件 ===========
*.log
*.tmp
*.temp

23
Dockerfile Normal file
View File

@@ -0,0 +1,23 @@
# 运行 Web 控制台;抓取数据通过卷挂载到 /data见 docker-compose 说明
FROM python:3.11-slim
ENV PYTHONUNBUFFERED=1 \
PYTHONDONTWRITEBYTECODE=1
WORKDIR /app
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --no-cache-dir -U pip setuptools wheel \
&& pip install --no-cache-dir -r requirements.txt
COPY telegram-scraper.py app_web.py ./
COPY templates ./templates/
COPY static ./static/
EXPOSE 8000
CMD ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]

239
README.md
View File

@@ -1,5 +1,17 @@
# Telegram Channel Scraper 📱 # Telegram Channel Scraper 📱
> **⚠️ DISCONTINUED**
>
> This project is no longer maintained. After a lot of support and interest from the community, A far more capable successor has been released:
>
> **➜ [Harrier — Telegram Scraping & Intelligence Platform](https://github.com/skuggrev/harrier)**
>
> Harrier has everything this tool had and much more - web UI, real-time progress, user lookup, webhook alerts, continuous scraping, and a proper export system. I recommend switching over.
>
> A huge thank you to everyone who used, starred, and supported this project.
---
A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities. A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
``` ```
@@ -10,17 +22,43 @@ ___________________ _________
|____| \______ /_______ / |____| \______ /_______ /
\/ \/ \/ \/
``` ```
## What's New in v3.1 🎉
**Enhanced Message Data:**
- **Message statistics** - Captures views, forwards, and post_author for each message
- **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
- **Automatic database migration** - Seamlessly adds new columns to existing databases
- **Richer exports** - All new data included in CSV/JSON exports
**Improved Channel Management:**
- **Channel names displayed** - Shows channel names alongside IDs everywhere
- **Smart filtering** - List option now only shows Channels and Groups (no private chats)
- **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
- **"all" selection** - Quickly add all listed channels at once
- **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`
**Bug Fixes:**
- **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
- **Better entity resolution** - Handles both numeric IDs and channel usernames
- **Improved error messages** - Shows channel names with IDs for clearer debugging
## Features 🚀 ## Features 🚀
- Scrape messages from multiple Telegram channels - **QR Code & Phone Authentication** - Choose your preferred login method
- Download media files (photos, documents) - Scrape messages with full metadata (views, forwards, reactions, post author)
- Download media files with parallel processing and unique naming
- Real-time continuous scraping - Real-time continuous scraping
- Export data to JSON and CSV formats - Export data to JSON and CSV formats with enhanced metadata
- SQLite database storage - SQLite database storage with automatic schema migration
- Resume capability (saves progress) - Resume capability (saves progress)
- Media reprocessing for failed downloads - Interactive menu with channel names and numbered selection
- Progress tracking - Smart channel filtering (only shows channels/groups)
- Interactive menu interface - Progress tracking with visual progress bars
- Automatic channels list export to CSV
## Prerequisites 📋 ## Prerequisites 📋
@@ -36,13 +74,6 @@ Before running the script, you'll need:
pip install -r requirements.txt pip install -r requirements.txt
``` ```
Contents of `requirements.txt`:
```
telethon
aiohttp
asyncio
```
## Getting Telegram API Credentials 🔑 ## Getting Telegram API Credentials 🔑
1. Visit https://my.telegram.org/auth 1. Visit https://my.telegram.org/auth
@@ -57,140 +88,149 @@ asyncio
6. You'll receive: 6. You'll receive:
- `api_id`: A number - `api_id`: A number
- `api_hash`: A string of letters and numbers - `api_hash`: A string of letters and numbers
Keep these credentials safe, you'll need them to run the script! Keep these credentials safe, you'll need them to run the script!
## Setup and Running 🔧 ## Setup and Running 🔧
1. Clone the repository: 1. Clone the repository:
```bash ```bash
git clone https://github.com/unnohwn/telegram-scraper.git git clone https://github.com/unnohwn/telegram-scraper.git
cd telegram-scraper cd telegram-scraper
``` ```
2. Install requirements: 2. Install requirements:
```bash ```bash
pip install -r requirements.txt pip install -r requirements.txt
``` ```
3. Run the script: 3. Run the script:
```bash ```bash
python telegram-scraper.py python telegram-scraper.py
``` ```
4. On first run, you'll be prompted to enter: 4. On first run, you'll be prompted to enter:
- Your API ID - Your API ID (from my.telegram.org)
- Your API Hash - Your API Hash (from my.telegram.org)
- Your phone number (with country code) - **Choose authentication method:**
- Your phone number (with country code) or bot, but use the phone number option when prompted second time. - **QR Code** (Recommended) - Scan with your phone (no phone number needed)
- Verification code (sent to your Telegram) - **Phone Number** - Traditional SMS verification
## Initial Scraping Behavior 🕒 ## Web Console (MVP) 🌐
When scraping a channel for the first time, please note: You can run a simple web control panel that manages `.env` configuration and starts/stops the scraper process:
- The script will attempt to retrieve the entire channel history, starting from the oldest messages ```bash
- Initial scraping can take several minutes or even hours, depending on: pip install -r requirements.txt
- The total number of messages in the channel uvicorn app_web:app --host 0.0.0.0 --port 8000 --reload
- Whether media downloading is enabled ```
- The size and number of media files
- Your internet connection speed Then open:
- Telegram's rate limiting
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off ```text
- Progress percentage is displayed in real-time to track the scraping status http://127.0.0.1:8000
- Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete ```
Features:
- Edit core config values from the web page (saved back to `.env`)
- Start / stop scraper process from browser
- View recent runtime logs
## Usage 📝 ## Usage 📝
The script provides an interactive menu with the following options: The script provides a clean interactive menu:
- **[A]** Add new channel ```
- Enter the channel ID or channelname ========================================
- **[R]** Remove channel TELEGRAM SCRAPER
- Remove a channel from scraping list ========================================
- **[S]** Scrape all channels [S] Scrape channels
- One-time scraping of all configured channels [C] Continuous scraping
- **[M]** Toggle media scraping [M] Media scraping: ON
- Enable/disable downloading of media files [L] List & add channels
- **[C]** Continuous scraping [R] Remove channels
- Real-time monitoring of channels for new messages [E] Export data
- **[E]** Export data [T] Rescrape media
- Export to JSON and CSV formats [Q] Quit
- **[V]** View saved channels ========================================
- List all saved channels ```
- **[L]** List account channels
- List all channels with ID:s for account
- **[Q]** Quit
### Channel IDs 📢 ### Channel Selection Made Easy 🔢
You can use either: Instead of typing long channel IDs, use numbers:
- Channel username (e.g., `channelname`)
- Channel ID (e.g., `-1001234567890`) **Adding Channels:**
```
[1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
[2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
[3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)
Enter: 1,3 (adds channels 1 and 3)
Or: all (adds all listed channels)
```
**Viewing Your Channels:**
```
[1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
[2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
```
**Scraping Channels:**
- Single: `1`
- Multiple: `1,3,5`
- All: `all`
- Mix formats: `1,-1001597139842,3`
## Data Storage 💾 ## Data Storage 💾
### Database Structure ### Database Structure
Data is stored in SQLite databases, one per channel: Data is stored in SQLite databases, one per channel:
- Location: `./channelname/channelname.db` - Location: `./channelname/channelname.db`
- Table: `messages` - Optimized with indexes for fast queries
- `id`: Primary key - WAL mode for better performance
- `message_id`: Telegram message ID - Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
- `date`: Message timestamp - Automatic migration adds new columns to existing databases
- `sender_id`: Sender's Telegram ID
- `first_name`: Sender's first name
- `last_name`: Sender's last name
- `username`: Sender's username
- `message`: Message text
- `media_type`: Type of media (if any)
- `media_path`: Local path to downloaded media
- `reply_to`: ID of replied message (if any)
### Media Storage 📁 ### Media Storage 📁
Media files are stored in: Media files are stored with unique naming:
- Location: `./channelname/media/` - Location: `./channelname/media/`
- Files are named using message ID or original filename - Format: `{message_id}-{unique_id}-{original_name}.ext`
- **No more file overwrites** - Each file gets a unique name
### Exported Data 📊 ### Exported Data 📊
Data can be exported in two formats: Export formats:
1. **CSV**: `./channelname/channelname.csv`
- Human-readable spreadsheet format
- Easy to import into Excel/Google Sheets
2. **JSON**: `./channelname/channelname.json` 1. **CSV**: `./channelname/channelid_username.csv`
- Structured data format 2. **JSON**: `./channelname/channelid_username.json`
- Ideal for programmatic processing 3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)
## Features in Detail 🔍 All exports include complete message metadata: views, forwards, reactions, and post author information.
### Continuous Scraping ## Performance Features ⚙️
The continuous scraping feature (`[C]` option) allows you to: - **5 concurrent downloads** for faster media processing
- Monitor channels in real-time - **Batch database operations** for optimal speed
- Automatically download new messages - **Progress bars** with real-time feedback
- Download media as it's posted - **Resume capability** - Continue where you left off
- Run indefinitely until interrupted (Ctrl+C) - **Memory-efficient** exports for large datasets
- Maintains state between runs
### Media Handling
The script can download:
- Photos
- Documents
- Other media types supported by Telegram
- Automatically retries failed downloads
- Skips existing files to avoid duplicates
## Error Handling 🛠️ ## Error Handling 🛠️
The script includes: - Automatic retry with exponential backoff
- Automatic retry mechanism for failed media downloads - Rate limit compliance
- State preservation in case of interruption - Network error recovery
- Flood control compliance - State preservation during interruptions
- Error logging for failed operations
## Limitations ⚠️ ## Limitations ⚠️
@@ -198,10 +238,6 @@ The script includes:
- Can only access public channels or channels you're a member of - Can only access public channels or channels you're a member of
- Media download size limits apply as per Telegram's restrictions - Media download size limits apply as per Telegram's restrictions
## Contributing 🤝
Contributions are welcome! Please feel free to submit a Pull Request.
## License 📄 ## License 📄
This project is licensed under the MIT License - see the LICENSE file for details. This project is licensed under the MIT License - see the LICENSE file for details.
@@ -209,6 +245,7 @@ This project is licensed under the MIT License - see the LICENSE file for detail
## Disclaimer ⚖️ ## Disclaimer ⚖️
This tool is for educational purposes only. Make sure to: This tool is for educational purposes only. Make sure to:
- Respect Telegram's Terms of Service - Respect Telegram's Terms of Service
- Obtain necessary permissions before scraping - Obtain necessary permissions before scraping
- Use responsibly and ethically - Use responsibly and ethically

1117
app_web.py Normal file

File diff suppressed because it is too large Load Diff

17
docker-compose.yml Normal file
View File

@@ -0,0 +1,17 @@
# 用法(在项目根目录):
# docker compose build
# docker compose up -d
# 数据持久化:把宿主机上的项目目录挂到 /data与 app 内工作目录一致(见下方 volumes
services:
web:
build: .
image: telegram-scraper-web:local
container_name: telegram-scraper-web
restart: unless-stopped
ports:
- "8000:8000"
working_dir: /data
volumes:
# 改成你服务器上「已有代码 + .env + state + session + 各 -100* 目录」的绝对路径
- ${HOST_PROJECT_DIR:-.}:/data
command: ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]

View File

@@ -1,3 +1,10 @@
telethon # 直接依赖(子依赖由 pip 自动解析)
aiohttp # 若仍装不上:请先执行 python3 --versionCentOS 自带 Python 3.6 过旧,建议安装 python39/python311 后再 pip install
asyncio Telethon>=1.28.0,<2
fastapi>=0.65.0,<1
uvicorn>=0.17.0,<1
itsdangerous>=2.0.0
jinja2>=3.0.0,<4
python-multipart>=0.0.5
qrcode>=7.3.0
PySocks>=1.7.0

BIN
static/favicon.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 619 KiB

File diff suppressed because it is too large Load Diff

1564
templates/index.html Normal file

File diff suppressed because it is too large Load Diff