Compare commits
29 Commits
6c98710320
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 77f0d404fa | |||
| cf64bc4703 | |||
| ff022bce5d | |||
| 4ae6898be0 | |||
| 440416ba0c | |||
| 459a5299a0 | |||
| d4378afbc9 | |||
| dfb5fe0c89 | |||
| 384d7e4838 | |||
| e30292e330 | |||
| ec804afc60 | |||
| 4c48525b3a | |||
| b00a0c40d8 | |||
| 5ec4c38495 | |||
| 2d0eeaa78f | |||
|
|
c84141674a | ||
|
|
fb7ad3742e | ||
|
|
8d4e092b1b | ||
|
|
e7bf2b1ed7 | ||
|
|
7db46018ce | ||
|
|
65b221ade6 | ||
|
|
ac7d6de06b | ||
|
|
57bf125ca1 | ||
|
|
f383f222c4 | ||
|
|
6273c9c11c | ||
|
|
85d3f0f935 | ||
|
|
30bda684fe | ||
|
|
aa9b756d37 | ||
|
|
6baf4bdd13 |
13
.dockerignore
Normal file
13
.dockerignore
Normal file
@@ -0,0 +1,13 @@
|
|||||||
|
# 构建镜像时不打进上下文(减小体积;数据在宿主机卷里)
|
||||||
|
.git
|
||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
*.session
|
||||||
|
*.session-journal
|
||||||
|
state.json
|
||||||
|
-100*/
|
||||||
|
__pycache__
|
||||||
|
*.pyc
|
||||||
|
.cursor
|
||||||
|
.venv
|
||||||
|
venv
|
||||||
41
.gitignore
vendored
Normal file
41
.gitignore
vendored
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
# ========== 密钥与登录(切勿提交到远程)==========
|
||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
*.session
|
||||||
|
*.session-journal
|
||||||
|
|
||||||
|
# ========== 运行状态与抓取进度(与频道数据配套,勿提交)==========
|
||||||
|
state.json
|
||||||
|
|
||||||
|
# ========== 按频道存放的抓取结果(SQLite、媒体、导出文件)==========
|
||||||
|
# 目录名一般为 Telegram 超级群/频道 ID(-100xxxxxxxxxx)
|
||||||
|
-100*/
|
||||||
|
|
||||||
|
# ========== 脚本生成的列表(可随时再生成)==========
|
||||||
|
channels_list.csv
|
||||||
|
|
||||||
|
# ========== Python ==========
|
||||||
|
__pycache__/
|
||||||
|
*.py[cod]
|
||||||
|
*$py.class
|
||||||
|
.Python
|
||||||
|
venv/
|
||||||
|
.venv/
|
||||||
|
*.egg-info/
|
||||||
|
.eggs/
|
||||||
|
dist/
|
||||||
|
build/
|
||||||
|
|
||||||
|
# ========== 编辑器 / 本地工具 ==========
|
||||||
|
.cursor/
|
||||||
|
.vscode/
|
||||||
|
.idea/
|
||||||
|
*.swp
|
||||||
|
*.swo
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
|
|
||||||
|
# ========== 日志与临时文件 ===========
|
||||||
|
*.log
|
||||||
|
*.tmp
|
||||||
|
*.temp
|
||||||
23
Dockerfile
Normal file
23
Dockerfile
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
# 运行 Web 控制台;抓取数据通过卷挂载到 /data,见 docker-compose 说明
|
||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
ENV PYTHONUNBUFFERED=1 \
|
||||||
|
PYTHONDONTWRITEBYTECODE=1
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
RUN apt-get update \
|
||||||
|
&& apt-get install -y --no-install-recommends ca-certificates \
|
||||||
|
&& rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
COPY requirements.txt .
|
||||||
|
RUN pip install --no-cache-dir -U pip setuptools wheel \
|
||||||
|
&& pip install --no-cache-dir -r requirements.txt
|
||||||
|
|
||||||
|
COPY telegram-scraper.py app_web.py ./
|
||||||
|
COPY templates ./templates/
|
||||||
|
COPY static ./static/
|
||||||
|
|
||||||
|
EXPOSE 8000
|
||||||
|
|
||||||
|
CMD ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
239
README.md
239
README.md
@@ -1,5 +1,17 @@
|
|||||||
# Telegram Channel Scraper 📱
|
# Telegram Channel Scraper 📱
|
||||||
|
|
||||||
|
> **⚠️ DISCONTINUED**
|
||||||
|
>
|
||||||
|
> This project is no longer maintained. After a lot of support and interest from the community, A far more capable successor has been released:
|
||||||
|
>
|
||||||
|
> **➜ [Harrier — Telegram Scraping & Intelligence Platform](https://github.com/skuggrev/harrier)**
|
||||||
|
>
|
||||||
|
> Harrier has everything this tool had and much more - web UI, real-time progress, user lookup, webhook alerts, continuous scraping, and a proper export system. I recommend switching over.
|
||||||
|
>
|
||||||
|
> A huge thank you to everyone who used, starred, and supported this project.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
|
A powerful Python script that allows you to scrape messages and media from Telegram channels using the Telethon library. Features include real-time continuous scraping, media downloading, and data export capabilities.
|
||||||
|
|
||||||
```
|
```
|
||||||
@@ -10,17 +22,43 @@ ___________________ _________
|
|||||||
|____| \______ /_______ /
|
|____| \______ /_______ /
|
||||||
\/ \/
|
\/ \/
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## What's New in v3.1 🎉
|
||||||
|
|
||||||
|
**Enhanced Message Data:**
|
||||||
|
|
||||||
|
- **Message statistics** - Captures views, forwards, and post_author for each message
|
||||||
|
- **Reactions support** - Records all emoji reactions with counts (e.g., "😀 12 👍 3")
|
||||||
|
- **Automatic database migration** - Seamlessly adds new columns to existing databases
|
||||||
|
- **Richer exports** - All new data included in CSV/JSON exports
|
||||||
|
|
||||||
|
**Improved Channel Management:**
|
||||||
|
|
||||||
|
- **Channel names displayed** - Shows channel names alongside IDs everywhere
|
||||||
|
- **Smart filtering** - List option now only shows Channels and Groups (no private chats)
|
||||||
|
- **channels_list.csv export** - Automatically saves channel list with names, IDs, usernames, and types
|
||||||
|
- **"all" selection** - Quickly add all listed channels at once
|
||||||
|
- **Better export naming** - Files now named as `ID_username.csv` and `ID_username.json`
|
||||||
|
|
||||||
|
**Bug Fixes:**
|
||||||
|
|
||||||
|
- **Fixed channel ID parsing** - Resolved "invalid literal for int()" error in fix missing media
|
||||||
|
- **Better entity resolution** - Handles both numeric IDs and channel usernames
|
||||||
|
- **Improved error messages** - Shows channel names with IDs for clearer debugging
|
||||||
|
|
||||||
## Features 🚀
|
## Features 🚀
|
||||||
|
|
||||||
- Scrape messages from multiple Telegram channels
|
- **QR Code & Phone Authentication** - Choose your preferred login method
|
||||||
- Download media files (photos, documents)
|
- Scrape messages with full metadata (views, forwards, reactions, post author)
|
||||||
|
- Download media files with parallel processing and unique naming
|
||||||
- Real-time continuous scraping
|
- Real-time continuous scraping
|
||||||
- Export data to JSON and CSV formats
|
- Export data to JSON and CSV formats with enhanced metadata
|
||||||
- SQLite database storage
|
- SQLite database storage with automatic schema migration
|
||||||
- Resume capability (saves progress)
|
- Resume capability (saves progress)
|
||||||
- Media reprocessing for failed downloads
|
- Interactive menu with channel names and numbered selection
|
||||||
- Progress tracking
|
- Smart channel filtering (only shows channels/groups)
|
||||||
- Interactive menu interface
|
- Progress tracking with visual progress bars
|
||||||
|
- Automatic channels list export to CSV
|
||||||
|
|
||||||
## Prerequisites 📋
|
## Prerequisites 📋
|
||||||
|
|
||||||
@@ -36,13 +74,6 @@ Before running the script, you'll need:
|
|||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
Contents of `requirements.txt`:
|
|
||||||
```
|
|
||||||
telethon
|
|
||||||
aiohttp
|
|
||||||
asyncio
|
|
||||||
```
|
|
||||||
|
|
||||||
## Getting Telegram API Credentials 🔑
|
## Getting Telegram API Credentials 🔑
|
||||||
|
|
||||||
1. Visit https://my.telegram.org/auth
|
1. Visit https://my.telegram.org/auth
|
||||||
@@ -57,140 +88,149 @@ asyncio
|
|||||||
6. You'll receive:
|
6. You'll receive:
|
||||||
- `api_id`: A number
|
- `api_id`: A number
|
||||||
- `api_hash`: A string of letters and numbers
|
- `api_hash`: A string of letters and numbers
|
||||||
|
|
||||||
Keep these credentials safe, you'll need them to run the script!
|
Keep these credentials safe, you'll need them to run the script!
|
||||||
|
|
||||||
## Setup and Running 🔧
|
## Setup and Running 🔧
|
||||||
|
|
||||||
1. Clone the repository:
|
1. Clone the repository:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
git clone https://github.com/unnohwn/telegram-scraper.git
|
git clone https://github.com/unnohwn/telegram-scraper.git
|
||||||
cd telegram-scraper
|
cd telegram-scraper
|
||||||
```
|
```
|
||||||
|
|
||||||
2. Install requirements:
|
2. Install requirements:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
pip install -r requirements.txt
|
pip install -r requirements.txt
|
||||||
```
|
```
|
||||||
|
|
||||||
3. Run the script:
|
3. Run the script:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
python telegram-scraper.py
|
python telegram-scraper.py
|
||||||
```
|
```
|
||||||
|
|
||||||
4. On first run, you'll be prompted to enter:
|
4. On first run, you'll be prompted to enter:
|
||||||
- Your API ID
|
- Your API ID (from my.telegram.org)
|
||||||
- Your API Hash
|
- Your API Hash (from my.telegram.org)
|
||||||
- Your phone number (with country code)
|
- **Choose authentication method:**
|
||||||
- Your phone number (with country code) or bot, but use the phone number option when prompted second time.
|
- **QR Code** (Recommended) - Scan with your phone (no phone number needed)
|
||||||
- Verification code (sent to your Telegram)
|
- **Phone Number** - Traditional SMS verification
|
||||||
|
|
||||||
## Initial Scraping Behavior 🕒
|
## Web Console (MVP) 🌐
|
||||||
|
|
||||||
When scraping a channel for the first time, please note:
|
You can run a simple web control panel that manages `.env` configuration and starts/stops the scraper process:
|
||||||
|
|
||||||
- The script will attempt to retrieve the entire channel history, starting from the oldest messages
|
```bash
|
||||||
- Initial scraping can take several minutes or even hours, depending on:
|
pip install -r requirements.txt
|
||||||
- The total number of messages in the channel
|
uvicorn app_web:app --host 0.0.0.0 --port 8000 --reload
|
||||||
- Whether media downloading is enabled
|
```
|
||||||
- The size and number of media files
|
|
||||||
- Your internet connection speed
|
Then open:
|
||||||
- Telegram's rate limiting
|
|
||||||
- The script uses pagination and maintains state, so if interrupted, it can resume from where it left off
|
```text
|
||||||
- Progress percentage is displayed in real-time to track the scraping status
|
http://127.0.0.1:8000
|
||||||
- Messages are stored in the database as they are scraped, so you can start analyzing available data even before the scraping is complete
|
```
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- Edit core config values from the web page (saved back to `.env`)
|
||||||
|
- Start / stop scraper process from browser
|
||||||
|
- View recent runtime logs
|
||||||
|
|
||||||
## Usage 📝
|
## Usage 📝
|
||||||
|
|
||||||
The script provides an interactive menu with the following options:
|
The script provides a clean interactive menu:
|
||||||
|
|
||||||
- **[A]** Add new channel
|
```
|
||||||
- Enter the channel ID or channelname
|
========================================
|
||||||
- **[R]** Remove channel
|
TELEGRAM SCRAPER
|
||||||
- Remove a channel from scraping list
|
========================================
|
||||||
- **[S]** Scrape all channels
|
[S] Scrape channels
|
||||||
- One-time scraping of all configured channels
|
[C] Continuous scraping
|
||||||
- **[M]** Toggle media scraping
|
[M] Media scraping: ON
|
||||||
- Enable/disable downloading of media files
|
[L] List & add channels
|
||||||
- **[C]** Continuous scraping
|
[R] Remove channels
|
||||||
- Real-time monitoring of channels for new messages
|
[E] Export data
|
||||||
- **[E]** Export data
|
[T] Rescrape media
|
||||||
- Export to JSON and CSV formats
|
[Q] Quit
|
||||||
- **[V]** View saved channels
|
========================================
|
||||||
- List all saved channels
|
```
|
||||||
- **[L]** List account channels
|
|
||||||
- List all channels with ID:s for account
|
|
||||||
- **[Q]** Quit
|
|
||||||
|
|
||||||
### Channel IDs 📢
|
### Channel Selection Made Easy 🔢
|
||||||
|
|
||||||
You can use either:
|
Instead of typing long channel IDs, use numbers:
|
||||||
- Channel username (e.g., `channelname`)
|
|
||||||
- Channel ID (e.g., `-1001234567890`)
|
**Adding Channels:**
|
||||||
|
|
||||||
|
```
|
||||||
|
[1] Tech News (ID: -1002116176890, Type: Channel, Username: @technews)
|
||||||
|
[2] Python Dev (ID: -1001597139842, Type: Group, Username: @pythondev)
|
||||||
|
[3] Daily Updates (ID: -1002274713954, Type: Channel, Username: @dailyupdates)
|
||||||
|
|
||||||
|
Enter: 1,3 (adds channels 1 and 3)
|
||||||
|
Or: all (adds all listed channels)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Viewing Your Channels:**
|
||||||
|
|
||||||
|
```
|
||||||
|
[1] Tech News (ID: -1002116176890), Last Message ID: 5234, Messages: 12450
|
||||||
|
[2] Python Dev (ID: -1001597139842), Last Message ID: 8192, Messages: 45782
|
||||||
|
```
|
||||||
|
|
||||||
|
**Scraping Channels:**
|
||||||
|
|
||||||
|
- Single: `1`
|
||||||
|
- Multiple: `1,3,5`
|
||||||
|
- All: `all`
|
||||||
|
- Mix formats: `1,-1001597139842,3`
|
||||||
|
|
||||||
## Data Storage 💾
|
## Data Storage 💾
|
||||||
|
|
||||||
### Database Structure
|
### Database Structure
|
||||||
|
|
||||||
Data is stored in SQLite databases, one per channel:
|
Data is stored in SQLite databases, one per channel:
|
||||||
|
|
||||||
- Location: `./channelname/channelname.db`
|
- Location: `./channelname/channelname.db`
|
||||||
- Table: `messages`
|
- Optimized with indexes for fast queries
|
||||||
- `id`: Primary key
|
- WAL mode for better performance
|
||||||
- `message_id`: Telegram message ID
|
- Schema includes: message_id, date, sender info, message text, media info, reply_to, post_author, views, forwards, reactions
|
||||||
- `date`: Message timestamp
|
- Automatic migration adds new columns to existing databases
|
||||||
- `sender_id`: Sender's Telegram ID
|
|
||||||
- `first_name`: Sender's first name
|
|
||||||
- `last_name`: Sender's last name
|
|
||||||
- `username`: Sender's username
|
|
||||||
- `message`: Message text
|
|
||||||
- `media_type`: Type of media (if any)
|
|
||||||
- `media_path`: Local path to downloaded media
|
|
||||||
- `reply_to`: ID of replied message (if any)
|
|
||||||
|
|
||||||
### Media Storage 📁
|
### Media Storage 📁
|
||||||
|
|
||||||
Media files are stored in:
|
Media files are stored with unique naming:
|
||||||
|
|
||||||
- Location: `./channelname/media/`
|
- Location: `./channelname/media/`
|
||||||
- Files are named using message ID or original filename
|
- Format: `{message_id}-{unique_id}-{original_name}.ext`
|
||||||
|
- **No more file overwrites** - Each file gets a unique name
|
||||||
|
|
||||||
### Exported Data 📊
|
### Exported Data 📊
|
||||||
|
|
||||||
Data can be exported in two formats:
|
Export formats:
|
||||||
1. **CSV**: `./channelname/channelname.csv`
|
|
||||||
- Human-readable spreadsheet format
|
|
||||||
- Easy to import into Excel/Google Sheets
|
|
||||||
|
|
||||||
2. **JSON**: `./channelname/channelname.json`
|
1. **CSV**: `./channelname/channelid_username.csv`
|
||||||
- Structured data format
|
2. **JSON**: `./channelname/channelid_username.json`
|
||||||
- Ideal for programmatic processing
|
3. **Channel List**: `./channels_list.csv` (automatically created when using [L] option)
|
||||||
|
|
||||||
## Features in Detail 🔍
|
All exports include complete message metadata: views, forwards, reactions, and post author information.
|
||||||
|
|
||||||
### Continuous Scraping
|
## Performance Features ⚙️
|
||||||
|
|
||||||
The continuous scraping feature (`[C]` option) allows you to:
|
- **5 concurrent downloads** for faster media processing
|
||||||
- Monitor channels in real-time
|
- **Batch database operations** for optimal speed
|
||||||
- Automatically download new messages
|
- **Progress bars** with real-time feedback
|
||||||
- Download media as it's posted
|
- **Resume capability** - Continue where you left off
|
||||||
- Run indefinitely until interrupted (Ctrl+C)
|
- **Memory-efficient** exports for large datasets
|
||||||
- Maintains state between runs
|
|
||||||
|
|
||||||
### Media Handling
|
|
||||||
|
|
||||||
The script can download:
|
|
||||||
- Photos
|
|
||||||
- Documents
|
|
||||||
- Other media types supported by Telegram
|
|
||||||
- Automatically retries failed downloads
|
|
||||||
- Skips existing files to avoid duplicates
|
|
||||||
|
|
||||||
## Error Handling 🛠️
|
## Error Handling 🛠️
|
||||||
|
|
||||||
The script includes:
|
- Automatic retry with exponential backoff
|
||||||
- Automatic retry mechanism for failed media downloads
|
- Rate limit compliance
|
||||||
- State preservation in case of interruption
|
- Network error recovery
|
||||||
- Flood control compliance
|
- State preservation during interruptions
|
||||||
- Error logging for failed operations
|
|
||||||
|
|
||||||
## Limitations ⚠️
|
## Limitations ⚠️
|
||||||
|
|
||||||
@@ -198,10 +238,6 @@ The script includes:
|
|||||||
- Can only access public channels or channels you're a member of
|
- Can only access public channels or channels you're a member of
|
||||||
- Media download size limits apply as per Telegram's restrictions
|
- Media download size limits apply as per Telegram's restrictions
|
||||||
|
|
||||||
## Contributing 🤝
|
|
||||||
|
|
||||||
Contributions are welcome! Please feel free to submit a Pull Request.
|
|
||||||
|
|
||||||
## License 📄
|
## License 📄
|
||||||
|
|
||||||
This project is licensed under the MIT License - see the LICENSE file for details.
|
This project is licensed under the MIT License - see the LICENSE file for details.
|
||||||
@@ -209,6 +245,7 @@ This project is licensed under the MIT License - see the LICENSE file for detail
|
|||||||
## Disclaimer ⚖️
|
## Disclaimer ⚖️
|
||||||
|
|
||||||
This tool is for educational purposes only. Make sure to:
|
This tool is for educational purposes only. Make sure to:
|
||||||
|
|
||||||
- Respect Telegram's Terms of Service
|
- Respect Telegram's Terms of Service
|
||||||
- Obtain necessary permissions before scraping
|
- Obtain necessary permissions before scraping
|
||||||
- Use responsibly and ethically
|
- Use responsibly and ethically
|
||||||
|
|||||||
1117
app_web.py
Normal file
1117
app_web.py
Normal file
File diff suppressed because it is too large
Load Diff
17
docker-compose.yml
Normal file
17
docker-compose.yml
Normal file
@@ -0,0 +1,17 @@
|
|||||||
|
# 用法(在项目根目录):
|
||||||
|
# docker compose build
|
||||||
|
# docker compose up -d
|
||||||
|
# 数据持久化:把宿主机上的项目目录挂到 /data,与 app 内工作目录一致(见下方 volumes)
|
||||||
|
services:
|
||||||
|
web:
|
||||||
|
build: .
|
||||||
|
image: telegram-scraper-web:local
|
||||||
|
container_name: telegram-scraper-web
|
||||||
|
restart: unless-stopped
|
||||||
|
ports:
|
||||||
|
- "8000:8000"
|
||||||
|
working_dir: /data
|
||||||
|
volumes:
|
||||||
|
# 改成你服务器上「已有代码 + .env + state + session + 各 -100* 目录」的绝对路径
|
||||||
|
- ${HOST_PROJECT_DIR:-.}:/data
|
||||||
|
command: ["uvicorn", "app_web:app", "--host", "0.0.0.0", "--port", "8000"]
|
||||||
@@ -1,3 +1,10 @@
|
|||||||
telethon
|
# 直接依赖(子依赖由 pip 自动解析)
|
||||||
aiohttp
|
# 若仍装不上:请先执行 python3 --version,CentOS 自带 Python 3.6 过旧,建议安装 python39/python311 后再 pip install
|
||||||
asyncio
|
Telethon>=1.28.0,<2
|
||||||
|
fastapi>=0.65.0,<1
|
||||||
|
uvicorn>=0.17.0,<1
|
||||||
|
itsdangerous>=2.0.0
|
||||||
|
jinja2>=3.0.0,<4
|
||||||
|
python-multipart>=0.0.5
|
||||||
|
qrcode>=7.3.0
|
||||||
|
PySocks>=1.7.0
|
||||||
|
|||||||
BIN
static/favicon.png
Normal file
BIN
static/favicon.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 619 KiB |
1762
telegram-scraper.py
1762
telegram-scraper.py
File diff suppressed because it is too large
Load Diff
1564
templates/index.html
Normal file
1564
templates/index.html
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user