Initial commit

This commit is contained in:
2025-11-18 03:36:49 +08:00
commit d17c7efb3c
7078 changed files with 831480 additions and 0 deletions

28
.env Normal file
View File

@@ -0,0 +1,28 @@
DJANGO_SECRET_KEY=GkfYWT9b6xmPNmthn4ficOrYkLnW3dmg8FGgICRR8aymCZC7MNVF0vvzhVGDzxJkSfploEzlPFN8EEP9d9K2RbdQllDYtdNTODOzcCbIFVzfjQ8CTfWVAmi6qp33qi90
DJANGO_DEBUG=True
DJANGO_ALLOWED_HOSTS=*
# Database config
DB_ENGINE=sqlite
DB_NAME=pygoedge
DB_USER=root
DB_PASSWORD=
DB_HOST=127.0.0.1
DB_PORT=3306
# GoEdge settings
# 管理员API地址HTTP接口地址
GOEDGE_ADMIN_API_BASE_URL=https://backend.dooki.cloud
# 若未提供 AccessKeyId/AccessKey可直接填令牌不推荐长期使用
# GOEDGE_ACCESS_TOKEN=
# 推荐:提供管理员 AccessKeyId/AccessKey系统将自动获取并缓存令牌
GOEDGE_ACCESS_KEY_ID=pmP4MVmhYl8fgVpu
GOEDGE_ACCESS_KEY=6YWEYwNC6dVKXv009MniUu37H8XOwG2v
# 创建网站所需的默认节点集群ID必须配置为有效集群ID
# 可在 GoEdge 管理后台「集群管理」处查看并填写
GOEDGE_DEFAULT_NODE_CLUSTER_ID=
# System defaults
DEFAULT_FREE_TRAFFIC_GB_PER_DOMAIN=15
CNAME_TEMPLATE={sub}.cdn.example.com

23
.env.example Normal file
View File

@@ -0,0 +1,23 @@
DJANGO_ENV=dev
# Django
DJANGO_SECRET_KEY=change-me
DJANGO_ALLOWED_HOSTS=localhost,127.0.0.1
DJANGO_DEBUG=True
DJANGO_CSRF_TRUSTED_ORIGINS=
# Database (default sqlite). To use MySQL, uncomment and fill:
# DB_ENGINE=mysql
# DB_NAME=pygoedge
# DB_USER=root
# DB_PASSWORD=your-password
# DB_HOST=127.0.0.1
# DB_PORT=3306
# DB_CONN_MAX_AGE=60
# GoEdge Admin API
GOEDGE_ADMIN_API_BASE_URL=https://your-goedge-host:port
GOEDGE_ACCESS_KEY_ID=your-access-key-id
GOEDGE_ACCESS_KEY=your-access-key
GOEDGE_ACCESS_TOKEN=
GOEDGE_DEFAULT_NODE_CLUSTER_ID=1

View File

@@ -0,0 +1,80 @@
## 当前状态快照
- 账户/认证登录、注册、资料、密码修改、登录历史已就绪accounts/views.py:33, 81, 121
- 域名接入:支持创建域名→调用 GoEdge 创建服务→生成 CNAME→DNS 检测domains/views.py:42, 224
- 流量统计:管理命令拉取每日流量与带宽,写入 `DomainTrafficDaily`domains/management/commands/pull_traffic.py:12pull_daily_stats.py:11
- 套餐与功能Plan 模型与运营面板的创建/编辑/开关/可见性plans/models.py:4admin_panel/views.py:154
- 账单:按自然月生成(套餐费+超量费),用户侧与运营侧账单列表/详情/导出/标记支付billing/management/commands/generate_invoices.py:40billing/views.py:14admin_panel/views.py:406
- 系统设置GoEdge API、默认配额、策略ID、集群ID、验证码与异常检测配置core/models.py:4admin_panel/forms.py:9admin_panel/views.py:47
- GoEdge 封装:令牌管理、创建网站、查询每日统计/带宽、查询 webId/sslPolicyId、同步访问日志/WebSocket/WAF/HTTP3core/goedge_client.py:20
- 模板:用户与运营页面基础完成,域名详情有统计与 GoEdge 状态占位templates/domains/detail.html
## 目标增量(两周迭代)
1) 完善“域名功能设置”到 GoEdge 的同步覆盖支持缓存、路径规则、重写、Header、HTTPS 跳转、Referer、防盗链、RemoteAddr 等通用能力。
2) 增加“访问日志浏览”页面(按日/小时/IP/关键词分页查询)。
3) 执行超量/未支付策略(停服或限速),打通账单与域名状态联动。
4) 域名详情的图表可视化与 UX 优化(近 7/30 天流量曲线,峰值标注)。
5) 健壮性提升吞错点统一日志化MySQL 连接与配置检测提示;操作日志完善。
6) 运行指南与定时任务Windows 任务计划或 Linux cron 的命令与频率。
7) 测试用例:关键流程的单元/集成测试。
## 迭代拆解
### 迭代 1GoEdge 同步能力完善(功能设置)
- 扩展 `GoEdgeClient`
- `updateHTTPWebCache`HTTPWebService/updateHTTPWebCache映射 `cache_rules_json` 最小配置。
- `updateHTTPWebLocations` 映射 `page_rules_json` 中的路径/动作(最简:按前缀/正则路由到缓存/重写)。
- `updateHTTPWebRewriteRules` 支持简单重写from→to
- `updateHTTPWebRequestHeader` / `updateHTTPWebResponseHeader` 支持常用 Header 注入(如 HSTS、Cache-Control
- `updateHTTPWebRedirectToHTTPS` 一键跳转 HTTPS可由计划功能开关或自定义字段驱动
- `updateHTTPWebReferers` 防盗链白名单/黑名单的最小实现。
- `updateHTTPWebRemoteAddr` 按 X-Forwarded-For 解析真实客户端 IP与 WAF/限速联动)。
- 扩展 `domains/views.py:340 domain_settings`:在保存后,根据 `webId/sslPolicyId` 调用上述接口进行同步;失败采用 `messages.warning` 并写入 `OperationLog`
- 约定 JSON 形态:首版采取简化 schema键名与 GoEdge 文档一致),并在后端进行最小校验与 Base64 编码。
### 迭代 2访问日志浏览
- 新增路由 `/domains/<id>/logs/` 与模板:
- 使用 `HTTPAccessLogService/listHTTPAccessLogs`,支持 day/hourFrom/hourTo/size/reverse/ip/keyword 过滤。
- 提供翻页(基于 `requestId``hasMore`)。
- 展示字段时间、域名、IP、方法、路径、状态码、字节数、UA、WAF 命中。
- 权限:仅域名所有者或 staff 可访问。
### 迭代 3账单联动与策略执行
- 账单生成后(未支付):
- 运营面板增加“执行超量策略”入口:
- 停服:`HTTPWebService/updateHTTPWebShutdown` 开启;域名状态设为 `suspended`
- 限速:`HTTPWebService/updateHTTPWebRequestLimit` 设置最小限速阈值(可从 `SystemSettings.default_overage_policy` 读取)。
- 支付后:自动/手动恢复(关闭 Shutdown 或解除 RequestLimit域名状态回到 `active`)。
- 管理命令或后台任务:按周期扫描未支付账单并批量执行策略(安全保护:仅对 `active` 域名生效)。
### 迭代 4可视化与 UX
- 域名详情页:
- 使用 Chart.js 绘制近 30 天流量曲线GB与峰值带宽Mbps双轴图近 7 天摘要卡片。
- 清晰的 CNAME 生效提示与重试入口GoEdge 状态分组卡片(访问日志/WebSocket/WAF/HTTP3
### 迭代 5健壮性与日志化
- 替换 `except Exception: pass` 为日志记录:
- 例如 `core/goedge_client.py:388,407``admin_panel/views.py:101,126,330,357,392,451,466,482``billing/views.py:51``accounts/views.py:68``settings/base.py:68`
- MySQL 检测:若 `DB_ENGINE=mysql``pymysql.install_as_MySQLdb()` 失败,提示更明确的安装/环境信息。
- 操作日志:补齐失败原因与结果摘要,便于审计回溯。
### 迭代 6运行指南与定时任务
- 文档(运营面板“系统设置”页添加说明块):
- 每日 02:00 执行:`python manage.py pull_daily_stats --days 1`
- 每月 01 日 01:00 执行:`python manage.py generate_invoices --overwrite`
- Windows 任务计划/ Linux cron 配置示例与常见故障排查。
### 迭代 7测试
- 单元测试:
- `GoEdgeClient` 令牌续期与错误处理mock requests
- `check_cname_map``bytes_to_gb` 边界用例。
- 集成测试:
- 域名接入→DNS 检测→流量写入→账单生成 →(策略执行)闭环。
## 验收与交付
- 域名设置同步:开启功能后在 GoEdge 生效访问日志、WAF、WebSocket、HTTP/3、缓存、重写、Header、HTTPS 跳转、Referer
- 访问日志浏览:指定日期范围能分页查看,并支持关键词/IP 筛选。
- 账单联动:未支付账单可一键对域名执行停服/限速;支付后恢复。
- 图表与 UX域名详情页数据可视化清晰操作反馈明确。
- 健壮性:关键吞错点具备异常日志,配置错误有可见提示。
- 测试:核心流程覆盖并可运行通过。
若确认以上计划,我将按迭代顺序开始实现并提交改动(保持 Django+Bootstrap 现有风格,复用已建模型与模板结构)。

View File

@@ -0,0 +1,451 @@
# 项目开发需求说明文档Django + Bootstrap + MySQL
> 本文档用于告诉 AI Agent需要开发一个 **类似 Cloudflare / 百度云加速的域名加速面板**
> 使用 **Python + Django非前后端分离+ Bootstrap + MySQL**,并集成 GoEdge 管理员 API 作为底层 CDN 引擎。
---
## 1. 项目目标
- 搭建一个 **按域名计费的 CDN 加速控制面板**,用户以 CNAME 方式接入域名。
- 每个域名默认有 **每月 15GB 免费流量**(该值应可在管理后台配置,并可对用户/域名单独调整)。
- 支持多种 **套餐Plan**,不同套餐控制:
- 每个域名的免费/包含流量额度
- 每 GB 超量单价
- 可用功能WAF、日志、HTTP/3、访问控制、缓存高级规则等
- 系统提供:
- 用户注册登录、域名接入管理、流量统计展示
- 套餐订购与变更、账单与支付(先定义结构,支付接入可预留)
- 强大的管理后台(基于 Django Admin + 自定义页面),可灵活配置套餐、免费额度、功能开关、超量策略等。
- 后端使用 Django 模板 + Bootstrap 完成 UI**不做前后端分离**。
---
## 2. 技术栈与基础要求
- **后端框架**Python 3.x + Django 4.x或当前 LTS
- **前端框架**Django Template + Bootstrap 5或当前版本
- **数据库**MySQL 8.x
- **CDN 引擎集成**:通过 HTTP 调用 **GoEdge 管理员 API**(使用 X-Edge-Access-Token
- **用户认证**
- 使用 Django 内置认证体系扩展(或自定义 User 模型)
- 支持邮箱 / 用户名登录
- **语言**
- 首版可仅中文界面,保留 i18n 结构,便于未来多语言。
---
## 3. 核心业务概念与模型
### 3.1 用户User
- 平台登录账号(客户)。
- 一个用户可以拥有多个域名。
### 3.2 域名Domain / Site / Zone
- 用户添加的待加速域名,如 `example.com`
- 用户通过将其子域名(如 `www.example.com`CNAME 到平台提供的加速域名接入。
- 每个域名:
- 绑定一个当前套餐Plan
- 拥有每月免费/包含流量额度
- 有一套独立的功能配置WAF 开关、缓存规则等)
### 3.3 套餐Plan
- 按域名计费的套餐模版。
- 定义:
- 基础月费:每个域名每月多少钱
- 包含流量:每个域名每月包含多少 GB
- 超量单价:每 GB 超出后的费用
- 功能开关:是否支持 WAF、日志下载、HTTP/3、WebSocket、自定义规则等
- 可见/可购状态:是否显示给普通用户新购/升级、是否仅管理员可用
### 3.4 域名套餐实例DomainPlan / Subscription
- 某个域名当前使用的套餐实例。
- 记录:
- `domain` + `plan`
- 当前计费周期(开始时间/结束时间)
- 实际生效的配额(可在 Plan 基础上对该域名单独 override
- 自定义价格/流量/功能
### 3.5 流量使用记录Traffic Usage
-**域名 + 日期** 为粒度记录流量数据。
- 用于:
- 展示流量图表
- 计算是否超过“免费额度”和“套餐包含流量”
- 计算超量费用
### 3.6 账单 / 订单Invoice / Order
- 对应一个计费周期或某一段用量的费用结果。
- 包含:
- 基础套餐费用
- 超量流量费用
- 优惠/调整
- 是否已支付
---
## 4. 功能模块(用户侧)
### 4.1 用户注册 / 登录 / 账号管理
**需求:**
- 用户可以通过邮箱注册账号。
- 登录支持:
- 用户名 + 密码 或 邮箱 + 密码
- 验证码系统(预留,可以用简单图形验证码)。
- 用户中心页面:
- 修改密码
- 查看登录历史(最近若干登录 IP、时间
- 管理个人信息(名称、联系方式)
### 4.2 域名列表 & 概览
- 页面:`/domains/`
- 展示内容:
- 域名名称
- 当前套餐名称
- 本月已用流量 / 当前月度可用流量
- 状态(未接入 / DNS 未生效 / 正常 / 超额 / 暂停)
- 支持操作:
- 添加新域名
- 进入域名详情
- 删除/停用域名(逻辑删除或状态标记)
### 4.3 添加域名流程CNAME 接入向导)
页面流程(可以拆成一个页面多步骤):
1. 用户输入域名(`example.com`
2. 选择接入子域名(可填多个,如 `www.example.com``static.example.com`
3. 填写源站信息:
- 源站地址IP 或 源站域名)
- 回源协议HTTP/HTTPS
- 回源端口
4. 选择套餐(默认 Free 套餐,支持可选升级)
5. 创建后:
- 后端调用 GoEdge Admin API 创建对应的 Server 与配置;
- 生成 CNAME 目标,例如:`www.example.com.cdn.platform.com`
6. 页面展示:
- 需要添加的 DNS 记录列表
- 状态为“等待 DNS 配置”
- 提供“立即检测 DNS 生效情况”的按钮(后台通过 DNS 查询判断)
### 4.4 域名详情页
- 路径示例:`/domains/<domain_id>/`
- 展示模块:
1. **概览**
- 当前套餐信息
- 本月流量进度条(已用 vs 总额度)
- 超量状态提示(如已超过套餐流量 X GB
2. **统计(流量 / 请求)**
- 近 24 小时 / 7 天 / 30 天:
- 流量图MB/GB
- 请求数
- HTTP 状态码分布200/404/5xx
- 按地区/运营商分布(若 GoEdge 支持)
3. **接入信息**
- 当前接入的子域名和对应的 CNAME 记录
- DNS 状态
4. **功能设置(视套餐决定是否可见/可编辑)**
- SSL/TLS 设置
- 缓存规则
- WAF 总开关
- IP 黑白名单
- 路径规则(简单页面规则)
> 所有修改类操作都通过 Django 视图调用 GoEdge Admin API 更新配置。
### 4.5 套餐列表 & 升级
- 页面:`/plans/`
- 展示所有 **对用户公开的套餐**,包括:
- 套餐名称、月费
- 每域名包含流量
- 超量单价
- 功能对比(表格形式)
- 在域名详情页可以:
- 点击「升级套餐」
- 跳转到套餐列表,选中一个 Plan → 后端进行升级逻辑(仅变更套餐记录,计费逻辑后续执行)
> 支付可先不做实际第三方接口,预留订单记录和“标记为已支付”的操作。
### 4.6 账单 / 费用
- 页面:`/billing/`
- 用户可查看:
- 历史账单列表:时间区间、金额、状态
- 每个账单详情:
- 流量用量明细(按域名)
- 超量部分计算方式
> 样式可以简单先实现,支付流程在后端逻辑中预留钩子。
---
## 5. 功能模块(管理员后台)
### 5.1 使用 Django Admin 的基础管理
- 启用 Django Admin并注册模型
- 用户管理
- 域名管理
- 套餐管理
- 域名套餐实例DomainPlan/Subscription
- 流量记录
- 账单/订单
### 5.2 自定义管理面板(独立于 admin site 的运营后台)
> 目的:提供一个比 Django Admin 更贴合业务的运营控制台。
可能路径:`/admin-panel/`(与 Django admin 区分)
功能:
#### 5.2.1 全局系统设置
- 例如 `SystemSettings` 表,对应的配置项:
- `default_free_traffic_gb_per_domain`(默认每域名每月免费流量,默认 15
- 默认超量策略(超量是否允许访问、是否限速、是否自动停服)
- GoEdge API 地址 + 管理员 `X-Edge-Access-Token`
- 默认 CNAME 目标域名模板(如 `<subdomain>.cdn.xxx.com`
#### 5.2.2 套餐管理Plans
- 新增/编辑/删除套餐:
- 套餐名称、描述
- 月费(每域名月费)
- 每域名包含流量GB
- 超量单价(元/GB
- 功能开关(布尔或 JSON 配置):
- 启用 WAF是/否)
- 是否支持自定义 SSL 证书
- 是否支持实时日志
- 是否支持 HTTP/3
- 是否支持 WebSocket
- 是否支持自定义规则(页面规则数量上限)
- 可见性:
- 是否对普通用户展示
- 是否允许新购
- 是否允许续费
- 是否只允许管理员手动分配
#### 5.2.3 用户与域名管理
- 用户列表:
- 查看用户基本信息、注册时间、状态
- 查看该用户下所有域名
- 域名列表:
- 按域名搜索、按用户过滤
- 查看当前使用的套餐、用量、状态
- 操作:
- 手动切换套餐
- 为该域名增加 **额外的免费流量额度**(本周期)
- 为该域名设置 **自定义超量单价**
- 暂停/恢复域名
- 查看与 GoEdge 的映射关系serverId / clusterId 等)
#### 5.2.4 免费流量与配额调整
- 支持对以下层级做手动调整:
1. 全局默认(所有新域名默认免费 15GB
2. 某个套餐的包含流量
3. 某个用户的默认免费额度(该用户新添加的域名都使用这个值)
4. 某个具体域名在当前周期的 **个人额外赠送流量**
- 在管理面板中应有页面:
- 显示每个域名:
- 基础套餐流量
- 全局默认调整
- 用户层级额外
- 域名本身额外
- 以及最终计算出的“本周期可用总免费流量”。
#### 5.2.5 计费与账单管理
- 账单列表:
- 账单编号、用户、计费周期、总金额、是否已支付
- 支持:
- 手动标记账单为已支付
- 重新生成某个周期的账单(用于调试或运营)
- 对超量部分进行 **人工减免或加价**(在账单上增加调账项目)
#### 5.2.6 运维监控 & 风控(可分阶段实现)
- 简单版本:
- 全平台每日总流量曲线
- 域名按流量排序的 Top N 列表
- 进一步:
- 配置“异常流量检测阈值”(例如当日流量超过过去 7 日平均的 3 倍)→ 标记异常域名
---
## 6. 数据库设计(概要说明)
> 这里仅做表与字段的**概念说明**,具体字段类型和索引由 AI Agent 在实现时按 MySQL 习惯设计。
### 6.1 用户相关
- `users`(可扩展 Django User 或自定义)
- id
- username
- email
- password_hash
- is_active
- is_staff / is_superuser
- created_at, updated_at
- `user_profile`(可选扩展)
- user_id (FK)
- display_name
- contact_phone
- default_free_traffic_gb_per_domain_override用户级默认免费流量覆盖值
### 6.2 域名与接入
- `domains`
- id
- user_id (FK)
- name`example.com`
- statuspending_dns / active / suspended / deleted
- current_plan_id (FK → plans)
- current_cycle_start
- current_cycle_end
- cname_targetsJSON子域名与 CNAME 目标映射)
- origin_configJSON源站地址/端口/协议等)
- edge_server_id在 GoEdge 中的 serverId
- created_at, updated_at
- `domain_overrides`(域名级别的配额/价格/功能覆盖,可合并到 domains 里)
- domain_id (FK)
- extra_free_traffic_gb_current_cycle本周期额外赠送的免费流量
- custom_overage_price_per_gb
- custom_features (JSON)
- note
### 6.3 套餐 & 功能
- `plans`
- id
- name
- description
- billing_modeper_domain_monthly
- base_price_per_domain
- included_traffic_gb_per_domain
- overage_price_per_gb
- allow_overage是否允许超量计费
- is_active
- is_public
- allow_new_purchase
- allow_renew
- created_at, updated_at
- `plan_features`(或直接用 JSON 存在 plans 中)
- plan_id
- key`waf_enabled``http3_enabled` 等)
- valuebool/int/string
### 6.4 流量统计与计费
- `domain_traffic_daily`
- id
- domain_id (FK)
- day日期
- bytes
- peak_bandwidth_mbps
- created_at
- `invoices`
- id
- user_id
- period_start
- period_end
- amount_plan_total该周期所有域名套餐费用之和
- amount_overage_total
- amount_adjustment手工调整
- amount_total
- statusunpaid / paid / cancelled
- created_at, paid_at
- `invoice_items`
- id
- invoice_id (FK)
- domain_id (FK)
- description如“基础套餐费用”、“超量流量费用”等
- quantity如 GB 数)
- unit_price
- amount
---
## 7. 业务流程说明(供实现时参考)
### 7.1 域名接入流程
1. 用户登录 → 添加域名 → 填写源站 → 选择套餐 → 创建域名记录 + 调 GoEdge 创建 Server
2. 系统生成 CNAME 目标 → 显示给用户
3. 用户配置 DNS → 前端提供“检测”按钮 → 后端通过 DNS 查询判断已生效
4. DNS 生效后,域名状态从 `pending_dns``active`,开始正式计流量。
### 7.2 计费周期与流量统计(按月)
1. 每天定时任务,从 GoEdge 统计 API 拉各域名的流量/请求数据,写入 `domain_traffic_daily`
2. 每个域名有自己的计费周期(默认按自然月,也可按首次接入日算周期)。
3. 周期结束时,后台任务:
- 计算该周期每个域名总流量
- 对比“免费/套餐包含流量”:
- 未超出 → 只收套餐费
- 超出 → 套餐费 + 超量*单价
- 生成 `invoice``invoice_items` 记录。
4. 对于未支付账单,可根据策略:
- 在下一周期开始前停止所有域名(状态为 `suspended`
- 或只允许免费额度,超出部分阻断。
---
## 8. 界面与前端要求Django + Bootstrap
- 全站使用 Django 模板 + Bootstrap 布局:
- 顶部导航:登录用户信息、域名列表入口、套餐、账单入口
- 左侧在“域名详情页”使用类似 Cloudflare 的菜单列表(概览 / 分析 / SSL / 缓存 / 防火墙 / 设置)
- 表格和列表使用 Bootstrap Table 风格。
- 表单使用 CSRF 保护。
- 操作类按钮(如“升级套餐”、“暂停域名”)需二次确认弹窗。
---
## 9. 安全与权限
- 使用 Django Auth + `login_required` 装饰器保护用户页面。
- 管理后台分两套:
- Django Admin仅超级用户访问
- 自定义运营面板 `/admin-panel/`(限 staff 与具有相应权限的用户)
- 所有与 GoEdge API 交互的操作仅在后端进行,前端不暴露任何 GoEdge Access Token。
- 重要操作(删除域名、切换套餐、调账等)写「操作日志表」,记录操作人、时间、变更内容。
---
## 10. 可扩展点(后续版本预留)
- 支付功能集成(支付宝 / 微信 / Stripe 等)基于易支付:
- 暂时只定义订单、账单结构,后续接入支付网关。
- 工单系统Ticket
---
> **AI Agent 实施重点:**
> - 先完成基础模型(用户 / 域名 / 套餐 / 流量 / 账单)和管理后台结构。
> - 实现「添加域名 → 生成 CNAME → 拉取流量 → 生成账单」完整闭环。
> - 前端使用 Django 模板 + Bootstrap 提供一个简单但清晰的 Cloudflare 风格面板。

View File

@@ -0,0 +1 @@
pip

View File

@@ -0,0 +1,247 @@
Metadata-Version: 2.4
Name: asgiref
Version: 3.10.0
Summary: ASGI specs, helper code, and adapters
Home-page: https://github.com/django/asgiref/
Author: Django Software Foundation
Author-email: foundation@djangoproject.com
License: BSD-3-Clause
Project-URL: Documentation, https://asgi.readthedocs.io/
Project-URL: Further Documentation, https://docs.djangoproject.com/en/stable/topics/async/#async-adapter-functions
Project-URL: Changelog, https://github.com/django/asgiref/blob/master/CHANGELOG.txt
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Web Environment
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: BSD License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Requires-Python: >=3.9
License-File: LICENSE
Requires-Dist: typing_extensions>=4; python_version < "3.11"
Provides-Extra: tests
Requires-Dist: pytest; extra == "tests"
Requires-Dist: pytest-asyncio; extra == "tests"
Requires-Dist: mypy>=1.14.0; extra == "tests"
Dynamic: license-file
asgiref
=======
.. image:: https://github.com/django/asgiref/actions/workflows/tests.yml/badge.svg
:target: https://github.com/django/asgiref/actions/workflows/tests.yml
.. image:: https://img.shields.io/pypi/v/asgiref.svg
:target: https://pypi.python.org/pypi/asgiref
ASGI is a standard for Python asynchronous web apps and servers to communicate
with each other, and positioned as an asynchronous successor to WSGI. You can
read more at https://asgi.readthedocs.io/en/latest/
This package includes ASGI base libraries, such as:
* Sync-to-async and async-to-sync function wrappers, ``asgiref.sync``
* Server base classes, ``asgiref.server``
* A WSGI-to-ASGI adapter, in ``asgiref.wsgi``
Function wrappers
-----------------
These allow you to wrap or decorate async or sync functions to call them from
the other style (so you can call async functions from a synchronous thread,
or vice-versa).
In particular:
* AsyncToSync lets a synchronous subthread stop and wait while the async
function is called on the main thread's event loop, and then control is
returned to the thread when the async function is finished.
* SyncToAsync lets async code call a synchronous function, which is run in
a threadpool and control returned to the async coroutine when the synchronous
function completes.
The idea is to make it easier to call synchronous APIs from async code and
asynchronous APIs from synchronous code so it's easier to transition code from
one style to the other. In the case of Channels, we wrap the (synchronous)
Django view system with SyncToAsync to allow it to run inside the (asynchronous)
ASGI server.
Note that exactly what threads things run in is very specific, and aimed to
keep maximum compatibility with old synchronous code. See
"Synchronous code & Threads" below for a full explanation. By default,
``sync_to_async`` will run all synchronous code in the program in the same
thread for safety reasons; you can disable this for more performance with
``@sync_to_async(thread_sensitive=False)``, but make sure that your code does
not rely on anything bound to threads (like database connections) when you do.
Threadlocal replacement
-----------------------
This is a drop-in replacement for ``threading.local`` that works with both
threads and asyncio Tasks. Even better, it will proxy values through from a
task-local context to a thread-local context when you use ``sync_to_async``
to run things in a threadpool, and vice-versa for ``async_to_sync``.
If you instead want true thread- and task-safety, you can set
``thread_critical`` on the Local object to ensure this instead.
Server base classes
-------------------
Includes a ``StatelessServer`` class which provides all the hard work of
writing a stateless server (as in, does not handle direct incoming sockets
but instead consumes external streams or sockets to work out what is happening).
An example of such a server would be a chatbot server that connects out to
a central chat server and provides a "connection scope" per user chatting to
it. There's only one actual connection, but the server has to separate things
into several scopes for easier writing of the code.
You can see an example of this being used in `frequensgi <https://github.com/andrewgodwin/frequensgi>`_.
WSGI-to-ASGI adapter
--------------------
Allows you to wrap a WSGI application so it appears as a valid ASGI application.
Simply wrap it around your WSGI application like so::
asgi_application = WsgiToAsgi(wsgi_application)
The WSGI application will be run in a synchronous threadpool, and the wrapped
ASGI application will be one that accepts ``http`` class messages.
Please note that not all extended features of WSGI may be supported (such as
file handles for incoming POST bodies).
Dependencies
------------
``asgiref`` requires Python 3.9 or higher.
Contributing
------------
Please refer to the
`main Channels contributing docs <https://github.com/django/channels/blob/master/CONTRIBUTING.rst>`_.
Testing
'''''''
To run tests, make sure you have installed the ``tests`` extra with the package::
cd asgiref/
pip install -e .[tests]
pytest
Building the documentation
''''''''''''''''''''''''''
The documentation uses `Sphinx <http://www.sphinx-doc.org>`_::
cd asgiref/docs/
pip install sphinx
To build the docs, you can use the default tools::
sphinx-build -b html . _build/html # or `make html`, if you've got make set up
cd _build/html
python -m http.server
...or you can use ``sphinx-autobuild`` to run a server and rebuild/reload
your documentation changes automatically::
pip install sphinx-autobuild
sphinx-autobuild . _build/html
Releasing
'''''''''
To release, first add details to CHANGELOG.txt and update the version number in ``asgiref/__init__.py``.
Then, build and push the packages::
python -m build
twine upload dist/*
rm -r asgiref.egg-info dist
Implementation Details
----------------------
Synchronous code & threads
''''''''''''''''''''''''''
The ``asgiref.sync`` module provides two wrappers that let you go between
asynchronous and synchronous code at will, while taking care of the rough edges
for you.
Unfortunately, the rough edges are numerous, and the code has to work especially
hard to keep things in the same thread as much as possible. Notably, the
restrictions we are working with are:
* All synchronous code called through ``SyncToAsync`` and marked with
``thread_sensitive`` should run in the same thread as each other (and if the
outer layer of the program is synchronous, the main thread)
* If a thread already has a running async loop, ``AsyncToSync`` can't run things
on that loop if it's blocked on synchronous code that is above you in the
call stack.
The first compromise you get to might be that ``thread_sensitive`` code should
just run in the same thread and not spawn in a sub-thread, fulfilling the first
restriction, but that immediately runs you into the second restriction.
The only real solution is to essentially have a variant of ThreadPoolExecutor
that executes any ``thread_sensitive`` code on the outermost synchronous
thread - either the main thread, or a single spawned subthread.
This means you now have two basic states:
* If the outermost layer of your program is synchronous, then all async code
run through ``AsyncToSync`` will run in a per-call event loop in arbitrary
sub-threads, while all ``thread_sensitive`` code will run in the main thread.
* If the outermost layer of your program is asynchronous, then all async code
runs on the main thread's event loop, and all ``thread_sensitive`` synchronous
code will run in a single shared sub-thread.
Crucially, this means that in both cases there is a thread which is a shared
resource that all ``thread_sensitive`` code must run on, and there is a chance
that this thread is currently blocked on its own ``AsyncToSync`` call. Thus,
``AsyncToSync`` needs to act as an executor for thread code while it's blocking.
The ``CurrentThreadExecutor`` class provides this functionality; rather than
simply waiting on a Future, you can call its ``run_until_future`` method and
it will run submitted code until that Future is done. This means that code
inside the call can then run code on your thread.
Maintenance and Security
------------------------
To report security issues, please contact security@djangoproject.com. For GPG
signatures and more security process information, see
https://docs.djangoproject.com/en/dev/internals/security/.
To report bugs or request new features, please open a new GitHub issue.
This repository is part of the Channels project. For the shepherd and maintenance team, please see the
`main Channels readme <https://github.com/django/channels/blob/master/README.rst>`_.

View File

@@ -0,0 +1,27 @@
asgiref-3.10.0.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
asgiref-3.10.0.dist-info/METADATA,sha256=TlcKOCn3FwSCGD62jZkbckPRh-RjAhkCLLDnfmDZTyA,9287
asgiref-3.10.0.dist-info/RECORD,,
asgiref-3.10.0.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
asgiref-3.10.0.dist-info/licenses/LICENSE,sha256=uEZBXRtRTpwd_xSiLeuQbXlLxUbKYSn5UKGM0JHipmk,1552
asgiref-3.10.0.dist-info/top_level.txt,sha256=bokQjCzwwERhdBiPdvYEZa4cHxT4NCeAffQNUqJ8ssg,8
asgiref/__init__.py,sha256=iKJAvc5i0UTDDSSefTGL0Tq-kWQ4S3OJJgvyaQfQNF8,23
asgiref/__pycache__/__init__.cpython-312.pyc,,
asgiref/__pycache__/compatibility.cpython-312.pyc,,
asgiref/__pycache__/current_thread_executor.cpython-312.pyc,,
asgiref/__pycache__/local.cpython-312.pyc,,
asgiref/__pycache__/server.cpython-312.pyc,,
asgiref/__pycache__/sync.cpython-312.pyc,,
asgiref/__pycache__/testing.cpython-312.pyc,,
asgiref/__pycache__/timeout.cpython-312.pyc,,
asgiref/__pycache__/typing.cpython-312.pyc,,
asgiref/__pycache__/wsgi.cpython-312.pyc,,
asgiref/compatibility.py,sha256=DhY1SOpOvOw0Y1lSEjCqg-znRUQKecG3LTaV48MZi68,1606
asgiref/current_thread_executor.py,sha256=42CU1VODLTk-_PYise-cP1XgyAvI5Djc8f97owFzdrs,4157
asgiref/local.py,sha256=ZZeWWIXptVU4GbNApMMWQ-skuglvodcQA5WpzJDMxh4,4912
asgiref/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
asgiref/server.py,sha256=3A68169Nuh2sTY_2O5JzRd_opKObWvvrEFcrXssq3kA,6311
asgiref/sync.py,sha256=CEKxFyePiksUoA7MronOKaF6mmNQxUYZjXlfJZXEQCM,22551
asgiref/testing.py,sha256=U5wcs_-ZYTO5SIGfl80EqRAGv_T8BHrAhvAKRuuztT4,4421
asgiref/timeout.py,sha256=LtGL-xQpG8JHprdsEUCMErJ0kNWj4qwWZhEHJ3iKu4s,3627
asgiref/typing.py,sha256=Zi72AZlOyF1C7N14LLZnpAdfUH4ljoBqFdQo_bBKMq0,6290
asgiref/wsgi.py,sha256=J8OAgirfsYHZmxxqIGfFiZ43uq1qKKv2xGMkRISNIo4,6742

View File

@@ -0,0 +1,5 @@
Wheel-Version: 1.0
Generator: setuptools (80.9.0)
Root-Is-Purelib: true
Tag: py3-none-any

View File

@@ -0,0 +1,27 @@
Copyright (c) Django Software Foundation and individual contributors.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of Django nor the names of its contributors may be used
to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -0,0 +1 @@
asgiref

View File

@@ -0,0 +1 @@
__version__ = "3.10.0"

View File

@@ -0,0 +1,48 @@
import inspect
from .sync import iscoroutinefunction
def is_double_callable(application):
"""
Tests to see if an application is a legacy-style (double-callable) application.
"""
# Look for a hint on the object first
if getattr(application, "_asgi_single_callable", False):
return False
if getattr(application, "_asgi_double_callable", False):
return True
# Uninstanted classes are double-callable
if inspect.isclass(application):
return True
# Instanted classes depend on their __call__
if hasattr(application, "__call__"):
# We only check to see if its __call__ is a coroutine function -
# if it's not, it still might be a coroutine function itself.
if iscoroutinefunction(application.__call__):
return False
# Non-classes we just check directly
return not iscoroutinefunction(application)
def double_to_single_callable(application):
"""
Transforms a double-callable ASGI application into a single-callable one.
"""
async def new_application(scope, receive, send):
instance = application(scope)
return await instance(receive, send)
return new_application
def guarantee_single_callable(application):
"""
Takes either a single- or double-callable application and always returns it
in single-callable style. Use this to add backwards compatibility for ASGI
2.0 applications to your server/test harness/etc.
"""
if is_double_callable(application):
application = double_to_single_callable(application)
return application

View File

@@ -0,0 +1,123 @@
import sys
import threading
from collections import deque
from concurrent.futures import Executor, Future
from typing import Any, Callable, TypeVar
if sys.version_info >= (3, 10):
from typing import ParamSpec
else:
from typing_extensions import ParamSpec
_T = TypeVar("_T")
_P = ParamSpec("_P")
_R = TypeVar("_R")
class _WorkItem:
"""
Represents an item needing to be run in the executor.
Copied from ThreadPoolExecutor (but it's private, so we're not going to rely on importing it)
"""
def __init__(
self,
future: "Future[_R]",
fn: Callable[_P, _R],
*args: _P.args,
**kwargs: _P.kwargs,
):
self.future = future
self.fn = fn
self.args = args
self.kwargs = kwargs
def run(self) -> None:
__traceback_hide__ = True # noqa: F841
if not self.future.set_running_or_notify_cancel():
return
try:
result = self.fn(*self.args, **self.kwargs)
except BaseException as exc:
self.future.set_exception(exc)
# Break a reference cycle with the exception 'exc'
self = None # type: ignore[assignment]
else:
self.future.set_result(result)
class CurrentThreadExecutor(Executor):
"""
An Executor that actually runs code in the thread it is instantiated in.
Passed to other threads running async code, so they can run sync code in
the thread they came from.
"""
def __init__(self, old_executor: "CurrentThreadExecutor | None") -> None:
self._work_thread = threading.current_thread()
self._work_ready = threading.Condition(threading.Lock())
self._work_items = deque[_WorkItem]() # synchronized by _work_ready
self._broken = False # synchronized by _work_ready
self._old_executor = old_executor
def run_until_future(self, future: "Future[Any]") -> None:
"""
Runs the code in the work queue until a result is available from the future.
Should be run from the thread the executor is initialised in.
"""
# Check we're in the right thread
if threading.current_thread() != self._work_thread:
raise RuntimeError(
"You cannot run CurrentThreadExecutor from a different thread"
)
def done(future: "Future[Any]") -> None:
with self._work_ready:
self._broken = True
self._work_ready.notify()
future.add_done_callback(done)
# Keep getting and running work items until the future we're waiting for
# is done and the queue is empty.
while True:
with self._work_ready:
while not self._work_items and not self._broken:
self._work_ready.wait()
if not self._work_items:
break
# Get a work item and run it
work_item = self._work_items.popleft()
work_item.run()
del work_item
def submit(
self,
fn: Callable[_P, _R],
/,
*args: _P.args,
**kwargs: _P.kwargs,
) -> "Future[_R]":
# Check they're not submitting from the same thread
if threading.current_thread() == self._work_thread:
raise RuntimeError(
"You cannot submit onto CurrentThreadExecutor from its own thread"
)
f: "Future[_R]" = Future()
work_item = _WorkItem(f, fn, *args, **kwargs)
# Walk up the CurrentThreadExecutor stack to find the closest one still
# running
executor = self
while True:
with executor._work_ready:
if not executor._broken:
# Add to work queue
executor._work_items.append(work_item)
executor._work_ready.notify()
break
if executor._old_executor is None:
raise RuntimeError("CurrentThreadExecutor already quit or is broken")
executor = executor._old_executor
# Return the future
return f

View File

@@ -0,0 +1,131 @@
import asyncio
import contextlib
import contextvars
import threading
from typing import Any, Dict, Union
class _CVar:
"""Storage utility for Local."""
def __init__(self) -> None:
self._data: "contextvars.ContextVar[Dict[str, Any]]" = contextvars.ContextVar(
"asgiref.local"
)
def __getattr__(self, key):
storage_object = self._data.get({})
try:
return storage_object[key]
except KeyError:
raise AttributeError(f"{self!r} object has no attribute {key!r}")
def __setattr__(self, key: str, value: Any) -> None:
if key == "_data":
return super().__setattr__(key, value)
storage_object = self._data.get({}).copy()
storage_object[key] = value
self._data.set(storage_object)
def __delattr__(self, key: str) -> None:
storage_object = self._data.get({}).copy()
if key in storage_object:
del storage_object[key]
self._data.set(storage_object)
else:
raise AttributeError(f"{self!r} object has no attribute {key!r}")
class Local:
"""Local storage for async tasks.
This is a namespace object (similar to `threading.local`) where data is
also local to the current async task (if there is one).
In async threads, local means in the same sense as the `contextvars`
module - i.e. a value set in an async frame will be visible:
- to other async code `await`-ed from this frame.
- to tasks spawned using `asyncio` utilities (`create_task`, `wait_for`,
`gather` and probably others).
- to code scheduled in a sync thread using `sync_to_async`
In "sync" threads (a thread with no async event loop running), the
data is thread-local, but additionally shared with async code executed
via the `async_to_sync` utility, which schedules async code in a new thread
and copies context across to that thread.
If `thread_critical` is True, then the local will only be visible per-thread,
behaving exactly like `threading.local` if the thread is sync, and as
`contextvars` if the thread is async. This allows genuinely thread-sensitive
code (such as DB handles) to be kept stricly to their initial thread and
disable the sharing across `sync_to_async` and `async_to_sync` wrapped calls.
Unlike plain `contextvars` objects, this utility is threadsafe.
"""
def __init__(self, thread_critical: bool = False) -> None:
self._thread_critical = thread_critical
self._thread_lock = threading.RLock()
self._storage: "Union[threading.local, _CVar]"
if thread_critical:
# Thread-local storage
self._storage = threading.local()
else:
# Contextvar storage
self._storage = _CVar()
@contextlib.contextmanager
def _lock_storage(self):
# Thread safe access to storage
if self._thread_critical:
is_async = True
try:
# this is a test for are we in a async or sync
# thread - will raise RuntimeError if there is
# no current loop
asyncio.get_running_loop()
except RuntimeError:
is_async = False
if not is_async:
# We are in a sync thread, the storage is
# just the plain thread local (i.e, "global within
# this thread" - it doesn't matter where you are
# in a call stack you see the same storage)
yield self._storage
else:
# We are in an async thread - storage is still
# local to this thread, but additionally should
# behave like a context var (is only visible with
# the same async call stack)
# Ensure context exists in the current thread
if not hasattr(self._storage, "cvar"):
self._storage.cvar = _CVar()
# self._storage is a thread local, so the members
# can't be accessed in another thread (we don't
# need any locks)
yield self._storage.cvar
else:
# Lock for thread_critical=False as other threads
# can access the exact same storage object
with self._thread_lock:
yield self._storage
def __getattr__(self, key):
with self._lock_storage() as storage:
return getattr(storage, key)
def __setattr__(self, key, value):
if key in ("_local", "_storage", "_thread_critical", "_thread_lock"):
return super().__setattr__(key, value)
with self._lock_storage() as storage:
setattr(storage, key, value)
def __delattr__(self, key):
with self._lock_storage() as storage:
delattr(storage, key)

View File

View File

@@ -0,0 +1,173 @@
import asyncio
import logging
import time
import traceback
from .compatibility import guarantee_single_callable
logger = logging.getLogger(__name__)
class StatelessServer:
"""
Base server class that handles basic concepts like application instance
creation/pooling, exception handling, and similar, for stateless protocols
(i.e. ones without actual incoming connections to the process)
Your code should override the handle() method, doing whatever it needs to,
and calling get_or_create_application_instance with a unique `scope_id`
and `scope` for the scope it wants to get.
If an application instance is found with the same `scope_id`, you are
given its input queue, otherwise one is made for you with the scope provided
and you are given that fresh new input queue. Either way, you should do
something like:
input_queue = self.get_or_create_application_instance(
"user-123456",
{"type": "testprotocol", "user_id": "123456", "username": "andrew"},
)
input_queue.put_nowait(message)
If you try and create an application instance and there are already
`max_application` instances, the oldest/least recently used one will be
reclaimed and shut down to make space.
Application coroutines that error will be found periodically (every 100ms
by default) and have their exceptions printed to the console. Override
application_exception() if you want to do more when this happens.
If you override run(), make sure you handle things like launching the
application checker.
"""
application_checker_interval = 0.1
def __init__(self, application, max_applications=1000):
# Parameters
self.application = application
self.max_applications = max_applications
# Initialisation
self.application_instances = {}
### Mainloop and handling
def run(self):
"""
Runs the asyncio event loop with our handler loop.
"""
event_loop = asyncio.get_event_loop()
try:
event_loop.run_until_complete(self.arun())
except KeyboardInterrupt:
logger.info("Exiting due to Ctrl-C/interrupt")
async def arun(self):
"""
Runs the asyncio event loop with our handler loop.
"""
class Done(Exception):
pass
async def handle():
await self.handle()
raise Done
try:
await asyncio.gather(self.application_checker(), handle())
except Done:
pass
async def handle(self):
raise NotImplementedError("You must implement handle()")
async def application_send(self, scope, message):
"""
Receives outbound sends from applications and handles them.
"""
raise NotImplementedError("You must implement application_send()")
### Application instance management
def get_or_create_application_instance(self, scope_id, scope):
"""
Creates an application instance and returns its queue.
"""
if scope_id in self.application_instances:
self.application_instances[scope_id]["last_used"] = time.time()
return self.application_instances[scope_id]["input_queue"]
# See if we need to delete an old one
while len(self.application_instances) > self.max_applications:
self.delete_oldest_application_instance()
# Make an instance of the application
input_queue = asyncio.Queue()
application_instance = guarantee_single_callable(self.application)
# Run it, and stash the future for later checking
future = asyncio.ensure_future(
application_instance(
scope=scope,
receive=input_queue.get,
send=lambda message: self.application_send(scope, message),
),
)
self.application_instances[scope_id] = {
"input_queue": input_queue,
"future": future,
"scope": scope,
"last_used": time.time(),
}
return input_queue
def delete_oldest_application_instance(self):
"""
Finds and deletes the oldest application instance
"""
oldest_time = min(
details["last_used"] for details in self.application_instances.values()
)
for scope_id, details in self.application_instances.items():
if details["last_used"] == oldest_time:
self.delete_application_instance(scope_id)
# Return to make sure we only delete one in case two have
# the same oldest time
return
def delete_application_instance(self, scope_id):
"""
Removes an application instance (makes sure its task is stopped,
then removes it from the current set)
"""
details = self.application_instances[scope_id]
del self.application_instances[scope_id]
if not details["future"].done():
details["future"].cancel()
async def application_checker(self):
"""
Goes through the set of current application instance Futures and cleans up
any that are done/prints exceptions for any that errored.
"""
while True:
await asyncio.sleep(self.application_checker_interval)
for scope_id, details in list(self.application_instances.items()):
if details["future"].done():
exception = details["future"].exception()
if exception:
await self.application_exception(exception, details)
try:
del self.application_instances[scope_id]
except KeyError:
# Exception handling might have already got here before us. That's fine.
pass
async def application_exception(self, exception, application_details):
"""
Called whenever an application coroutine has an exception.
"""
logging.error(
"Exception inside application: %s\n%s%s",
exception,
"".join(traceback.format_tb(exception.__traceback__)),
f" {exception}",
)

View File

@@ -0,0 +1,647 @@
import asyncio
import asyncio.coroutines
import contextvars
import functools
import inspect
import os
import sys
import threading
import warnings
import weakref
from concurrent.futures import Future, ThreadPoolExecutor
from typing import (
TYPE_CHECKING,
Any,
Awaitable,
Callable,
Coroutine,
Dict,
Generic,
List,
Optional,
TypeVar,
Union,
overload,
)
from .current_thread_executor import CurrentThreadExecutor
from .local import Local
if sys.version_info >= (3, 10):
from typing import ParamSpec
else:
from typing_extensions import ParamSpec
if TYPE_CHECKING:
# This is not available to import at runtime
from _typeshed import OptExcInfo
_F = TypeVar("_F", bound=Callable[..., Any])
_P = ParamSpec("_P")
_R = TypeVar("_R")
def _restore_context(context: contextvars.Context) -> None:
# Check for changes in contextvars, and set them to the current
# context for downstream consumers
for cvar in context:
cvalue = context.get(cvar)
try:
if cvar.get() != cvalue:
cvar.set(cvalue)
except LookupError:
cvar.set(cvalue)
# Python 3.12 deprecates asyncio.iscoroutinefunction() as an alias for
# inspect.iscoroutinefunction(), whilst also removing the _is_coroutine marker.
# The latter is replaced with the inspect.markcoroutinefunction decorator.
# Until 3.12 is the minimum supported Python version, provide a shim.
if hasattr(inspect, "markcoroutinefunction"):
iscoroutinefunction = inspect.iscoroutinefunction
markcoroutinefunction: Callable[[_F], _F] = inspect.markcoroutinefunction
else:
iscoroutinefunction = asyncio.iscoroutinefunction # type: ignore[assignment]
def markcoroutinefunction(func: _F) -> _F:
func._is_coroutine = asyncio.coroutines._is_coroutine # type: ignore
return func
class AsyncSingleThreadContext:
"""Context manager to run async code inside the same thread.
Normally, AsyncToSync functions run either inside a separate ThreadPoolExecutor or
the main event loop if it exists. This context manager ensures that all AsyncToSync
functions execute within the same thread.
This context manager is re-entrant, so only the outer-most call to
AsyncSingleThreadContext will set the context.
Usage:
>>> import asyncio
>>> with AsyncSingleThreadContext():
... async_to_sync(asyncio.sleep(1))()
"""
def __init__(self):
self.token = None
def __enter__(self):
try:
AsyncToSync.async_single_thread_context.get()
except LookupError:
self.token = AsyncToSync.async_single_thread_context.set(self)
return self
def __exit__(self, exc, value, tb):
if not self.token:
return
executor = AsyncToSync.context_to_thread_executor.pop(self, None)
if executor:
executor.shutdown()
AsyncToSync.async_single_thread_context.reset(self.token)
class ThreadSensitiveContext:
"""Async context manager to manage context for thread sensitive mode
This context manager controls which thread pool executor is used when in
thread sensitive mode. By default, a single thread pool executor is shared
within a process.
The ThreadSensitiveContext() context manager may be used to specify a
thread pool per context.
This context manager is re-entrant, so only the outer-most call to
ThreadSensitiveContext will set the context.
Usage:
>>> import time
>>> async with ThreadSensitiveContext():
... await sync_to_async(time.sleep, 1)()
"""
def __init__(self):
self.token = None
async def __aenter__(self):
try:
SyncToAsync.thread_sensitive_context.get()
except LookupError:
self.token = SyncToAsync.thread_sensitive_context.set(self)
return self
async def __aexit__(self, exc, value, tb):
if not self.token:
return
executor = SyncToAsync.context_to_thread_executor.pop(self, None)
if executor:
executor.shutdown()
SyncToAsync.thread_sensitive_context.reset(self.token)
class AsyncToSync(Generic[_P, _R]):
"""
Utility class which turns an awaitable that only works on the thread with
the event loop into a synchronous callable that works in a subthread.
If the call stack contains an async loop, the code runs there.
Otherwise, the code runs in a new loop in a new thread.
Either way, this thread then pauses and waits to run any thread_sensitive
code called from further down the call stack using SyncToAsync, before
finally exiting once the async task returns.
"""
# Keeps a reference to the CurrentThreadExecutor in local context, so that
# any sync_to_async inside the wrapped code can find it.
executors: "Local" = Local()
# When we can't find a CurrentThreadExecutor from the context, such as
# inside create_task, we'll look it up here from the running event loop.
loop_thread_executors: "Dict[asyncio.AbstractEventLoop, CurrentThreadExecutor]" = {}
async_single_thread_context: "contextvars.ContextVar[AsyncSingleThreadContext]" = (
contextvars.ContextVar("async_single_thread_context")
)
context_to_thread_executor: "weakref.WeakKeyDictionary[AsyncSingleThreadContext, ThreadPoolExecutor]" = (
weakref.WeakKeyDictionary()
)
def __init__(
self,
awaitable: Union[
Callable[_P, Coroutine[Any, Any, _R]],
Callable[_P, Awaitable[_R]],
],
force_new_loop: bool = False,
):
if not callable(awaitable) or (
not iscoroutinefunction(awaitable)
and not iscoroutinefunction(getattr(awaitable, "__call__", awaitable))
):
# Python does not have very reliable detection of async functions
# (lots of false negatives) so this is just a warning.
warnings.warn(
"async_to_sync was passed a non-async-marked callable", stacklevel=2
)
self.awaitable = awaitable
try:
self.__self__ = self.awaitable.__self__ # type: ignore[union-attr]
except AttributeError:
pass
self.force_new_loop = force_new_loop
self.main_event_loop = None
try:
self.main_event_loop = asyncio.get_running_loop()
except RuntimeError:
# There's no event loop in this thread.
pass
def __call__(self, *args: _P.args, **kwargs: _P.kwargs) -> _R:
__traceback_hide__ = True # noqa: F841
if not self.force_new_loop and not self.main_event_loop:
# There's no event loop in this thread. Look for the threadlocal if
# we're inside SyncToAsync
main_event_loop_pid = getattr(
SyncToAsync.threadlocal, "main_event_loop_pid", None
)
# We make sure the parent loop is from the same process - if
# they've forked, this is not going to be valid any more (#194)
if main_event_loop_pid and main_event_loop_pid == os.getpid():
self.main_event_loop = getattr(
SyncToAsync.threadlocal, "main_event_loop", None
)
# You can't call AsyncToSync from a thread with a running event loop
try:
asyncio.get_running_loop()
except RuntimeError:
pass
else:
raise RuntimeError(
"You cannot use AsyncToSync in the same thread as an async event loop - "
"just await the async function directly."
)
# Make a future for the return information
call_result: "Future[_R]" = Future()
# Make a CurrentThreadExecutor we'll use to idle in this thread - we
# need one for every sync frame, even if there's one above us in the
# same thread.
old_executor = getattr(self.executors, "current", None)
current_executor = CurrentThreadExecutor(old_executor)
self.executors.current = current_executor
# Wrapping context in list so it can be reassigned from within
# `main_wrap`.
context = [contextvars.copy_context()]
# Get task context so that parent task knows which task to propagate
# an asyncio.CancelledError to.
task_context = getattr(SyncToAsync.threadlocal, "task_context", None)
# Use call_soon_threadsafe to schedule a synchronous callback on the
# main event loop's thread if it's there, otherwise make a new loop
# in this thread.
try:
awaitable = self.main_wrap(
call_result,
sys.exc_info(),
task_context,
context,
# prepare an awaitable which can be passed as is to self.main_wrap,
# so that `args` and `kwargs` don't need to be
# destructured when passed to self.main_wrap
# (which is required by `ParamSpec`)
# as that may cause overlapping arguments
self.awaitable(*args, **kwargs),
)
async def new_loop_wrap() -> None:
loop = asyncio.get_running_loop()
self.loop_thread_executors[loop] = current_executor
try:
await awaitable
finally:
del self.loop_thread_executors[loop]
if self.main_event_loop is not None:
try:
self.main_event_loop.call_soon_threadsafe(
self.main_event_loop.create_task, awaitable
)
except RuntimeError:
running_in_main_event_loop = False
else:
running_in_main_event_loop = True
# Run the CurrentThreadExecutor until the future is done.
current_executor.run_until_future(call_result)
else:
running_in_main_event_loop = False
if not running_in_main_event_loop:
loop_executor = None
if self.async_single_thread_context.get(None):
single_thread_context = self.async_single_thread_context.get()
if single_thread_context in self.context_to_thread_executor:
loop_executor = self.context_to_thread_executor[
single_thread_context
]
else:
loop_executor = ThreadPoolExecutor(max_workers=1)
self.context_to_thread_executor[
single_thread_context
] = loop_executor
else:
# Make our own event loop - in a new thread - and run inside that.
loop_executor = ThreadPoolExecutor(max_workers=1)
loop_future = loop_executor.submit(asyncio.run, new_loop_wrap())
# Run the CurrentThreadExecutor until the future is done.
current_executor.run_until_future(loop_future)
# Wait for future and/or allow for exception propagation
loop_future.result()
finally:
_restore_context(context[0])
# Restore old current thread executor state
self.executors.current = old_executor
# Wait for results from the future.
return call_result.result()
def __get__(self, parent: Any, objtype: Any) -> Callable[_P, _R]:
"""
Include self for methods
"""
func = functools.partial(self.__call__, parent)
return functools.update_wrapper(func, self.awaitable)
async def main_wrap(
self,
call_result: "Future[_R]",
exc_info: "OptExcInfo",
task_context: "Optional[List[asyncio.Task[Any]]]",
context: List[contextvars.Context],
awaitable: Union[Coroutine[Any, Any, _R], Awaitable[_R]],
) -> None:
"""
Wraps the awaitable with something that puts the result into the
result/exception future.
"""
__traceback_hide__ = True # noqa: F841
if context is not None:
_restore_context(context[0])
current_task = asyncio.current_task()
if current_task is not None and task_context is not None:
task_context.append(current_task)
try:
# If we have an exception, run the function inside the except block
# after raising it so exc_info is correctly populated.
if exc_info[1]:
try:
raise exc_info[1]
except BaseException:
result = await awaitable
else:
result = await awaitable
except BaseException as e:
call_result.set_exception(e)
else:
call_result.set_result(result)
finally:
if current_task is not None and task_context is not None:
task_context.remove(current_task)
context[0] = contextvars.copy_context()
class SyncToAsync(Generic[_P, _R]):
"""
Utility class which turns a synchronous callable into an awaitable that
runs in a threadpool. It also sets a threadlocal inside the thread so
calls to AsyncToSync can escape it.
If thread_sensitive is passed, the code will run in the same thread as any
outer code. This is needed for underlying Python code that is not
threadsafe (for example, code which handles SQLite database connections).
If the outermost program is async (i.e. SyncToAsync is outermost), then
this will be a dedicated single sub-thread that all sync code runs in,
one after the other. If the outermost program is sync (i.e. AsyncToSync is
outermost), this will just be the main thread. This is achieved by idling
with a CurrentThreadExecutor while AsyncToSync is blocking its sync parent,
rather than just blocking.
If executor is passed in, that will be used instead of the loop's default executor.
In order to pass in an executor, thread_sensitive must be set to False, otherwise
a TypeError will be raised.
"""
# Storage for main event loop references
threadlocal = threading.local()
# Single-thread executor for thread-sensitive code
single_thread_executor = ThreadPoolExecutor(max_workers=1)
# Maintain a contextvar for the current execution context. Optionally used
# for thread sensitive mode.
thread_sensitive_context: "contextvars.ContextVar[ThreadSensitiveContext]" = (
contextvars.ContextVar("thread_sensitive_context")
)
# Contextvar that is used to detect if the single thread executor
# would be awaited on while already being used in the same context
deadlock_context: "contextvars.ContextVar[bool]" = contextvars.ContextVar(
"deadlock_context"
)
# Maintaining a weak reference to the context ensures that thread pools are
# erased once the context goes out of scope. This terminates the thread pool.
context_to_thread_executor: "weakref.WeakKeyDictionary[ThreadSensitiveContext, ThreadPoolExecutor]" = (
weakref.WeakKeyDictionary()
)
def __init__(
self,
func: Callable[_P, _R],
thread_sensitive: bool = True,
executor: Optional["ThreadPoolExecutor"] = None,
) -> None:
if (
not callable(func)
or iscoroutinefunction(func)
or iscoroutinefunction(getattr(func, "__call__", func))
):
raise TypeError("sync_to_async can only be applied to sync functions.")
self.func = func
functools.update_wrapper(self, func)
self._thread_sensitive = thread_sensitive
markcoroutinefunction(self)
if thread_sensitive and executor is not None:
raise TypeError("executor must not be set when thread_sensitive is True")
self._executor = executor
try:
self.__self__ = func.__self__ # type: ignore
except AttributeError:
pass
async def __call__(self, *args: _P.args, **kwargs: _P.kwargs) -> _R:
__traceback_hide__ = True # noqa: F841
loop = asyncio.get_running_loop()
# Work out what thread to run the code in
if self._thread_sensitive:
current_thread_executor = getattr(AsyncToSync.executors, "current", None)
if current_thread_executor:
# If we have a parent sync thread above somewhere, use that
executor = current_thread_executor
elif self.thread_sensitive_context.get(None):
# If we have a way of retrieving the current context, attempt
# to use a per-context thread pool executor
thread_sensitive_context = self.thread_sensitive_context.get()
if thread_sensitive_context in self.context_to_thread_executor:
# Re-use thread executor in current context
executor = self.context_to_thread_executor[thread_sensitive_context]
else:
# Create new thread executor in current context
executor = ThreadPoolExecutor(max_workers=1)
self.context_to_thread_executor[thread_sensitive_context] = executor
elif loop in AsyncToSync.loop_thread_executors:
# Re-use thread executor for running loop
executor = AsyncToSync.loop_thread_executors[loop]
elif self.deadlock_context.get(False):
raise RuntimeError(
"Single thread executor already being used, would deadlock"
)
else:
# Otherwise, we run it in a fixed single thread
executor = self.single_thread_executor
self.deadlock_context.set(True)
else:
# Use the passed in executor, or the loop's default if it is None
executor = self._executor
context = contextvars.copy_context()
child = functools.partial(self.func, *args, **kwargs)
func = context.run
task_context: List[asyncio.Task[Any]] = []
# Run the code in the right thread
exec_coro = loop.run_in_executor(
executor,
functools.partial(
self.thread_handler,
loop,
sys.exc_info(),
task_context,
func,
child,
),
)
ret: _R
try:
ret = await asyncio.shield(exec_coro)
except asyncio.CancelledError:
cancel_parent = True
try:
task = task_context[0]
task.cancel()
try:
await task
cancel_parent = False
except asyncio.CancelledError:
pass
except IndexError:
pass
if exec_coro.done():
raise
if cancel_parent:
exec_coro.cancel()
ret = await exec_coro
finally:
_restore_context(context)
self.deadlock_context.set(False)
return ret
def __get__(
self, parent: Any, objtype: Any
) -> Callable[_P, Coroutine[Any, Any, _R]]:
"""
Include self for methods
"""
func = functools.partial(self.__call__, parent)
return functools.update_wrapper(func, self.func)
def thread_handler(self, loop, exc_info, task_context, func, *args, **kwargs):
"""
Wraps the sync application with exception handling.
"""
__traceback_hide__ = True # noqa: F841
# Set the threadlocal for AsyncToSync
self.threadlocal.main_event_loop = loop
self.threadlocal.main_event_loop_pid = os.getpid()
self.threadlocal.task_context = task_context
# Run the function
# If we have an exception, run the function inside the except block
# after raising it so exc_info is correctly populated.
if exc_info[1]:
try:
raise exc_info[1]
except BaseException:
return func(*args, **kwargs)
else:
return func(*args, **kwargs)
@overload
def async_to_sync(
*,
force_new_loop: bool = False,
) -> Callable[
[Union[Callable[_P, Coroutine[Any, Any, _R]], Callable[_P, Awaitable[_R]]]],
Callable[_P, _R],
]:
...
@overload
def async_to_sync(
awaitable: Union[
Callable[_P, Coroutine[Any, Any, _R]],
Callable[_P, Awaitable[_R]],
],
*,
force_new_loop: bool = False,
) -> Callable[_P, _R]:
...
def async_to_sync(
awaitable: Optional[
Union[
Callable[_P, Coroutine[Any, Any, _R]],
Callable[_P, Awaitable[_R]],
]
] = None,
*,
force_new_loop: bool = False,
) -> Union[
Callable[
[Union[Callable[_P, Coroutine[Any, Any, _R]], Callable[_P, Awaitable[_R]]]],
Callable[_P, _R],
],
Callable[_P, _R],
]:
if awaitable is None:
return lambda f: AsyncToSync(
f,
force_new_loop=force_new_loop,
)
return AsyncToSync(
awaitable,
force_new_loop=force_new_loop,
)
@overload
def sync_to_async(
*,
thread_sensitive: bool = True,
executor: Optional["ThreadPoolExecutor"] = None,
) -> Callable[[Callable[_P, _R]], Callable[_P, Coroutine[Any, Any, _R]]]:
...
@overload
def sync_to_async(
func: Callable[_P, _R],
*,
thread_sensitive: bool = True,
executor: Optional["ThreadPoolExecutor"] = None,
) -> Callable[_P, Coroutine[Any, Any, _R]]:
...
def sync_to_async(
func: Optional[Callable[_P, _R]] = None,
*,
thread_sensitive: bool = True,
executor: Optional["ThreadPoolExecutor"] = None,
) -> Union[
Callable[[Callable[_P, _R]], Callable[_P, Coroutine[Any, Any, _R]]],
Callable[_P, Coroutine[Any, Any, _R]],
]:
if func is None:
return lambda f: SyncToAsync(
f,
thread_sensitive=thread_sensitive,
executor=executor,
)
return SyncToAsync(
func,
thread_sensitive=thread_sensitive,
executor=executor,
)

View File

@@ -0,0 +1,137 @@
import asyncio
import contextvars
import time
from .compatibility import guarantee_single_callable
from .timeout import timeout as async_timeout
class ApplicationCommunicator:
"""
Runs an ASGI application in a test mode, allowing sending of
messages to it and retrieval of messages it sends.
"""
def __init__(self, application, scope):
self._future = None
self.application = guarantee_single_callable(application)
self.scope = scope
self._input_queue = None
self._output_queue = None
# For Python 3.9 we need to lazily bind the queues, on 3.10+ they bind the
# event loop lazily.
@property
def input_queue(self):
if self._input_queue is None:
self._input_queue = asyncio.Queue()
return self._input_queue
@property
def output_queue(self):
if self._output_queue is None:
self._output_queue = asyncio.Queue()
return self._output_queue
@property
def future(self):
if self._future is None:
# Clear context - this ensures that context vars set in the testing scope
# are not "leaked" into the application which would normally begin with
# an empty context. In Python >= 3.11 this could also be written as:
# asyncio.create_task(..., context=contextvars.Context())
self._future = contextvars.Context().run(
asyncio.create_task,
self.application(
self.scope, self.input_queue.get, self.output_queue.put
),
)
return self._future
async def wait(self, timeout=1):
"""
Waits for the application to stop itself and returns any exceptions.
"""
try:
async with async_timeout(timeout):
try:
await self.future
self.future.result()
except asyncio.CancelledError:
pass
finally:
if not self.future.done():
self.future.cancel()
try:
await self.future
except asyncio.CancelledError:
pass
def stop(self, exceptions=True):
future = self._future
if future is None:
return
if not future.done():
future.cancel()
elif exceptions:
# Give a chance to raise any exceptions
future.result()
def __del__(self):
# Clean up on deletion
try:
self.stop(exceptions=False)
except RuntimeError:
# Event loop already stopped
pass
async def send_input(self, message):
"""
Sends a single message to the application
"""
# Make sure there's not an exception to raise from the task
if self.future.done():
self.future.result()
# Give it the message
await self.input_queue.put(message)
async def receive_output(self, timeout=1):
"""
Receives a single message from the application, with optional timeout.
"""
# Make sure there's not an exception to raise from the task
if self.future.done():
self.future.result()
# Wait and receive the message
try:
async with async_timeout(timeout):
return await self.output_queue.get()
except asyncio.TimeoutError as e:
# See if we have another error to raise inside
if self.future.done():
self.future.result()
else:
self.future.cancel()
try:
await self.future
except asyncio.CancelledError:
pass
raise e
async def receive_nothing(self, timeout=0.1, interval=0.01):
"""
Checks that there is no message to receive in the given time.
"""
# Make sure there's not an exception to raise from the task
if self.future.done():
self.future.result()
# `interval` has precedence over `timeout`
start = time.monotonic()
while time.monotonic() - start < timeout:
if not self.output_queue.empty():
return False
await asyncio.sleep(interval)
return self.output_queue.empty()

View File

@@ -0,0 +1,118 @@
# This code is originally sourced from the aio-libs project "async_timeout",
# under the Apache 2.0 license. You may see the original project at
# https://github.com/aio-libs/async-timeout
# It is vendored here to reduce chain-dependencies on this library, and
# modified slightly to remove some features we don't use.
import asyncio
import warnings
from types import TracebackType
from typing import Any # noqa
from typing import Optional, Type
class timeout:
"""timeout context manager.
Useful in cases when you want to apply timeout logic around block
of code or in cases when asyncio.wait_for is not suitable. For example:
>>> with timeout(0.001):
... async with aiohttp.get('https://github.com') as r:
... await r.text()
timeout - value in seconds or None to disable timeout logic
loop - asyncio compatible event loop
"""
def __init__(
self,
timeout: Optional[float],
*,
loop: Optional[asyncio.AbstractEventLoop] = None,
) -> None:
self._timeout = timeout
if loop is None:
loop = asyncio.get_running_loop()
else:
warnings.warn(
"""The loop argument to timeout() is deprecated.""", DeprecationWarning
)
self._loop = loop
self._task = None # type: Optional[asyncio.Task[Any]]
self._cancelled = False
self._cancel_handler = None # type: Optional[asyncio.Handle]
self._cancel_at = None # type: Optional[float]
def __enter__(self) -> "timeout":
return self._do_enter()
def __exit__(
self,
exc_type: Type[BaseException],
exc_val: BaseException,
exc_tb: TracebackType,
) -> Optional[bool]:
self._do_exit(exc_type)
return None
async def __aenter__(self) -> "timeout":
return self._do_enter()
async def __aexit__(
self,
exc_type: Type[BaseException],
exc_val: BaseException,
exc_tb: TracebackType,
) -> None:
self._do_exit(exc_type)
@property
def expired(self) -> bool:
return self._cancelled
@property
def remaining(self) -> Optional[float]:
if self._cancel_at is not None:
return max(self._cancel_at - self._loop.time(), 0.0)
else:
return None
def _do_enter(self) -> "timeout":
# Support Tornado 5- without timeout
# Details: https://github.com/python/asyncio/issues/392
if self._timeout is None:
return self
self._task = asyncio.current_task(self._loop)
if self._task is None:
raise RuntimeError(
"Timeout context manager should be used " "inside a task"
)
if self._timeout <= 0:
self._loop.call_soon(self._cancel_task)
return self
self._cancel_at = self._loop.time() + self._timeout
self._cancel_handler = self._loop.call_at(self._cancel_at, self._cancel_task)
return self
def _do_exit(self, exc_type: Type[BaseException]) -> None:
if exc_type is asyncio.CancelledError and self._cancelled:
self._cancel_handler = None
self._task = None
raise asyncio.TimeoutError
if self._timeout is not None and self._cancel_handler is not None:
self._cancel_handler.cancel()
self._cancel_handler = None
self._task = None
return None
def _cancel_task(self) -> None:
if self._task is not None:
self._task.cancel()
self._cancelled = True

View File

@@ -0,0 +1,279 @@
import sys
from typing import (
Any,
Awaitable,
Callable,
Dict,
Iterable,
Literal,
Optional,
Protocol,
Tuple,
Type,
TypedDict,
Union,
)
if sys.version_info >= (3, 11):
from typing import NotRequired
else:
from typing_extensions import NotRequired
__all__ = (
"ASGIVersions",
"HTTPScope",
"WebSocketScope",
"LifespanScope",
"WWWScope",
"Scope",
"HTTPRequestEvent",
"HTTPResponseStartEvent",
"HTTPResponseBodyEvent",
"HTTPResponseTrailersEvent",
"HTTPResponsePathsendEvent",
"HTTPServerPushEvent",
"HTTPDisconnectEvent",
"WebSocketConnectEvent",
"WebSocketAcceptEvent",
"WebSocketReceiveEvent",
"WebSocketSendEvent",
"WebSocketResponseStartEvent",
"WebSocketResponseBodyEvent",
"WebSocketDisconnectEvent",
"WebSocketCloseEvent",
"LifespanStartupEvent",
"LifespanShutdownEvent",
"LifespanStartupCompleteEvent",
"LifespanStartupFailedEvent",
"LifespanShutdownCompleteEvent",
"LifespanShutdownFailedEvent",
"ASGIReceiveEvent",
"ASGISendEvent",
"ASGIReceiveCallable",
"ASGISendCallable",
"ASGI2Protocol",
"ASGI2Application",
"ASGI3Application",
"ASGIApplication",
)
class ASGIVersions(TypedDict):
spec_version: str
version: Union[Literal["2.0"], Literal["3.0"]]
class HTTPScope(TypedDict):
type: Literal["http"]
asgi: ASGIVersions
http_version: str
method: str
scheme: str
path: str
raw_path: bytes
query_string: bytes
root_path: str
headers: Iterable[Tuple[bytes, bytes]]
client: Optional[Tuple[str, int]]
server: Optional[Tuple[str, Optional[int]]]
state: NotRequired[Dict[str, Any]]
extensions: Optional[Dict[str, Dict[object, object]]]
class WebSocketScope(TypedDict):
type: Literal["websocket"]
asgi: ASGIVersions
http_version: str
scheme: str
path: str
raw_path: bytes
query_string: bytes
root_path: str
headers: Iterable[Tuple[bytes, bytes]]
client: Optional[Tuple[str, int]]
server: Optional[Tuple[str, Optional[int]]]
subprotocols: Iterable[str]
state: NotRequired[Dict[str, Any]]
extensions: Optional[Dict[str, Dict[object, object]]]
class LifespanScope(TypedDict):
type: Literal["lifespan"]
asgi: ASGIVersions
state: NotRequired[Dict[str, Any]]
WWWScope = Union[HTTPScope, WebSocketScope]
Scope = Union[HTTPScope, WebSocketScope, LifespanScope]
class HTTPRequestEvent(TypedDict):
type: Literal["http.request"]
body: bytes
more_body: bool
class HTTPResponseDebugEvent(TypedDict):
type: Literal["http.response.debug"]
info: Dict[str, object]
class HTTPResponseStartEvent(TypedDict):
type: Literal["http.response.start"]
status: int
headers: Iterable[Tuple[bytes, bytes]]
trailers: bool
class HTTPResponseBodyEvent(TypedDict):
type: Literal["http.response.body"]
body: bytes
more_body: bool
class HTTPResponseTrailersEvent(TypedDict):
type: Literal["http.response.trailers"]
headers: Iterable[Tuple[bytes, bytes]]
more_trailers: bool
class HTTPResponsePathsendEvent(TypedDict):
type: Literal["http.response.pathsend"]
path: str
class HTTPServerPushEvent(TypedDict):
type: Literal["http.response.push"]
path: str
headers: Iterable[Tuple[bytes, bytes]]
class HTTPDisconnectEvent(TypedDict):
type: Literal["http.disconnect"]
class WebSocketConnectEvent(TypedDict):
type: Literal["websocket.connect"]
class WebSocketAcceptEvent(TypedDict):
type: Literal["websocket.accept"]
subprotocol: Optional[str]
headers: Iterable[Tuple[bytes, bytes]]
class WebSocketReceiveEvent(TypedDict):
type: Literal["websocket.receive"]
bytes: Optional[bytes]
text: Optional[str]
class WebSocketSendEvent(TypedDict):
type: Literal["websocket.send"]
bytes: Optional[bytes]
text: Optional[str]
class WebSocketResponseStartEvent(TypedDict):
type: Literal["websocket.http.response.start"]
status: int
headers: Iterable[Tuple[bytes, bytes]]
class WebSocketResponseBodyEvent(TypedDict):
type: Literal["websocket.http.response.body"]
body: bytes
more_body: bool
class WebSocketDisconnectEvent(TypedDict):
type: Literal["websocket.disconnect"]
code: int
reason: Optional[str]
class WebSocketCloseEvent(TypedDict):
type: Literal["websocket.close"]
code: int
reason: Optional[str]
class LifespanStartupEvent(TypedDict):
type: Literal["lifespan.startup"]
class LifespanShutdownEvent(TypedDict):
type: Literal["lifespan.shutdown"]
class LifespanStartupCompleteEvent(TypedDict):
type: Literal["lifespan.startup.complete"]
class LifespanStartupFailedEvent(TypedDict):
type: Literal["lifespan.startup.failed"]
message: str
class LifespanShutdownCompleteEvent(TypedDict):
type: Literal["lifespan.shutdown.complete"]
class LifespanShutdownFailedEvent(TypedDict):
type: Literal["lifespan.shutdown.failed"]
message: str
ASGIReceiveEvent = Union[
HTTPRequestEvent,
HTTPDisconnectEvent,
WebSocketConnectEvent,
WebSocketReceiveEvent,
WebSocketDisconnectEvent,
LifespanStartupEvent,
LifespanShutdownEvent,
]
ASGISendEvent = Union[
HTTPResponseStartEvent,
HTTPResponseBodyEvent,
HTTPResponseTrailersEvent,
HTTPServerPushEvent,
HTTPDisconnectEvent,
WebSocketAcceptEvent,
WebSocketSendEvent,
WebSocketResponseStartEvent,
WebSocketResponseBodyEvent,
WebSocketCloseEvent,
LifespanStartupCompleteEvent,
LifespanStartupFailedEvent,
LifespanShutdownCompleteEvent,
LifespanShutdownFailedEvent,
]
ASGIReceiveCallable = Callable[[], Awaitable[ASGIReceiveEvent]]
ASGISendCallable = Callable[[ASGISendEvent], Awaitable[None]]
class ASGI2Protocol(Protocol):
def __init__(self, scope: Scope) -> None:
...
async def __call__(
self, receive: ASGIReceiveCallable, send: ASGISendCallable
) -> None:
...
ASGI2Application = Type[ASGI2Protocol]
ASGI3Application = Callable[
[
Scope,
ASGIReceiveCallable,
ASGISendCallable,
],
Awaitable[None],
]
ASGIApplication = Union[ASGI2Application, ASGI3Application]

View File

@@ -0,0 +1,166 @@
import sys
from tempfile import SpooledTemporaryFile
from asgiref.sync import AsyncToSync, sync_to_async
class WsgiToAsgi:
"""
Wraps a WSGI application to make it into an ASGI application.
"""
def __init__(self, wsgi_application):
self.wsgi_application = wsgi_application
async def __call__(self, scope, receive, send):
"""
ASGI application instantiation point.
We return a new WsgiToAsgiInstance here with the WSGI app
and the scope, ready to respond when it is __call__ed.
"""
await WsgiToAsgiInstance(self.wsgi_application)(scope, receive, send)
class WsgiToAsgiInstance:
"""
Per-socket instance of a wrapped WSGI application
"""
def __init__(self, wsgi_application):
self.wsgi_application = wsgi_application
self.response_started = False
self.response_content_length = None
async def __call__(self, scope, receive, send):
if scope["type"] != "http":
raise ValueError("WSGI wrapper received a non-HTTP scope")
self.scope = scope
with SpooledTemporaryFile(max_size=65536) as body:
# Alright, wait for the http.request messages
while True:
message = await receive()
if message["type"] != "http.request":
raise ValueError("WSGI wrapper received a non-HTTP-request message")
body.write(message.get("body", b""))
if not message.get("more_body"):
break
body.seek(0)
# Wrap send so it can be called from the subthread
self.sync_send = AsyncToSync(send)
# Call the WSGI app
await self.run_wsgi_app(body)
def build_environ(self, scope, body):
"""
Builds a scope and request body into a WSGI environ object.
"""
script_name = scope.get("root_path", "").encode("utf8").decode("latin1")
path_info = scope["path"].encode("utf8").decode("latin1")
if path_info.startswith(script_name):
path_info = path_info[len(script_name) :]
environ = {
"REQUEST_METHOD": scope["method"],
"SCRIPT_NAME": script_name,
"PATH_INFO": path_info,
"QUERY_STRING": scope["query_string"].decode("ascii"),
"SERVER_PROTOCOL": "HTTP/%s" % scope["http_version"],
"wsgi.version": (1, 0),
"wsgi.url_scheme": scope.get("scheme", "http"),
"wsgi.input": body,
"wsgi.errors": sys.stderr,
"wsgi.multithread": True,
"wsgi.multiprocess": True,
"wsgi.run_once": False,
}
# Get server name and port - required in WSGI, not in ASGI
if "server" in scope:
environ["SERVER_NAME"] = scope["server"][0]
environ["SERVER_PORT"] = str(scope["server"][1])
else:
environ["SERVER_NAME"] = "localhost"
environ["SERVER_PORT"] = "80"
if scope.get("client") is not None:
environ["REMOTE_ADDR"] = scope["client"][0]
# Go through headers and make them into environ entries
for name, value in self.scope.get("headers", []):
name = name.decode("latin1")
if name == "content-length":
corrected_name = "CONTENT_LENGTH"
elif name == "content-type":
corrected_name = "CONTENT_TYPE"
else:
corrected_name = "HTTP_%s" % name.upper().replace("-", "_")
# HTTPbis say only ASCII chars are allowed in headers, but we latin1 just in case
value = value.decode("latin1")
if corrected_name in environ:
value = environ[corrected_name] + "," + value
environ[corrected_name] = value
return environ
def start_response(self, status, response_headers, exc_info=None):
"""
WSGI start_response callable.
"""
# Don't allow re-calling once response has begun
if self.response_started:
raise exc_info[1].with_traceback(exc_info[2])
# Don't allow re-calling without exc_info
if hasattr(self, "response_start") and exc_info is None:
raise ValueError(
"You cannot call start_response a second time without exc_info"
)
# Extract status code
status_code, _ = status.split(" ", 1)
status_code = int(status_code)
# Extract headers
headers = [
(name.lower().encode("ascii"), value.encode("ascii"))
for name, value in response_headers
]
# Extract content-length
self.response_content_length = None
for name, value in response_headers:
if name.lower() == "content-length":
self.response_content_length = int(value)
# Build and send response start message.
self.response_start = {
"type": "http.response.start",
"status": status_code,
"headers": headers,
}
@sync_to_async
def run_wsgi_app(self, body):
"""
Called in a subthread to run the WSGI app. We encapsulate like
this so that the start_response callable is called in the same thread.
"""
# Translate the scope and incoming request body into a WSGI environ
environ = self.build_environ(self.scope, body)
# Run the WSGI app
bytes_sent = 0
for output in self.wsgi_application(environ, self.start_response):
# If this is the first response, include the response headers
if not self.response_started:
self.response_started = True
self.sync_send(self.response_start)
# If the application supplies a Content-Length header
if self.response_content_length is not None:
# The server should not transmit more bytes to the client than the header allows
bytes_allowed = self.response_content_length - bytes_sent
if len(output) > bytes_allowed:
output = output[:bytes_allowed]
self.sync_send(
{"type": "http.response.body", "body": output, "more_body": True}
)
bytes_sent += len(output)
# The server should stop iterating over the response when enough data has been sent
if bytes_sent == self.response_content_length:
break
# Close connection
if not self.response_started:
self.response_started = True
self.sync_send(self.response_start)
self.sync_send({"type": "http.response.body"})

View File

@@ -0,0 +1 @@
pip

View File

@@ -0,0 +1,78 @@
Metadata-Version: 2.4
Name: certifi
Version: 2025.10.5
Summary: Python package for providing Mozilla's CA Bundle.
Home-page: https://github.com/certifi/python-certifi
Author: Kenneth Reitz
Author-email: me@kennethreitz.com
License: MPL-2.0
Project-URL: Source, https://github.com/certifi/python-certifi
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: Mozilla Public License 2.0 (MPL 2.0)
Classifier: Natural Language :: English
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Python: >=3.7
License-File: LICENSE
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary
Certifi: Python SSL Certificates
================================
Certifi provides Mozilla's carefully curated collection of Root Certificates for
validating the trustworthiness of SSL certificates while verifying the identity
of TLS hosts. It has been extracted from the `Requests`_ project.
Installation
------------
``certifi`` is available on PyPI. Simply install it with ``pip``::
$ pip install certifi
Usage
-----
To reference the installed certificate authority (CA) bundle, you can use the
built-in function::
>>> import certifi
>>> certifi.where()
'/usr/local/lib/python3.7/site-packages/certifi/cacert.pem'
Or from the command line::
$ python -m certifi
/usr/local/lib/python3.7/site-packages/certifi/cacert.pem
Enjoy!
.. _`Requests`: https://requests.readthedocs.io/en/master/
Addition/Removal of Certificates
--------------------------------
Certifi does not support any addition/removal or other modification of the
CA trust store content. This project is intended to provide a reliable and
highly portable root of trust to python deployments. Look to upstream projects
for methods to use alternate trust.

View File

@@ -0,0 +1,14 @@
certifi-2025.10.5.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
certifi-2025.10.5.dist-info/METADATA,sha256=RzyR4sT6xRN1pNNy24IHVOlZuDJh1BNfaMa04zEadtk,2474
certifi-2025.10.5.dist-info/RECORD,,
certifi-2025.10.5.dist-info/WHEEL,sha256=_zCd3N1l69ArxyTb8rzEoP9TpbYXkqRFSNOD5OuxnTs,91
certifi-2025.10.5.dist-info/licenses/LICENSE,sha256=6TcW2mucDVpKHfYP5pWzcPBpVgPSH2-D8FPkLPwQyvc,989
certifi-2025.10.5.dist-info/top_level.txt,sha256=KMu4vUCfsjLrkPbSNdgdekS-pVJzBAJFO__nI8NF6-U,8
certifi/__init__.py,sha256=jWkaYHMk4oIPSSBEK5bLMbO_qrkyNm_cRFx-D16-3Ks,94
certifi/__main__.py,sha256=xBBoj905TUWBLRGANOcf7oi6e-3dMP4cEoG9OyMs11g,243
certifi/__pycache__/__init__.cpython-312.pyc,,
certifi/__pycache__/__main__.cpython-312.pyc,,
certifi/__pycache__/core.cpython-312.pyc,,
certifi/cacert.pem,sha256=IIn8WiWDZAH67pn3IkYLAbOTmZdGoPuBeUNmbW7MBFg,291366
certifi/core.py,sha256=XFXycndG5pf37ayeF8N32HUuDafsyhkVMbO4BAPWHa0,3394
certifi/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0

View File

@@ -0,0 +1,5 @@
Wheel-Version: 1.0
Generator: setuptools (80.9.0)
Root-Is-Purelib: true
Tag: py3-none-any

View File

@@ -0,0 +1,20 @@
This package contains a modified version of ca-bundle.crt:
ca-bundle.crt -- Bundle of CA Root Certificates
This is a bundle of X.509 certificates of public Certificate Authorities
(CA). These were automatically extracted from Mozilla's root certificates
file (certdata.txt). This file can be found in the mozilla source tree:
https://hg.mozilla.org/mozilla-central/file/tip/security/nss/lib/ckfw/builtins/certdata.txt
It contains the certificates in PEM format and therefore
can be directly used with curl / libcurl / php_curl, or with
an Apache+mod_ssl webserver for SSL client authentication.
Just configure this file as the SSLCACertificateFile.#
***** BEGIN LICENSE BLOCK *****
This Source Code Form is subject to the terms of the Mozilla Public License,
v. 2.0. If a copy of the MPL was not distributed with this file, You can obtain
one at http://mozilla.org/MPL/2.0/.
***** END LICENSE BLOCK *****
@(#) $RCSfile: certdata.txt,v $ $Revision: 1.80 $ $Date: 2011/11/03 15:11:58 $

View File

@@ -0,0 +1 @@
certifi

View File

@@ -0,0 +1,4 @@
from .core import contents, where
__all__ = ["contents", "where"]
__version__ = "2025.10.05"

View File

@@ -0,0 +1,12 @@
import argparse
from certifi import contents, where
parser = argparse.ArgumentParser()
parser.add_argument("-c", "--contents", action="store_true")
args = parser.parse_args()
if args.contents:
print(contents())
else:
print(where())

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,83 @@
"""
certifi.py
~~~~~~~~~~
This module returns the installation location of cacert.pem or its contents.
"""
import sys
import atexit
def exit_cacert_ctx() -> None:
_CACERT_CTX.__exit__(None, None, None) # type: ignore[union-attr]
if sys.version_info >= (3, 11):
from importlib.resources import as_file, files
_CACERT_CTX = None
_CACERT_PATH = None
def where() -> str:
# This is slightly terrible, but we want to delay extracting the file
# in cases where we're inside of a zipimport situation until someone
# actually calls where(), but we don't want to re-extract the file
# on every call of where(), so we'll do it once then store it in a
# global variable.
global _CACERT_CTX
global _CACERT_PATH
if _CACERT_PATH is None:
# This is slightly janky, the importlib.resources API wants you to
# manage the cleanup of this file, so it doesn't actually return a
# path, it returns a context manager that will give you the path
# when you enter it and will do any cleanup when you leave it. In
# the common case of not needing a temporary file, it will just
# return the file system location and the __exit__() is a no-op.
#
# We also have to hold onto the actual context manager, because
# it will do the cleanup whenever it gets garbage collected, so
# we will also store that at the global level as well.
_CACERT_CTX = as_file(files("certifi").joinpath("cacert.pem"))
_CACERT_PATH = str(_CACERT_CTX.__enter__())
atexit.register(exit_cacert_ctx)
return _CACERT_PATH
def contents() -> str:
return files("certifi").joinpath("cacert.pem").read_text(encoding="ascii")
else:
from importlib.resources import path as get_path, read_text
_CACERT_CTX = None
_CACERT_PATH = None
def where() -> str:
# This is slightly terrible, but we want to delay extracting the
# file in cases where we're inside of a zipimport situation until
# someone actually calls where(), but we don't want to re-extract
# the file on every call of where(), so we'll do it once then store
# it in a global variable.
global _CACERT_CTX
global _CACERT_PATH
if _CACERT_PATH is None:
# This is slightly janky, the importlib.resources API wants you
# to manage the cleanup of this file, so it doesn't actually
# return a path, it returns a context manager that will give
# you the path when you enter it and will do any cleanup when
# you leave it. In the common case of not needing a temporary
# file, it will just return the file system location and the
# __exit__() is a no-op.
#
# We also have to hold onto the actual context manager, because
# it will do the cleanup whenever it gets garbage collected, so
# we will also store that at the global level as well.
_CACERT_CTX = get_path("certifi", "cacert.pem")
_CACERT_PATH = str(_CACERT_CTX.__enter__())
atexit.register(exit_cacert_ctx)
return _CACERT_PATH
def contents() -> str:
return read_text("certifi", "cacert.pem", encoding="ascii")

View File

View File

@@ -0,0 +1,764 @@
Metadata-Version: 2.4
Name: charset-normalizer
Version: 3.4.4
Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
Author-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
Maintainer-email: "Ahmed R. TAHRI" <tahri.ahmed@proton.me>
License: MIT
Project-URL: Changelog, https://github.com/jawah/charset_normalizer/blob/master/CHANGELOG.md
Project-URL: Documentation, https://charset-normalizer.readthedocs.io/
Project-URL: Code, https://github.com/jawah/charset_normalizer
Project-URL: Issue tracker, https://github.com/jawah/charset_normalizer/issues
Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: Implementation :: CPython
Classifier: Programming Language :: Python :: Implementation :: PyPy
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Utilities
Classifier: Typing :: Typed
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: unicode-backport
Dynamic: license-file
<h1 align="center">Charset Detection, for Everyone 👋</h1>
<p align="center">
<sup>The Real First Universal Charset Detector</sup><br>
<a href="https://pypi.org/project/charset-normalizer">
<img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
</a>
<a href="https://pepy.tech/project/charset-normalizer/">
<img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" />
</a>
<a href="https://bestpractices.coreinfrastructure.org/projects/7297">
<img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge">
</a>
</p>
<p align="center">
<sup><i>Featured Packages</i></sup><br>
<a href="https://github.com/jawah/niquests">
<img alt="Static Badge" src="https://img.shields.io/badge/Niquests-Most_Advanced_HTTP_Client-cyan">
</a>
<a href="https://github.com/jawah/wassima">
<img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Replacement-cyan">
</a>
</p>
<p align="center">
<sup><i>In other language (unofficial port - by the community)</i></sup><br>
<a href="https://github.com/nickspring/charset-normalizer-rs">
<img alt="Static Badge" src="https://img.shields.io/badge/Rust-red">
</a>
</p>
> A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
> I'm trying to resolve the issue by taking a new approach.
> All IANA character set names for which the Python core library provides codecs are supported.
<p align="center">
>>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
</p>
This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**.
| Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) |
|--------------------------------------------------|:---------------------------------------------:|:--------------------------------------------------------------------------------------------------:|:-----------------------------------------------:|
| `Fast` | ❌ | ✅ | ✅ |
| `Universal**` | ❌ | ✅ | ❌ |
| `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ |
| `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ |
| `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ |
| `Native Python` | ✅ | ✅ | ❌ |
| `Detect spoken language` | ❌ | ✅ | N/A |
| `UnicodeDecodeError Safety` | ❌ | ✅ | ❌ |
| `Whl Size (min)` | 193.6 kB | 42 kB | ~200 kB |
| `Supported Encoding` | 33 | 🎉 [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 |
<p align="center">
<img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
</p>
*\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
## ⚡ Performance
This package offer better performance than its counterpart Chardet. Here are some numbers.
| Package | Accuracy | Mean per file (ms) | File per sec (est) |
|-----------------------------------------------|:--------:|:------------------:|:------------------:|
| [chardet](https://github.com/chardet/chardet) | 86 % | 63 ms | 16 file/sec |
| charset-normalizer | **98 %** | **10 ms** | 100 file/sec |
| Package | 99th percentile | 95th percentile | 50th percentile |
|-----------------------------------------------|:---------------:|:---------------:|:---------------:|
| [chardet](https://github.com/chardet/chardet) | 265 ms | 71 ms | 7 ms |
| charset-normalizer | 100 ms | 50 ms | 5 ms |
_updated as of december 2024 using CPython 3.12_
Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
> Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
> And yes, these results might change at any time. The dataset can be updated to include more files.
> The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
> Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability
> (e.g. Supported Encoding) Challenge-them if you want.
## ✨ Installation
Using pip:
```sh
pip install charset-normalizer -U
```
## 🚀 Basic Usage
### CLI
This package comes with a CLI.
```
usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
file [file ...]
The Real First Universal Charset Detector. Discover originating encoding used
on text file. Normalize text to unicode.
positional arguments:
files File(s) to be analysed
optional arguments:
-h, --help show this help message and exit
-v, --verbose Display complementary information about file if any.
Stdout will contain logs about the detection process.
-a, --with-alternative
Output complementary possibilities if any. Top-level
JSON WILL be a list.
-n, --normalize Permit to normalize input file. If not set, program
does not write anything.
-m, --minimal Only output the charset detected to STDOUT. Disabling
JSON output.
-r, --replace Replace file when trying to normalize it instead of
creating a new one.
-f, --force Replace file without asking if you are sure, use this
flag with caution.
-t THRESHOLD, --threshold THRESHOLD
Define a custom maximum amount of chaos allowed in
decoded content. 0. <= chaos <= 1.
--version Show version information and exit.
```
```bash
normalizer ./data/sample.1.fr.srt
```
or
```bash
python -m charset_normalizer ./data/sample.1.fr.srt
```
🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
```json
{
"path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
"encoding": "cp1252",
"encoding_aliases": [
"1252",
"windows_1252"
],
"alternative_encodings": [
"cp1254",
"cp1256",
"cp1258",
"iso8859_14",
"iso8859_15",
"iso8859_16",
"iso8859_3",
"iso8859_9",
"latin_1",
"mbcs"
],
"language": "French",
"alphabets": [
"Basic Latin",
"Latin-1 Supplement"
],
"has_sig_or_bom": false,
"chaos": 0.149,
"coherence": 97.152,
"unicode_path": null,
"is_preferred": true
}
```
### Python
*Just print out normalized text*
```python
from charset_normalizer import from_path
results = from_path('./my_subtitle.srt')
print(str(results.best()))
```
*Upgrade your code without effort*
```python
from charset_normalizer import detect
```
The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible.
See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
## 😇 Why
When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
reliable alternative using a completely different method. Also! I never back down on a good challenge!
I **don't care** about the **originating charset** encoding, because **two different tables** can
produce **two identical rendered string.**
What I want is to get readable text, the best I can.
In a way, **I'm brute forcing text decoding.** How cool is that ? 😎
Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair Unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
## 🍰 How
- Discard all charset encoding table that could not fit the binary content.
- Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
- Extract matches with the lowest mess detected.
- Additionally, we measure coherence / probe for a language.
**Wait a minute**, what is noise/mess and coherence according to **YOU ?**
*Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then
**I established** some ground rules about **what is obvious** when **it seems like** a mess (aka. defining noise in rendered text).
I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
improve or rewrite it.
*Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
## ⚡ Known limitations
- Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
- Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
## ⚠️ About Python EOLs
**If you are running:**
- Python >=2.7,<3.5: Unsupported
- Python 3.5: charset-normalizer < 2.1
- Python 3.6: charset-normalizer < 3.1
- Python 3.7: charset-normalizer < 4.0
Upgrade your Python interpreter as soon as possible.
## 👤 Contributing
Contributions, issues and feature requests are very much welcome.<br />
Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
## 📝 License
Copyright © [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
## 💼 For Enterprise
Professional support for charset-normalizer is available as part of the [Tidelift
Subscription][1]. Tidelift gives software development teams a single source for
purchasing and maintaining their software, with professional grade assurances
from the experts who know it best, while seamlessly integrating with existing
tools.
[1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme
[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/7297/badge)](https://www.bestpractices.dev/projects/7297)
# Changelog
All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
## [3.4.4](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.4) (2025-10-13)
### Changed
- Bound `setuptools` to a specific constraint `setuptools>=68,<=81`.
- Raised upper bound of mypyc for the optional pre-built extension to v1.18.2
### Removed
- `setuptools-scm` as a build dependency.
### Misc
- Enforced hashes in `dev-requirements.txt` and created `ci-requirements.txt` for security purposes.
- Additional pre-built wheels for riscv64, s390x, and armv7l architectures.
- Restore ` multiple.intoto.jsonl` in GitHub releases in addition to individual attestation file per wheel.
## [3.4.3](https://github.com/Ousret/charset_normalizer/compare/3.4.2...3.4.3) (2025-08-09)
### Changed
- mypy(c) is no longer a required dependency at build time if `CHARSET_NORMALIZER_USE_MYPYC` isn't set to `1`. (#595) (#583)
- automatically lower confidence on small bytes samples that are not Unicode in `detect` output legacy function. (#391)
### Added
- Custom build backend to overcome inability to mark mypy as an optional dependency in the build phase.
- Support for Python 3.14
### Fixed
- sdist archive contained useless directories.
- automatically fallback on valid UTF-16 or UTF-32 even if the md says it's noisy. (#633)
### Misc
- SBOM are automatically published to the relevant GitHub release to comply with regulatory changes.
Each published wheel comes with its SBOM. We choose CycloneDX as the format.
- Prebuilt optimized wheel are no longer distributed by default for CPython 3.7 due to a change in cibuildwheel.
## [3.4.2](https://github.com/Ousret/charset_normalizer/compare/3.4.1...3.4.2) (2025-05-02)
### Fixed
- Addressed the DeprecationWarning in our CLI regarding `argparse.FileType` by backporting the target class into the package. (#591)
- Improved the overall reliability of the detector with CJK Ideographs. (#605) (#587)
### Changed
- Optional mypyc compilation upgraded to version 1.15 for Python >= 3.8
## [3.4.1](https://github.com/Ousret/charset_normalizer/compare/3.4.0...3.4.1) (2024-12-24)
### Changed
- Project metadata are now stored using `pyproject.toml` instead of `setup.cfg` using setuptools as the build backend.
- Enforce annotation delayed loading for a simpler and consistent types in the project.
- Optional mypyc compilation upgraded to version 1.14 for Python >= 3.8
### Added
- pre-commit configuration.
- noxfile.
### Removed
- `build-requirements.txt` as per using `pyproject.toml` native build configuration.
- `bin/integration.py` and `bin/serve.py` in favor of downstream integration test (see noxfile).
- `setup.cfg` in favor of `pyproject.toml` metadata configuration.
- Unused `utils.range_scan` function.
### Fixed
- Converting content to Unicode bytes may insert `utf_8` instead of preferred `utf-8`. (#572)
- Deprecation warning "'count' is passed as positional argument" when converting to Unicode bytes on Python 3.13+
## [3.4.0](https://github.com/Ousret/charset_normalizer/compare/3.3.2...3.4.0) (2024-10-08)
### Added
- Argument `--no-preemptive` in the CLI to prevent the detector to search for hints.
- Support for Python 3.13 (#512)
### Fixed
- Relax the TypeError exception thrown when trying to compare a CharsetMatch with anything else than a CharsetMatch.
- Improved the general reliability of the detector based on user feedbacks. (#520) (#509) (#498) (#407) (#537)
- Declared charset in content (preemptive detection) not changed when converting to utf-8 bytes. (#381)
## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31)
### Fixed
- Unintentional memory usage regression when using large payload that match several encoding (#376)
- Regression on some detection case showcased in the documentation (#371)
### Added
- Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife)
## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22)
### Changed
- Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
- Improved the general detection reliability based on reports from the community
## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30)
### Added
- Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer`
- Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323)
### Removed
- (internal) Redundant utils.is_ascii function and unused function is_private_use_only
- (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
### Changed
- (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
- Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
### Fixed
- Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350)
## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07)
### Changed
- Typehint for function `from_path` no longer enforce `PathLike` as its first argument
- Minor improvement over the global detection reliability
### Added
- Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries
- Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True)
- Explicit support for Python 3.12
### Fixed
- Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289)
## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)
### Added
- Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)
### Removed
- Support for Python 3.6 (PR #260)
### Changed
- Optional speedup provided by mypy/c 1.0.1
## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)
### Fixed
- Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)
### Changed
- Speedup provided by mypy/c 0.990 on Python >= 3.7
## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20)
### Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
### Changed
- Build with static metadata using 'build' frontend
- Make the language detection stricter
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
### Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
- Sphinx warnings when generating the documentation
### Removed
- Coherence detector no longer return 'Simple English' instead return 'English'
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
- Breaking: Method `first()` and `best()` from CharsetMatch
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function `normalize`
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
- Support for the backport `unicodedata2`
## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18)
### Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
### Changed
- Build with static metadata using 'build' frontend
- Make the language detection stricter
### Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
### Removed
- Coherence detector no longer return 'Simple English' instead return 'English'
- Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21)
### Added
- `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
### Removed
- Breaking: Method `first()` and `best()` from CharsetMatch
- UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
### Fixed
- Sphinx warnings when generating the documentation
## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15)
### Changed
- Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
### Removed
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function `normalize`
- Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
- Support for the backport `unicodedata2`
## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19)
### Deprecated
- Function `normalize` scheduled for removal in 3.0
### Changed
- Removed useless call to decode in fn is_unprintable (#206)
### Fixed
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204)
## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19)
### Added
- Output the Unicode table version when running the CLI with `--version` (PR #194)
### Changed
- Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175)
- Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183)
### Fixed
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
- CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181)
### Removed
- Support for Python 3.5 (PR #192)
### Deprecated
- Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12)
### Fixed
- ASCII miss-detection on rare cases (PR #170)
## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30)
### Added
- Explicit support for Python 3.11 (PR #164)
### Changed
- The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)
## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04)
### Fixed
- Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)
### Changed
- Skipping the language-detection (CD) on ASCII (PR #155)
## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03)
### Changed
- Moderating the logging impact (since 2.0.8) for specific environments (PR #147)
### Fixed
- Wrong logging level applied when setting kwarg `explain` to True (PR #146)
## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24)
### Changed
- Improvement over Vietnamese detection (PR #126)
- MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
- Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122)
- call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129)
- Code style as refactored by Sourcery-AI (PR #131)
- Minor adjustment on the MD around european words (PR #133)
- Remove and replace SRTs from assets / tests (PR #139)
- Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135)
- Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
### Fixed
- Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
- Avoid using too insignificant chunk (PR #137)
### Added
- Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135)
- Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141)
## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11)
### Added
- Add support for Kazakh (Cyrillic) language detection (PR #109)
### Changed
- Further, improve inferring the language from a given single-byte code page (PR #112)
- Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116)
- Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113)
- Various detection improvement (MD+CD) (PR #117)
### Removed
- Remove redundant logging entry about detected language(s) (PR #115)
### Fixed
- Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102)
## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18)
### Fixed
- Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100)
- Fix CLI crash when using --minimal output in certain cases (PR #103)
### Changed
- Minor improvement to the detection efficiency (less than 1%) (PR #106 #101)
## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14)
### Changed
- The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81)
- The BC-support with v1.x was improved, the old staticmethods are restored (PR #82)
- The Unicode detection is slightly improved (PR #93)
- Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91)
### Removed
- The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92)
### Fixed
- In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95)
- Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96)
- The MANIFEST.in was not exhaustive (PR #78)
## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30)
### Fixed
- The CLI no longer raise an unexpected exception when no encoding has been found (PR #70)
- Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68)
- The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72)
- Submatch factoring could be wrong in rare edge cases (PR #72)
- Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72)
- Fix line endings from CRLF to LF for certain project files (PR #67)
### Changed
- Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76)
- Allow fallback on specified encoding if any (PR #71)
## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16)
### Changed
- Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63)
- According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64)
## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15)
### Fixed
- Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59)
### Changed
- Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57)
## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13)
### Fixed
- Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55)
- Using explain=False permanently disable the verbose output in the current runtime (PR #47)
- One log entry (language target preemptive) was not show in logs when using explain=True (PR #47)
- Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52)
### Changed
- Public function normalize default args values were not aligned with from_bytes (PR #53)
### Added
- You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47)
## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02)
### Changed
- 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet.
- Accent has been made on UTF-8 detection, should perform rather instantaneous.
- The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.
- The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
- The program has been rewritten to ease the readability and maintainability. (+Using static typing)+
- utf_7 detection has been reinstated.
### Removed
- This package no longer require anything when used with Python 3.5 (Dropped cached_property)
- Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
- The exception hook on UnicodeDecodeError has been removed.
### Deprecated
- Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0
### Fixed
- The CLI output used the relative path of the file(s). Should be absolute.
## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28)
### Fixed
- Logger configuration/usage no longer conflict with others (PR #44)
## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21)
### Removed
- Using standard logging instead of using the package loguru.
- Dropping nose test framework in favor of the maintained pytest.
- Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.
- Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.
- Stop support for UTF-7 that does not contain a SIG.
- Dropping PrettyTable, replaced with pure JSON output in CLI.
### Fixed
- BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.
- Not searching properly for the BOM when trying utf32/16 parent codec.
### Changed
- Improving the package final size by compressing frequencies.json.
- Huge improvement over the larges payload.
### Added
- CLI now produces JSON consumable output.
- Return ASCII if given sequences fit. Given reasonable confidence.
## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13)
### Fixed
- In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40)
## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12)
### Fixed
- Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39)
## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12)
### Fixed
- The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38)
## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09)
### Changed
- Amend the previous release to allow prettytable 2.0 (PR #35)
## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08)
### Fixed
- Fix error while using the package with a python pre-release interpreter (PR #33)
### Changed
- Dependencies refactoring, constraints revised.
### Added
- Add python 3.9 and 3.10 to the supported interpreters
MIT License
Copyright (c) 2025 TAHRI Ahmed R.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -0,0 +1,35 @@
../../Scripts/normalizer.exe,sha256=swCbbsYGphhCYhRLu1NPZcgzP9vs9NMw8xpifu59AwU,108386
charset_normalizer-3.4.4.dist-info/INSTALLER,sha256=zuuue4knoyJ-UwPPXg8fezS7VCrXJQrAP7zeNuwvFQg,4
charset_normalizer-3.4.4.dist-info/METADATA,sha256=Mg5oc0yfpVMtDcprHt_pPbbV0qUSHEeaEz4NG53pmyY,38067
charset_normalizer-3.4.4.dist-info/RECORD,,
charset_normalizer-3.4.4.dist-info/WHEEL,sha256=8UP9x9puWI0P1V_d7K2oMTBqfeLNm21CTzZ_Ptr0NXU,101
charset_normalizer-3.4.4.dist-info/entry_points.txt,sha256=ADSTKrkXZ3hhdOVFi6DcUEHQRS0xfxDIE_pEz4wLIXA,65
charset_normalizer-3.4.4.dist-info/licenses/LICENSE,sha256=GFd0hdNwTxpHne2OVzwJds_tMV_S_ReYP6mI2kwvcNE,1092
charset_normalizer-3.4.4.dist-info/top_level.txt,sha256=7ASyzePr8_xuZWJsnqJjIBtyV8vhEo0wBCv1MPRRi3Q,19
charset_normalizer/__init__.py,sha256=0NT8MHi7SKq3juMqYfOdrkzjisK0L73lneNHH4qaUAs,1638
charset_normalizer/__main__.py,sha256=2sj_BS6H0sU25C1bMqz9DVwa6kOK9lchSEbSU-_iu7M,115
charset_normalizer/__pycache__/__init__.cpython-312.pyc,,
charset_normalizer/__pycache__/__main__.cpython-312.pyc,,
charset_normalizer/__pycache__/api.cpython-312.pyc,,
charset_normalizer/__pycache__/cd.cpython-312.pyc,,
charset_normalizer/__pycache__/constant.cpython-312.pyc,,
charset_normalizer/__pycache__/legacy.cpython-312.pyc,,
charset_normalizer/__pycache__/md.cpython-312.pyc,,
charset_normalizer/__pycache__/models.cpython-312.pyc,,
charset_normalizer/__pycache__/utils.cpython-312.pyc,,
charset_normalizer/__pycache__/version.cpython-312.pyc,,
charset_normalizer/api.py,sha256=ODy4hX78b3ldTl5sViYPU1yzQ5qkclfgSIFE8BtNrTI,23337
charset_normalizer/cd.py,sha256=uq8nVxRpR6Guc16ACvOWtL8KO3w7vYaCh8hHisuOyTg,12917
charset_normalizer/cli/__init__.py,sha256=d9MUx-1V_qD3x9igIy4JT4oC5CU0yjulk7QyZWeRFhg,144
charset_normalizer/cli/__main__.py,sha256=-pdJCyPywouPyFsC8_eTSgTmvh1YEvgjsvy1WZ0XjaA,13027
charset_normalizer/cli/__pycache__/__init__.cpython-312.pyc,,
charset_normalizer/cli/__pycache__/__main__.cpython-312.pyc,,
charset_normalizer/constant.py,sha256=mCJmYzpBU27Ut9kiNWWoBbhhxQ-aRVw3K7LSwoFwBGI,44728
charset_normalizer/legacy.py,sha256=ui08NlKqAXU3Y7smK-NFJjEgRRQz9ruM7aNCbT0OOrE,2811
charset_normalizer/md.cp312-win_amd64.pyd,sha256=dqU14JU7SKI0i4dyNqV5nPHQHLIUIsfxeULzU2fLXI8,10752
charset_normalizer/md.py,sha256=LSuW2hNgXSgF7JGdRapLAHLuj6pABHiP85LTNAYmu7c,20780
charset_normalizer/md__mypyc.cp312-win_amd64.pyd,sha256=CDDD_25vg5Sn3xcPlfwQ3mWrnyKzD50jg_DMKZuN8QE,126976
charset_normalizer/models.py,sha256=ZR2PE-fqf6dASZfqdE5Uhkmr0o1MciSdXOjuNqwkmvg,12754
charset_normalizer/py.typed,sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU,0
charset_normalizer/utils.py,sha256=XtWIQeOuz7cnGebMzyi4Vvi1JtA84QBSIeR9PDzF7pw,12584
charset_normalizer/version.py,sha256=MhW8dOLls4GbbxBUqeS1huc7Rth1ArKi4nS90qTFwz8,123

View File

@@ -0,0 +1,5 @@
Wheel-Version: 1.0
Generator: setuptools (80.9.0)
Root-Is-Purelib: false
Tag: cp312-cp312-win_amd64

View File

@@ -0,0 +1,2 @@
[console_scripts]
normalizer = charset_normalizer.cli:cli_detect

View File

@@ -0,0 +1,21 @@
MIT License
Copyright (c) 2025 TAHRI Ahmed R.
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

View File

@@ -0,0 +1 @@
charset_normalizer

View File

@@ -0,0 +1,48 @@
"""
Charset-Normalizer
~~~~~~~~~~~~~~
The Real First Universal Charset Detector.
A library that helps you read text from an unknown charset encoding.
Motivated by chardet, This package is trying to resolve the issue by taking a new approach.
All IANA character set names for which the Python core library provides codecs are supported.
Basic usage:
>>> from charset_normalizer import from_bytes
>>> results = from_bytes('Bсеки човек има право на образование. Oбразованието!'.encode('utf_8'))
>>> best_guess = results.best()
>>> str(best_guess)
'Bсеки човек има право на образование. Oбразованието!'
Others methods and usages are available - see the full documentation
at <https://github.com/Ousret/charset_normalizer>.
:copyright: (c) 2021 by Ahmed TAHRI
:license: MIT, see LICENSE for more details.
"""
from __future__ import annotations
import logging
from .api import from_bytes, from_fp, from_path, is_binary
from .legacy import detect
from .models import CharsetMatch, CharsetMatches
from .utils import set_logging_handler
from .version import VERSION, __version__
__all__ = (
"from_fp",
"from_path",
"from_bytes",
"is_binary",
"detect",
"CharsetMatch",
"CharsetMatches",
"__version__",
"VERSION",
"set_logging_handler",
)
# Attach a NullHandler to the top level logger by default
# https://docs.python.org/3.3/howto/logging.html#configuring-logging-for-a-library
logging.getLogger("charset_normalizer").addHandler(logging.NullHandler())

View File

@@ -0,0 +1,6 @@
from __future__ import annotations
from .cli import cli_detect
if __name__ == "__main__":
cli_detect()

View File

@@ -0,0 +1,669 @@
from __future__ import annotations
import logging
from os import PathLike
from typing import BinaryIO
from .cd import (
coherence_ratio,
encoding_languages,
mb_encoding_languages,
merge_coherence_ratios,
)
from .constant import IANA_SUPPORTED, TOO_BIG_SEQUENCE, TOO_SMALL_SEQUENCE, TRACE
from .md import mess_ratio
from .models import CharsetMatch, CharsetMatches
from .utils import (
any_specified_encoding,
cut_sequence_chunks,
iana_name,
identify_sig_or_bom,
is_cp_similar,
is_multi_byte_encoding,
should_strip_sig_or_bom,
)
logger = logging.getLogger("charset_normalizer")
explain_handler = logging.StreamHandler()
explain_handler.setFormatter(
logging.Formatter("%(asctime)s | %(levelname)s | %(message)s")
)
def from_bytes(
sequences: bytes | bytearray,
steps: int = 5,
chunk_size: int = 512,
threshold: float = 0.2,
cp_isolation: list[str] | None = None,
cp_exclusion: list[str] | None = None,
preemptive_behaviour: bool = True,
explain: bool = False,
language_threshold: float = 0.1,
enable_fallback: bool = True,
) -> CharsetMatches:
"""
Given a raw bytes sequence, return the best possibles charset usable to render str objects.
If there is no results, it is a strong indicator that the source is binary/not text.
By default, the process will extract 5 blocks of 512o each to assess the mess and coherence of a given sequence.
And will give up a particular code page after 20% of measured mess. Those criteria are customizable at will.
The preemptive behavior DOES NOT replace the traditional detection workflow, it prioritize a particular code page
but never take it for granted. Can improve the performance.
You may want to focus your attention to some code page or/and not others, use cp_isolation and cp_exclusion for that
purpose.
This function will strip the SIG in the payload/sequence every time except on UTF-16, UTF-32.
By default the library does not setup any handler other than the NullHandler, if you choose to set the 'explain'
toggle to True it will alter the logger configuration to add a StreamHandler that is suitable for debugging.
Custom logging format and handler can be set manually.
"""
if not isinstance(sequences, (bytearray, bytes)):
raise TypeError(
"Expected object of type bytes or bytearray, got: {}".format(
type(sequences)
)
)
if explain:
previous_logger_level: int = logger.level
logger.addHandler(explain_handler)
logger.setLevel(TRACE)
length: int = len(sequences)
if length == 0:
logger.debug("Encoding detection on empty bytes, assuming utf_8 intention.")
if explain: # Defensive: ensure exit path clean handler
logger.removeHandler(explain_handler)
logger.setLevel(previous_logger_level or logging.WARNING)
return CharsetMatches([CharsetMatch(sequences, "utf_8", 0.0, False, [], "")])
if cp_isolation is not None:
logger.log(
TRACE,
"cp_isolation is set. use this flag for debugging purpose. "
"limited list of encoding allowed : %s.",
", ".join(cp_isolation),
)
cp_isolation = [iana_name(cp, False) for cp in cp_isolation]
else:
cp_isolation = []
if cp_exclusion is not None:
logger.log(
TRACE,
"cp_exclusion is set. use this flag for debugging purpose. "
"limited list of encoding excluded : %s.",
", ".join(cp_exclusion),
)
cp_exclusion = [iana_name(cp, False) for cp in cp_exclusion]
else:
cp_exclusion = []
if length <= (chunk_size * steps):
logger.log(
TRACE,
"override steps (%i) and chunk_size (%i) as content does not fit (%i byte(s) given) parameters.",
steps,
chunk_size,
length,
)
steps = 1
chunk_size = length
if steps > 1 and length / steps < chunk_size:
chunk_size = int(length / steps)
is_too_small_sequence: bool = len(sequences) < TOO_SMALL_SEQUENCE
is_too_large_sequence: bool = len(sequences) >= TOO_BIG_SEQUENCE
if is_too_small_sequence:
logger.log(
TRACE,
"Trying to detect encoding from a tiny portion of ({}) byte(s).".format(
length
),
)
elif is_too_large_sequence:
logger.log(
TRACE,
"Using lazy str decoding because the payload is quite large, ({}) byte(s).".format(
length
),
)
prioritized_encodings: list[str] = []
specified_encoding: str | None = (
any_specified_encoding(sequences) if preemptive_behaviour else None
)
if specified_encoding is not None:
prioritized_encodings.append(specified_encoding)
logger.log(
TRACE,
"Detected declarative mark in sequence. Priority +1 given for %s.",
specified_encoding,
)
tested: set[str] = set()
tested_but_hard_failure: list[str] = []
tested_but_soft_failure: list[str] = []
fallback_ascii: CharsetMatch | None = None
fallback_u8: CharsetMatch | None = None
fallback_specified: CharsetMatch | None = None
results: CharsetMatches = CharsetMatches()
early_stop_results: CharsetMatches = CharsetMatches()
sig_encoding, sig_payload = identify_sig_or_bom(sequences)
if sig_encoding is not None:
prioritized_encodings.append(sig_encoding)
logger.log(
TRACE,
"Detected a SIG or BOM mark on first %i byte(s). Priority +1 given for %s.",
len(sig_payload),
sig_encoding,
)
prioritized_encodings.append("ascii")
if "utf_8" not in prioritized_encodings:
prioritized_encodings.append("utf_8")
for encoding_iana in prioritized_encodings + IANA_SUPPORTED:
if cp_isolation and encoding_iana not in cp_isolation:
continue
if cp_exclusion and encoding_iana in cp_exclusion:
continue
if encoding_iana in tested:
continue
tested.add(encoding_iana)
decoded_payload: str | None = None
bom_or_sig_available: bool = sig_encoding == encoding_iana
strip_sig_or_bom: bool = bom_or_sig_available and should_strip_sig_or_bom(
encoding_iana
)
if encoding_iana in {"utf_16", "utf_32"} and not bom_or_sig_available:
logger.log(
TRACE,
"Encoding %s won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.",
encoding_iana,
)
continue
if encoding_iana in {"utf_7"} and not bom_or_sig_available:
logger.log(
TRACE,
"Encoding %s won't be tested as-is because detection is unreliable without BOM/SIG.",
encoding_iana,
)
continue
try:
is_multi_byte_decoder: bool = is_multi_byte_encoding(encoding_iana)
except (ModuleNotFoundError, ImportError):
logger.log(
TRACE,
"Encoding %s does not provide an IncrementalDecoder",
encoding_iana,
)
continue
try:
if is_too_large_sequence and is_multi_byte_decoder is False:
str(
(
sequences[: int(50e4)]
if strip_sig_or_bom is False
else sequences[len(sig_payload) : int(50e4)]
),
encoding=encoding_iana,
)
else:
decoded_payload = str(
(
sequences
if strip_sig_or_bom is False
else sequences[len(sig_payload) :]
),
encoding=encoding_iana,
)
except (UnicodeDecodeError, LookupError) as e:
if not isinstance(e, LookupError):
logger.log(
TRACE,
"Code page %s does not fit given bytes sequence at ALL. %s",
encoding_iana,
str(e),
)
tested_but_hard_failure.append(encoding_iana)
continue
similar_soft_failure_test: bool = False
for encoding_soft_failed in tested_but_soft_failure:
if is_cp_similar(encoding_iana, encoding_soft_failed):
similar_soft_failure_test = True
break
if similar_soft_failure_test:
logger.log(
TRACE,
"%s is deemed too similar to code page %s and was consider unsuited already. Continuing!",
encoding_iana,
encoding_soft_failed,
)
continue
r_ = range(
0 if not bom_or_sig_available else len(sig_payload),
length,
int(length / steps),
)
multi_byte_bonus: bool = (
is_multi_byte_decoder
and decoded_payload is not None
and len(decoded_payload) < length
)
if multi_byte_bonus:
logger.log(
TRACE,
"Code page %s is a multi byte encoding table and it appear that at least one character "
"was encoded using n-bytes.",
encoding_iana,
)
max_chunk_gave_up: int = int(len(r_) / 4)
max_chunk_gave_up = max(max_chunk_gave_up, 2)
early_stop_count: int = 0
lazy_str_hard_failure = False
md_chunks: list[str] = []
md_ratios = []
try:
for chunk in cut_sequence_chunks(
sequences,
encoding_iana,
r_,
chunk_size,
bom_or_sig_available,
strip_sig_or_bom,
sig_payload,
is_multi_byte_decoder,
decoded_payload,
):
md_chunks.append(chunk)
md_ratios.append(
mess_ratio(
chunk,
threshold,
explain is True and 1 <= len(cp_isolation) <= 2,
)
)
if md_ratios[-1] >= threshold:
early_stop_count += 1
if (early_stop_count >= max_chunk_gave_up) or (
bom_or_sig_available and strip_sig_or_bom is False
):
break
except (
UnicodeDecodeError
) as e: # Lazy str loading may have missed something there
logger.log(
TRACE,
"LazyStr Loading: After MD chunk decode, code page %s does not fit given bytes sequence at ALL. %s",
encoding_iana,
str(e),
)
early_stop_count = max_chunk_gave_up
lazy_str_hard_failure = True
# We might want to check the sequence again with the whole content
# Only if initial MD tests passes
if (
not lazy_str_hard_failure
and is_too_large_sequence
and not is_multi_byte_decoder
):
try:
sequences[int(50e3) :].decode(encoding_iana, errors="strict")
except UnicodeDecodeError as e:
logger.log(
TRACE,
"LazyStr Loading: After final lookup, code page %s does not fit given bytes sequence at ALL. %s",
encoding_iana,
str(e),
)
tested_but_hard_failure.append(encoding_iana)
continue
mean_mess_ratio: float = sum(md_ratios) / len(md_ratios) if md_ratios else 0.0
if mean_mess_ratio >= threshold or early_stop_count >= max_chunk_gave_up:
tested_but_soft_failure.append(encoding_iana)
logger.log(
TRACE,
"%s was excluded because of initial chaos probing. Gave up %i time(s). "
"Computed mean chaos is %f %%.",
encoding_iana,
early_stop_count,
round(mean_mess_ratio * 100, ndigits=3),
)
# Preparing those fallbacks in case we got nothing.
if (
enable_fallback
and encoding_iana
in ["ascii", "utf_8", specified_encoding, "utf_16", "utf_32"]
and not lazy_str_hard_failure
):
fallback_entry = CharsetMatch(
sequences,
encoding_iana,
threshold,
bom_or_sig_available,
[],
decoded_payload,
preemptive_declaration=specified_encoding,
)
if encoding_iana == specified_encoding:
fallback_specified = fallback_entry
elif encoding_iana == "ascii":
fallback_ascii = fallback_entry
else:
fallback_u8 = fallback_entry
continue
logger.log(
TRACE,
"%s passed initial chaos probing. Mean measured chaos is %f %%",
encoding_iana,
round(mean_mess_ratio * 100, ndigits=3),
)
if not is_multi_byte_decoder:
target_languages: list[str] = encoding_languages(encoding_iana)
else:
target_languages = mb_encoding_languages(encoding_iana)
if target_languages:
logger.log(
TRACE,
"{} should target any language(s) of {}".format(
encoding_iana, str(target_languages)
),
)
cd_ratios = []
# We shall skip the CD when its about ASCII
# Most of the time its not relevant to run "language-detection" on it.
if encoding_iana != "ascii":
for chunk in md_chunks:
chunk_languages = coherence_ratio(
chunk,
language_threshold,
",".join(target_languages) if target_languages else None,
)
cd_ratios.append(chunk_languages)
cd_ratios_merged = merge_coherence_ratios(cd_ratios)
if cd_ratios_merged:
logger.log(
TRACE,
"We detected language {} using {}".format(
cd_ratios_merged, encoding_iana
),
)
current_match = CharsetMatch(
sequences,
encoding_iana,
mean_mess_ratio,
bom_or_sig_available,
cd_ratios_merged,
(
decoded_payload
if (
is_too_large_sequence is False
or encoding_iana in [specified_encoding, "ascii", "utf_8"]
)
else None
),
preemptive_declaration=specified_encoding,
)
results.append(current_match)
if (
encoding_iana in [specified_encoding, "ascii", "utf_8"]
and mean_mess_ratio < 0.1
):
# If md says nothing to worry about, then... stop immediately!
if mean_mess_ratio == 0.0:
logger.debug(
"Encoding detection: %s is most likely the one.",
current_match.encoding,
)
if explain: # Defensive: ensure exit path clean handler
logger.removeHandler(explain_handler)
logger.setLevel(previous_logger_level)
return CharsetMatches([current_match])
early_stop_results.append(current_match)
if (
len(early_stop_results)
and (specified_encoding is None or specified_encoding in tested)
and "ascii" in tested
and "utf_8" in tested
):
probable_result: CharsetMatch = early_stop_results.best() # type: ignore[assignment]
logger.debug(
"Encoding detection: %s is most likely the one.",
probable_result.encoding,
)
if explain: # Defensive: ensure exit path clean handler
logger.removeHandler(explain_handler)
logger.setLevel(previous_logger_level)
return CharsetMatches([probable_result])
if encoding_iana == sig_encoding:
logger.debug(
"Encoding detection: %s is most likely the one as we detected a BOM or SIG within "
"the beginning of the sequence.",
encoding_iana,
)
if explain: # Defensive: ensure exit path clean handler
logger.removeHandler(explain_handler)
logger.setLevel(previous_logger_level)
return CharsetMatches([results[encoding_iana]])
if len(results) == 0:
if fallback_u8 or fallback_ascii or fallback_specified:
logger.log(
TRACE,
"Nothing got out of the detection process. Using ASCII/UTF-8/Specified fallback.",
)
if fallback_specified:
logger.debug(
"Encoding detection: %s will be used as a fallback match",
fallback_specified.encoding,
)
results.append(fallback_specified)
elif (
(fallback_u8 and fallback_ascii is None)
or (
fallback_u8
and fallback_ascii
and fallback_u8.fingerprint != fallback_ascii.fingerprint
)
or (fallback_u8 is not None)
):
logger.debug("Encoding detection: utf_8 will be used as a fallback match")
results.append(fallback_u8)
elif fallback_ascii:
logger.debug("Encoding detection: ascii will be used as a fallback match")
results.append(fallback_ascii)
if results:
logger.debug(
"Encoding detection: Found %s as plausible (best-candidate) for content. With %i alternatives.",
results.best().encoding, # type: ignore
len(results) - 1,
)
else:
logger.debug("Encoding detection: Unable to determine any suitable charset.")
if explain:
logger.removeHandler(explain_handler)
logger.setLevel(previous_logger_level)
return results
def from_fp(
fp: BinaryIO,
steps: int = 5,
chunk_size: int = 512,
threshold: float = 0.20,
cp_isolation: list[str] | None = None,
cp_exclusion: list[str] | None = None,
preemptive_behaviour: bool = True,
explain: bool = False,
language_threshold: float = 0.1,
enable_fallback: bool = True,
) -> CharsetMatches:
"""
Same thing than the function from_bytes but using a file pointer that is already ready.
Will not close the file pointer.
"""
return from_bytes(
fp.read(),
steps,
chunk_size,
threshold,
cp_isolation,
cp_exclusion,
preemptive_behaviour,
explain,
language_threshold,
enable_fallback,
)
def from_path(
path: str | bytes | PathLike, # type: ignore[type-arg]
steps: int = 5,
chunk_size: int = 512,
threshold: float = 0.20,
cp_isolation: list[str] | None = None,
cp_exclusion: list[str] | None = None,
preemptive_behaviour: bool = True,
explain: bool = False,
language_threshold: float = 0.1,
enable_fallback: bool = True,
) -> CharsetMatches:
"""
Same thing than the function from_bytes but with one extra step. Opening and reading given file path in binary mode.
Can raise IOError.
"""
with open(path, "rb") as fp:
return from_fp(
fp,
steps,
chunk_size,
threshold,
cp_isolation,
cp_exclusion,
preemptive_behaviour,
explain,
language_threshold,
enable_fallback,
)
def is_binary(
fp_or_path_or_payload: PathLike | str | BinaryIO | bytes, # type: ignore[type-arg]
steps: int = 5,
chunk_size: int = 512,
threshold: float = 0.20,
cp_isolation: list[str] | None = None,
cp_exclusion: list[str] | None = None,
preemptive_behaviour: bool = True,
explain: bool = False,
language_threshold: float = 0.1,
enable_fallback: bool = False,
) -> bool:
"""
Detect if the given input (file, bytes, or path) points to a binary file. aka. not a string.
Based on the same main heuristic algorithms and default kwargs at the sole exception that fallbacks match
are disabled to be stricter around ASCII-compatible but unlikely to be a string.
"""
if isinstance(fp_or_path_or_payload, (str, PathLike)):
guesses = from_path(
fp_or_path_or_payload,
steps=steps,
chunk_size=chunk_size,
threshold=threshold,
cp_isolation=cp_isolation,
cp_exclusion=cp_exclusion,
preemptive_behaviour=preemptive_behaviour,
explain=explain,
language_threshold=language_threshold,
enable_fallback=enable_fallback,
)
elif isinstance(
fp_or_path_or_payload,
(
bytes,
bytearray,
),
):
guesses = from_bytes(
fp_or_path_or_payload,
steps=steps,
chunk_size=chunk_size,
threshold=threshold,
cp_isolation=cp_isolation,
cp_exclusion=cp_exclusion,
preemptive_behaviour=preemptive_behaviour,
explain=explain,
language_threshold=language_threshold,
enable_fallback=enable_fallback,
)
else:
guesses = from_fp(
fp_or_path_or_payload,
steps=steps,
chunk_size=chunk_size,
threshold=threshold,
cp_isolation=cp_isolation,
cp_exclusion=cp_exclusion,
preemptive_behaviour=preemptive_behaviour,
explain=explain,
language_threshold=language_threshold,
enable_fallback=enable_fallback,
)
return not guesses

View File

@@ -0,0 +1,395 @@
from __future__ import annotations
import importlib
from codecs import IncrementalDecoder
from collections import Counter
from functools import lru_cache
from typing import Counter as TypeCounter
from .constant import (
FREQUENCIES,
KO_NAMES,
LANGUAGE_SUPPORTED_COUNT,
TOO_SMALL_SEQUENCE,
ZH_NAMES,
)
from .md import is_suspiciously_successive_range
from .models import CoherenceMatches
from .utils import (
is_accentuated,
is_latin,
is_multi_byte_encoding,
is_unicode_range_secondary,
unicode_range,
)
def encoding_unicode_range(iana_name: str) -> list[str]:
"""
Return associated unicode ranges in a single byte code page.
"""
if is_multi_byte_encoding(iana_name):
raise OSError("Function not supported on multi-byte code page")
decoder = importlib.import_module(f"encodings.{iana_name}").IncrementalDecoder
p: IncrementalDecoder = decoder(errors="ignore")
seen_ranges: dict[str, int] = {}
character_count: int = 0
for i in range(0x40, 0xFF):
chunk: str = p.decode(bytes([i]))
if chunk:
character_range: str | None = unicode_range(chunk)
if character_range is None:
continue
if is_unicode_range_secondary(character_range) is False:
if character_range not in seen_ranges:
seen_ranges[character_range] = 0
seen_ranges[character_range] += 1
character_count += 1
return sorted(
[
character_range
for character_range in seen_ranges
if seen_ranges[character_range] / character_count >= 0.15
]
)
def unicode_range_languages(primary_range: str) -> list[str]:
"""
Return inferred languages used with a unicode range.
"""
languages: list[str] = []
for language, characters in FREQUENCIES.items():
for character in characters:
if unicode_range(character) == primary_range:
languages.append(language)
break
return languages
@lru_cache()
def encoding_languages(iana_name: str) -> list[str]:
"""
Single-byte encoding language association. Some code page are heavily linked to particular language(s).
This function does the correspondence.
"""
unicode_ranges: list[str] = encoding_unicode_range(iana_name)
primary_range: str | None = None
for specified_range in unicode_ranges:
if "Latin" not in specified_range:
primary_range = specified_range
break
if primary_range is None:
return ["Latin Based"]
return unicode_range_languages(primary_range)
@lru_cache()
def mb_encoding_languages(iana_name: str) -> list[str]:
"""
Multi-byte encoding language association. Some code page are heavily linked to particular language(s).
This function does the correspondence.
"""
if (
iana_name.startswith("shift_")
or iana_name.startswith("iso2022_jp")
or iana_name.startswith("euc_j")
or iana_name == "cp932"
):
return ["Japanese"]
if iana_name.startswith("gb") or iana_name in ZH_NAMES:
return ["Chinese"]
if iana_name.startswith("iso2022_kr") or iana_name in KO_NAMES:
return ["Korean"]
return []
@lru_cache(maxsize=LANGUAGE_SUPPORTED_COUNT)
def get_target_features(language: str) -> tuple[bool, bool]:
"""
Determine main aspects from a supported language if it contains accents and if is pure Latin.
"""
target_have_accents: bool = False
target_pure_latin: bool = True
for character in FREQUENCIES[language]:
if not target_have_accents and is_accentuated(character):
target_have_accents = True
if target_pure_latin and is_latin(character) is False:
target_pure_latin = False
return target_have_accents, target_pure_latin
def alphabet_languages(
characters: list[str], ignore_non_latin: bool = False
) -> list[str]:
"""
Return associated languages associated to given characters.
"""
languages: list[tuple[str, float]] = []
source_have_accents = any(is_accentuated(character) for character in characters)
for language, language_characters in FREQUENCIES.items():
target_have_accents, target_pure_latin = get_target_features(language)
if ignore_non_latin and target_pure_latin is False:
continue
if target_have_accents is False and source_have_accents:
continue
character_count: int = len(language_characters)
character_match_count: int = len(
[c for c in language_characters if c in characters]
)
ratio: float = character_match_count / character_count
if ratio >= 0.2:
languages.append((language, ratio))
languages = sorted(languages, key=lambda x: x[1], reverse=True)
return [compatible_language[0] for compatible_language in languages]
def characters_popularity_compare(
language: str, ordered_characters: list[str]
) -> float:
"""
Determine if a ordered characters list (by occurrence from most appearance to rarest) match a particular language.
The result is a ratio between 0. (absolutely no correspondence) and 1. (near perfect fit).
Beware that is function is not strict on the match in order to ease the detection. (Meaning close match is 1.)
"""
if language not in FREQUENCIES:
raise ValueError(f"{language} not available")
character_approved_count: int = 0
FREQUENCIES_language_set = set(FREQUENCIES[language])
ordered_characters_count: int = len(ordered_characters)
target_language_characters_count: int = len(FREQUENCIES[language])
large_alphabet: bool = target_language_characters_count > 26
for character, character_rank in zip(
ordered_characters, range(0, ordered_characters_count)
):
if character not in FREQUENCIES_language_set:
continue
character_rank_in_language: int = FREQUENCIES[language].index(character)
expected_projection_ratio: float = (
target_language_characters_count / ordered_characters_count
)
character_rank_projection: int = int(character_rank * expected_projection_ratio)
if (
large_alphabet is False
and abs(character_rank_projection - character_rank_in_language) > 4
):
continue
if (
large_alphabet is True
and abs(character_rank_projection - character_rank_in_language)
< target_language_characters_count / 3
):
character_approved_count += 1
continue
characters_before_source: list[str] = FREQUENCIES[language][
0:character_rank_in_language
]
characters_after_source: list[str] = FREQUENCIES[language][
character_rank_in_language:
]
characters_before: list[str] = ordered_characters[0:character_rank]
characters_after: list[str] = ordered_characters[character_rank:]
before_match_count: int = len(
set(characters_before) & set(characters_before_source)
)
after_match_count: int = len(
set(characters_after) & set(characters_after_source)
)
if len(characters_before_source) == 0 and before_match_count <= 4:
character_approved_count += 1
continue
if len(characters_after_source) == 0 and after_match_count <= 4:
character_approved_count += 1
continue
if (
before_match_count / len(characters_before_source) >= 0.4
or after_match_count / len(characters_after_source) >= 0.4
):
character_approved_count += 1
continue
return character_approved_count / len(ordered_characters)
def alpha_unicode_split(decoded_sequence: str) -> list[str]:
"""
Given a decoded text sequence, return a list of str. Unicode range / alphabet separation.
Ex. a text containing English/Latin with a bit a Hebrew will return two items in the resulting list;
One containing the latin letters and the other hebrew.
"""
layers: dict[str, str] = {}
for character in decoded_sequence:
if character.isalpha() is False:
continue
character_range: str | None = unicode_range(character)
if character_range is None:
continue
layer_target_range: str | None = None
for discovered_range in layers:
if (
is_suspiciously_successive_range(discovered_range, character_range)
is False
):
layer_target_range = discovered_range
break
if layer_target_range is None:
layer_target_range = character_range
if layer_target_range not in layers:
layers[layer_target_range] = character.lower()
continue
layers[layer_target_range] += character.lower()
return list(layers.values())
def merge_coherence_ratios(results: list[CoherenceMatches]) -> CoherenceMatches:
"""
This function merge results previously given by the function coherence_ratio.
The return type is the same as coherence_ratio.
"""
per_language_ratios: dict[str, list[float]] = {}
for result in results:
for sub_result in result:
language, ratio = sub_result
if language not in per_language_ratios:
per_language_ratios[language] = [ratio]
continue
per_language_ratios[language].append(ratio)
merge = [
(
language,
round(
sum(per_language_ratios[language]) / len(per_language_ratios[language]),
4,
),
)
for language in per_language_ratios
]
return sorted(merge, key=lambda x: x[1], reverse=True)
def filter_alt_coherence_matches(results: CoherenceMatches) -> CoherenceMatches:
"""
We shall NOT return "English—" in CoherenceMatches because it is an alternative
of "English". This function only keeps the best match and remove the em-dash in it.
"""
index_results: dict[str, list[float]] = dict()
for result in results:
language, ratio = result
no_em_name: str = language.replace("", "")
if no_em_name not in index_results:
index_results[no_em_name] = []
index_results[no_em_name].append(ratio)
if any(len(index_results[e]) > 1 for e in index_results):
filtered_results: CoherenceMatches = []
for language in index_results:
filtered_results.append((language, max(index_results[language])))
return filtered_results
return results
@lru_cache(maxsize=2048)
def coherence_ratio(
decoded_sequence: str, threshold: float = 0.1, lg_inclusion: str | None = None
) -> CoherenceMatches:
"""
Detect ANY language that can be identified in given sequence. The sequence will be analysed by layers.
A layer = Character extraction by alphabets/ranges.
"""
results: list[tuple[str, float]] = []
ignore_non_latin: bool = False
sufficient_match_count: int = 0
lg_inclusion_list = lg_inclusion.split(",") if lg_inclusion is not None else []
if "Latin Based" in lg_inclusion_list:
ignore_non_latin = True
lg_inclusion_list.remove("Latin Based")
for layer in alpha_unicode_split(decoded_sequence):
sequence_frequencies: TypeCounter[str] = Counter(layer)
most_common = sequence_frequencies.most_common()
character_count: int = sum(o for c, o in most_common)
if character_count <= TOO_SMALL_SEQUENCE:
continue
popular_character_ordered: list[str] = [c for c, o in most_common]
for language in lg_inclusion_list or alphabet_languages(
popular_character_ordered, ignore_non_latin
):
ratio: float = characters_popularity_compare(
language, popular_character_ordered
)
if ratio < threshold:
continue
elif ratio >= 0.8:
sufficient_match_count += 1
results.append((language, round(ratio, 4)))
if sufficient_match_count >= 3:
break
return sorted(
filter_alt_coherence_matches(results), key=lambda x: x[1], reverse=True
)

View File

@@ -0,0 +1,8 @@
from __future__ import annotations
from .__main__ import cli_detect, query_yes_no
__all__ = (
"cli_detect",
"query_yes_no",
)

View File

@@ -0,0 +1,381 @@
from __future__ import annotations
import argparse
import sys
import typing
from json import dumps
from os.path import abspath, basename, dirname, join, realpath
from platform import python_version
from unicodedata import unidata_version
import charset_normalizer.md as md_module
from charset_normalizer import from_fp
from charset_normalizer.models import CliDetectionResult
from charset_normalizer.version import __version__
def query_yes_no(question: str, default: str = "yes") -> bool:
"""Ask a yes/no question via input() and return their answer.
"question" is a string that is presented to the user.
"default" is the presumed answer if the user just hits <Enter>.
It must be "yes" (the default), "no" or None (meaning
an answer is required of the user).
The "answer" return value is True for "yes" or False for "no".
Credit goes to (c) https://stackoverflow.com/questions/3041986/apt-command-line-interface-like-yes-no-input
"""
valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
if default is None:
prompt = " [y/n] "
elif default == "yes":
prompt = " [Y/n] "
elif default == "no":
prompt = " [y/N] "
else:
raise ValueError("invalid default answer: '%s'" % default)
while True:
sys.stdout.write(question + prompt)
choice = input().lower()
if default is not None and choice == "":
return valid[default]
elif choice in valid:
return valid[choice]
else:
sys.stdout.write("Please respond with 'yes' or 'no' (or 'y' or 'n').\n")
class FileType:
"""Factory for creating file object types
Instances of FileType are typically passed as type= arguments to the
ArgumentParser add_argument() method.
Keyword Arguments:
- mode -- A string indicating how the file is to be opened. Accepts the
same values as the builtin open() function.
- bufsize -- The file's desired buffer size. Accepts the same values as
the builtin open() function.
- encoding -- The file's encoding. Accepts the same values as the
builtin open() function.
- errors -- A string indicating how encoding and decoding errors are to
be handled. Accepts the same value as the builtin open() function.
Backported from CPython 3.12
"""
def __init__(
self,
mode: str = "r",
bufsize: int = -1,
encoding: str | None = None,
errors: str | None = None,
):
self._mode = mode
self._bufsize = bufsize
self._encoding = encoding
self._errors = errors
def __call__(self, string: str) -> typing.IO: # type: ignore[type-arg]
# the special argument "-" means sys.std{in,out}
if string == "-":
if "r" in self._mode:
return sys.stdin.buffer if "b" in self._mode else sys.stdin
elif any(c in self._mode for c in "wax"):
return sys.stdout.buffer if "b" in self._mode else sys.stdout
else:
msg = f'argument "-" with mode {self._mode}'
raise ValueError(msg)
# all other arguments are used as file names
try:
return open(string, self._mode, self._bufsize, self._encoding, self._errors)
except OSError as e:
message = f"can't open '{string}': {e}"
raise argparse.ArgumentTypeError(message)
def __repr__(self) -> str:
args = self._mode, self._bufsize
kwargs = [("encoding", self._encoding), ("errors", self._errors)]
args_str = ", ".join(
[repr(arg) for arg in args if arg != -1]
+ [f"{kw}={arg!r}" for kw, arg in kwargs if arg is not None]
)
return f"{type(self).__name__}({args_str})"
def cli_detect(argv: list[str] | None = None) -> int:
"""
CLI assistant using ARGV and ArgumentParser
:param argv:
:return: 0 if everything is fine, anything else equal trouble
"""
parser = argparse.ArgumentParser(
description="The Real First Universal Charset Detector. "
"Discover originating encoding used on text file. "
"Normalize text to unicode."
)
parser.add_argument(
"files", type=FileType("rb"), nargs="+", help="File(s) to be analysed"
)
parser.add_argument(
"-v",
"--verbose",
action="store_true",
default=False,
dest="verbose",
help="Display complementary information about file if any. "
"Stdout will contain logs about the detection process.",
)
parser.add_argument(
"-a",
"--with-alternative",
action="store_true",
default=False,
dest="alternatives",
help="Output complementary possibilities if any. Top-level JSON WILL be a list.",
)
parser.add_argument(
"-n",
"--normalize",
action="store_true",
default=False,
dest="normalize",
help="Permit to normalize input file. If not set, program does not write anything.",
)
parser.add_argument(
"-m",
"--minimal",
action="store_true",
default=False,
dest="minimal",
help="Only output the charset detected to STDOUT. Disabling JSON output.",
)
parser.add_argument(
"-r",
"--replace",
action="store_true",
default=False,
dest="replace",
help="Replace file when trying to normalize it instead of creating a new one.",
)
parser.add_argument(
"-f",
"--force",
action="store_true",
default=False,
dest="force",
help="Replace file without asking if you are sure, use this flag with caution.",
)
parser.add_argument(
"-i",
"--no-preemptive",
action="store_true",
default=False,
dest="no_preemptive",
help="Disable looking at a charset declaration to hint the detector.",
)
parser.add_argument(
"-t",
"--threshold",
action="store",
default=0.2,
type=float,
dest="threshold",
help="Define a custom maximum amount of noise allowed in decoded content. 0. <= noise <= 1.",
)
parser.add_argument(
"--version",
action="version",
version="Charset-Normalizer {} - Python {} - Unicode {} - SpeedUp {}".format(
__version__,
python_version(),
unidata_version,
"OFF" if md_module.__file__.lower().endswith(".py") else "ON",
),
help="Show version information and exit.",
)
args = parser.parse_args(argv)
if args.replace is True and args.normalize is False:
if args.files:
for my_file in args.files:
my_file.close()
print("Use --replace in addition of --normalize only.", file=sys.stderr)
return 1
if args.force is True and args.replace is False:
if args.files:
for my_file in args.files:
my_file.close()
print("Use --force in addition of --replace only.", file=sys.stderr)
return 1
if args.threshold < 0.0 or args.threshold > 1.0:
if args.files:
for my_file in args.files:
my_file.close()
print("--threshold VALUE should be between 0. AND 1.", file=sys.stderr)
return 1
x_ = []
for my_file in args.files:
matches = from_fp(
my_file,
threshold=args.threshold,
explain=args.verbose,
preemptive_behaviour=args.no_preemptive is False,
)
best_guess = matches.best()
if best_guess is None:
print(
'Unable to identify originating encoding for "{}". {}'.format(
my_file.name,
(
"Maybe try increasing maximum amount of chaos."
if args.threshold < 1.0
else ""
),
),
file=sys.stderr,
)
x_.append(
CliDetectionResult(
abspath(my_file.name),
None,
[],
[],
"Unknown",
[],
False,
1.0,
0.0,
None,
True,
)
)
else:
x_.append(
CliDetectionResult(
abspath(my_file.name),
best_guess.encoding,
best_guess.encoding_aliases,
[
cp
for cp in best_guess.could_be_from_charset
if cp != best_guess.encoding
],
best_guess.language,
best_guess.alphabets,
best_guess.bom,
best_guess.percent_chaos,
best_guess.percent_coherence,
None,
True,
)
)
if len(matches) > 1 and args.alternatives:
for el in matches:
if el != best_guess:
x_.append(
CliDetectionResult(
abspath(my_file.name),
el.encoding,
el.encoding_aliases,
[
cp
for cp in el.could_be_from_charset
if cp != el.encoding
],
el.language,
el.alphabets,
el.bom,
el.percent_chaos,
el.percent_coherence,
None,
False,
)
)
if args.normalize is True:
if best_guess.encoding.startswith("utf") is True:
print(
'"{}" file does not need to be normalized, as it already came from unicode.'.format(
my_file.name
),
file=sys.stderr,
)
if my_file.closed is False:
my_file.close()
continue
dir_path = dirname(realpath(my_file.name))
file_name = basename(realpath(my_file.name))
o_: list[str] = file_name.split(".")
if args.replace is False:
o_.insert(-1, best_guess.encoding)
if my_file.closed is False:
my_file.close()
elif (
args.force is False
and query_yes_no(
'Are you sure to normalize "{}" by replacing it ?'.format(
my_file.name
),
"no",
)
is False
):
if my_file.closed is False:
my_file.close()
continue
try:
x_[0].unicode_path = join(dir_path, ".".join(o_))
with open(x_[0].unicode_path, "wb") as fp:
fp.write(best_guess.output())
except OSError as e:
print(str(e), file=sys.stderr)
if my_file.closed is False:
my_file.close()
return 2
if my_file.closed is False:
my_file.close()
if args.minimal is False:
print(
dumps(
[el.__dict__ for el in x_] if len(x_) > 1 else x_[0].__dict__,
ensure_ascii=True,
indent=4,
)
)
else:
for my_file in args.files:
print(
", ".join(
[
el.encoding or "undefined"
for el in x_
if el.path == abspath(my_file.name)
]
)
)
return 0
if __name__ == "__main__":
cli_detect()

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,80 @@
from __future__ import annotations
from typing import TYPE_CHECKING, Any
from warnings import warn
from .api import from_bytes
from .constant import CHARDET_CORRESPONDENCE, TOO_SMALL_SEQUENCE
# TODO: remove this check when dropping Python 3.7 support
if TYPE_CHECKING:
from typing_extensions import TypedDict
class ResultDict(TypedDict):
encoding: str | None
language: str
confidence: float | None
def detect(
byte_str: bytes, should_rename_legacy: bool = False, **kwargs: Any
) -> ResultDict:
"""
chardet legacy method
Detect the encoding of the given byte string. It should be mostly backward-compatible.
Encoding name will match Chardet own writing whenever possible. (Not on encoding name unsupported by it)
This function is deprecated and should be used to migrate your project easily, consult the documentation for
further information. Not planned for removal.
:param byte_str: The byte sequence to examine.
:param should_rename_legacy: Should we rename legacy encodings
to their more modern equivalents?
"""
if len(kwargs):
warn(
f"charset-normalizer disregard arguments '{','.join(list(kwargs.keys()))}' in legacy function detect()"
)
if not isinstance(byte_str, (bytearray, bytes)):
raise TypeError( # pragma: nocover
f"Expected object of type bytes or bytearray, got: {type(byte_str)}"
)
if isinstance(byte_str, bytearray):
byte_str = bytes(byte_str)
r = from_bytes(byte_str).best()
encoding = r.encoding if r is not None else None
language = r.language if r is not None and r.language != "Unknown" else ""
confidence = 1.0 - r.chaos if r is not None else None
# automatically lower confidence
# on small bytes samples.
# https://github.com/jawah/charset_normalizer/issues/391
if (
confidence is not None
and confidence >= 0.9
and encoding
not in {
"utf_8",
"ascii",
}
and r.bom is False # type: ignore[union-attr]
and len(byte_str) < TOO_SMALL_SEQUENCE
):
confidence -= 0.2
# Note: CharsetNormalizer does not return 'UTF-8-SIG' as the sig get stripped in the detection/normalization process
# but chardet does return 'utf-8-sig' and it is a valid codec name.
if r is not None and encoding == "utf_8" and r.bom:
encoding += "_sig"
if should_rename_legacy is False and encoding in CHARDET_CORRESPONDENCE:
encoding = CHARDET_CORRESPONDENCE[encoding]
return {
"encoding": encoding,
"language": language,
"confidence": confidence,
}

View File

@@ -0,0 +1,635 @@
from __future__ import annotations
from functools import lru_cache
from logging import getLogger
from .constant import (
COMMON_SAFE_ASCII_CHARACTERS,
TRACE,
UNICODE_SECONDARY_RANGE_KEYWORD,
)
from .utils import (
is_accentuated,
is_arabic,
is_arabic_isolated_form,
is_case_variable,
is_cjk,
is_emoticon,
is_hangul,
is_hiragana,
is_katakana,
is_latin,
is_punctuation,
is_separator,
is_symbol,
is_thai,
is_unprintable,
remove_accent,
unicode_range,
is_cjk_uncommon,
)
class MessDetectorPlugin:
"""
Base abstract class used for mess detection plugins.
All detectors MUST extend and implement given methods.
"""
def eligible(self, character: str) -> bool:
"""
Determine if given character should be fed in.
"""
raise NotImplementedError # pragma: nocover
def feed(self, character: str) -> None:
"""
The main routine to be executed upon character.
Insert the logic in witch the text would be considered chaotic.
"""
raise NotImplementedError # pragma: nocover
def reset(self) -> None: # pragma: no cover
"""
Permit to reset the plugin to the initial state.
"""
raise NotImplementedError
@property
def ratio(self) -> float:
"""
Compute the chaos ratio based on what your feed() has seen.
Must NOT be lower than 0.; No restriction gt 0.
"""
raise NotImplementedError # pragma: nocover
class TooManySymbolOrPunctuationPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._punctuation_count: int = 0
self._symbol_count: int = 0
self._character_count: int = 0
self._last_printable_char: str | None = None
self._frenzy_symbol_in_word: bool = False
def eligible(self, character: str) -> bool:
return character.isprintable()
def feed(self, character: str) -> None:
self._character_count += 1
if (
character != self._last_printable_char
and character not in COMMON_SAFE_ASCII_CHARACTERS
):
if is_punctuation(character):
self._punctuation_count += 1
elif (
character.isdigit() is False
and is_symbol(character)
and is_emoticon(character) is False
):
self._symbol_count += 2
self._last_printable_char = character
def reset(self) -> None: # Abstract
self._punctuation_count = 0
self._character_count = 0
self._symbol_count = 0
@property
def ratio(self) -> float:
if self._character_count == 0:
return 0.0
ratio_of_punctuation: float = (
self._punctuation_count + self._symbol_count
) / self._character_count
return ratio_of_punctuation if ratio_of_punctuation >= 0.3 else 0.0
class TooManyAccentuatedPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._character_count: int = 0
self._accentuated_count: int = 0
def eligible(self, character: str) -> bool:
return character.isalpha()
def feed(self, character: str) -> None:
self._character_count += 1
if is_accentuated(character):
self._accentuated_count += 1
def reset(self) -> None: # Abstract
self._character_count = 0
self._accentuated_count = 0
@property
def ratio(self) -> float:
if self._character_count < 8:
return 0.0
ratio_of_accentuation: float = self._accentuated_count / self._character_count
return ratio_of_accentuation if ratio_of_accentuation >= 0.35 else 0.0
class UnprintablePlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._unprintable_count: int = 0
self._character_count: int = 0
def eligible(self, character: str) -> bool:
return True
def feed(self, character: str) -> None:
if is_unprintable(character):
self._unprintable_count += 1
self._character_count += 1
def reset(self) -> None: # Abstract
self._unprintable_count = 0
@property
def ratio(self) -> float:
if self._character_count == 0:
return 0.0
return (self._unprintable_count * 8) / self._character_count
class SuspiciousDuplicateAccentPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._successive_count: int = 0
self._character_count: int = 0
self._last_latin_character: str | None = None
def eligible(self, character: str) -> bool:
return character.isalpha() and is_latin(character)
def feed(self, character: str) -> None:
self._character_count += 1
if (
self._last_latin_character is not None
and is_accentuated(character)
and is_accentuated(self._last_latin_character)
):
if character.isupper() and self._last_latin_character.isupper():
self._successive_count += 1
# Worse if its the same char duplicated with different accent.
if remove_accent(character) == remove_accent(self._last_latin_character):
self._successive_count += 1
self._last_latin_character = character
def reset(self) -> None: # Abstract
self._successive_count = 0
self._character_count = 0
self._last_latin_character = None
@property
def ratio(self) -> float:
if self._character_count == 0:
return 0.0
return (self._successive_count * 2) / self._character_count
class SuspiciousRange(MessDetectorPlugin):
def __init__(self) -> None:
self._suspicious_successive_range_count: int = 0
self._character_count: int = 0
self._last_printable_seen: str | None = None
def eligible(self, character: str) -> bool:
return character.isprintable()
def feed(self, character: str) -> None:
self._character_count += 1
if (
character.isspace()
or is_punctuation(character)
or character in COMMON_SAFE_ASCII_CHARACTERS
):
self._last_printable_seen = None
return
if self._last_printable_seen is None:
self._last_printable_seen = character
return
unicode_range_a: str | None = unicode_range(self._last_printable_seen)
unicode_range_b: str | None = unicode_range(character)
if is_suspiciously_successive_range(unicode_range_a, unicode_range_b):
self._suspicious_successive_range_count += 1
self._last_printable_seen = character
def reset(self) -> None: # Abstract
self._character_count = 0
self._suspicious_successive_range_count = 0
self._last_printable_seen = None
@property
def ratio(self) -> float:
if self._character_count <= 13:
return 0.0
ratio_of_suspicious_range_usage: float = (
self._suspicious_successive_range_count * 2
) / self._character_count
return ratio_of_suspicious_range_usage
class SuperWeirdWordPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._word_count: int = 0
self._bad_word_count: int = 0
self._foreign_long_count: int = 0
self._is_current_word_bad: bool = False
self._foreign_long_watch: bool = False
self._character_count: int = 0
self._bad_character_count: int = 0
self._buffer: str = ""
self._buffer_accent_count: int = 0
self._buffer_glyph_count: int = 0
def eligible(self, character: str) -> bool:
return True
def feed(self, character: str) -> None:
if character.isalpha():
self._buffer += character
if is_accentuated(character):
self._buffer_accent_count += 1
if (
self._foreign_long_watch is False
and (is_latin(character) is False or is_accentuated(character))
and is_cjk(character) is False
and is_hangul(character) is False
and is_katakana(character) is False
and is_hiragana(character) is False
and is_thai(character) is False
):
self._foreign_long_watch = True
if (
is_cjk(character)
or is_hangul(character)
or is_katakana(character)
or is_hiragana(character)
or is_thai(character)
):
self._buffer_glyph_count += 1
return
if not self._buffer:
return
if (
character.isspace() or is_punctuation(character) or is_separator(character)
) and self._buffer:
self._word_count += 1
buffer_length: int = len(self._buffer)
self._character_count += buffer_length
if buffer_length >= 4:
if self._buffer_accent_count / buffer_length >= 0.5:
self._is_current_word_bad = True
# Word/Buffer ending with an upper case accentuated letter are so rare,
# that we will consider them all as suspicious. Same weight as foreign_long suspicious.
elif (
is_accentuated(self._buffer[-1])
and self._buffer[-1].isupper()
and all(_.isupper() for _ in self._buffer) is False
):
self._foreign_long_count += 1
self._is_current_word_bad = True
elif self._buffer_glyph_count == 1:
self._is_current_word_bad = True
self._foreign_long_count += 1
if buffer_length >= 24 and self._foreign_long_watch:
camel_case_dst = [
i
for c, i in zip(self._buffer, range(0, buffer_length))
if c.isupper()
]
probable_camel_cased: bool = False
if camel_case_dst and (len(camel_case_dst) / buffer_length <= 0.3):
probable_camel_cased = True
if not probable_camel_cased:
self._foreign_long_count += 1
self._is_current_word_bad = True
if self._is_current_word_bad:
self._bad_word_count += 1
self._bad_character_count += len(self._buffer)
self._is_current_word_bad = False
self._foreign_long_watch = False
self._buffer = ""
self._buffer_accent_count = 0
self._buffer_glyph_count = 0
elif (
character not in {"<", ">", "-", "=", "~", "|", "_"}
and character.isdigit() is False
and is_symbol(character)
):
self._is_current_word_bad = True
self._buffer += character
def reset(self) -> None: # Abstract
self._buffer = ""
self._is_current_word_bad = False
self._foreign_long_watch = False
self._bad_word_count = 0
self._word_count = 0
self._character_count = 0
self._bad_character_count = 0
self._foreign_long_count = 0
@property
def ratio(self) -> float:
if self._word_count <= 10 and self._foreign_long_count == 0:
return 0.0
return self._bad_character_count / self._character_count
class CjkUncommonPlugin(MessDetectorPlugin):
"""
Detect messy CJK text that probably means nothing.
"""
def __init__(self) -> None:
self._character_count: int = 0
self._uncommon_count: int = 0
def eligible(self, character: str) -> bool:
return is_cjk(character)
def feed(self, character: str) -> None:
self._character_count += 1
if is_cjk_uncommon(character):
self._uncommon_count += 1
return
def reset(self) -> None: # Abstract
self._character_count = 0
self._uncommon_count = 0
@property
def ratio(self) -> float:
if self._character_count < 8:
return 0.0
uncommon_form_usage: float = self._uncommon_count / self._character_count
# we can be pretty sure it's garbage when uncommon characters are widely
# used. otherwise it could just be traditional chinese for example.
return uncommon_form_usage / 10 if uncommon_form_usage > 0.5 else 0.0
class ArchaicUpperLowerPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._buf: bool = False
self._character_count_since_last_sep: int = 0
self._successive_upper_lower_count: int = 0
self._successive_upper_lower_count_final: int = 0
self._character_count: int = 0
self._last_alpha_seen: str | None = None
self._current_ascii_only: bool = True
def eligible(self, character: str) -> bool:
return True
def feed(self, character: str) -> None:
is_concerned = character.isalpha() and is_case_variable(character)
chunk_sep = is_concerned is False
if chunk_sep and self._character_count_since_last_sep > 0:
if (
self._character_count_since_last_sep <= 64
and character.isdigit() is False
and self._current_ascii_only is False
):
self._successive_upper_lower_count_final += (
self._successive_upper_lower_count
)
self._successive_upper_lower_count = 0
self._character_count_since_last_sep = 0
self._last_alpha_seen = None
self._buf = False
self._character_count += 1
self._current_ascii_only = True
return
if self._current_ascii_only is True and character.isascii() is False:
self._current_ascii_only = False
if self._last_alpha_seen is not None:
if (character.isupper() and self._last_alpha_seen.islower()) or (
character.islower() and self._last_alpha_seen.isupper()
):
if self._buf is True:
self._successive_upper_lower_count += 2
self._buf = False
else:
self._buf = True
else:
self._buf = False
self._character_count += 1
self._character_count_since_last_sep += 1
self._last_alpha_seen = character
def reset(self) -> None: # Abstract
self._character_count = 0
self._character_count_since_last_sep = 0
self._successive_upper_lower_count = 0
self._successive_upper_lower_count_final = 0
self._last_alpha_seen = None
self._buf = False
self._current_ascii_only = True
@property
def ratio(self) -> float:
if self._character_count == 0:
return 0.0
return self._successive_upper_lower_count_final / self._character_count
class ArabicIsolatedFormPlugin(MessDetectorPlugin):
def __init__(self) -> None:
self._character_count: int = 0
self._isolated_form_count: int = 0
def reset(self) -> None: # Abstract
self._character_count = 0
self._isolated_form_count = 0
def eligible(self, character: str) -> bool:
return is_arabic(character)
def feed(self, character: str) -> None:
self._character_count += 1
if is_arabic_isolated_form(character):
self._isolated_form_count += 1
@property
def ratio(self) -> float:
if self._character_count < 8:
return 0.0
isolated_form_usage: float = self._isolated_form_count / self._character_count
return isolated_form_usage
@lru_cache(maxsize=1024)
def is_suspiciously_successive_range(
unicode_range_a: str | None, unicode_range_b: str | None
) -> bool:
"""
Determine if two Unicode range seen next to each other can be considered as suspicious.
"""
if unicode_range_a is None or unicode_range_b is None:
return True
if unicode_range_a == unicode_range_b:
return False
if "Latin" in unicode_range_a and "Latin" in unicode_range_b:
return False
if "Emoticons" in unicode_range_a or "Emoticons" in unicode_range_b:
return False
# Latin characters can be accompanied with a combining diacritical mark
# eg. Vietnamese.
if ("Latin" in unicode_range_a or "Latin" in unicode_range_b) and (
"Combining" in unicode_range_a or "Combining" in unicode_range_b
):
return False
keywords_range_a, keywords_range_b = (
unicode_range_a.split(" "),
unicode_range_b.split(" "),
)
for el in keywords_range_a:
if el in UNICODE_SECONDARY_RANGE_KEYWORD:
continue
if el in keywords_range_b:
return False
# Japanese Exception
range_a_jp_chars, range_b_jp_chars = (
unicode_range_a
in (
"Hiragana",
"Katakana",
),
unicode_range_b in ("Hiragana", "Katakana"),
)
if (range_a_jp_chars or range_b_jp_chars) and (
"CJK" in unicode_range_a or "CJK" in unicode_range_b
):
return False
if range_a_jp_chars and range_b_jp_chars:
return False
if "Hangul" in unicode_range_a or "Hangul" in unicode_range_b:
if "CJK" in unicode_range_a or "CJK" in unicode_range_b:
return False
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
return False
# Chinese/Japanese use dedicated range for punctuation and/or separators.
if ("CJK" in unicode_range_a or "CJK" in unicode_range_b) or (
unicode_range_a in ["Katakana", "Hiragana"]
and unicode_range_b in ["Katakana", "Hiragana"]
):
if "Punctuation" in unicode_range_a or "Punctuation" in unicode_range_b:
return False
if "Forms" in unicode_range_a or "Forms" in unicode_range_b:
return False
if unicode_range_a == "Basic Latin" or unicode_range_b == "Basic Latin":
return False
return True
@lru_cache(maxsize=2048)
def mess_ratio(
decoded_sequence: str, maximum_threshold: float = 0.2, debug: bool = False
) -> float:
"""
Compute a mess ratio given a decoded bytes sequence. The maximum threshold does stop the computation earlier.
"""
detectors: list[MessDetectorPlugin] = [
md_class() for md_class in MessDetectorPlugin.__subclasses__()
]
length: int = len(decoded_sequence) + 1
mean_mess_ratio: float = 0.0
if length < 512:
intermediary_mean_mess_ratio_calc: int = 32
elif length <= 1024:
intermediary_mean_mess_ratio_calc = 64
else:
intermediary_mean_mess_ratio_calc = 128
for character, index in zip(decoded_sequence + "\n", range(length)):
for detector in detectors:
if detector.eligible(character):
detector.feed(character)
if (
index > 0 and index % intermediary_mean_mess_ratio_calc == 0
) or index == length - 1:
mean_mess_ratio = sum(dt.ratio for dt in detectors)
if mean_mess_ratio >= maximum_threshold:
break
if debug:
logger = getLogger("charset_normalizer")
logger.log(
TRACE,
"Mess-detector extended-analysis start. "
f"intermediary_mean_mess_ratio_calc={intermediary_mean_mess_ratio_calc} mean_mess_ratio={mean_mess_ratio} "
f"maximum_threshold={maximum_threshold}",
)
if len(decoded_sequence) > 16:
logger.log(TRACE, f"Starting with: {decoded_sequence[:16]}")
logger.log(TRACE, f"Ending with: {decoded_sequence[-16::]}")
for dt in detectors:
logger.log(TRACE, f"{dt.__class__}: {dt.ratio}")
return round(mean_mess_ratio, 3)

View File

@@ -0,0 +1,360 @@
from __future__ import annotations
from encodings.aliases import aliases
from hashlib import sha256
from json import dumps
from re import sub
from typing import Any, Iterator, List, Tuple
from .constant import RE_POSSIBLE_ENCODING_INDICATION, TOO_BIG_SEQUENCE
from .utils import iana_name, is_multi_byte_encoding, unicode_range
class CharsetMatch:
def __init__(
self,
payload: bytes,
guessed_encoding: str,
mean_mess_ratio: float,
has_sig_or_bom: bool,
languages: CoherenceMatches,
decoded_payload: str | None = None,
preemptive_declaration: str | None = None,
):
self._payload: bytes = payload
self._encoding: str = guessed_encoding
self._mean_mess_ratio: float = mean_mess_ratio
self._languages: CoherenceMatches = languages
self._has_sig_or_bom: bool = has_sig_or_bom
self._unicode_ranges: list[str] | None = None
self._leaves: list[CharsetMatch] = []
self._mean_coherence_ratio: float = 0.0
self._output_payload: bytes | None = None
self._output_encoding: str | None = None
self._string: str | None = decoded_payload
self._preemptive_declaration: str | None = preemptive_declaration
def __eq__(self, other: object) -> bool:
if not isinstance(other, CharsetMatch):
if isinstance(other, str):
return iana_name(other) == self.encoding
return False
return self.encoding == other.encoding and self.fingerprint == other.fingerprint
def __lt__(self, other: object) -> bool:
"""
Implemented to make sorted available upon CharsetMatches items.
"""
if not isinstance(other, CharsetMatch):
raise ValueError
chaos_difference: float = abs(self.chaos - other.chaos)
coherence_difference: float = abs(self.coherence - other.coherence)
# Below 1% difference --> Use Coherence
if chaos_difference < 0.01 and coherence_difference > 0.02:
return self.coherence > other.coherence
elif chaos_difference < 0.01 and coherence_difference <= 0.02:
# When having a difficult decision, use the result that decoded as many multi-byte as possible.
# preserve RAM usage!
if len(self._payload) >= TOO_BIG_SEQUENCE:
return self.chaos < other.chaos
return self.multi_byte_usage > other.multi_byte_usage
return self.chaos < other.chaos
@property
def multi_byte_usage(self) -> float:
return 1.0 - (len(str(self)) / len(self.raw))
def __str__(self) -> str:
# Lazy Str Loading
if self._string is None:
self._string = str(self._payload, self._encoding, "strict")
return self._string
def __repr__(self) -> str:
return f"<CharsetMatch '{self.encoding}' bytes({self.fingerprint})>"
def add_submatch(self, other: CharsetMatch) -> None:
if not isinstance(other, CharsetMatch) or other == self:
raise ValueError(
"Unable to add instance <{}> as a submatch of a CharsetMatch".format(
other.__class__
)
)
other._string = None # Unload RAM usage; dirty trick.
self._leaves.append(other)
@property
def encoding(self) -> str:
return self._encoding
@property
def encoding_aliases(self) -> list[str]:
"""
Encoding name are known by many name, using this could help when searching for IBM855 when it's listed as CP855.
"""
also_known_as: list[str] = []
for u, p in aliases.items():
if self.encoding == u:
also_known_as.append(p)
elif self.encoding == p:
also_known_as.append(u)
return also_known_as
@property
def bom(self) -> bool:
return self._has_sig_or_bom
@property
def byte_order_mark(self) -> bool:
return self._has_sig_or_bom
@property
def languages(self) -> list[str]:
"""
Return the complete list of possible languages found in decoded sequence.
Usually not really useful. Returned list may be empty even if 'language' property return something != 'Unknown'.
"""
return [e[0] for e in self._languages]
@property
def language(self) -> str:
"""
Most probable language found in decoded sequence. If none were detected or inferred, the property will return
"Unknown".
"""
if not self._languages:
# Trying to infer the language based on the given encoding
# Its either English or we should not pronounce ourselves in certain cases.
if "ascii" in self.could_be_from_charset:
return "English"
# doing it there to avoid circular import
from charset_normalizer.cd import encoding_languages, mb_encoding_languages
languages = (
mb_encoding_languages(self.encoding)
if is_multi_byte_encoding(self.encoding)
else encoding_languages(self.encoding)
)
if len(languages) == 0 or "Latin Based" in languages:
return "Unknown"
return languages[0]
return self._languages[0][0]
@property
def chaos(self) -> float:
return self._mean_mess_ratio
@property
def coherence(self) -> float:
if not self._languages:
return 0.0
return self._languages[0][1]
@property
def percent_chaos(self) -> float:
return round(self.chaos * 100, ndigits=3)
@property
def percent_coherence(self) -> float:
return round(self.coherence * 100, ndigits=3)
@property
def raw(self) -> bytes:
"""
Original untouched bytes.
"""
return self._payload
@property
def submatch(self) -> list[CharsetMatch]:
return self._leaves
@property
def has_submatch(self) -> bool:
return len(self._leaves) > 0
@property
def alphabets(self) -> list[str]:
if self._unicode_ranges is not None:
return self._unicode_ranges
# list detected ranges
detected_ranges: list[str | None] = [unicode_range(char) for char in str(self)]
# filter and sort
self._unicode_ranges = sorted(list({r for r in detected_ranges if r}))
return self._unicode_ranges
@property
def could_be_from_charset(self) -> list[str]:
"""
The complete list of encoding that output the exact SAME str result and therefore could be the originating
encoding.
This list does include the encoding available in property 'encoding'.
"""
return [self._encoding] + [m.encoding for m in self._leaves]
def output(self, encoding: str = "utf_8") -> bytes:
"""
Method to get re-encoded bytes payload using given target encoding. Default to UTF-8.
Any errors will be simply ignored by the encoder NOT replaced.
"""
if self._output_encoding is None or self._output_encoding != encoding:
self._output_encoding = encoding
decoded_string = str(self)
if (
self._preemptive_declaration is not None
and self._preemptive_declaration.lower()
not in ["utf-8", "utf8", "utf_8"]
):
patched_header = sub(
RE_POSSIBLE_ENCODING_INDICATION,
lambda m: m.string[m.span()[0] : m.span()[1]].replace(
m.groups()[0],
iana_name(self._output_encoding).replace("_", "-"), # type: ignore[arg-type]
),
decoded_string[:8192],
count=1,
)
decoded_string = patched_header + decoded_string[8192:]
self._output_payload = decoded_string.encode(encoding, "replace")
return self._output_payload # type: ignore
@property
def fingerprint(self) -> str:
"""
Retrieve the unique SHA256 computed using the transformed (re-encoded) payload. Not the original one.
"""
return sha256(self.output()).hexdigest()
class CharsetMatches:
"""
Container with every CharsetMatch items ordered by default from most probable to the less one.
Act like a list(iterable) but does not implements all related methods.
"""
def __init__(self, results: list[CharsetMatch] | None = None):
self._results: list[CharsetMatch] = sorted(results) if results else []
def __iter__(self) -> Iterator[CharsetMatch]:
yield from self._results
def __getitem__(self, item: int | str) -> CharsetMatch:
"""
Retrieve a single item either by its position or encoding name (alias may be used here).
Raise KeyError upon invalid index or encoding not present in results.
"""
if isinstance(item, int):
return self._results[item]
if isinstance(item, str):
item = iana_name(item, False)
for result in self._results:
if item in result.could_be_from_charset:
return result
raise KeyError
def __len__(self) -> int:
return len(self._results)
def __bool__(self) -> bool:
return len(self._results) > 0
def append(self, item: CharsetMatch) -> None:
"""
Insert a single match. Will be inserted accordingly to preserve sort.
Can be inserted as a submatch.
"""
if not isinstance(item, CharsetMatch):
raise ValueError(
"Cannot append instance '{}' to CharsetMatches".format(
str(item.__class__)
)
)
# We should disable the submatch factoring when the input file is too heavy (conserve RAM usage)
if len(item.raw) < TOO_BIG_SEQUENCE:
for match in self._results:
if match.fingerprint == item.fingerprint and match.chaos == item.chaos:
match.add_submatch(item)
return
self._results.append(item)
self._results = sorted(self._results)
def best(self) -> CharsetMatch | None:
"""
Simply return the first match. Strict equivalent to matches[0].
"""
if not self._results:
return None
return self._results[0]
def first(self) -> CharsetMatch | None:
"""
Redundant method, call the method best(). Kept for BC reasons.
"""
return self.best()
CoherenceMatch = Tuple[str, float]
CoherenceMatches = List[CoherenceMatch]
class CliDetectionResult:
def __init__(
self,
path: str,
encoding: str | None,
encoding_aliases: list[str],
alternative_encodings: list[str],
language: str,
alphabets: list[str],
has_sig_or_bom: bool,
chaos: float,
coherence: float,
unicode_path: str | None,
is_preferred: bool,
):
self.path: str = path
self.unicode_path: str | None = unicode_path
self.encoding: str | None = encoding
self.encoding_aliases: list[str] = encoding_aliases
self.alternative_encodings: list[str] = alternative_encodings
self.language: str = language
self.alphabets: list[str] = alphabets
self.has_sig_or_bom: bool = has_sig_or_bom
self.chaos: float = chaos
self.coherence: float = coherence
self.is_preferred: bool = is_preferred
@property
def __dict__(self) -> dict[str, Any]: # type: ignore
return {
"path": self.path,
"encoding": self.encoding,
"encoding_aliases": self.encoding_aliases,
"alternative_encodings": self.alternative_encodings,
"language": self.language,
"alphabets": self.alphabets,
"has_sig_or_bom": self.has_sig_or_bom,
"chaos": self.chaos,
"coherence": self.coherence,
"unicode_path": self.unicode_path,
"is_preferred": self.is_preferred,
}
def to_json(self) -> str:
return dumps(self.__dict__, ensure_ascii=True, indent=4)

View File

@@ -0,0 +1,414 @@
from __future__ import annotations
import importlib
import logging
import unicodedata
from codecs import IncrementalDecoder
from encodings.aliases import aliases
from functools import lru_cache
from re import findall
from typing import Generator
from _multibytecodec import ( # type: ignore[import-not-found,import]
MultibyteIncrementalDecoder,
)
from .constant import (
ENCODING_MARKS,
IANA_SUPPORTED_SIMILAR,
RE_POSSIBLE_ENCODING_INDICATION,
UNICODE_RANGES_COMBINED,
UNICODE_SECONDARY_RANGE_KEYWORD,
UTF8_MAXIMAL_ALLOCATION,
COMMON_CJK_CHARACTERS,
)
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_accentuated(character: str) -> bool:
try:
description: str = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return (
"WITH GRAVE" in description
or "WITH ACUTE" in description
or "WITH CEDILLA" in description
or "WITH DIAERESIS" in description
or "WITH CIRCUMFLEX" in description
or "WITH TILDE" in description
or "WITH MACRON" in description
or "WITH RING ABOVE" in description
)
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def remove_accent(character: str) -> str:
decomposed: str = unicodedata.decomposition(character)
if not decomposed:
return character
codes: list[str] = decomposed.split(" ")
return chr(int(codes[0], 16))
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def unicode_range(character: str) -> str | None:
"""
Retrieve the Unicode range official name from a single character.
"""
character_ord: int = ord(character)
for range_name, ord_range in UNICODE_RANGES_COMBINED.items():
if character_ord in ord_range:
return range_name
return None
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_latin(character: str) -> bool:
try:
description: str = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "LATIN" in description
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_punctuation(character: str) -> bool:
character_category: str = unicodedata.category(character)
if "P" in character_category:
return True
character_range: str | None = unicode_range(character)
if character_range is None:
return False
return "Punctuation" in character_range
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_symbol(character: str) -> bool:
character_category: str = unicodedata.category(character)
if "S" in character_category or "N" in character_category:
return True
character_range: str | None = unicode_range(character)
if character_range is None:
return False
return "Forms" in character_range and character_category != "Lo"
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_emoticon(character: str) -> bool:
character_range: str | None = unicode_range(character)
if character_range is None:
return False
return "Emoticons" in character_range or "Pictographs" in character_range
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_separator(character: str) -> bool:
if character.isspace() or character in {"", "+", "<", ">"}:
return True
character_category: str = unicodedata.category(character)
return "Z" in character_category or character_category in {"Po", "Pd", "Pc"}
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_case_variable(character: str) -> bool:
return character.islower() != character.isupper()
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_cjk(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "CJK" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_hiragana(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "HIRAGANA" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_katakana(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "KATAKANA" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_hangul(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "HANGUL" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_thai(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "THAI" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_arabic(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "ARABIC" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_arabic_isolated_form(character: str) -> bool:
try:
character_name = unicodedata.name(character)
except ValueError: # Defensive: unicode database outdated?
return False
return "ARABIC" in character_name and "ISOLATED FORM" in character_name
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_cjk_uncommon(character: str) -> bool:
return character not in COMMON_CJK_CHARACTERS
@lru_cache(maxsize=len(UNICODE_RANGES_COMBINED))
def is_unicode_range_secondary(range_name: str) -> bool:
return any(keyword in range_name for keyword in UNICODE_SECONDARY_RANGE_KEYWORD)
@lru_cache(maxsize=UTF8_MAXIMAL_ALLOCATION)
def is_unprintable(character: str) -> bool:
return (
character.isspace() is False # includes \n \t \r \v
and character.isprintable() is False
and character != "\x1a" # Why? Its the ASCII substitute character.
and character != "\ufeff" # bug discovered in Python,
# Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space.
)
def any_specified_encoding(sequence: bytes, search_zone: int = 8192) -> str | None:
"""
Extract using ASCII-only decoder any specified encoding in the first n-bytes.
"""
if not isinstance(sequence, bytes):
raise TypeError
seq_len: int = len(sequence)
results: list[str] = findall(
RE_POSSIBLE_ENCODING_INDICATION,
sequence[: min(seq_len, search_zone)].decode("ascii", errors="ignore"),
)
if len(results) == 0:
return None
for specified_encoding in results:
specified_encoding = specified_encoding.lower().replace("-", "_")
encoding_alias: str
encoding_iana: str
for encoding_alias, encoding_iana in aliases.items():
if encoding_alias == specified_encoding:
return encoding_iana
if encoding_iana == specified_encoding:
return encoding_iana
return None
@lru_cache(maxsize=128)
def is_multi_byte_encoding(name: str) -> bool:
"""
Verify is a specific encoding is a multi byte one based on it IANA name
"""
return name in {
"utf_8",
"utf_8_sig",
"utf_16",
"utf_16_be",
"utf_16_le",
"utf_32",
"utf_32_le",
"utf_32_be",
"utf_7",
} or issubclass(
importlib.import_module(f"encodings.{name}").IncrementalDecoder,
MultibyteIncrementalDecoder,
)
def identify_sig_or_bom(sequence: bytes) -> tuple[str | None, bytes]:
"""
Identify and extract SIG/BOM in given sequence.
"""
for iana_encoding in ENCODING_MARKS:
marks: bytes | list[bytes] = ENCODING_MARKS[iana_encoding]
if isinstance(marks, bytes):
marks = [marks]
for mark in marks:
if sequence.startswith(mark):
return iana_encoding, mark
return None, b""
def should_strip_sig_or_bom(iana_encoding: str) -> bool:
return iana_encoding not in {"utf_16", "utf_32"}
def iana_name(cp_name: str, strict: bool = True) -> str:
"""Returns the Python normalized encoding name (Not the IANA official name)."""
cp_name = cp_name.lower().replace("-", "_")
encoding_alias: str
encoding_iana: str
for encoding_alias, encoding_iana in aliases.items():
if cp_name in [encoding_alias, encoding_iana]:
return encoding_iana
if strict:
raise ValueError(f"Unable to retrieve IANA for '{cp_name}'")
return cp_name
def cp_similarity(iana_name_a: str, iana_name_b: str) -> float:
if is_multi_byte_encoding(iana_name_a) or is_multi_byte_encoding(iana_name_b):
return 0.0
decoder_a = importlib.import_module(f"encodings.{iana_name_a}").IncrementalDecoder
decoder_b = importlib.import_module(f"encodings.{iana_name_b}").IncrementalDecoder
id_a: IncrementalDecoder = decoder_a(errors="ignore")
id_b: IncrementalDecoder = decoder_b(errors="ignore")
character_match_count: int = 0
for i in range(255):
to_be_decoded: bytes = bytes([i])
if id_a.decode(to_be_decoded) == id_b.decode(to_be_decoded):
character_match_count += 1
return character_match_count / 254
def is_cp_similar(iana_name_a: str, iana_name_b: str) -> bool:
"""
Determine if two code page are at least 80% similar. IANA_SUPPORTED_SIMILAR dict was generated using
the function cp_similarity.
"""
return (
iana_name_a in IANA_SUPPORTED_SIMILAR
and iana_name_b in IANA_SUPPORTED_SIMILAR[iana_name_a]
)
def set_logging_handler(
name: str = "charset_normalizer",
level: int = logging.INFO,
format_string: str = "%(asctime)s | %(levelname)s | %(message)s",
) -> None:
logger = logging.getLogger(name)
logger.setLevel(level)
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(format_string))
logger.addHandler(handler)
def cut_sequence_chunks(
sequences: bytes,
encoding_iana: str,
offsets: range,
chunk_size: int,
bom_or_sig_available: bool,
strip_sig_or_bom: bool,
sig_payload: bytes,
is_multi_byte_decoder: bool,
decoded_payload: str | None = None,
) -> Generator[str, None, None]:
if decoded_payload and is_multi_byte_decoder is False:
for i in offsets:
chunk = decoded_payload[i : i + chunk_size]
if not chunk:
break
yield chunk
else:
for i in offsets:
chunk_end = i + chunk_size
if chunk_end > len(sequences) + 8:
continue
cut_sequence = sequences[i : i + chunk_size]
if bom_or_sig_available and strip_sig_or_bom is False:
cut_sequence = sig_payload + cut_sequence
chunk = cut_sequence.decode(
encoding_iana,
errors="ignore" if is_multi_byte_decoder else "strict",
)
# multi-byte bad cutting detector and adjustment
# not the cleanest way to perform that fix but clever enough for now.
if is_multi_byte_decoder and i > 0:
chunk_partial_size_chk: int = min(chunk_size, 16)
if (
decoded_payload
and chunk[:chunk_partial_size_chk] not in decoded_payload
):
for j in range(i, i - 4, -1):
cut_sequence = sequences[j:chunk_end]
if bom_or_sig_available and strip_sig_or_bom is False:
cut_sequence = sig_payload + cut_sequence
chunk = cut_sequence.decode(encoding_iana, errors="ignore")
if chunk[:chunk_partial_size_chk] in decoded_payload:
break
yield chunk

View File

@@ -0,0 +1,8 @@
"""
Expose version
"""
from __future__ import annotations
__version__ = "3.4.4"
VERSION = __version__.split(".")

View File

@@ -0,0 +1 @@
pip

View File

@@ -0,0 +1,98 @@
Metadata-Version: 2.4
Name: Django
Version: 5.2.8
Summary: A high-level Python web framework that encourages rapid development and clean, pragmatic design.
Author-email: Django Software Foundation <foundation@djangoproject.com>
License-Expression: BSD-3-Clause
Project-URL: Homepage, https://www.djangoproject.com/
Project-URL: Documentation, https://docs.djangoproject.com/
Project-URL: Release notes, https://docs.djangoproject.com/en/stable/releases/
Project-URL: Funding, https://www.djangoproject.com/fundraising/
Project-URL: Source, https://github.com/django/django
Project-URL: Tracker, https://code.djangoproject.com/
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Web Environment
Classifier: Framework :: Django
Classifier: Intended Audience :: Developers
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3 :: Only
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Internet :: WWW/HTTP :: Dynamic Content
Classifier: Topic :: Internet :: WWW/HTTP :: WSGI
Classifier: Topic :: Software Development :: Libraries :: Application Frameworks
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.10
Description-Content-Type: text/x-rst
License-File: LICENSE
License-File: LICENSE.python
Requires-Dist: asgiref>=3.8.1
Requires-Dist: sqlparse>=0.3.1
Requires-Dist: tzdata; sys_platform == "win32"
Provides-Extra: argon2
Requires-Dist: argon2-cffi>=19.1.0; extra == "argon2"
Provides-Extra: bcrypt
Requires-Dist: bcrypt; extra == "bcrypt"
Dynamic: license-file
======
Django
======
Django is a high-level Python web framework that encourages rapid development
and clean, pragmatic design. Thanks for checking it out.
All documentation is in the "``docs``" directory and online at
https://docs.djangoproject.com/en/stable/. If you're just getting started,
here's how we recommend you read the docs:
* First, read ``docs/intro/install.txt`` for instructions on installing Django.
* Next, work through the tutorials in order (``docs/intro/tutorial01.txt``,
``docs/intro/tutorial02.txt``, etc.).
* If you want to set up an actual deployment server, read
``docs/howto/deployment/index.txt`` for instructions.
* You'll probably want to read through the topical guides (in ``docs/topics``)
next; from there you can jump to the HOWTOs (in ``docs/howto``) for specific
problems, and check out the reference (``docs/ref``) for gory details.
* See ``docs/README`` for instructions on building an HTML version of the docs.
Docs are updated rigorously. If you find any problems in the docs, or think
they should be clarified in any way, please take 30 seconds to fill out a
ticket here: https://code.djangoproject.com/newticket
To get more help:
* Join the ``#django`` channel on ``irc.libera.chat``. Lots of helpful people
hang out there. `Webchat is available <https://web.libera.chat/#django>`_.
* Join the `Django Discord community <https://chat.djangoproject.com>`_.
* Join the community on the `Django Forum <https://forum.djangoproject.com/>`_.
To contribute to Django:
* Check out https://docs.djangoproject.com/en/dev/internals/contributing/ for
information about getting involved.
To run Django's test suite:
* Follow the instructions in the "Unit tests" section of
``docs/internals/contributing/writing-code/unit-tests.txt``, published online at
https://docs.djangoproject.com/en/dev/internals/contributing/writing-code/unit-tests/#running-the-unit-tests
Supporting the Development of Django
====================================
Django's development depends on your contributions.
If you depend on Django, remember to support the Django Software Foundation: https://www.djangoproject.com/fundraising/

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,5 @@
Wheel-Version: 1.0
Generator: setuptools (80.9.0)
Root-Is-Purelib: true
Tag: py3-none-any

View File

@@ -0,0 +1,2 @@
[console_scripts]
django-admin = django.core.management:execute_from_command_line

View File

@@ -0,0 +1,27 @@
Copyright (c) Django Software Foundation and individual contributors.
All rights reserved.
Redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright notice,
this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. Neither the name of Django nor the names of its contributors may be used
to endorse or promote products derived from this software without
specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

View File

@@ -0,0 +1,288 @@
Django is licensed under the three-clause BSD license; see the file
LICENSE for details.
Django includes code from the Python standard library, which is licensed under
the Python license, a permissive open source license. The copyright and license
is included below for compliance with Python's terms.
----------------------------------------------------------------------
Copyright (c) 2001-present Python Software Foundation; All Rights Reserved
A. HISTORY OF THE SOFTWARE
==========================
Python was created in the early 1990s by Guido van Rossum at Stichting
Mathematisch Centrum (CWI, see https://www.cwi.nl) in the Netherlands
as a successor of a language called ABC. Guido remains Python's
principal author, although it includes many contributions from others.
In 1995, Guido continued his work on Python at the Corporation for
National Research Initiatives (CNRI, see https://www.cnri.reston.va.us)
in Reston, Virginia where he released several versions of the
software.
In May 2000, Guido and the Python core development team moved to
BeOpen.com to form the BeOpen PythonLabs team. In October of the same
year, the PythonLabs team moved to Digital Creations, which became
Zope Corporation. In 2001, the Python Software Foundation (PSF, see
https://www.python.org/psf/) was formed, a non-profit organization
created specifically to own Python-related Intellectual Property.
Zope Corporation was a sponsoring member of the PSF.
All Python releases are Open Source (see https://opensource.org for
the Open Source Definition). Historically, most, but not all, Python
releases have also been GPL-compatible; the table below summarizes
the various releases.
Release Derived Year Owner GPL-
from compatible? (1)
0.9.0 thru 1.2 1991-1995 CWI yes
1.3 thru 1.5.2 1.2 1995-1999 CNRI yes
1.6 1.5.2 2000 CNRI no
2.0 1.6 2000 BeOpen.com no
1.6.1 1.6 2001 CNRI yes (2)
2.1 2.0+1.6.1 2001 PSF no
2.0.1 2.0+1.6.1 2001 PSF yes
2.1.1 2.1+2.0.1 2001 PSF yes
2.1.2 2.1.1 2002 PSF yes
2.1.3 2.1.2 2002 PSF yes
2.2 and above 2.1.1 2001-now PSF yes
Footnotes:
(1) GPL-compatible doesn't mean that we're distributing Python under
the GPL. All Python licenses, unlike the GPL, let you distribute
a modified version without making your changes open source. The
GPL-compatible licenses make it possible to combine Python with
other software that is released under the GPL; the others don't.
(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
because its license has a choice of law clause. According to
CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
is "not incompatible" with the GPL.
Thanks to the many outside volunteers who have worked under Guido's
direction to make these releases possible.
B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
===============================================================
Python software and documentation are licensed under the
Python Software Foundation License Version 2.
Starting with Python 3.8.6, examples, recipes, and other code in
the documentation are dual licensed under the PSF License Version 2
and the Zero-Clause BSD license.
Some software incorporated into Python is under different licenses.
The licenses are listed with code falling under that license.
PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
--------------------------------------------
1. This LICENSE AGREEMENT is between the Python Software Foundation
("PSF"), and the Individual or Organization ("Licensee") accessing and
otherwise using this software ("Python") in source or binary form and
its associated documentation.
2. Subject to the terms and conditions of this License Agreement, PSF hereby
grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
analyze, test, perform and/or display publicly, prepare derivative works,
distribute, and otherwise use Python alone or in any derivative version,
provided, however, that PSF's License Agreement and PSF's notice of copyright,
i.e., "Copyright (c) 2001 Python Software Foundation; All Rights Reserved"
are retained in Python alone or in any derivative version prepared by Licensee.
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python.
4. PSF is making Python available to Licensee on an "AS IS"
basis. PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. Nothing in this License Agreement shall be deemed to create any
relationship of agency, partnership, or joint venture between PSF and
Licensee. This License Agreement does not grant permission to use PSF
trademarks or trade name in a trademark sense to endorse or promote
products or services of Licensee, or any third party.
8. By copying, installing or otherwise using Python, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
-------------------------------------------
BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
Individual or Organization ("Licensee") accessing and otherwise using
this software in source or binary form and its associated
documentation ("the Software").
2. Subject to the terms and conditions of this BeOpen Python License
Agreement, BeOpen hereby grants Licensee a non-exclusive,
royalty-free, world-wide license to reproduce, analyze, test, perform
and/or display publicly, prepare derivative works, distribute, and
otherwise use the Software alone or in any derivative version,
provided, however, that the BeOpen Python License is retained in the
Software, alone or in any derivative version prepared by Licensee.
3. BeOpen is making the Software available to Licensee on an "AS IS"
basis. BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
5. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
6. This License Agreement shall be governed by and interpreted in all
respects by the law of the State of California, excluding conflict of
law provisions. Nothing in this License Agreement shall be deemed to
create any relationship of agency, partnership, or joint venture
between BeOpen and Licensee. This License Agreement does not grant
permission to use BeOpen trademarks or trade names in a trademark
sense to endorse or promote products or services of Licensee, or any
third party. As an exception, the "BeOpen Python" logos available at
http://www.pythonlabs.com/logos.html may be used according to the
permissions granted on that web page.
7. By copying, installing or otherwise using the software, Licensee
agrees to be bound by the terms and conditions of this License
Agreement.
CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
---------------------------------------
1. This LICENSE AGREEMENT is between the Corporation for National
Research Initiatives, having an office at 1895 Preston White Drive,
Reston, VA 20191 ("CNRI"), and the Individual or Organization
("Licensee") accessing and otherwise using Python 1.6.1 software in
source or binary form and its associated documentation.
2. Subject to the terms and conditions of this License Agreement, CNRI
hereby grants Licensee a nonexclusive, royalty-free, world-wide
license to reproduce, analyze, test, perform and/or display publicly,
prepare derivative works, distribute, and otherwise use Python 1.6.1
alone or in any derivative version, provided, however, that CNRI's
License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
1995-2001 Corporation for National Research Initiatives; All Rights
Reserved" are retained in Python 1.6.1 alone or in any derivative
version prepared by Licensee. Alternately, in lieu of CNRI's License
Agreement, Licensee may substitute the following text (omitting the
quotes): "Python 1.6.1 is made available subject to the terms and
conditions in CNRI's License Agreement. This Agreement together with
Python 1.6.1 may be located on the internet using the following
unique, persistent identifier (known as a handle): 1895.22/1013. This
Agreement may also be obtained from a proxy server on the internet
using the following URL: http://hdl.handle.net/1895.22/1013".
3. In the event Licensee prepares a derivative work that is based on
or incorporates Python 1.6.1 or any part thereof, and wants to make
the derivative work available to others as provided herein, then
Licensee hereby agrees to include in any such work a brief summary of
the changes made to Python 1.6.1.
4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
basis. CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
INFRINGE ANY THIRD PARTY RIGHTS.
5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
6. This License Agreement will automatically terminate upon a material
breach of its terms and conditions.
7. This License Agreement shall be governed by the federal
intellectual property law of the United States, including without
limitation the federal copyright law, and, to the extent such
U.S. federal law does not apply, by the law of the Commonwealth of
Virginia, excluding Virginia's conflict of law provisions.
Notwithstanding the foregoing, with regard to derivative works based
on Python 1.6.1 that incorporate non-separable material that was
previously distributed under the GNU General Public License (GPL), the
law of the Commonwealth of Virginia shall govern this License
Agreement only as to issues arising under or with respect to
Paragraphs 4, 5, and 7 of this License Agreement. Nothing in this
License Agreement shall be deemed to create any relationship of
agency, partnership, or joint venture between CNRI and Licensee. This
License Agreement does not grant permission to use CNRI trademarks or
trade name in a trademark sense to endorse or promote products or
services of Licensee, or any third party.
8. By clicking on the "ACCEPT" button where indicated, or by copying,
installing or otherwise using Python 1.6.1, Licensee agrees to be
bound by the terms and conditions of this License Agreement.
ACCEPT
CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
--------------------------------------------------
Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
The Netherlands. All rights reserved.
Permission to use, copy, modify, and distribute this software and its
documentation for any purpose and without fee is hereby granted,
provided that the above copyright notice appear in all copies and that
both that copyright notice and this permission notice appear in
supporting documentation, and that the name of Stichting Mathematisch
Centrum or CWI not be used in advertising or publicity pertaining to
distribution of the software without specific, written prior
permission.
STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
ZERO-CLAUSE BSD LICENSE FOR CODE IN THE PYTHON DOCUMENTATION
----------------------------------------------------------------------
Permission to use, copy, modify, and/or distribute this software for any
purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT,
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THIS SOFTWARE.

View File

@@ -0,0 +1 @@
django

View File

@@ -0,0 +1,24 @@
from django.utils.version import get_version
VERSION = (5, 2, 8, "final", 0)
__version__ = get_version(VERSION)
def setup(set_prefix=True):
"""
Configure the settings (this happens as a side effect of accessing the
first setting), configure logging and populate the app registry.
Set the thread-local urlresolvers script prefix if `set_prefix` is True.
"""
from django.apps import apps
from django.conf import settings
from django.urls import set_script_prefix
from django.utils.log import configure_logging
configure_logging(settings.LOGGING_CONFIG, settings.LOGGING)
if set_prefix:
set_script_prefix(
"/" if settings.FORCE_SCRIPT_NAME is None else settings.FORCE_SCRIPT_NAME
)
apps.populate(settings.INSTALLED_APPS)

View File

@@ -0,0 +1,10 @@
"""
Invokes django-admin when the django module is run as a script.
Example: python -m django check
"""
from django.core import management
if __name__ == "__main__":
management.execute_from_command_line()

View File

@@ -0,0 +1,4 @@
from .config import AppConfig
from .registry import apps
__all__ = ["AppConfig", "apps"]

View File

@@ -0,0 +1,274 @@
import inspect
import os
from importlib import import_module
from django.core.exceptions import ImproperlyConfigured
from django.utils.functional import cached_property
from django.utils.module_loading import import_string, module_has_submodule
APPS_MODULE_NAME = "apps"
MODELS_MODULE_NAME = "models"
class AppConfig:
"""Class representing a Django application and its configuration."""
def __init__(self, app_name, app_module):
# Full Python path to the application e.g. 'django.contrib.admin'.
self.name = app_name
# Root module for the application e.g. <module 'django.contrib.admin'
# from 'django/contrib/admin/__init__.py'>.
self.module = app_module
# Reference to the Apps registry that holds this AppConfig. Set by the
# registry when it registers the AppConfig instance.
self.apps = None
# The following attributes could be defined at the class level in a
# subclass, hence the test-and-set pattern.
# Last component of the Python path to the application e.g. 'admin'.
# This value must be unique across a Django project.
if not hasattr(self, "label"):
self.label = app_name.rpartition(".")[2]
if not self.label.isidentifier():
raise ImproperlyConfigured(
"The app label '%s' is not a valid Python identifier." % self.label
)
# Human-readable name for the application e.g. "Admin".
if not hasattr(self, "verbose_name"):
self.verbose_name = self.label.title()
# Filesystem path to the application directory e.g.
# '/path/to/django/contrib/admin'.
if not hasattr(self, "path"):
self.path = self._path_from_module(app_module)
# Module containing models e.g. <module 'django.contrib.admin.models'
# from 'django/contrib/admin/models.py'>. Set by import_models().
# None if the application doesn't have a models module.
self.models_module = None
# Mapping of lowercase model names to model classes. Initially set to
# None to prevent accidental access before import_models() runs.
self.models = None
def __repr__(self):
return "<%s: %s>" % (self.__class__.__name__, self.label)
@cached_property
def default_auto_field(self):
from django.conf import settings
return settings.DEFAULT_AUTO_FIELD
@property
def _is_default_auto_field_overridden(self):
return self.__class__.default_auto_field is not AppConfig.default_auto_field
def _path_from_module(self, module):
"""Attempt to determine app's filesystem path from its module."""
# See #21874 for extended discussion of the behavior of this method in
# various cases.
# Convert to list because __path__ may not support indexing.
paths = list(getattr(module, "__path__", []))
if len(paths) != 1:
filename = getattr(module, "__file__", None)
if filename is not None:
paths = [os.path.dirname(filename)]
else:
# For unknown reasons, sometimes the list returned by __path__
# contains duplicates that must be removed (#25246).
paths = list(set(paths))
if len(paths) > 1:
raise ImproperlyConfigured(
"The app module %r has multiple filesystem locations (%r); "
"you must configure this app with an AppConfig subclass "
"with a 'path' class attribute." % (module, paths)
)
elif not paths:
raise ImproperlyConfigured(
"The app module %r has no filesystem location, "
"you must configure this app with an AppConfig subclass "
"with a 'path' class attribute." % module
)
return paths[0]
@classmethod
def create(cls, entry):
"""
Factory that creates an app config from an entry in INSTALLED_APPS.
"""
# create() eventually returns app_config_class(app_name, app_module).
app_config_class = None
app_name = None
app_module = None
# If import_module succeeds, entry points to the app module.
try:
app_module = import_module(entry)
except Exception:
pass
else:
# If app_module has an apps submodule that defines a single
# AppConfig subclass, use it automatically.
# To prevent this, an AppConfig subclass can declare a class
# variable default = False.
# If the apps module defines more than one AppConfig subclass,
# the default one can declare default = True.
if module_has_submodule(app_module, APPS_MODULE_NAME):
mod_path = "%s.%s" % (entry, APPS_MODULE_NAME)
mod = import_module(mod_path)
# Check if there's exactly one AppConfig candidate,
# excluding those that explicitly define default = False.
app_configs = [
(name, candidate)
for name, candidate in inspect.getmembers(mod, inspect.isclass)
if (
issubclass(candidate, cls)
and candidate is not cls
and getattr(candidate, "default", True)
)
]
if len(app_configs) == 1:
app_config_class = app_configs[0][1]
else:
# Check if there's exactly one AppConfig subclass,
# among those that explicitly define default = True.
app_configs = [
(name, candidate)
for name, candidate in app_configs
if getattr(candidate, "default", False)
]
if len(app_configs) > 1:
candidates = [repr(name) for name, _ in app_configs]
raise RuntimeError(
"%r declares more than one default AppConfig: "
"%s." % (mod_path, ", ".join(candidates))
)
elif len(app_configs) == 1:
app_config_class = app_configs[0][1]
# Use the default app config class if we didn't find anything.
if app_config_class is None:
app_config_class = cls
app_name = entry
# If import_string succeeds, entry is an app config class.
if app_config_class is None:
try:
app_config_class = import_string(entry)
except Exception:
pass
# If both import_module and import_string failed, it means that entry
# doesn't have a valid value.
if app_module is None and app_config_class is None:
# If the last component of entry starts with an uppercase letter,
# then it was likely intended to be an app config class; if not,
# an app module. Provide a nice error message in both cases.
mod_path, _, cls_name = entry.rpartition(".")
if mod_path and cls_name[0].isupper():
# We could simply re-trigger the string import exception, but
# we're going the extra mile and providing a better error
# message for typos in INSTALLED_APPS.
# This may raise ImportError, which is the best exception
# possible if the module at mod_path cannot be imported.
mod = import_module(mod_path)
candidates = [
repr(name)
for name, candidate in inspect.getmembers(mod, inspect.isclass)
if issubclass(candidate, cls) and candidate is not cls
]
msg = "Module '%s' does not contain a '%s' class." % (
mod_path,
cls_name,
)
if candidates:
msg += " Choices are: %s." % ", ".join(candidates)
raise ImportError(msg)
else:
# Re-trigger the module import exception.
import_module(entry)
# Check for obvious errors. (This check prevents duck typing, but
# it could be removed if it became a problem in practice.)
if not issubclass(app_config_class, AppConfig):
raise ImproperlyConfigured("'%s' isn't a subclass of AppConfig." % entry)
# Obtain app name here rather than in AppClass.__init__ to keep
# all error checking for entries in INSTALLED_APPS in one place.
if app_name is None:
try:
app_name = app_config_class.name
except AttributeError:
raise ImproperlyConfigured("'%s' must supply a name attribute." % entry)
# Ensure app_name points to a valid module.
try:
app_module = import_module(app_name)
except ImportError:
raise ImproperlyConfigured(
"Cannot import '%s'. Check that '%s.%s.name' is correct."
% (
app_name,
app_config_class.__module__,
app_config_class.__qualname__,
)
)
# Entry is a path to an app config class.
return app_config_class(app_name, app_module)
def get_model(self, model_name, require_ready=True):
"""
Return the model with the given case-insensitive model_name.
Raise LookupError if no model exists with this name.
"""
if require_ready:
self.apps.check_models_ready()
else:
self.apps.check_apps_ready()
try:
return self.models[model_name.lower()]
except KeyError:
raise LookupError(
"App '%s' doesn't have a '%s' model." % (self.label, model_name)
)
def get_models(self, include_auto_created=False, include_swapped=False):
"""
Return an iterable of models.
By default, the following models aren't included:
- auto-created models for many-to-many relations without
an explicit intermediate table,
- models that have been swapped out.
Set the corresponding keyword argument to True to include such models.
Keyword arguments aren't documented; they're a private API.
"""
self.apps.check_models_ready()
for model in self.models.values():
if model._meta.auto_created and not include_auto_created:
continue
if model._meta.swapped and not include_swapped:
continue
yield model
def import_models(self):
# Dictionary of models for this app, primarily maintained in the
# 'all_models' attribute of the Apps this AppConfig is attached to.
self.models = self.apps.all_models[self.label]
if module_has_submodule(self.module, MODELS_MODULE_NAME):
models_module_name = "%s.%s" % (self.name, MODELS_MODULE_NAME)
self.models_module = import_module(models_module_name)
def ready(self):
"""
Override this method in subclasses to run code when Django starts.
"""

View File

@@ -0,0 +1,437 @@
import functools
import sys
import threading
import warnings
from collections import Counter, defaultdict
from functools import partial
from django.core.exceptions import AppRegistryNotReady, ImproperlyConfigured
from .config import AppConfig
class Apps:
"""
A registry that stores the configuration of installed applications.
It also keeps track of models, e.g. to provide reverse relations.
"""
def __init__(self, installed_apps=()):
# installed_apps is set to None when creating the main registry
# because it cannot be populated at that point. Other registries must
# provide a list of installed apps and are populated immediately.
if installed_apps is None and hasattr(sys.modules[__name__], "apps"):
raise RuntimeError("You must supply an installed_apps argument.")
# Mapping of app labels => model names => model classes. Every time a
# model is imported, ModelBase.__new__ calls apps.register_model which
# creates an entry in all_models. All imported models are registered,
# regardless of whether they're defined in an installed application
# and whether the registry has been populated. Since it isn't possible
# to reimport a module safely (it could reexecute initialization code)
# all_models is never overridden or reset.
self.all_models = defaultdict(dict)
# Mapping of labels to AppConfig instances for installed apps.
self.app_configs = {}
# Stack of app_configs. Used to store the current state in
# set_available_apps and set_installed_apps.
self.stored_app_configs = []
# Whether the registry is populated.
self.apps_ready = self.models_ready = self.ready = False
# For the autoreloader.
self.ready_event = threading.Event()
# Lock for thread-safe population.
self._lock = threading.RLock()
self.loading = False
# Maps ("app_label", "modelname") tuples to lists of functions to be
# called when the corresponding model is ready. Used by this class's
# `lazy_model_operation()` and `do_pending_operations()` methods.
self._pending_operations = defaultdict(list)
# Populate apps and models, unless it's the main registry.
if installed_apps is not None:
self.populate(installed_apps)
def populate(self, installed_apps=None):
"""
Load application configurations and models.
Import each application module and then each model module.
It is thread-safe and idempotent, but not reentrant.
"""
if self.ready:
return
# populate() might be called by two threads in parallel on servers
# that create threads before initializing the WSGI callable.
with self._lock:
if self.ready:
return
# An RLock prevents other threads from entering this section. The
# compare and set operation below is atomic.
if self.loading:
# Prevent reentrant calls to avoid running AppConfig.ready()
# methods twice.
raise RuntimeError("populate() isn't reentrant")
self.loading = True
# Phase 1: initialize app configs and import app modules.
for entry in installed_apps:
if isinstance(entry, AppConfig):
app_config = entry
else:
app_config = AppConfig.create(entry)
if app_config.label in self.app_configs:
raise ImproperlyConfigured(
"Application labels aren't unique, "
"duplicates: %s" % app_config.label
)
self.app_configs[app_config.label] = app_config
app_config.apps = self
# Check for duplicate app names.
counts = Counter(
app_config.name for app_config in self.app_configs.values()
)
duplicates = [name for name, count in counts.most_common() if count > 1]
if duplicates:
raise ImproperlyConfigured(
"Application names aren't unique, "
"duplicates: %s" % ", ".join(duplicates)
)
self.apps_ready = True
# Phase 2: import models modules.
for app_config in self.app_configs.values():
app_config.import_models()
self.clear_cache()
self.models_ready = True
# Phase 3: run ready() methods of app configs.
for app_config in self.get_app_configs():
app_config.ready()
self.ready = True
self.ready_event.set()
def check_apps_ready(self):
"""Raise an exception if all apps haven't been imported yet."""
if not self.apps_ready:
from django.conf import settings
# If "not ready" is due to unconfigured settings, accessing
# INSTALLED_APPS raises a more helpful ImproperlyConfigured
# exception.
settings.INSTALLED_APPS
raise AppRegistryNotReady("Apps aren't loaded yet.")
def check_models_ready(self):
"""Raise an exception if all models haven't been imported yet."""
if not self.models_ready:
raise AppRegistryNotReady("Models aren't loaded yet.")
def get_app_configs(self):
"""Import applications and return an iterable of app configs."""
self.check_apps_ready()
return self.app_configs.values()
def get_app_config(self, app_label):
"""
Import applications and returns an app config for the given label.
Raise LookupError if no application exists with this label.
"""
self.check_apps_ready()
try:
return self.app_configs[app_label]
except KeyError:
message = "No installed app with label '%s'." % app_label
for app_config in self.get_app_configs():
if app_config.name == app_label:
message += " Did you mean '%s'?" % app_config.label
break
raise LookupError(message)
# This method is performance-critical at least for Django's test suite.
@functools.cache
def get_models(self, include_auto_created=False, include_swapped=False):
"""
Return a list of all installed models.
By default, the following models aren't included:
- auto-created models for many-to-many relations without
an explicit intermediate table,
- models that have been swapped out.
Set the corresponding keyword argument to True to include such models.
"""
self.check_models_ready()
result = []
for app_config in self.app_configs.values():
result.extend(app_config.get_models(include_auto_created, include_swapped))
return result
def get_model(self, app_label, model_name=None, require_ready=True):
"""
Return the model matching the given app_label and model_name.
As a shortcut, app_label may be in the form <app_label>.<model_name>.
model_name is case-insensitive.
Raise LookupError if no application exists with this label, or no
model exists with this name in the application. Raise ValueError if
called with a single argument that doesn't contain exactly one dot.
"""
if require_ready:
self.check_models_ready()
else:
self.check_apps_ready()
if model_name is None:
app_label, model_name = app_label.split(".")
app_config = self.get_app_config(app_label)
if not require_ready and app_config.models is None:
app_config.import_models()
return app_config.get_model(model_name, require_ready=require_ready)
def register_model(self, app_label, model):
# Since this method is called when models are imported, it cannot
# perform imports because of the risk of import loops. It mustn't
# call get_app_config().
model_name = model._meta.model_name
app_models = self.all_models[app_label]
if model_name in app_models:
if (
model.__name__ == app_models[model_name].__name__
and model.__module__ == app_models[model_name].__module__
):
warnings.warn(
"Model '%s.%s' was already registered. Reloading models is not "
"advised as it can lead to inconsistencies, most notably with "
"related models." % (app_label, model_name),
RuntimeWarning,
stacklevel=2,
)
else:
raise RuntimeError(
"Conflicting '%s' models in application '%s': %s and %s."
% (model_name, app_label, app_models[model_name], model)
)
app_models[model_name] = model
self.do_pending_operations(model)
self.clear_cache()
def is_installed(self, app_name):
"""
Check whether an application with this name exists in the registry.
app_name is the full name of the app e.g. 'django.contrib.admin'.
"""
self.check_apps_ready()
return any(ac.name == app_name for ac in self.app_configs.values())
def get_containing_app_config(self, object_name):
"""
Look for an app config containing a given object.
object_name is the dotted Python path to the object.
Return the app config for the inner application in case of nesting.
Return None if the object isn't in any registered app config.
"""
self.check_apps_ready()
candidates = []
for app_config in self.app_configs.values():
if object_name.startswith(app_config.name):
subpath = object_name.removeprefix(app_config.name)
if subpath == "" or subpath[0] == ".":
candidates.append(app_config)
if candidates:
return sorted(candidates, key=lambda ac: -len(ac.name))[0]
def get_registered_model(self, app_label, model_name):
"""
Similar to get_model(), but doesn't require that an app exists with
the given app_label.
It's safe to call this method at import time, even while the registry
is being populated.
"""
model = self.all_models[app_label].get(model_name.lower())
if model is None:
raise LookupError("Model '%s.%s' not registered." % (app_label, model_name))
return model
@functools.cache
def get_swappable_settings_name(self, to_string):
"""
For a given model string (e.g. "auth.User"), return the name of the
corresponding settings name if it refers to a swappable model. If the
referred model is not swappable, return None.
This method is decorated with @functools.cache because it's performance
critical when it comes to migrations. Since the swappable settings don't
change after Django has loaded the settings, there is no reason to get
the respective settings attribute over and over again.
"""
to_string = to_string.lower()
for model in self.get_models(include_swapped=True):
swapped = model._meta.swapped
# Is this model swapped out for the model given by to_string?
if swapped and swapped.lower() == to_string:
return model._meta.swappable
# Is this model swappable and the one given by to_string?
if model._meta.swappable and model._meta.label_lower == to_string:
return model._meta.swappable
return None
def set_available_apps(self, available):
"""
Restrict the set of installed apps used by get_app_config[s].
available must be an iterable of application names.
set_available_apps() must be balanced with unset_available_apps().
Primarily used for performance optimization in TransactionTestCase.
This method is safe in the sense that it doesn't trigger any imports.
"""
available = set(available)
installed = {app_config.name for app_config in self.get_app_configs()}
if not available.issubset(installed):
raise ValueError(
"Available apps isn't a subset of installed apps, extra apps: %s"
% ", ".join(available - installed)
)
self.stored_app_configs.append(self.app_configs)
self.app_configs = {
label: app_config
for label, app_config in self.app_configs.items()
if app_config.name in available
}
self.clear_cache()
def unset_available_apps(self):
"""Cancel a previous call to set_available_apps()."""
self.app_configs = self.stored_app_configs.pop()
self.clear_cache()
def set_installed_apps(self, installed):
"""
Enable a different set of installed apps for get_app_config[s].
installed must be an iterable in the same format as INSTALLED_APPS.
set_installed_apps() must be balanced with unset_installed_apps(),
even if it exits with an exception.
Primarily used as a receiver of the setting_changed signal in tests.
This method may trigger new imports, which may add new models to the
registry of all imported models. They will stay in the registry even
after unset_installed_apps(). Since it isn't possible to replay
imports safely (e.g. that could lead to registering listeners twice),
models are registered when they're imported and never removed.
"""
if not self.ready:
raise AppRegistryNotReady("App registry isn't ready yet.")
self.stored_app_configs.append(self.app_configs)
self.app_configs = {}
self.apps_ready = self.models_ready = self.loading = self.ready = False
self.clear_cache()
self.populate(installed)
def unset_installed_apps(self):
"""Cancel a previous call to set_installed_apps()."""
self.app_configs = self.stored_app_configs.pop()
self.apps_ready = self.models_ready = self.ready = True
self.clear_cache()
def clear_cache(self):
"""
Clear all internal caches, for methods that alter the app registry.
This is mostly used in tests.
"""
self.get_swappable_settings_name.cache_clear()
# Call expire cache on each model. This will purge
# the relation tree and the fields cache.
self.get_models.cache_clear()
if self.ready:
# Circumvent self.get_models() to prevent that the cache is refilled.
# This particularly prevents that an empty value is cached while cloning.
for app_config in self.app_configs.values():
for model in app_config.get_models(include_auto_created=True):
model._meta._expire_cache()
def lazy_model_operation(self, function, *model_keys):
"""
Take a function and a number of ("app_label", "modelname") tuples, and
when all the corresponding models have been imported and registered,
call the function with the model classes as its arguments.
The function passed to this method must accept exactly n models as
arguments, where n=len(model_keys).
"""
# Base case: no arguments, just execute the function.
if not model_keys:
function()
# Recursive case: take the head of model_keys, wait for the
# corresponding model class to be imported and registered, then apply
# that argument to the supplied function. Pass the resulting partial
# to lazy_model_operation() along with the remaining model args and
# repeat until all models are loaded and all arguments are applied.
else:
next_model, *more_models = model_keys
# This will be executed after the class corresponding to next_model
# has been imported and registered. The `func` attribute provides
# duck-type compatibility with partials.
def apply_next_model(model):
next_function = partial(apply_next_model.func, model)
self.lazy_model_operation(next_function, *more_models)
apply_next_model.func = function
# If the model has already been imported and registered, partially
# apply it to the function now. If not, add it to the list of
# pending operations for the model, where it will be executed with
# the model class as its sole argument once the model is ready.
try:
model_class = self.get_registered_model(*next_model)
except LookupError:
self._pending_operations[next_model].append(apply_next_model)
else:
apply_next_model(model_class)
def do_pending_operations(self, model):
"""
Take a newly-prepared model and pass it to each function waiting for
it. This is called at the very end of Apps.register_model().
"""
key = model._meta.app_label, model._meta.model_name
for function in self._pending_operations.pop(key, []):
function(model)
apps = Apps(installed_apps=None)

View File

@@ -0,0 +1,272 @@
"""
Settings and configuration for Django.
Read values from the module specified by the DJANGO_SETTINGS_MODULE environment
variable, and then from django.conf.global_settings; see the global_settings.py
for a list of all possible variables.
"""
import importlib
import os
import time
import traceback
import warnings
from pathlib import Path
import django
from django.conf import global_settings
from django.core.exceptions import ImproperlyConfigured
from django.utils.deprecation import RemovedInDjango60Warning
from django.utils.functional import LazyObject, empty
ENVIRONMENT_VARIABLE = "DJANGO_SETTINGS_MODULE"
DEFAULT_STORAGE_ALIAS = "default"
STATICFILES_STORAGE_ALIAS = "staticfiles"
# RemovedInDjango60Warning.
FORMS_URLFIELD_ASSUME_HTTPS_DEPRECATED_MSG = (
"The FORMS_URLFIELD_ASSUME_HTTPS transitional setting is deprecated."
)
class SettingsReference(str):
"""
String subclass which references a current settings value. It's treated as
the value in memory but serializes to a settings.NAME attribute reference.
"""
def __new__(self, value, setting_name):
return str.__new__(self, value)
def __init__(self, value, setting_name):
self.setting_name = setting_name
class LazySettings(LazyObject):
"""
A lazy proxy for either global Django settings or a custom settings object.
The user can manually configure settings prior to using them. Otherwise,
Django uses the settings module pointed to by DJANGO_SETTINGS_MODULE.
"""
def _setup(self, name=None):
"""
Load the settings module pointed to by the environment variable. This
is used the first time settings are needed, if the user hasn't
configured settings manually.
"""
settings_module = os.environ.get(ENVIRONMENT_VARIABLE)
if not settings_module:
desc = ("setting %s" % name) if name else "settings"
raise ImproperlyConfigured(
"Requested %s, but settings are not configured. "
"You must either define the environment variable %s "
"or call settings.configure() before accessing settings."
% (desc, ENVIRONMENT_VARIABLE)
)
self._wrapped = Settings(settings_module)
def __repr__(self):
# Hardcode the class name as otherwise it yields 'Settings'.
if self._wrapped is empty:
return "<LazySettings [Unevaluated]>"
return '<LazySettings "%(settings_module)s">' % {
"settings_module": self._wrapped.SETTINGS_MODULE,
}
def __getattr__(self, name):
"""Return the value of a setting and cache it in self.__dict__."""
if (_wrapped := self._wrapped) is empty:
self._setup(name)
_wrapped = self._wrapped
val = getattr(_wrapped, name)
# Special case some settings which require further modification.
# This is done here for performance reasons so the modified value is cached.
if name in {"MEDIA_URL", "STATIC_URL"} and val is not None:
val = self._add_script_prefix(val)
elif name == "SECRET_KEY" and not val:
raise ImproperlyConfigured("The SECRET_KEY setting must not be empty.")
self.__dict__[name] = val
return val
def __setattr__(self, name, value):
"""
Set the value of setting. Clear all cached values if _wrapped changes
(@override_settings does this) or clear single values when set.
"""
if name == "_wrapped":
self.__dict__.clear()
else:
self.__dict__.pop(name, None)
super().__setattr__(name, value)
def __delattr__(self, name):
"""Delete a setting and clear it from cache if needed."""
super().__delattr__(name)
self.__dict__.pop(name, None)
def configure(self, default_settings=global_settings, **options):
"""
Called to manually configure the settings. The 'default_settings'
parameter sets where to retrieve any unspecified values from (its
argument must support attribute access (__getattr__)).
"""
if self._wrapped is not empty:
raise RuntimeError("Settings already configured.")
holder = UserSettingsHolder(default_settings)
for name, value in options.items():
if not name.isupper():
raise TypeError("Setting %r must be uppercase." % name)
setattr(holder, name, value)
self._wrapped = holder
@staticmethod
def _add_script_prefix(value):
"""
Add SCRIPT_NAME prefix to relative paths.
Useful when the app is being served at a subpath and manually prefixing
subpath to STATIC_URL and MEDIA_URL in settings is inconvenient.
"""
# Don't apply prefix to absolute paths and URLs.
if value.startswith(("http://", "https://", "/")):
return value
from django.urls import get_script_prefix
return "%s%s" % (get_script_prefix(), value)
@property
def configured(self):
"""Return True if the settings have already been configured."""
return self._wrapped is not empty
def _show_deprecation_warning(self, message, category):
stack = traceback.extract_stack()
# Show a warning if the setting is used outside of Django.
# Stack index: -1 this line, -2 the property, -3 the
# LazyObject __getattribute__(), -4 the caller.
filename, _, _, _ = stack[-4]
if not filename.startswith(os.path.dirname(django.__file__)):
warnings.warn(message, category, stacklevel=2)
class Settings:
def __init__(self, settings_module):
# update this dict from global settings (but only for ALL_CAPS settings)
for setting in dir(global_settings):
if setting.isupper():
setattr(self, setting, getattr(global_settings, setting))
# store the settings module in case someone later cares
self.SETTINGS_MODULE = settings_module
mod = importlib.import_module(self.SETTINGS_MODULE)
tuple_settings = (
"ALLOWED_HOSTS",
"INSTALLED_APPS",
"TEMPLATE_DIRS",
"LOCALE_PATHS",
"SECRET_KEY_FALLBACKS",
)
self._explicit_settings = set()
for setting in dir(mod):
if setting.isupper():
setting_value = getattr(mod, setting)
if setting in tuple_settings and not isinstance(
setting_value, (list, tuple)
):
raise ImproperlyConfigured(
"The %s setting must be a list or a tuple." % setting
)
setattr(self, setting, setting_value)
self._explicit_settings.add(setting)
if self.is_overridden("FORMS_URLFIELD_ASSUME_HTTPS"):
warnings.warn(
FORMS_URLFIELD_ASSUME_HTTPS_DEPRECATED_MSG,
RemovedInDjango60Warning,
)
if hasattr(time, "tzset") and self.TIME_ZONE:
# When we can, attempt to validate the timezone. If we can't find
# this file, no check happens and it's harmless.
zoneinfo_root = Path("/usr/share/zoneinfo")
zone_info_file = zoneinfo_root.joinpath(*self.TIME_ZONE.split("/"))
if zoneinfo_root.exists() and not zone_info_file.exists():
raise ValueError("Incorrect timezone setting: %s" % self.TIME_ZONE)
# Move the time zone info into os.environ. See ticket #2315 for why
# we don't do this unconditionally (breaks Windows).
os.environ["TZ"] = self.TIME_ZONE
time.tzset()
def is_overridden(self, setting):
return setting in self._explicit_settings
def __repr__(self):
return '<%(cls)s "%(settings_module)s">' % {
"cls": self.__class__.__name__,
"settings_module": self.SETTINGS_MODULE,
}
class UserSettingsHolder:
"""Holder for user configured settings."""
# SETTINGS_MODULE doesn't make much sense in the manually configured
# (standalone) case.
SETTINGS_MODULE = None
def __init__(self, default_settings):
"""
Requests for configuration variables not in this class are satisfied
from the module specified in default_settings (if possible).
"""
self.__dict__["_deleted"] = set()
self.default_settings = default_settings
def __getattr__(self, name):
if not name.isupper() or name in self._deleted:
raise AttributeError
return getattr(self.default_settings, name)
def __setattr__(self, name, value):
self._deleted.discard(name)
if name == "FORMS_URLFIELD_ASSUME_HTTPS":
warnings.warn(
FORMS_URLFIELD_ASSUME_HTTPS_DEPRECATED_MSG,
RemovedInDjango60Warning,
)
super().__setattr__(name, value)
def __delattr__(self, name):
self._deleted.add(name)
if hasattr(self, name):
super().__delattr__(name)
def __dir__(self):
return sorted(
s
for s in [*self.__dict__, *dir(self.default_settings)]
if s not in self._deleted
)
def is_overridden(self, setting):
deleted = setting in self._deleted
set_locally = setting in self.__dict__
set_on_default = getattr(
self.default_settings, "is_overridden", lambda s: False
)(setting)
return deleted or set_locally or set_on_default
def __repr__(self):
return "<%(cls)s>" % {
"cls": self.__class__.__name__,
}
settings = LazySettings()

Some files were not shown because too many files have changed in this diff Show More