Listen to this Post
The Natural Language Toolkit (NLTK) is a widely used Python library for natural language processing. A path traversal vulnerability exists in NLTK versions up to and including 3.9.2, allowing attackers to read arbitrary files from the filesystem.
The root cause lies in the `nltk.data.load()` and `nltk.data.find()` functions, which resolve user-supplied resource names to filesystem paths. These functions perform safety checks on the raw, undecoded resource string using a regular expression (_UNSAFE_NO_PROTOCOL_RE) designed to block dangerous patterns like `../` and leading slashes. However, the path is subsequently normalized using url2pathname(), which decodes percent-encoded sequences (e.g., `%2e%2e` to ..).
This creates a classic “decode-after-check” or TOCTOU (Time-of-Check to Time-of-Use) flaw. An attacker can supply a payload like %2e%2e/etc/passwd. The regex check sees the literal string %2e%2e/etc/passwd, which does not match the prohibited patterns, so it passes. The `url2pathname()` function then decodes this to ../../etc/passwd, resulting in a filesystem path outside the intended data directory. This allows an attacker to read any file the process has permissions to access, including sensitive system files, credentials, and application secrets. The vulnerability is present in multiple `CorpusReader` classes and is particularly critical in applications like machine learning APIs that process user-controlled file paths.
DailyCVE Form
Platform: ……. NLTK (Python library)
Version: …….. <= 3.9.2
Vulnerability :…… Path Traversal (CWE-22)
Severity: ……. 7.5 (High)
date: ………. March 4, 2026
Prediction: Fixed in NLTK 3.9.3
What Undercode Say: Analytics
The vulnerability stems from a logical error where path validation is performed before URL decoding.
Vulnerable code in nltk/data.py
The regex check operates on the raw, encoded string.
if <em>UNSAFE_NO_PROTOCOL_RE.search(resource_name):
raise ValueError("...")
Later, url2pathname() decodes the string.
p = os.path.join(path</em>, url2pathname(resource_name))
This flaw allows attackers to bypass the security regex. The following Python script demonstrates the vulnerability:
import nltk.data
Set the NLTK data path
nltk.data.path = ["/home/user/nltk_data"]
Payload with URL-encoded path traversal
data = nltk.data.load("%2e%2e/SECRET_credentials.txt", format="raw")
print(data)
Output: b'AWS_SECRET_KEY=AKIAIOSFODNN7EXAMPLE\nDATABASE_PASS=hunter2\n'
Multiple encoded variants can bypass the check and decode to the same traversal sequence:
| Payload | After `url2pathname()` |
| : | : |
| `%2e%2e/secret` | `../secret` |
| `.%2e/secret` | `../secret` |
| `%2e./secret` | `../secret` |
| `%2E%2E/secret` | `../secret` |
How Exploit
An attacker who can control the resource name passed to `nltk.data.load()` or `nltk.data.find()` can exploit this vulnerability. Common attack vectors include:
Web Applications: Applications that allow users to specify a corpus or model file name.
Hosted Notebook Services: Environments like JupyterHub where users can execute arbitrary code.
ML Pipelines: Systems that process untrusted input for model training or evaluation.
CI/CD Systems: Pipelines that use NLTK to process data from external sources.
By supplying a payload such as nltk:%2fetc%2fpasswd, an attacker can read the contents of /etc/passwd. More critically, they can read `/proc/self/environ` to leak environment variables containing API keys, database credentials, and cloud secrets.
Protection
Upgrade NLTK: The primary mitigation is to upgrade to NLTK version 3.9.3 or later. This version includes a fix that performs path validation after URL decoding.
Input Validation: As a defense-in-depth measure, applications should never pass untrusted user input directly to `nltk.data.load()` or nltk.data.find(). Implement strict allowlists for resource names.
Sandboxing: Run NLTK in a restricted environment, such as a container with minimal filesystem access, to limit the impact of a successful exploit.
Principle of Least Privilege: Ensure the NLTK process runs with the minimum necessary filesystem permissions.
Impact
Arbitrary File Read: Attackers can read any file the NLTK process has access to, including:
`/etc/passwd`, `/etc/shadow`
`/proc/self/environ` (leaking environment variables)
Application source code and configuration files
SSH private keys and cloud metadata
Data Breach: Exposure of sensitive information can lead to further compromise, such as account takeover or lateral movement within a network.
Credential Theft: Leaked API keys and database credentials can be used to access external services and internal databases.
Compliance Violations: Unauthorized access to sensitive data can result in violations of regulations like GDPR or HIPAA.
🎯Let’s Practice Exploiting & Learn Patching For Free:
🎓 Live Courses & Certifications:
Join Undercode Academy for Verified Certifications
🚀 Request a Custom Project:
Secure, high-velocity infrastructure and disruptive technological engineering. Contact our engineering team for high-tier development and proprietary systems:
[email protected]
💎 Smart Architecture | 🛡️ Secure by Design | ⭐ Trusted by Thousands
Sources:
Reported By: github.com
Extra Source Hub:
Undercode

