Cat and Mouse
Background
I have a Xiaohongshu link parser that extracts titles, authors, images, comments, and statistics from notes. One day, someone reported that a shared link failed to parse:
http://xhslink.com/o/4IVKoZeuj0O
Requesting with aiohttp and following redirects eventually landed on /404?errorCode=-510001. Opening the same link in a browser worked perfectly fine.
What happened?
Short Link Redirects and xsec_token
Following the redirect chain with curl:
xhslink.com/o/4IVKoZeuj0O
→ 302 → xiaohongshu.com/discovery/item/6955f790000000001f0042e2?xsec_token=...&type=normal
→ 302 → /404?errorCode=-510001
The short link first 302 redirects to the note page, and then the note page 302 redirects to 404. The key parameter is xsec_token in the URL, which is Xiaohongshu's anti-hotlinking token. The server verifies whether it matches the current session.
Comparing the requests in the browser's packet capture, the second request had an extra batch of Cookies:
a1, webId, websectiga, sec_poison_id, gid, web_session, acw_tc, abRequestId
These Cookies are not issued by the server via Set-Cookie; they are generated by the JavaScript in the page. A pure HTTP client cannot get them.
Two Sources of Identity Materials
To reproduce Route 2 and Route 3, a key understanding must first be established—cookies are not all the same kind of thing. Some are generated locally, and some are issued by the server.
| Type | Cookie | How to obtain |
|---|---|---|
| Locally generated | a1 / webId / abRequestId | Calculated by Python code using a fixed algorithm, completely independent of the network |
| Locally generated | loadts / webBuild / xsecappid | Local timestamp, version number, and application ID, written directly into the cookie jar |
| Server issued | websectiga / sec_poison_id | POST /api/sec/v1/scripting, the server issues a piece of JS, decoded locally by a fixed offset |
| Server issued | gid / acw_tc | POST /api/sec/v1/shield/webprofile, body contains encrypted fingerprint, server Set-Cookie |
| Server issued | web_session | POST /api/sns/web/v1/login/activate, issued by the server, prefix 03 / 04 |
Route 1 does not touch either of these two categories (so it's called zero session cost); Route 2 needs to finish all 9 steps and get all cookies, but does not call the signature API; Route 3 requires calculating 5 signature headers yourself in addition to all the cookies.
Route 1: Mobile UA + CDN Domain Replacement
This is the simplest path. The desktop requires JS to generate Cookies to pass verification, so let's change our thinking and disguise as a mobile phone. Directly follow the 302 with aiohttp, the final page is HTTP 200, and the HTML can be obtained without any Cookies.
MOBILE_UA = (
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.0 Mobile/15E148 Safari/604.1"
)
async with aiohttp.ClientSession(
timeout=aiohttp.ClientTimeout(total=30),
headers={
"User-Agent": MOBILE_UA,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9",
},
) as s:
async with s.get(share_url, allow_redirects=True) as resp:
final_url = str(resp.url)
html = await resp.text(errors="replace")
if "/404" in final_url or "errorCode=" in final_url:
raise RuntimeError("The note may have been deleted or hit risk control")
It makes sense when you think about it. The WebView in the App may not be able to run the complete anti-scraping JS. The mobile share link (app_platform=ios) takes a more lenient path and does not verify xsec-related Cookies.
The cost is that the image URLs have watermarks, looking like http://sns-webpic-qc.xhscdn.com/202604070308/<sign>/<image_id>!h5_1080jpg. The !h5_1080jpg at the end is a processing instruction at the CDN level, and the CDN synthesizes the watermark when outputting the image; removing the suffix will result in a 403 (the signature path does not match). The watermark is part of the image processing pipeline, not added by the client.
Fortunately, Xiaohongshu has another CDN domain sns-img-qc.xhscdn.com (and ci.xiaohongshu.com), which directly outputs the original image based on the bare image_id, without signature or watermark. The implementation of changing the domain to remove the watermark:
def cdn_strip_watermark(url: str) -> str:
clean = url.split("!")[0] if "!" in url else url # Remove suffixes like !h5_1080jpg
path = urlparse(clean).path
parts = path.strip("/").split("/")
if len(parts) >= 3:
image_path = "/".join(parts[2:]) # Skip DATE and SIGN, keep only image_id
return f"https://sns-img-qc.xhscdn.com/{image_path}"
return url
Extracting Fields from HTML
The HTML of the mobile share page does have the window.__INITIAL_STATE__ variable, but the note.noteDetailMap inside is an empty object—the mobile data injection method is different. Note fields are scattered in the HTML text in the form of JSON fragments (SSR pre-rendered script blocks, or template serialization products). Key-value pairs like "nickname":"..." / "desc":"..." / "title":"..." are readable in plaintext in the text, and can be scanned directly with regex.
Define a tier for each field, match them one by one according to priority, and use the first non-empty, non-placeholder value:
def _first(patterns, html: str) -> str:
for p in patterns:
for m in re.finditer(p, html, re.I | re.DOTALL):
val = m.group(1).strip() if m.group(1) else ""
if val and val not in ("小红书", "小红书 - 你的生活指南"):
return val
return ""
The title tier must add a "Xiaohongshu" placeholder filter—the <title> tag is often a site name placeholder like "小红书" or "小红书 - 你的生活指南". If encountered, skip to the next pattern. Otherwise, the fallback will match the site name, see "Xiaohongshu", and return it. The user will think the parsing was successful, but actually got nothing:
title = _first([
r'<meta\s+property="og:title"\s+content="([^"]+)"',
r'"title":"([^"]+)"',
r"<title[^>]*>(.*?)</title>",
], html) or "小红书内容"
author_name = _first([r'"nickname":"([^"]+)"', r'"nickName":"([^"]+)"'], html)
author_id = _first([r'"userId":"([^"]+)"', r'"user_id":"([^"]+)"'], html)
content_raw = _first([r'"desc":"([^"]+)"', r'"content":"([^"]+)"', r'"text":"([^"]+)"'], html)
stats = {
"liked": _first([r'"likedCount":"?(\d+)"?'], html),
"comment": _first([r'"commentCount":"?(\d+)"?'], html),
"collect": _first([r'"collectedCount":"?(\d+)"?'], html),
"share": _first([r'"shareCount":"?(\d+)"?'], html),
}
pt_ms = _first([r'"time":(\d{13})'], html) # Millisecond timestamp
The content captured is a JS escaped string and needs to be unescaped: \u002F → /, \u0026 → &, \u003D → =, \u003F → ?, \u003A → :, \n, \t, ", etc. If the unescaping is not clean, URL-type fields (http:\u002F\u002F...) will be completely unusable:
def _unescape_js_string(s: str) -> str:
return (s.replace(r"\u002F", "/")
.replace(r"\u0026", "&")
.replace(r"\u003D", "=")
.replace(r"\u003F", "?")
.replace(r"\u003A", ":")
.replace(r"\n", "\n").replace(r"\t", "\t")
.replace(r"\"", '"'))
There is a cross-route difference in the statistics fields to note: in the mobile HTML, it is an integer string ("likedCount":"857"), while the API /v1/feed returns a readable format ("liked_count":"7.1万"). The semantics of the same field in the two routes are different. If the downstream needs to build a UI, it has to align them itself.
Topics: Must use tagList, cannot use desc
There is a common pitfall in topic extraction: trying to extract from the #topic_name[话题]# pattern in the desc field. It looks like it can match, but the results are often fragmented. The reason is that desc has been JSON escaped, and the Chinese in the topic name is serialized via \uXXXX, making the regex boundary judgment easy to cut wrong. The correct approach is to use the tagList array, locate first, then findall:
topics = []
m = re.search(r'"tagList":\s*\[(.{0,5000}?)\]', html, re.DOTALL)
if m:
topics = re.findall(r'"name":"([^"]+)"', m.group(1))
if not topics: # Fallback: scan HTML for #xxx[话题]
topics = re.findall(r"#([^\s#\[]+)\[话题\]", html)
topics = list(dict.fromkeys(topics))[:20] # Deduplicate and preserve order, max 20
tagList is a structured source without escape interference. The name field of each topic is readable as-is, allowing you to grab them all at once.
Images: Anchor the onix-carousel-item DOM
The mobile share page uses the structure <div class="onix-carousel-item"><img src="..."></div> to render the image carousel. This class name is very unique and won't be mismatched. Grabbing src with regex can extract all the images, and each URL is then passed through cdn_strip_watermark to change the domain:
carousel = re.findall(
r'class="onix-carousel-item"[^>]*>.*?<img[^>]*src=["\']([^"\'\s]+)["\']',
html, re.DOTALL,
)
images = [
{"index": i, "id": image_id_from_url(u), "url": u, "raw_url": cdn_strip_watermark(u)}
for i, u in enumerate(carousel, 1)
]
Videos and Live Photos: Scan masterUrl + Filter Watermark Variants
Video URLs are scattered and scanned in three patterns, and each must filter out the watermarked _259.mp4 variant:
video_urls = []
for pat in [
r'"masterUrl":"([^"]+)"',
r'"master_url":"([^"]+)"',
r'"url":"(https://v\.xhscdn\.com[^"]+)"',
]:
for m in re.finditer(pat, html):
v = _unescape_js_string(m.group(1))
if "_259.mp4" in v: # Watermarked video variant, skip
continue
if v not in video_urls:
video_urls.append(v)
Other suffixes (_adapt_720p.mp4 / master URL, etc.) have no watermark by default.
Content Type Determination
content_type has three values: image / video / live_photo. Determination order:
type_param = parse_qs(urlparse(final_url).query).get("type", [""])[0]
if type_param == "video" and video_urls:
content_type = "video"
videos = video_urls[:1] # One main video is enough
elif video_urls:
# The number of videos is close to the number of images (e.g., both are 3), it's highly likely a live photo: each static image is paired with a motion video
content_type = "live_photo"
live_photos = [
{"index": i, "image_url": images[i-1]["raw_url"], "video_url": video_urls[i-1]}
for i in range(1, min(len(images), len(video_urls)) + 1)
]
else:
content_type = "image"
When downloading live photo notes downstream, each pair is saved as live_01_still.jpg + live_01_motion.mp4. After packaging, the user can restore the live photo effect in the mobile photo album.
Actual Test
Image Note: http://xhslink.com/o/4IVKoZeuj0O
note_id : 6955f790000000001f0042e2
title : 𝐰𝐞𝐜𝐡𝐚𝐭|情侣头像
author : zhang / 5cd3f6730000000012033a83
content_type : image
images : 12 images (all changed CDN domain to get unwatermarked original images)
stats : likes=857 comments=22 collects=290 shares=200
topics : ['今天你换头像了吗', '情侣头像', 'cp', '头像分享', '今日头像分享',
'可爱小猫', '猫猫是世界上最可爱的生物', '每日分享', '小动物头像', '头像']
publish : 2026-01-01T12:26:56
Video Note: http://xhslink.com/o/Ap3mwS5Q0UD
note_id : 69e3114b000000002202916e
title : 仲夏可可很萌!
author : 用眼泪把你复习一遍 / 6690bced000000000f0348e9
content_type : video
videos : 1 master URL
stats : likes=1209 comments=154 collects=160 shares=42
topics : ['仲夏可可', '莓喵jk']
publish : 2026-04-18T13:06:19
Live Photo Note: http://xhslink.com/o/LRYdx90zeV
note_id : 69e1594b000000000b010eaf
title : 🇫🇷尼斯老城遇到杨超越董思成
author : 喵了个汪 / 6161f5460000000002022ced
content_type : live_photo
images : 3 static images + paired motion videos one by one
stats : likes=7 comments=3419 collects=5392 shares=5024
topics : ['偶遇明星', '偶遇', '杨超越', '董思成', '法国', '尼斯', '尼斯老城区']
publish : 2026-04-17T05:48:59
All three content types can be extracted, and the main data is basically complete. The limitations of this route are also clear:
- The comment body cannot be obtained. Comments are not in the share page HTML. You have to separately request
/api/sns/web/v2/comment/page, and that API goes back to the world requiring full signatures. - Statistical fields are integers rather than readable formats. The live photo above actually has 79k likes, but this route captures the integer
7—in the mobile HTML, the like count exists as scattered fragments truncated by CDN/SSR, lacking precision.
Route 2: PC Web Session plus INITIAL_STATE in HTML
Xiaohongshu's PC share page is server-side rendered, and the data is directly embedded in the HTML:
<script>
window.__INITIAL_STATE__ = { "note": { "noteDetailMap": { "<note_id>": { "note": { ... } } } }, ... }
</script>
Inside is an almost complete note JSON—title, body, image list (with infoList multi-resolution variants), author, interaction data, topic tags, and video stream information are all available, and the image URLs are unwatermarked original CDN links.
There is no need to call /v1/feed, so JSVMP signatures are not needed. But there is a cost: you must first be able to open this share page in a "browser-like" state. Directly hitting it with aiohttp plus a UA will redirect to /login or 404, so you have to run through the entire browser initialization process.
Cookie Generation Chain: 9-step bootstrap
Running the complete session initialization takes these 9 steps, and the cookie produced in each step will be relied upon by the next step:
1. GET / Load homepage
2. GET /api/sec/v1/ds?appId=xhs-pc-web Pre-pull JSVMP decryption script
3. POST /api/redcaptcha/v2/getconfig Captcha config
4. POST /api/sec/v1/scripting type=ds scripting channel warmup
5. POST /api/sec/v1/sbtsource Report sbt source
6. POST /api/sec/v1/scripting callback=seccallback Issue websectiga / sec_poison_id
7. POST /api/sec/v1/shield/webprofile Report fingerprint → Issue gid
8. POST /api/sns/web/v1/login/activate Guest activation → Issue web_session
9. runtime bootstrap: user/me, system/config, zones,
homefeed/category, global/config,
racing_get, racing_report
Miss one step, and some subsequent API will crash. The generation methods of several key cookies inside:
a1: This is the seed of the entire identity, completely generated locally. Timestamp hex + 30 random characters + platform code + CRC32 checksum, truncated to the first 52 bits:
def gen_a1():
hex_data = hex(int(time.time() * 1000))[2:]
random_30 = ''.join(random.choices(
"abcdefghijklmnopqrstuvwxyz1234567890", k=30))
# GET_PLAT_FROM_CODE = 5 (Windows takes the other branch in the frontend getPlatformCode and returns 5)
text = hex_data + random_30 + "5" + "0" + "000"
crc32 = crc32_encode(text)
return (text + str(crc32))[:52] # 52 bytes fixed length
webId: MD5(a1), the device identifier bound to a1.
websectiga and secpoisonid: Step 6 POST /api/sec/v1/scripting callback=seccallback returns a JS string, looking like {"b":"<base64>","d":[...]}). The server wants you to run the VM in the browser to decrypt the 64-bit key, but we decrypt it statically:
def gen_websectiga(js_text: str) -> str:
b = re.search(r'"b":"(.*?)",', js_text).group(1)
d = json.loads(re.search(r'"d":(.*?)\}\)', js_text).group(1))
# 1. Base64 decode b, split into a list of 5 characters per group, each character value takes ord(c) - 1
padding = len(b) % 4
if padding:
b += '=' * (4 - padding)
decoded = base64.b64decode(b).decode('utf-8')
decode_list = []
chunk = []
for c in decoded:
if len(chunk) == 5:
decode_list.append(chunk)
chunk = []
chunk.append(ord(c) - 1)
if chunk:
decode_list.append(chunk)
# 2. Slice by d[92]:d[93]+1, then do a secondary table lookup by fixed offset to get 64 integers
target = decode_list[d[92]:d[93]+1]
key = [d[target[675 + i][2]] for i in range(0, 128, 2)]
# 3. Concatenate 64 characters using a double loop of for i in range(56, -1, -8) for j in range(8)
return "".join(chr(key[i + j]) for i in range(56, -1, -8) for j in range(8))
That string of offsets (92 / 93 / 675 / 56 / -1 / -8 / 8) are all magic numbers extracted from the JSVMP bytecode, which will be fine-tuned with versions. sec_poison_id is taken directly from another field in the same response.
gid and acw_tc: Serialize the 80+ field browser fingerprint (UA, screen, WebGL, Canvas hash, etc.) → base64 → DES-ECB encryption (key zbp30y86, zero-padded to 8-byte blocks) → hex. POST as profileData to webprofile, and the server returns these two cookies via Set-Cookie in the response:
def encrypt_profile_data(fp: dict) -> str:
fp_json = json.dumps(fp, separators=(',', ':'), ensure_ascii=False)
fp_b64 = base64.b64encode(fp_json.encode())
cipher = DES.new(b"zbp30y86", DES.MODE_ECB)
# Zero-pad to a multiple of 8 bytes
pad_len = 8 - len(fp_b64) % 8
padded = fp_b64 + b'\x00' * pad_len
return cipher.encrypt(padded).hex()
web_session: The last step, guest activation POST /api/sns/web/v1/login/activate with an empty body, issued by the server. There are two types of prefixes: starting with 03 is the device-level guest state, which can be obtained by an empty body POST; starting with 04 is the real logged-in state, which can only be obtained by entering activate with the session cookie of a logged-in browser. 03 is sufficient for public data like /v1/feed and share page HTML. What really needs 04 are APIs bound to real user relationships like the following feed and private messages, which have nothing to do with note parsing.
In addition, several auxiliary cookies will be written along the way: loadts (timestamp for signature, updated on every encrypted request), webBuild (equals ARTIFACT_VERSION), xsecappid (equals xhs-pc-web), abRequestId (a UUID). Missing any of these will cause the server to treat it as an abnormal client.
Besides cookies, headers must also be strictly aligned. The Chrome version number in the UA (e.g., Chrome/147) must be consistent with the Chromium version number in sec-ch-ua, and sec-ch-ua-platform and sec-ch-ua-mobile must also be fully provided. If even one item does not match, the signature API will return 461.
Version Synchronization: Don't hardcode ARTIFACT_VERSION
If ARTIFACT_VERSION is hardcoded, it will break sooner or later—in over a year, the online version climbed all the way from 4.83.1 to 6.7.0 (LANGUAGE_VERSION changed from 4.2.6 to 4.3.5), roughly one major version per quarter. The most typical symptom of falling behind in version is directly getting 471 verifyType=290 at the shield/webprofile stage:
{"msg": "The current version is too low, please refresh the page or close and reopen the page", "code": 300042}
A safe approach is to pull https://www.xiaohongshu.com/ at startup, find <script src="...vendor-dynamic.xxx.js"> from the returned HTML, download it, and use regex to extract the version number:
html = requests.get("https://www.xiaohongshu.com/").text
m = re.search(r'/vendor-dynamic\。([a-f0-9]+)\.js', html)
js = requests.get(f"https://static-resource.xhscdn.com/.../vendor-dynamic.{m.group(1)}.js").text
artifact_version = re.search(r'artifactVersion.*?(\d+\.\d+\.\d+)', js).group(1)
language_version = re.search(r'languageVersion.*?(\d+\.\d+\.\d+)', js).group(1)
Use the same method to extract sdkVersion and appId. If they cannot be extracted, fallback to local configuration, but local configuration must be updated regularly, don't leave values from two years ago there.
Opening the Share Page
After the Session is built, try two URLs in order:
candidates = [
f"https://www.xiaohongshu.com/discovery/item/{note_id}"
f"?app_platform=ios&app_version=9.22.1&share_from_user_hidden=true"
f"&xsec_source=app_share&type=normal&xsec_token={quote(xsec_token, safe='')}",
f"https://www.xiaohongshu.com/explore/{note_id}"
f"?xsec_token={quote(xsec_token, safe='')}&xsec_source=pc_feed",
]
The first simulates the origin of an App share, and the second simulates clicking in from the PC feed. The server will fill noteDetailMap[<note_id>].note directly into the HTML and return it. The request headers must imitate a document-level navigation for the server to treat it as a real browser:
doc_headers = {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
"image/avif,image/webp,image/apng,*/*;q=0.8",
"referer": "https://www.xiaohongshu.com/",
"sec-fetch-dest": "document",
"sec-fetch-mode": "navigate",
"sec-fetch-site": "same-origin",
"upgrade-insecure-requests": "1",
}
xsec_token must be fully escaped with quote(token, safe='') before being embedded in the URL, otherwise the = at the end will be treated as a URL query separator.
Extracting INITIAL_STATE from HTML
After a certain update, the build of Xiaohongshu's PC frontend changed the assignment to a tight syntax window.__INITIAL_STATE__={...} (it used to be window.__INITIAL_STATE__ = {...}, with a space on each side). The old code html.find("window.__INITIAL_STATE__ = ") directly returns -1, the entire HTML parsing dies at the first step, and the whole system silently degrades to the <meta og:*> fallback—no errors on the surface, but actually only the title and cover remain.
A safe way to write it is to use regex to match the assignment symbol, and then start from the first { to do a brace balance scan to accurately intercept the entire JSON object, while avoiding brackets in string literals:
def extract_assigned_json_object(html: str, var_name: str) -> str:
m = re.search(rf"{re.escape(var_name)}\s*=\s*", html, flags=re.IGNORECASE)
if not m:
return ""
# Skip whitespace after the equals sign, locate the first '{'
i = m.end()
while i < len(html) and html[i].isspace():
i += 1
if i >= len(html) or html[i] != "{":
return ""
depth, in_string, quote_char, escaped = 0, False, "", False
start = i
for cursor in range(i, len(html)):
ch = html[cursor]
if in_string:
if escaped:
escaped = False
elif ch == "\\":
escaped = True
elif ch == quote_char:
in_string = False
continue
if ch in ('"', "'"):
in_string, quote_char = True, ch
elif ch == "{":
depth += 1
elif ch == "}":
depth -= 1
if depth == 0:
return html[start : cursor + 1]
return ""
The extracted JSON is not valid JSON—the Xiaohongshu frontend will stuff undefined literals:
"someField": undefined,
"someArray": [undefined]
json.loads will explode directly, so do two rounds of replacement first:
def sanitize_initial_state(raw: str) -> str:
s = re.sub(r":\s*undefined(?=[,}])", ": null", raw)
s = re.sub(r"\[\s*undefined\s*\]", "[null]", s)
return s
state = json.loads(sanitize_initial_state(extract_assigned_json_object(html, "window.__INITIAL_STATE__")))
The note body is in state["note"]["noteDetailMap"][<note_id>]["note"]. If the first of the two candidate URLs hits, return directly; if neither hits, noteDetailMap may only have an "undefined" placeholder key left (the note is risk-controlled or deleted). The code specifically leaves a fallback to "find the first non-undefined key":
def choose_note_payload(state: dict, note_id: str) -> dict:
detail_map = (state.get("note") or {}).get("noteDetailMap") or {}
if note_id and note_id in detail_map:
return (detail_map[note_id] or {}).get("note") or {}
for key, value in detail_map.items():
if key != "undefined" and isinstance(value, dict):
return value.get("note") or {}
return {}
After getting note, the fields are all structured: note["title"], note["desc"], note["user"]["nickname"], note["user"]["userId"], note["interactInfo"]["likedCount"] (readable format "7.9万"), note["tagList"] is [{"id", "name", "type"}, ...], each item in note["imageList"] comes with urlDefault / urlPre / url / infoList (multi-resolution variants) and a livePhoto flag, and note["video"]["media"]["stream"] contains master URLs and backup URLs for h264 / h265 / av1.
Supplementing Comments and Auxiliary Data
Comments are not in the HTML and must be requested separately via /api/sns/web/v2/comment/page. This API, like /api/sns/web/share/code (share code) and /api/sns/web/v2/widgets (widget info), goes through the complete signature process—it uses the same session and the same encryption headers as /v1/feed. In other words, once the PC session is nurtured cleanly, getting auxiliary data like comments, share codes, and widget info is basically effortless.
Actual Test
Route 2 runs three test links on the basis of the same session, and the main fields obtained are exactly the same as Route 1:
| Field | Image Note | Video Note | Live Photo Note |
|---|---|---|---|
| note_id | 6955f790... | 69e3114b... | 69e1594b... |
| title | 𝐰𝐞𝐜𝐡𝐚𝐭|情侣头像 | 仲夏可可很萌! | 🇫🇷尼斯老城遇到杨超越董思成 |
| content_type | image | video | live_photo |
| images | 12 | — | 3 |
| videos | — | 1 master URL | 3 motion (one paired with each static image) |
| stats | 857 / 22 / 290 / 200 | 1209 / 154 / 160 / 42 | 79000 / 3419 / 5392 / 5024 |
Compared to Route 1, Route 2 has two advantages:
- The image URL defaults to the unwatermarked
!nd_dft_wlteh_jpg_3variant, so you can get clean images without changing the domain. - Statistics are structured fields (the live photo above is
79000at the HTML level, not the fragmented7in Route 1).
The cost is that you have to run the complete 9-step bootstrap every time, with a cold start of 10–30 seconds; if you want to get the comment body, you have to go through the signature API again.
Route 3: Using /api/sns/web/v1/feed
HTML can get the vast majority of fields, but there are a few scenarios where only the API is clean. For example, complete infoList multi-resolution variants, multiple backup CDN URLs for videos, precise readable format statistics, the stream.h264 structure of live photos, and complete metadata for certain topics. Therefore, a path to directly call the feed API is left at the innermost layer.
The core of getting through this route lies in two things: first, the session and cookie generation chain in Route 2 must be built first (/v1/feed reuses the same session, a1 / web_session / gid are all indispensable); second, each request must carry 5 self-calculated signature headers. Let's talk about them one by one below.
Request Signature Overview
| Header | Construction Method |
|---|---|
x-s | 124-byte array concatenated from 11 segments of byte streams → bitwise XOR with fixed 124-byte key → Base58 encoding → mns0101_ prefix → wrapped in an outer layer of custom Base64 → XYS_ prefix |
x-s-common | 18-field subset of fingerprint encrypted by ARC4 to get b1 → outer layer contains a dictionary of plat_from_code / language_version / artifact_version / cookie_a1 / b1 / MRC checksum → custom Base64 |
x-b3-traceid | 16-bit random hex |
x-xray-traceid | (timestamp_ms << 23) | seq padded to 16 hex + 64-bit random number padded to 16 hex, concatenated to 32 bits |
x-t | Current millisecond timestamp |
The last three of these five headers are pure local calculations and can be implemented at a glance. The real highlights are x-s and x-s-common—let's break them down below.
x-s: 11 Segments of Bytes Concatenated, then XOR + Base58
In the browser, this header is calculated by window.mnsv2(). The actual logic is compiled into custom VM bytecode running in the JSVMP virtual machine. The static restoration method is to interpret the bytecode instruction by instruction, restoring the following Python snippet:
def _encrypt_headers_x3(a1, loadts, uri, params=None, data=None):
# Concatenate query/body after uri to calculate the signature together
if params:
uri = f"{uri}?{urlencode(params).replace('%2C', ',')}"
if data is not None:
uri = uri + json.dumps(data, separators=(',', ':'))
md5_url = hashlib.md5(uri.encode()).hexdigest()
random_num = int(random.random() * 4294967295)
timestamp = int(time.time() * 1000)
# 11 segments of byte streams = 4+4+8+8+4+4+4+8+53+11+16 = 124 bytes
part1 = [119, 104, 96, 41] # Fixed magic
part2 = list(random_num.to_bytes(4, 'little'))
# timestamp 8 bytes LE, byte 0 changed to sum(b[1:5])+sum(b[5:8]) lower 8 bits, all bytes XOR 41
b = list(timestamp.to_bytes(8, 'little'))
b[0] = (sum(b[1:5]) & 255) + sum(b[5:8]) & 0xFF
part3 = [x ^ 41 for x in b]
part4 = list(loadts.to_bytes(8, 'little'))
part5 = list((int(random.random() * 99) + 1).to_bytes(4, 'little'))
part6 = list((1293).to_bytes(4, 'little')) # Number of window properties
part7 = list(len(uri.encode()).to_bytes(4, 'little'))
part8 = [b ^ (random_num & 255) for b in bytes.fromhex(md5_url)][:8]
part9 = [len(a1)] + list(a1.encode()) # 53
part10 = [len('xhs-pc-web')] + list(b'xhs-pc-web') # 11
part11 = [1, (random_num & 255) ^ 115,
249, 83, 103, 103, 201, 181, 131, 99, 94, 7, 68, 250, 132, 21]
raw = part1 + part2 + part3 + part4 + part5 + part6 + part7 \
+ part8 + part9 + part10 + part11
encrypted = [i ^ j for i, j in zip(raw, XOR_KEY_124)] # Fixed key
return "mns0101_" + base58_encode(bytes(encrypted), CUSTOM_BASE58_TABLE)
A few details that are easy to get wrong:
- In
part3, the timestamp bytes are self-checked first and then all XORed with 41. If the check fails (for example, the timestamp you casually provided is wrong), the server will directly return 461. loadtsis not the timestamp of the note API, but the timestamp of "this signature"—before signing, first writeloadts = str(int(time.time() * 1000))back to the cookie jar, so that the cookie echoed back by the server always carries the latest value; then concatenate this millisecond value into the array aspart4. In other words, it is recalculated for each signature, and is only one or two milliseconds different fromx-t(but they are not the same value and cannot be mixed).1293is the hardcoded value of "Object.getOwnPropertyNames(window).lengthin Chrome". Chromium version iterations will fine-tune this number, and occasionally you have to update it against actual tests in the online browser.XOR_KEY_124is a 124-byte constant table ([175, 87, 43, 149, ...]), which is dragged out of the JS all at once to use.- The length of
part9is dynamic—len(a1)+ a1 bytes, but a1 is always truncated to 52 bytes, sopart9is always 53 bytes, and the entire array length is stable at 124.
Wrap another layer on the outside:
p = {
'x0': LANGUAGE_VERSION, 'x1': 'xhs-pc-web', 'x2': 'Windows',
'x3': _encrypt_headers_x3(...),
'x4': '' if data is None else 'object',
}
payload = url_quote(json.dumps(p, separators=(',', ':')))
return "XYS_" + custom_base64(utf8_to_bytes(payload), CUSTOM_BASE64_TABLE)
CUSTOM_BASE64_TABLE is Xiaohongshu's own code table ZmserbBoHQtNP+wOcza/LpngG8yJq42KWYj0DSfdikx3VT16IlUAFM97hECvuRX5—standard base64 cannot decode it, you have to implement a version of the encoder that looks up the table according to the code table.
x-s-common: b1 Fingerprint + MRC Checksum
The main body of x-s-common is a dictionary. What's really difficult is the b1 field in it—a subset of the fingerprint is encrypted by ARC4 and then undergoes another custom Base64:
def _encrypt_b1(fp):
# Pick 18 from the 80+ field fingerprint (x33~x39 + x42~x46 + x48~x52 + x82)
subset = {k: fp[k] for k in (
'x33','x34','x35','x36','x37','x38','x39',
'x42','x43','x44','x45','x46','x48','x49','x50','x51','x52','x82',
)}
raw = json.dumps(subset, separators=(',', ':'), ensure_ascii=False).encode()
cipher = ARC4.new(b'xhswebmplfbt')
ct = cipher.encrypt(raw).decode('latin1')
# URL encode and then manually split the percent sequence, the purpose is to split non-ASCII into single bytes
encoded = url_quote(ct, safe="!*'()~_-")
b = []
for c in encoded.split('%')[1:]:
chars = list(c)
b.append(int(''.join(chars[:2]), 16))
[b.append(ord(j)) for j in chars[2:]]
return custom_base64(bytes(b), CUSTOM_BASE64_TABLE)
Outer dictionary:
source = {
's0': 5, # GET_PLAT_FROM_CODE (Windows takes the other branch = 5)
's1': '',
'x0': '1', # localStorage.getItem("b1b1"), hardcoded
'x1': LANGUAGE_VERSION, # 4.3.5
'x2': 'Windows',
'x3': 'xhs-pc-web',
'x4': ARTIFACT_VERSION, # 6.7.0
'x5': cookie_a1,
'x6': '', 'x7': '', # Old versions were XS / XT, now hardcoded
'x8': b1,
'x9': diy_mrc('' + '' + b1), # Self-implemented CRC32, verify b1
'x10': fp['x39'], # Call counter, hardcoding it also passes
'x11': 'normal',
}
return custom_base64(utf8_bytes(url_quote(json.dumps(source, separators=(',', ':'), ensure_ascii=False))),
CUSTOM_BASE64_TABLE)
diy_mrc is a modified CRC32: the table is generated according to the standard polynomial 0xedb88320, but the last step will do a JS-style int32 wraparound on the accumulated value (num >= 2**31 ? num % 2**32 : num - 2**32). If this wraparound is not done in Python, it will not match the signed integer behavior of JS, and the server will judge x9 as illegal.
The Other Three Headers
The remaining three headers are relatively simple:
x_b3_traceid = ''.join(random.choices('abcdef0123456789', k=16))
x_t = str(int(time.time() * 1000))
# There is a seq in x-xray, auto-increments on each call—suspected to be the same kind of behavior count as x10 in x-s-common
seq = initial_seq + call_count
part1 = hex(int(time.time() * 1000) << 23 | seq)[2:].zfill(16)
part2 = hex((random_u32() << 32) | random_u32())[2:].zfill(16)
x_xray_traceid = part1 + part2
A common misconception is "the signature will change the bytecode anyway, it's better to reuse implementations maintained by others". Actual testing shows this is only half right: the VM instruction set does not change, what changes are only the constants of certain operations in the bytecode (124-byte XOR key, Base58/Base64 code tables, part11 tail, part1 magic, and those d[92]/d[93]/675 offsets in websectiga). Once the static implementation is written through, the next time the server changes the key, diffing the old and new vendor-dynamic.xxx.js can locate the new constants to replace in ten minutes; on the contrary, when relying on third-party implementations, if it doesn't update, you have to wait. The biggest gain of reverse engineering it yourself is knowing "which constants will change and which structures will not change"—the structure layer almost never changes.
Fingerprint context is dynamic
An easily overlooked detail is that the fingerprint is not one-time. b1 in x-s-common is recalculated with the current fp every time, and fp will be updated according to storage_state (page state) and page_context (current URL and referer)—for example, calling note details should be set to explore/{note_id}, and calling homepage recommendations should be set to the homepage. If "which page you are on" seen by the server does not match the request itself, risk control will directly pull the plug. Before calling feed, you must first switch page_context to explore/{note_id} and then calculate the signature. If you forget this step, you get 461 verifyType=301.
page_context is mapped to page_type based on the path in the URL:
def normalize_page_context(ctx: dict | None) -> dict:
out = {
"location": "https://www.xiaohongshu.com/explore",
"referer": "https://www.xiaohongshu.com/",
"page_type": "explore",
}
if ctx:
out.update({k: v for k, v in ctx.items() if v is not None})
path = urlparse(out["location"]).path or "/"
if "/search_result" in path:
out["page_type"] = "search"
elif "/user/profile" in path:
out["page_type"] = "user_profile"
elif "/explore/" in path or "noteId=" in out["location"]:
out["page_type"] = "note_detail"
else:
out["page_type"] = "explore"
return out
The fingerprint itself is a dictionary of 80+ fields fp = {"x1": ua, "x2": "false", "x3": "zh-CN", "x4": 24, ...}—UA, language, color depth, device memory, CPU, screen resolution ("1920;1080"), available area ("1920;1040"), time zone (-480, "Asia/Shanghai"), GPU vendor/renderer, plugins, canvas fingerprint (x22), voice hash (x53), WebGL extension hash (x56), cookie copy (x57), DOM-related counts (x58 div / x59 resource / x61 window.* property count / x73 DOM node count), {referer, location, frame} in x66, {prefix}|{window_props}|{script_count} string in x69, and some hardcoded magic fields (x30 "swf object not loaded", x45 "__SEC_CAV__1-1-1-1-1|__SEC_WSA__|", etc.). The role of the entire dict is to describe the state of "the browser on a certain URL at a certain moment".
Before each call to an encrypted API, it will recalculate the fields in fp affected by the current request:
# Key: fields affected by the request
fp["x39"] = str(storage["sc"]) # session counter, +1 per request
fp["x44"] = str(int(time.time()*1000)) # Current millisecond
fp["x57"] = "; ".join(f"{k}={v}" for k, v in cookies.items()) # Current cookie snapshot
fp["x58"] = str(div_count) # Number of DOM divs under the current page_type
fp["x59"] = str(resource_count)
fp["x61"] = str(window_props)
fp["x66"] = {"referer": ctx["referer"], "location": ctx["location"], "frame": 0}
fp["x69"] = f"{prefix}|{window_props}|{script_count}"
fp["x73"] = str(dom_count)
x58/x59/x73 have different baselines on different page_types: explore is (204, 14, 1240), note_detail adds (+36, +12, +420), search adds (+18, +8, +180), user_profile adds (+22, +10, +260). These baselines are values captured against real browsers on various pages. Fine-tuning one or two numbers does not affect risk control, but the overall magnitude and page_type must match.
Feed Request
The POST body of /api/sns/web/v1/feed is very simple:
data = {
"source_note_id": note_id,
"image_formats": ["jpg", "webp", "avif"],
"extra": {"need_body_topic": "1"},
"xsec_source": "pc_feed", # Taken from the query of the share URL
"xsec_token": xsec_token,
}
The signature input is (a1, loadts, "/api/sns/web/v1/feed", None, data)—JSON serialize data (separators=(',', ':')) and concatenate it directly after the URI to MD5 together. The response body looks like:
{
"code": 0, "success": true, "msg": "成功",
"data": {
"items": [{
"id": "...", "model_type": "note",
"note_card": {
"note_id": "...", "type": "video",
"title": "...", "desc": "...",
"user": {"user_id": "...", "nickname": "..."},
"interact_info": {"liked_count": "7.9万", "comment_count": "3419", ...},
"image_list": [{"url_default": "...", "info_list": [...], "live_photo": true}],
"video": {"media": {"stream": {"h264": [{"master_url": "...", "backup_urls": [...]}]}}},
"tag_list": [{"id": "...", "name": "...", "type": "topic"}]
}
}]
}
}
Take items[0].note_card for downstream processing.
Initial Session Quality and Auto-Retry
Even if the signature, cookie, and bootstrap are all correct, the first time a newly created session hits the feed, there is still a probability of returning a response with a successful shell but empty data (success=True but items is empty). This is not a code issue, but a probabilistic demotion by the server for "first access by a brand new identity". With the same code and the same note, returning empty this time, rebuilding the session once and hitting it again will yield data.
Identification mark: success=True and items is empty. The handling method is an auto-retry loop—delete the persisted device state (device_state.json), rerun the 9-step bootstrap to change a set of identities, and hit the feed again. Try up to 3 rounds in a single call. The single hit rate of cold start is about 70-90%, and adding retry can pull it up to 95%+:
MAX_ATTEMPTS = 3
note_card = {}
for attempt in range(1, MAX_ATTEMPTS + 1):
if attempt > 1:
DEVICE_STATE_FILE.unlink(missing_ok=True) # Delete persisted device state
session = await create_xhs_session() # Rerun 9-step bootstrap to change identity
try:
resp = await session.apis.note.note_detail(note_id, xsec_token)
payload = await resp.json(content_type=None)
items = (payload.get("data") or {}).get("items") or []
if items and (items[0].get("note_card") or {}):
note_card = items[0]["note_card"]
break # Success
except NeedCaptchaVerify as e:
# 216 semantic captcha / 102 slider verification; catch it and let the next round reset identity and retry
last_err = f"attempt={attempt} verify={e.details.get('verifyType')}"
finally:
await session.close_session()
The return structure also carries two observable fields feed_attempts (how many times tried) and feed_recovered_on_attempt (which round was the successful one), so you can see at a glance whether it was saved by retrying—routes that do not go through feed are always 1/1, only the API route will change.
Actual Test
Image Note:
note_id : 6955f790000000001f0042e2
title : 𝐰𝐞𝐜𝐡𝐚𝐭|情侣头像
author : zhang / 5cd3f6730000000012033a83
content_type : image
images : 12 images (each with infoList multi-resolution, url_pre, url_default, url)
stats : liked=857 comment=22 collect=290 share=200
publish : 2026-01-01T12:26:56
feed_attempts=2 feed_recovered_on_attempt=2 ← Initial jitter this time, recovered in the second round
Video Note:
note_id : 69e3114b000000002202916e
title : 仲夏可可很萌!
author : 用眼泪把你复习一遍 / 6690bced000000000f0348e9
content_type : video
videos : 6 items (master URL + multiple CDN backups)
stats : liked=1209 comment=154 collect=160 share=42
publish : 2026-04-18T13:06:19
Live Photo Note:
note_id : 69e1594b000000000b010eaf
title : 🇫🇷尼斯老城遇到杨超越董思成
author : 喵了个汪 / 6161f5460000000002022ced
content_type : live_photo
images : 3 static images
videos : 6 items (each static image paired with a motion, stream.h264 structure is regular)
stats : liked=7.9万 comment=3419 collect=5392 share=5024 ← Readable format
publish : 2026-04-17T05:48:59
Route 3 has the finest field granularity: multiple backup URLs for videos, regular live photo motion stream structure, and statistics are readable 7.9万. The cost is that you have to run the complete 9-step bootstrap every time, calculate the JS signature locally once, the constants in the signature may need to be updated with the server version, and occasional retries are required.
Combining the Three Routes
In actual operation, the three routes are stacked together sorted by data quality:
step 1 /v1/feed + self-calculated x-s / x-s-common → source=feed_api Cleanest
step 2 PC session opens share page, reads __INITIAL_STATE__ → source=initial_state Complete note
step 3 Reads <meta og:*> tags → source=meta_fallback Only title and cover left
Each output comes with an extract_source field, so you can see at a glance which tier the data is in this time.
Route 1 is another entry point—scenarios that only need images and grab some metadata along the way use it, without building a PC session, and the overall overhead is one-tenth of the complete path. Pulling the results of the three test links on the three routes together, the three routes are actually almost equivalent in "what data can be obtained", the real difference is the cost of establishing the connection:
| Route | Session Cost | Signature Cost | Data Freshness | Comment Content | Image Watermark |
|---|---|---|---|---|---|
| Mobile UA + CDN Replacement | None | None | SSR Snapshot | Cannot get | !h5_1080jpg included (change domain to resolve) |
PC Session + __INITIAL_STATE__ | 9-step bootstrap | None | Real-time | Requires separate API call | Not included by default |
PC Session + /v1/feed API | 9-step bootstrap + fingerprint | Self-calculated x-s / x-s-common | Real-time | Returned together | Not included by default |
Unified output structure (all three routes are normalized to this shape for the convenience of downstream):
{
"source": "mobile | html | api",
"extract_source": "feed_api | initial_state | meta_fallback",
"original_url": "...",
"final_url": "...",
"note_id": "...",
"xsec_token": "...",
"title": "...", "content": "...",
"content_type": "image | video | live_photo",
"author": {"name": "...", "user_id": "..."},
"stats": {"liked": "...", "comment": "...", "collect": "...", "share": "..."},
"topics": ["..."],
"publish_time_ms": 0, "publish_time_iso": "...",
"images": [{"index", "id", "url", "raw_url", "live_photo"}],
"videos": ["..."],
"live_photos": [{"index", "image_url", "video_url"}],
"feed_attempts": 1,
"feed_recovered_on_attempt": 1,
}
Image Variant Lineage
Looking at the actual download results of the three routes together, although the image_id is exactly the same, the "default variant" given by the CDN differs significantly:
| Source | URL Suffix | Size | Watermark |
|---|---|---|---|
Mobile HTML (!h5_1080jpg) | Scaled to 1080, jpg | ~77 KB | Included |
Desktop HTML __INITIAL_STATE__ (!nd_dft_wlteh_jpg_3) | Default processing, jpg | ~147 KB | Not included |
API /v1/feed (!nd_dft_wlteh_webp_3) | Default processing, webp | ~42 KB | Not included |
Change to sns-img-qc.xhscdn.com/ | No suffix, no signature | ~147 KB | Not included |
A few rules worth noting down:
- The watermark is a feature of the specific suffix variant
!h5_1080jpg, not a feature of thesns-webpic-qcdomain. The!nd_dft_wlteh_*variants under the same domain do not add watermarks by default. - The jpg variant given by the desktop HTML is almost the same size as the bare original image (147KB vs 147KB), indicating that variant is nearly lossless; changing the domain is not necessary on this route.
- The webp variant given by the API is truly small (42KB), with volume optimization dominating. To get the original image, you have to change the domain.
- The
image_idof the three routes is exactly the same. So as long as you know the URL given by any route, you can get any other variant (including the unwatermarked original image) by "skipping date and sign, keepingimage_id, and changing the domain".
Troubleshooting Order: Where to start checking when the link breaks
When the link breaks, the most feared thing is guessing backward from the "last step". The correct direction is to push backward from the foremost, cheapest check, and each layer has specific signals:
Layer 1: Entry Parameters
First confirm that note_id and xsec_token are parsed correctly—especially whether the = at the end of xsec_token has been eaten. Troubleshooting method: print the resolved note_id and xsec_token, use them to directly quote, and manually assemble an explore/{note_id}?xsec_token=... URL to open in the browser. If you can see the note, it means the entry parameters are fine.
Layer 2: Version Number
The typical symptom of hitting this layer is shield/webprofile returning 471 verifyType=290 ("当前版本过低"). Troubleshooting: see if the current ARTIFACT_VERSION / LANGUAGE_VERSION is the same as the one extracted from the online vendor-dynamic.xxx.js; if not, manually synchronize once and rerun.
Layer 3: Bootstrap
If any of the 9 steps crashes, the rest is a chain reaction. Troubleshooting: print out the status code and whether key cookies are written (websectiga / sec_poison_id / gid / acw_tc / web_session) at each step. Typical lesions:
- The
typeparameter ofscriptingin steps 6 and 7 is wrong →websectigacannot be decrypted - The fingerprint in
profileDatawas escaped for non-ASCII by the JSON encoder → webprofile 401 / 461 - The cookie jar is not clean before activate → cannot get
web_session, or get the wrong prefix
Layer 4: Signature (Route 3 Exclusive)
When feed returns 461, look at Verifytype in the header:
| Phenomenon | Stage | Meaning | Countermeasure |
|---|---|---|---|
461 verifyType=216 | /v1/feed | Signature invalid or JSVMP bytecode changed | Align new constants / re-decrypt bytecode |
461 verifyType=301 | /v1/feed | Session quality insufficient or page_context mismatch | Reset device state, switch to correct page_type |
461 verifyType=102 | /v1/feed | Trigger slider verification | Pass manually or hook up captcha service |
461 verifyType=216 + HTML | Share page HTML | Semantic captcha (select image and click words) | Parse theme/grid, call specialized branch |
code=0 success=true data={} | /v1/feed | Soft risk control: shell successful but data emptied | First check if = at the end of xsec_token is truncated; otherwise reset device state and retry |
SSLError / UNEXPECTED_EOF | Any | TLS fingerprint recognized or network jitter | Retry, change proxy |
Redirected to /login | HTML share page | Session expired or guest state revoked | Clear device state, reactivate |
Be especially alert to code=0 data={}. It looks completely normal, but actually returns nothing, easily fooling rough success judgments.
Layer 5: HTML Strong Extraction (Route 2)
If feed keeps crashing, but the HTML share page can be opened, fall back to Route 2 to check if __INITIAL_STATE__ can be extracted. The typical lesion is the marker regression of "spaces on both sides of the equals sign disappear" mentioned earlier—the symptom is that extract_source keeps falling to meta_fallback, the title and body are still there but the author/image/time are all empty. This is not the site cutting capabilities, but a local parser regression.
Layer 6: Mobile Fallback (Route 1)
When both PC routes fluctuate, use the mobile route as the final verification: if this one also cannot produce data, it is highly likely that the note itself has been deleted or risk control has directly blacklisted this exit IP; if this one can produce basic fields, it means the content is still alive, just the PC side state is not clean, reset the session and start over.
Confirmed, Inferred, and Stop Mythologizing
Having done this, I've accumulated a lot of experience. Let's distinguish these three categories to prevent future generations from taking detours.
Confirmed:
xsec_tokentail=being truncated will cause/v1/feedto appearcode=0 success=true data={}- Hardcoding the
window.__INITIAL_STATE__marker will cause the complete HTML extraction to degrade entirely tometa_fallback /v1/feedcurrently has real fluctuations of initial failure and recovery after secondary session rebuildingweb_sessionis issued by the server, no matter how you piece it together locally, you cannot exchange for a "valid but unseen by the server" value
Highly Credible Inferences:
- Session quality (cookie completeness + fingerprint consistency +
page_contextmatching) will significantly affect the initial success rate of Route C - A persisted "stable device state" is more like a real browser than a completely random new device state, with a higher pass rate
- The JSVMP VM instruction set is stable in the long term, what changes are mainly the constants in the bytecode—meaning the maintenance cost after reverse engineering it yourself is actually acceptable
Stop Mythologizing:
- "
web_sessionwith an 03 prefix definitely won't work, and an 04 prefix definitely will" — False. 03 is sufficient for public data, 04 is only necessary on APIs like following feeds / private messages. - "As long as a certain header is added, it will definitely pass feed" — False. The success or failure of feed is the comprehensive result of session quality, cookie combination, fingerprint context, and timing fluctuations.
- "Route 1 can only get images" — False. Most fields are actually scattered in the mobile HTML, and scattered regex can scan them out; the comment body indeed cannot be obtained, but everything else can.
- "Watermarks are a feature of the
sns-webpic-qcdomain" — False. Watermarks are a feature of the!h5_1080jpgprocessing suffix, and!nd_dft_wlteh_*variants of the same domain do not have watermarks by default.
What Xiaohongshu's Anti-Scraping Looks Like
Stacking the three routes together, we can roughly draw Xiaohongshu's protection hierarchy:
| Layer | Mechanism | Description |
|---|---|---|
| 1 | xsec_token | URL-level anti-hotlinking token, bound to session, don't get screwed by the = at the end |
| 2 | JS Cookie Generation | a1/webId/websectiga/gid etc. generated by JS runtime |
| 3 | Browser Fingerprint | 80+ field device fingerprint, DES encrypted reporting |
| 4 | Request Signature x-s | JSVMP virtual machine execution, periodically changing bytecode constants |
| 5 | TLS Fingerprint + UA/client hints | Inconsistent UA and sec-ch-ua directly recognized |
| 6 | Version Check | ARTIFACT_VERSION expired directly 471/290 |
| 7 | page_context Consistency | Request URL and referer/location mismatch triggers 301 |
| 8 | Semantic Captcha | Type 216, needs to recognize themes and grids in images |
| 9 | Behavior Analysis | Comprehensive risk control of frequency, trajectory, timing, etc. |
Layer 4 is a hard bone, but the first 8 layers combined have actually blocked 99% of scripts, and most people haven't even touched the door to Layer 9. JSVMP compiles the signature algorithm into custom bytecode. The VM instruction set is basically stable, and what changes are the constants in the bytecode (XOR key, code tables, magic bytes, etc.)—so after reverse engineering it once, maintenance is just a matter of diffing the old and new vendor-dynamic.js to grab constants. What really requires daily gaming with the server is actually session quality: cookie issuance, version number synchronization, fingerprint consistency, these are what change with every request.
Disclaimer
This article only records personal exploration and thinking during the technical learning process, and is for security research and educational purposes. All technical analyses involved in the text are aimed at publicly accessible pages and network requests, and do not involve any unauthorized access, bulk data scraping, or commercial use. Please abide by the terms of service of the relevant platforms and local laws and regulations, and do not use the content of this article in scenarios that infringe on the legitimate rights and interests of others.
Afterword
The three routes span quite a bit from "bare HTTP" to "complete Web session plus official API", but what really takes time is never the signature algorithm itself, but the quality of the session—whether UA and client hints need to be aligned, whether bootstrap can finish running, whether the web_session obtained is 03 or 04, whether page_context corresponds to the current request. The signature algorithm can be used for a long time once written, while session quality is the daily work of gaming with the server on every request.
Another observation is that the most expensive route (Route 3) and the cheapest route (Route 1) do not differ much in "what data can be obtained"—the field differences returned by the two are mainly in the comment body, multiple backup URLs for videos, and readable format statistics. Layered fallback stacks them together, diluting the cost of the expensive path, while using the extract_source field to explicitly mark which tier was reached each time, so you don't have to guess manually when problems occur.
The pair of CDN domains sns-webpic-qc and sns-img-qc is probably the most suitable example to conclude the entire system. The former takes the signature path and determines the image output specifications based on the URL suffix—whether to add a watermark, resolution, and encoding format are all hidden in the suffix; the latter directly gives the original image based on the bare image_id, without signature, and has no concept of suffix. The two domains share the same image_id namespace—knowing the id of an image on webpic-qc, getting the original image on img-qc is free. It is exactly this point that allows the watermarked version of !h5_1080jpg in the mobile share page to be easily bypassed: you don't need to know how it calculates the signature, just keep the image_id and change the domain. This door may close in the future. But at least for today, it is still open.
Appendix: Minimum Reproduction Checklist
Completing the capabilities in this order will allow you to run through all three routes, check them off for troubleshooting:
Entry Layer
- Follow
xhslink.com/o/...302, get the final URL - Extract
note_id/xsec_token/xsec_sourcefrom the final URL, useparse_qsto keep=padding
Route 1 (Zero Session)
- iPhone UA requests share page, follow redirects
- Scattered regex to extract title/author/body/stats/topics/time, with "Xiaohongshu" placeholder filtering
-
onix-carousel-itemto grab images,_259.mp4to filter watermarked videos -
cdn_strip_watermarkchange tosns-img-qcto get original images -
content_typethree-branch determination (video/live_photo/image)
Cookie Generation (Shared by Route 2 and 3)
-
a1=hex(ts) + rand30 + "50000" + crc32[:52],webId= MD5(a1) - Version sync: extract
artifactVersion/languageVersionfrom onlinevendor-dynamic.xxx.js - 9-step bootstrap runs through in order, check corresponding cookie writing at each step
-
websectiga's base64 + 5-character grouping + secondary table lookup decoding -
gid's JSON → base64 → DES-ECB(zbp30y86) → hex -
web_sessiontaken from activate response, distinguish 03 / 04
Route 2 (HTML)
- Two candidate URLs (discovery/item + explore), with document-level headers
-
extract_assigned_json_object: regex marker + brace-balance + string escaping -
sanitize_initial_state: two rounds ofundefined→null -
choose_note_payload: lookup bynote_id, fallback to find the first non-undefinedkey
Route 3 (API)
-
x-s: 11 segments of bytes → XORKEY124 → Base58(custom) →mns0101_+ outer Base64(custom) +XYS_ -
x-s-common: ARC4(xhswebmplfbt) encrypt 18-field fingerprint →b1, outer dict + diy_mrc + Base64 -
x-b3-traceid/x-xray-traceid/x-t - Update
loadtsand dynamic fields offpbefore each signature (x39/x44/x57/x58/x59/x61/x66/x69/x73) -
page_contextswitched tonote_detailbased on URL - feed POST body fixed shape (
source_note_id/image_formats/extra/xsec_source/xsec_token) - 3 rounds of auto-retry, delete
device_state.json+ clear init cookie cache + rebuild session on each failure
Unified Output
- Three routes normalized to the same dict shape, with
source/extract_source/feed_attempts/feed_recovered_on_attempt - Actual test shows all three content types (
image/video/live_photo) can output complete fields