Cat and Mouse

Background

I have a Xiaohongshu link parser that extracts titles, authors, images, comments, and statistics from notes. One day, someone reported that a shared link failed to parse:

http://xhslink.com/o/4IVKoZeuj0O

Requesting with aiohttp and following redirects eventually landed on /404?errorCode=-510001. Opening the same link in a browser worked perfectly fine.

What happened?

Short Link Redirects and xsec_token

Following the redirect chain with curl:

xhslink.com/o/4IVKoZeuj0O
  → 302 → xiaohongshu.com/discovery/item/6955f790000000001f0042e2?xsec_token=...&type=normal
    → 302 → /404?errorCode=-510001

The short link first 302 redirects to the note page, and then the note page 302 redirects to 404. The key parameter is xsec_token in the URL, which is Xiaohongshu's anti-hotlinking token. The server verifies whether it matches the current session.

Comparing the requests in the browser's packet capture, the second request had an extra batch of Cookies:

a1, webId, websectiga, sec_poison_id, gid, web_session, acw_tc, abRequestId

These Cookies are not issued by the server via Set-Cookie; they are generated by the JavaScript in the page. A pure HTTP client cannot get them.

Two Sources of Identity Materials

To reproduce Route 2 and Route 3, a key understanding must first be established—cookies are not all the same kind of thing. Some are generated locally, and some are issued by the server.

Type	Cookie	How to obtain
Locally generated	`a1` / `webId` / `abRequestId`	Calculated by Python code using a fixed algorithm, completely independent of the network
Locally generated	`loadts` / `webBuild` / `xsecappid`	Local timestamp, version number, and application ID, written directly into the cookie jar
Server issued	`websectiga` / `sec_poison_id`	POST `/api/sec/v1/scripting`, the server issues a piece of JS, decoded locally by a fixed offset
Server issued	`gid` / `acw_tc`	POST `/api/sec/v1/shield/webprofile`, body contains encrypted fingerprint, server `Set-Cookie`
Server issued	`web_session`	POST `/api/sns/web/v1/login/activate`, issued by the server, prefix `03` / `04`

Route 1 does not touch either of these two categories (so it's called zero session cost); Route 2 needs to finish all 9 steps and get all cookies, but does not call the signature API; Route 3 requires calculating 5 signature headers yourself in addition to all the cookies.

Route 1: Mobile UA + CDN Domain Replacement

This is the simplest path. The desktop requires JS to generate Cookies to pass verification, so let's change our thinking and disguise as a mobile phone. Directly follow the 302 with aiohttp, the final page is HTTP 200, and the HTML can be obtained without any Cookies.

PYTHON

MOBILE_UA = (
    "Mozilla/5.0 (iPhone; CPU iPhone OS 17_0 like Mac OS X) "
    "AppleWebKit/605.1.15 (KHTML, like Gecko) "
    "Version/17.0 Mobile/15E148 Safari/604.1"
)

async with aiohttp.ClientSession(
    timeout=aiohttp.ClientTimeout(total=30),
    headers={
        "User-Agent": MOBILE_UA,
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9",
    },
) as s:
    async with s.get(share_url, allow_redirects=True) as resp:
        final_url = str(resp.url)
        html = await resp.text(errors="replace")

if "/404" in final_url or "errorCode=" in final_url:
    raise RuntimeError("The note may have been deleted or hit risk control")

It makes sense when you think about it. The WebView in the App may not be able to run the complete anti-scraping JS. The mobile share link (app_platform=ios) takes a more lenient path and does not verify xsec-related Cookies.

The cost is that the image URLs have watermarks, looking like http://sns-webpic-qc.xhscdn.com/202604070308/<sign>/<image_id>!h5_1080jpg. The !h5_1080jpg at the end is a processing instruction at the CDN level, and the CDN synthesizes the watermark when outputting the image; removing the suffix will result in a 403 (the signature path does not match). The watermark is part of the image processing pipeline, not added by the client.

Fortunately, Xiaohongshu has another CDN domain sns-img-qc.xhscdn.com (and ci.xiaohongshu.com), which directly outputs the original image based on the bare image_id, without signature or watermark. The implementation of changing the domain to remove the watermark:

PYTHON

def cdn_strip_watermark(url: str) -> str:
    clean = url.split("!")[0] if "!" in url else url     # Remove suffixes like !h5_1080jpg
    path = urlparse(clean).path
    parts = path.strip("/").split("/")
    if len(parts) >= 3:
        image_path = "/".join(parts[2:])                 # Skip DATE and SIGN, keep only image_id
        return f"https://sns-img-qc.xhscdn.com/{image_path}"
    return url

Extracting Fields from HTML

The HTML of the mobile share page does have the window.__INITIAL_STATE__ variable, but the note.noteDetailMap inside is an empty object—the mobile data injection method is different. Note fields are scattered in the HTML text in the form of JSON fragments (SSR pre-rendered script blocks, or template serialization products). Key-value pairs like "nickname":"..." / "desc":"..." / "title":"..." are readable in plaintext in the text, and can be scanned directly with regex.

Define a tier for each field, match them one by one according to priority, and use the first non-empty, non-placeholder value:

PYTHON

def _first(patterns, html: str) -> str:
    for p in patterns:
        for m in re.finditer(p, html, re.I | re.DOTALL):
            val = m.group(1).strip() if m.group(1) else ""
            if val and val not in ("小红书", "小红书 - 你的生活指南"):
                return val
    return ""

The title tier must add a "Xiaohongshu" placeholder filter—the <title> tag is often a site name placeholder like "小红书" or "小红书 - 你的生活指南". If encountered, skip to the next pattern. Otherwise, the fallback will match the site name, see "Xiaohongshu", and return it. The user will think the parsing was successful, but actually got nothing:

PYTHON

title = _first([
    r'<meta\s+property="og:title"\s+content="([^"]+)"',
    r'"title":"([^"]+)"',
    r"<title[^>]*>(.*?)</title>",
], html) or "小红书内容"

author_name = _first([r'"nickname":"([^"]+)"', r'"nickName":"([^"]+)"'], html)
author_id   = _first([r'"userId":"([^"]+)"',   r'"user_id":"([^"]+)"'],  html)

content_raw = _first([r'"desc":"([^"]+)"', r'"content":"([^"]+)"', r'"text":"([^"]+)"'], html)

stats = {
    "liked":   _first([r'"likedCount":"?(\d+)"?'],     html),
    "comment": _first([r'"commentCount":"?(\d+)"?'],   html),
    "collect": _first([r'"collectedCount":"?(\d+)"?'], html),
    "share":   _first([r'"shareCount":"?(\d+)"?'],     html),
}

pt_ms = _first([r'"time":(\d{13})'], html)                  # Millisecond timestamp

The content captured is a JS escaped string and needs to be unescaped: \u002F → /, \u0026 → &, \u003D → =, \u003F → ?, \u003A → :, \n, \t, ", etc. If the unescaping is not clean, URL-type fields (http:\u002F\u002F...) will be completely unusable:

PYTHON

def _unescape_js_string(s: str) -> str:
    return (s.replace(r"\u002F", "/")
             .replace(r"\u0026", "&")
             .replace(r"\u003D", "=")
             .replace(r"\u003F", "?")
             .replace(r"\u003A", ":")
             .replace(r"\n", "\n").replace(r"\t", "\t")
             .replace(r"\"", '"'))

There is a cross-route difference in the statistics fields to note: in the mobile HTML, it is an integer string ("likedCount":"857"), while the API /v1/feed returns a readable format ("liked_count":"7.1万"). The semantics of the same field in the two routes are different. If the downstream needs to build a UI, it has to align them itself.

There is a common pitfall in topic extraction: trying to extract from the #topic_name[话题]# pattern in the desc field. It looks like it can match, but the results are often fragmented. The reason is that desc has been JSON escaped, and the Chinese in the topic name is serialized via \uXXXX, making the regex boundary judgment easy to cut wrong. The correct approach is to use the tagList array, locate first, then findall:

PYTHON

topics = []
m = re.search(r'"tagList":\s*\[(.{0,5000}?)\]', html, re.DOTALL)
if m:
    topics = re.findall(r'"name":"([^"]+)"', m.group(1))
if not topics:                                               # Fallback: scan HTML for #xxx[话题]
    topics = re.findall(r"#([^\s#\[]+)\[话题\]", html)
topics = list(dict.fromkeys(topics))[:20]                    # Deduplicate and preserve order, max 20

tagList is a structured source without escape interference. The name field of each topic is readable as-is, allowing you to grab them all at once.

Images: Anchor the onix-carousel-item DOM

The mobile share page uses the structure <div class="onix-carousel-item"><img src="..."></div> to render the image carousel. This class name is very unique and won't be mismatched. Grabbing src with regex can extract all the images, and each URL is then passed through cdn_strip_watermark to change the domain:

PYTHON

carousel = re.findall(
    r'class="onix-carousel-item"[^>]*>.*?<img[^>]*src=["\']([^"\'\s]+)["\']',
    html, re.DOTALL,
)
images = [
    {"index": i, "id": image_id_from_url(u), "url": u, "raw_url": cdn_strip_watermark(u)}
    for i, u in enumerate(carousel, 1)
]

Videos and Live Photos: Scan masterUrl + Filter Watermark Variants

Video URLs are scattered and scanned in three patterns, and each must filter out the watermarked _259.mp4 variant:

PYTHON

video_urls = []
for pat in [
    r'"masterUrl":"([^"]+)"',
    r'"master_url":"([^"]+)"',
    r'"url":"(https://v\.xhscdn\.com[^"]+)"',
]:
    for m in re.finditer(pat, html):
        v = _unescape_js_string(m.group(1))
        if "_259.mp4" in v:                                  # Watermarked video variant, skip
            continue
        if v not in video_urls:
            video_urls.append(v)

Other suffixes (_adapt_720p.mp4 / master URL, etc.) have no watermark by default.

Content Type Determination

content_type has three values: image / video / live_photo. Determination order:

PYTHON

type_param = parse_qs(urlparse(final_url).query).get("type", [""])[0]

if type_param == "video" and video_urls:
    content_type = "video"
    videos = video_urls[:1]                                  # One main video is enough
elif video_urls:
    # The number of videos is close to the number of images (e.g., both are 3), it's highly likely a live photo: each static image is paired with a motion video
    content_type = "live_photo"
    live_photos = [
        {"index": i, "image_url": images[i-1]["raw_url"], "video_url": video_urls[i-1]}
        for i in range(1, min(len(images), len(video_urls)) + 1)
    ]
else:
    content_type = "image"

When downloading live photo notes downstream, each pair is saved as live_01_still.jpg + live_01_motion.mp4. After packaging, the user can restore the live photo effect in the mobile photo album.

Actual Test

Image Note: http://xhslink.com/o/4IVKoZeuj0O

note_id      : 6955f790000000001f0042e2
title        : 𝐰𝐞𝐜𝐡𝐚𝐭｜情侣头像
author       : zhang / 5cd3f6730000000012033a83
content_type : image
images       : 12 images (all changed CDN domain to get unwatermarked original images)
stats        : likes=857  comments=22  collects=290  shares=200
topics       : ['今天你换头像了吗', '情侣头像', 'cp', '头像分享', '今日头像分享',
               '可爱小猫', '猫猫是世界上最可爱的生物', '每日分享', '小动物头像', '头像']
publish      : 2026-01-01T12:26:56

Video Note: http://xhslink.com/o/Ap3mwS5Q0UD

note_id      : 69e3114b000000002202916e
title        : 仲夏可可很萌！
author       : 用眼泪把你复习一遍 / 6690bced000000000f0348e9
content_type : video
videos       : 1 master URL
stats        : likes=1209  comments=154  collects=160  shares=42
topics       : ['仲夏可可', '莓喵jk']
publish      : 2026-04-18T13:06:19

Live Photo Note: http://xhslink.com/o/LRYdx90zeV

note_id      : 69e1594b000000000b010eaf
title        : 🇫🇷尼斯老城遇到杨超越董思成
author       : 喵了个汪 / 6161f5460000000002022ced
content_type : live_photo
images       : 3 static images + paired motion videos one by one
stats        : likes=7  comments=3419  collects=5392  shares=5024
topics       : ['偶遇明星', '偶遇', '杨超越', '董思成', '法国', '尼斯', '尼斯老城区']
publish      : 2026-04-17T05:48:59

All three content types can be extracted, and the main data is basically complete. The limitations of this route are also clear:

The comment body cannot be obtained. Comments are not in the share page HTML. You have to separately request /api/sns/web/v2/comment/page, and that API goes back to the world requiring full signatures.
Statistical fields are integers rather than readable formats. The live photo above actually has 79k likes, but this route captures the integer 7—in the mobile HTML, the like count exists as scattered fragments truncated by CDN/SSR, lacking precision.

Route 2: PC Web Session plus INITIAL_STATE in HTML

Xiaohongshu's PC share page is server-side rendered, and the data is directly embedded in the HTML:

HTML

<script>
  window.__INITIAL_STATE__ = { "note": { "noteDetailMap": { "<note_id>": { "note": { ... } } } }, ... }
</script>

Inside is an almost complete note JSON—title, body, image list (with infoList multi-resolution variants), author, interaction data, topic tags, and video stream information are all available, and the image URLs are unwatermarked original CDN links.

There is no need to call /v1/feed, so JSVMP signatures are not needed. But there is a cost: you must first be able to open this share page in a "browser-like" state. Directly hitting it with aiohttp plus a UA will redirect to /login or 404, so you have to run through the entire browser initialization process.

Cookie Generation Chain: 9-step bootstrap

Running the complete session initialization takes these 9 steps, and the cookie produced in each step will be relied upon by the next step:

1. GET  /                                       Load homepage
2. GET  /api/sec/v1/ds?appId=xhs-pc-web         Pre-pull JSVMP decryption script
3. POST /api/redcaptcha/v2/getconfig            Captcha config
4. POST /api/sec/v1/scripting  type=ds          scripting channel warmup
5. POST /api/sec/v1/sbtsource                   Report sbt source
6. POST /api/sec/v1/scripting  callback=seccallback   Issue websectiga / sec_poison_id
7. POST /api/sec/v1/shield/webprofile           Report fingerprint → Issue gid
8. POST /api/sns/web/v1/login/activate          Guest activation → Issue web_session
9. runtime bootstrap: user/me, system/config, zones,
                      homefeed/category, global/config,
                      racing_get, racing_report

Miss one step, and some subsequent API will crash. The generation methods of several key cookies inside:

a1: This is the seed of the entire identity, completely generated locally. Timestamp hex + 30 random characters + platform code + CRC32 checksum, truncated to the first 52 bits:

PYTHON

def gen_a1():
    hex_data = hex(int(time.time() * 1000))[2:]
    random_30 = ''.join(random.choices(
        "abcdefghijklmnopqrstuvwxyz1234567890", k=30))
    # GET_PLAT_FROM_CODE = 5 (Windows takes the other branch in the frontend getPlatformCode and returns 5)
    text = hex_data + random_30 + "5" + "0" + "000"
    crc32 = crc32_encode(text)
    return (text + str(crc32))[:52]                      # 52 bytes fixed length

webId: MD5(a1), the device identifier bound to a1.

websectiga and secpoisonid: Step 6 POST /api/sec/v1/scripting callback=seccallback returns a JS string, looking like {"b":"<base64>","d":[...]}). The server wants you to run the VM in the browser to decrypt the 64-bit key, but we decrypt it statically:

PYTHON

def gen_websectiga(js_text: str) -> str:
    b = re.search(r'"b":"(.*?)",', js_text).group(1)
    d = json.loads(re.search(r'"d":(.*?)\}\)', js_text).group(1))

    # 1. Base64 decode b, split into a list of 5 characters per group, each character value takes ord(c) - 1
    padding = len(b) % 4
    if padding:
        b += '=' * (4 - padding)
    decoded = base64.b64decode(b).decode('utf-8')
    decode_list = []
    chunk = []
    for c in decoded:
        if len(chunk) == 5:
            decode_list.append(chunk)
            chunk = []
        chunk.append(ord(c) - 1)
    if chunk:
        decode_list.append(chunk)

    # 2. Slice by d[92]:d[93]+1, then do a secondary table lookup by fixed offset to get 64 integers
    target = decode_list[d[92]:d[93]+1]
    key = [d[target[675 + i][2]] for i in range(0, 128, 2)]

    # 3. Concatenate 64 characters using a double loop of for i in range(56, -1, -8) for j in range(8)
    return "".join(chr(key[i + j]) for i in range(56, -1, -8) for j in range(8))

That string of offsets (92 / 93 / 675 / 56 / -1 / -8 / 8) are all magic numbers extracted from the JSVMP bytecode, which will be fine-tuned with versions. sec_poison_id is taken directly from another field in the same response.

gid and acw_tc: Serialize the 80+ field browser fingerprint (UA, screen, WebGL, Canvas hash, etc.) → base64 → DES-ECB encryption (key zbp30y86, zero-padded to 8-byte blocks) → hex. POST as profileData to webprofile, and the server returns these two cookies via Set-Cookie in the response:

PYTHON

def encrypt_profile_data(fp: dict) -> str:
    fp_json = json.dumps(fp, separators=(',', ':'), ensure_ascii=False)
    fp_b64 = base64.b64encode(fp_json.encode())
    cipher = DES.new(b"zbp30y86", DES.MODE_ECB)
    # Zero-pad to a multiple of 8 bytes
    pad_len = 8 - len(fp_b64) % 8
    padded = fp_b64 + b'\x00' * pad_len
    return cipher.encrypt(padded).hex()

web_session: The last step, guest activation POST /api/sns/web/v1/login/activate with an empty body, issued by the server. There are two types of prefixes: starting with 03 is the device-level guest state, which can be obtained by an empty body POST; starting with 04 is the real logged-in state, which can only be obtained by entering activate with the session cookie of a logged-in browser. 03 is sufficient for public data like /v1/feed and share page HTML. What really needs 04 are APIs bound to real user relationships like the following feed and private messages, which have nothing to do with note parsing.

In addition, several auxiliary cookies will be written along the way: loadts (timestamp for signature, updated on every encrypted request), webBuild (equals ARTIFACT_VERSION), xsecappid (equals xhs-pc-web), abRequestId (a UUID). Missing any of these will cause the server to treat it as an abnormal client.

Besides cookies, headers must also be strictly aligned. The Chrome version number in the UA (e.g., Chrome/147) must be consistent with the Chromium version number in sec-ch-ua, and sec-ch-ua-platform and sec-ch-ua-mobile must also be fully provided. If even one item does not match, the signature API will return 461.

Version Synchronization: Don't hardcode ARTIFACT_VERSION

If ARTIFACT_VERSION is hardcoded, it will break sooner or later—in over a year, the online version climbed all the way from 4.83.1 to 6.7.0 (LANGUAGE_VERSION changed from 4.2.6 to 4.3.5), roughly one major version per quarter. The most typical symptom of falling behind in version is directly getting 471 verifyType=290 at the shield/webprofile stage:

JSON

{"msg": "The current version is too low, please refresh the page or close and reopen the page", "code": 300042}

A safe approach is to pull https://www.xiaohongshu.com/ at startup, find <script src="...vendor-dynamic.xxx.js"> from the returned HTML, download it, and use regex to extract the version number:

PYTHON

html = requests.get("https://www.xiaohongshu.com/").text
m = re.search(r'/vendor-dynamic\。([a-f0-9]+)\.js', html)
js = requests.get(f"https://static-resource.xhscdn.com/.../vendor-dynamic.{m.group(1)}.js").text
artifact_version = re.search(r'artifactVersion.*?(\d+\.\d+\.\d+)', js).group(1)
language_version = re.search(r'languageVersion.*?(\d+\.\d+\.\d+)', js).group(1)

Use the same method to extract sdkVersion and appId. If they cannot be extracted, fallback to local configuration, but local configuration must be updated regularly, don't leave values from two years ago there.

After the Session is built, try two URLs in order:

PYTHON

candidates = [
    f"https://www.xiaohongshu.com/discovery/item/{note_id}"
    f"?app_platform=ios&app_version=9.22.1&share_from_user_hidden=true"
    f"&xsec_source=app_share&type=normal&xsec_token={quote(xsec_token, safe='')}",
    f"https://www.xiaohongshu.com/explore/{note_id}"
    f"?xsec_token={quote(xsec_token, safe='')}&xsec_source=pc_feed",
]

The first simulates the origin of an App share, and the second simulates clicking in from the PC feed. The server will fill noteDetailMap[<note_id>].note directly into the HTML and return it. The request headers must imitate a document-level navigation for the server to treat it as a real browser:

PYTHON

doc_headers = {
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,image/apng,*/*;q=0.8",
    "referer": "https://www.xiaohongshu.com/",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "same-origin",
    "upgrade-insecure-requests": "1",
}

xsec_token must be fully escaped with quote(token, safe='') before being embedded in the URL, otherwise the = at the end will be treated as a URL query separator.

Extracting INITIAL_STATE from HTML

After a certain update, the build of Xiaohongshu's PC frontend changed the assignment to a tight syntax window.__INITIAL_STATE__={...} (it used to be window.__INITIAL_STATE__ = {...}, with a space on each side). The old code html.find("window.__INITIAL_STATE__ = ") directly returns -1, the entire HTML parsing dies at the first step, and the whole system silently degrades to the <meta og:*> fallback—no errors on the surface, but actually only the title and cover remain.

A safe way to write it is to use regex to match the assignment symbol, and then start from the first { to do a brace balance scan to accurately intercept the entire JSON object, while avoiding brackets in string literals:

PYTHON

def extract_assigned_json_object(html: str, var_name: str) -> str:
    m = re.search(rf"{re.escape(var_name)}\s*=\s*", html, flags=re.IGNORECASE)
    if not m:
        return ""
    # Skip whitespace after the equals sign, locate the first '{'
    i = m.end()
    while i < len(html) and html[i].isspace():
        i += 1
    if i >= len(html) or html[i] != "{":
        return ""

    depth, in_string, quote_char, escaped = 0, False, "", False
    start = i
    for cursor in range(i, len(html)):
        ch = html[cursor]
        if in_string:
            if escaped:
                escaped = False
            elif ch == "\\":
                escaped = True
            elif ch == quote_char:
                in_string = False
            continue
        if ch in ('"', "'"):
            in_string, quote_char = True, ch
        elif ch == "{":
            depth += 1
        elif ch == "}":
            depth -= 1
            if depth == 0:
                return html[start : cursor + 1]
    return ""

The extracted JSON is not valid JSON—the Xiaohongshu frontend will stuff undefined literals:

"someField": undefined,
"someArray": [undefined]

json.loads will explode directly, so do two rounds of replacement first:

PYTHON

def sanitize_initial_state(raw: str) -> str:
    s = re.sub(r":\s*undefined(?=[,}])", ": null", raw)
    s = re.sub(r"\[\s*undefined\s*\]", "[null]", s)
    return s

state = json.loads(sanitize_initial_state(extract_assigned_json_object(html, "window.__INITIAL_STATE__")))

The note body is in state["note"]["noteDetailMap"][<note_id>]["note"]. If the first of the two candidate URLs hits, return directly; if neither hits, noteDetailMap may only have an "undefined" placeholder key left (the note is risk-controlled or deleted). The code specifically leaves a fallback to "find the first non-undefined key":

PYTHON

def choose_note_payload(state: dict, note_id: str) -> dict:
    detail_map = (state.get("note") or {}).get("noteDetailMap") or {}
    if note_id and note_id in detail_map:
        return (detail_map[note_id] or {}).get("note") or {}
    for key, value in detail_map.items():
        if key != "undefined" and isinstance(value, dict):
            return value.get("note") or {}
    return {}

After getting note, the fields are all structured: note["title"], note["desc"], note["user"]["nickname"], note["user"]["userId"], note["interactInfo"]["likedCount"] (readable format "7.9万"), note["tagList"] is [{"id", "name", "type"}, ...], each item in note["imageList"] comes with urlDefault / urlPre / url / infoList (multi-resolution variants) and a livePhoto flag, and note["video"]["media"]["stream"] contains master URLs and backup URLs for h264 / h265 / av1.

Supplementing Comments and Auxiliary Data

Comments are not in the HTML and must be requested separately via /api/sns/web/v2/comment/page. This API, like /api/sns/web/share/code (share code) and /api/sns/web/v2/widgets (widget info), goes through the complete signature process—it uses the same session and the same encryption headers as /v1/feed. In other words, once the PC session is nurtured cleanly, getting auxiliary data like comments, share codes, and widget info is basically effortless.

Actual Test

Route 2 runs three test links on the basis of the same session, and the main fields obtained are exactly the same as Route 1:

Field	Image Note	Video Note	Live Photo Note
note_id	6955f790...	69e3114b...	69e1594b...
title	𝐰𝐞𝐜𝐡𝐚𝐭｜情侣头像	仲夏可可很萌！	🇫🇷尼斯老城遇到杨超越董思成
content_type	image	video	live_photo
images	12	—	3
videos	—	1 master URL	3 motion (one paired with each static image)
stats	857 / 22 / 290 / 200	1209 / 154 / 160 / 42	79000 / 3419 / 5392 / 5024

Compared to Route 1, Route 2 has two advantages:

The image URL defaults to the unwatermarked !nd_dft_wlteh_jpg_3 variant, so you can get clean images without changing the domain.
Statistics are structured fields (the live photo above is 79000 at the HTML level, not the fragmented 7 in Route 1).

The cost is that you have to run the complete 9-step bootstrap every time, with a cold start of 10–30 seconds; if you want to get the comment body, you have to go through the signature API again.

Route 3: Using /api/sns/web/v1/feed

HTML can get the vast majority of fields, but there are a few scenarios where only the API is clean. For example, complete infoList multi-resolution variants, multiple backup CDN URLs for videos, precise readable format statistics, the stream.h264 structure of live photos, and complete metadata for certain topics. Therefore, a path to directly call the feed API is left at the innermost layer.

The core of getting through this route lies in two things: first, the session and cookie generation chain in Route 2 must be built first (/v1/feed reuses the same session, a1 / web_session / gid are all indispensable); second, each request must carry 5 self-calculated signature headers. Let's talk about them one by one below.

Request Signature Overview

Header	Construction Method
`x-s`	124-byte array concatenated from 11 segments of byte streams → bitwise XOR with fixed 124-byte key → Base58 encoding → `mns0101_` prefix → wrapped in an outer layer of custom Base64 → `XYS_` prefix
`x-s-common`	18-field subset of fingerprint encrypted by ARC4 to get `b1` → outer layer contains a dictionary of `plat_from_code` / `language_version` / `artifact_version` / `cookie_a1` / `b1` / MRC checksum → custom Base64
`x-b3-traceid`	16-bit random hex
`x-xray-traceid`	`(timestamp_ms << 23) \| seq` padded to 16 hex + 64-bit random number padded to 16 hex, concatenated to 32 bits
`x-t`	Current millisecond timestamp

The last three of these five headers are pure local calculations and can be implemented at a glance. The real highlights are x-s and x-s-common—let's break them down below.

x-s: 11 Segments of Bytes Concatenated, then XOR + Base58

In the browser, this header is calculated by window.mnsv2(). The actual logic is compiled into custom VM bytecode running in the JSVMP virtual machine. The static restoration method is to interpret the bytecode instruction by instruction, restoring the following Python snippet:

PYTHON

def _encrypt_headers_x3(a1, loadts, uri, params=None, data=None):
    # Concatenate query/body after uri to calculate the signature together
    if params:
        uri = f"{uri}?{urlencode(params).replace('%2C', ',')}"
    if data is not None:
        uri = uri + json.dumps(data, separators=(',', ':'))

    md5_url = hashlib.md5(uri.encode()).hexdigest()
    random_num = int(random.random() * 4294967295)
    timestamp = int(time.time() * 1000)

    # 11 segments of byte streams = 4+4+8+8+4+4+4+8+53+11+16 = 124 bytes
    part1  = [119, 104, 96, 41]                                      # Fixed magic
    part2  = list(random_num.to_bytes(4, 'little'))
    # timestamp 8 bytes LE, byte 0 changed to sum(b[1:5])+sum(b[5:8]) lower 8 bits, all bytes XOR 41
    b = list(timestamp.to_bytes(8, 'little'))
    b[0] = (sum(b[1:5]) & 255) + sum(b[5:8]) & 0xFF
    part3  = [x ^ 41 for x in b]
    part4  = list(loadts.to_bytes(8, 'little'))
    part5  = list((int(random.random() * 99) + 1).to_bytes(4, 'little'))
    part6  = list((1293).to_bytes(4, 'little'))                       # Number of window properties
    part7  = list(len(uri.encode()).to_bytes(4, 'little'))
    part8  = [b ^ (random_num & 255) for b in bytes.fromhex(md5_url)][:8]
    part9  = [len(a1)] + list(a1.encode())                             # 53
    part10 = [len('xhs-pc-web')] + list(b'xhs-pc-web')                 # 11
    part11 = [1, (random_num & 255) ^ 115,
              249, 83, 103, 103, 201, 181, 131, 99, 94, 7, 68, 250, 132, 21]

    raw = part1 + part2 + part3 + part4 + part5 + part6 + part7 \
        + part8 + part9 + part10 + part11
    encrypted = [i ^ j for i, j in zip(raw, XOR_KEY_124)]               # Fixed key
    return "mns0101_" + base58_encode(bytes(encrypted), CUSTOM_BASE58_TABLE)

A few details that are easy to get wrong:

In part3, the timestamp bytes are self-checked first and then all XORed with 41. If the check fails (for example, the timestamp you casually provided is wrong), the server will directly return 461.
loadts is not the timestamp of the note API, but the timestamp of "this signature"—before signing, first write loadts = str(int(time.time() * 1000)) back to the cookie jar, so that the cookie echoed back by the server always carries the latest value; then concatenate this millisecond value into the array as part4. In other words, it is recalculated for each signature, and is only one or two milliseconds different from x-t (but they are not the same value and cannot be mixed).
1293 is the hardcoded value of "Object.getOwnPropertyNames(window).length in Chrome". Chromium version iterations will fine-tune this number, and occasionally you have to update it against actual tests in the online browser.
XOR_KEY_124 is a 124-byte constant table ([175, 87, 43, 149, ...]), which is dragged out of the JS all at once to use.
The length of part9 is dynamic—len(a1) + a1 bytes, but a1 is always truncated to 52 bytes, so part9 is always 53 bytes, and the entire array length is stable at 124.

Wrap another layer on the outside:

PYTHON

p = {
    'x0': LANGUAGE_VERSION, 'x1': 'xhs-pc-web', 'x2': 'Windows',
    'x3': _encrypt_headers_x3(...),
    'x4': '' if data is None else 'object',
}
payload = url_quote(json.dumps(p, separators=(',', ':')))
return "XYS_" + custom_base64(utf8_to_bytes(payload), CUSTOM_BASE64_TABLE)

CUSTOM_BASE64_TABLE is Xiaohongshu's own code table ZmserbBoHQtNP+wOcza/LpngG8yJq42KWYj0DSfdikx3VT16IlUAFM97hECvuRX5—standard base64 cannot decode it, you have to implement a version of the encoder that looks up the table according to the code table.

x-s-common: b1 Fingerprint + MRC Checksum

The main body of x-s-common is a dictionary. What's really difficult is the b1 field in it—a subset of the fingerprint is encrypted by ARC4 and then undergoes another custom Base64:

PYTHON

def _encrypt_b1(fp):
    # Pick 18 from the 80+ field fingerprint (x33~x39 + x42~x46 + x48~x52 + x82)
    subset = {k: fp[k] for k in (
        'x33','x34','x35','x36','x37','x38','x39',
        'x42','x43','x44','x45','x46','x48','x49','x50','x51','x52','x82',
    )}
    raw = json.dumps(subset, separators=(',', ':'), ensure_ascii=False).encode()
    cipher = ARC4.new(b'xhswebmplfbt')
    ct = cipher.encrypt(raw).decode('latin1')
    # URL encode and then manually split the percent sequence, the purpose is to split non-ASCII into single bytes
    encoded = url_quote(ct, safe="!*'()~_-")
    b = []
    for c in encoded.split('%')[1:]:
        chars = list(c)
        b.append(int(''.join(chars[:2]), 16))
        [b.append(ord(j)) for j in chars[2:]]
    return custom_base64(bytes(b), CUSTOM_BASE64_TABLE)

Outer dictionary:

PYTHON

source = {
    's0': 5,                       # GET_PLAT_FROM_CODE (Windows takes the other branch = 5)
    's1': '',
    'x0': '1',                     # localStorage.getItem("b1b1"), hardcoded
    'x1': LANGUAGE_VERSION,        # 4.3.5
    'x2': 'Windows',
    'x3': 'xhs-pc-web',
    'x4': ARTIFACT_VERSION,        # 6.7.0
    'x5': cookie_a1,
    'x6': '', 'x7': '',            # Old versions were XS / XT, now hardcoded
    'x8': b1,
    'x9': diy_mrc('' + '' + b1),   # Self-implemented CRC32, verify b1
    'x10': fp['x39'],              # Call counter, hardcoding it also passes
    'x11': 'normal',
}
return custom_base64(utf8_bytes(url_quote(json.dumps(source, separators=(',', ':'), ensure_ascii=False))),
                     CUSTOM_BASE64_TABLE)

diy_mrc is a modified CRC32: the table is generated according to the standard polynomial 0xedb88320, but the last step will do a JS-style int32 wraparound on the accumulated value (num >= 2**31 ? num % 2**32 : num - 2**32). If this wraparound is not done in Python, it will not match the signed integer behavior of JS, and the server will judge x9 as illegal.

The Other Three Headers

The remaining three headers are relatively simple:

PYTHON

x_b3_traceid   = ''.join(random.choices('abcdef0123456789', k=16))
x_t            = str(int(time.time() * 1000))

# There is a seq in x-xray, auto-increments on each call—suspected to be the same kind of behavior count as x10 in x-s-common
seq = initial_seq + call_count
part1 = hex(int(time.time() * 1000) << 23 | seq)[2:].zfill(16)
part2 = hex((random_u32() << 32) | random_u32())[2:].zfill(16)
x_xray_traceid = part1 + part2

A common misconception is "the signature will change the bytecode anyway, it's better to reuse implementations maintained by others". Actual testing shows this is only half right: the VM instruction set does not change, what changes are only the constants of certain operations in the bytecode (124-byte XOR key, Base58/Base64 code tables, part11 tail, part1 magic, and those d[92]/d[93]/675 offsets in websectiga). Once the static implementation is written through, the next time the server changes the key, diffing the old and new vendor-dynamic.xxx.js can locate the new constants to replace in ten minutes; on the contrary, when relying on third-party implementations, if it doesn't update, you have to wait. The biggest gain of reverse engineering it yourself is knowing "which constants will change and which structures will not change"—the structure layer almost never changes.

Fingerprint context is dynamic

An easily overlooked detail is that the fingerprint is not one-time. b1 in x-s-common is recalculated with the current fp every time, and fp will be updated according to storage_state (page state) and page_context (current URL and referer)—for example, calling note details should be set to explore/{note_id}, and calling homepage recommendations should be set to the homepage. If "which page you are on" seen by the server does not match the request itself, risk control will directly pull the plug. Before calling feed, you must first switch page_context to explore/{note_id} and then calculate the signature. If you forget this step, you get 461 verifyType=301.

page_context is mapped to page_type based on the path in the URL:

PYTHON

def normalize_page_context(ctx: dict | None) -> dict:
    out = {
        "location": "https://www.xiaohongshu.com/explore",
        "referer": "https://www.xiaohongshu.com/",
        "page_type": "explore",
    }
    if ctx:
        out.update({k: v for k, v in ctx.items() if v is not None})
    path = urlparse(out["location"]).path or "/"
    if "/search_result" in path:
        out["page_type"] = "search"
    elif "/user/profile" in path:
        out["page_type"] = "user_profile"
    elif "/explore/" in path or "noteId=" in out["location"]:
        out["page_type"] = "note_detail"
    else:
        out["page_type"] = "explore"
    return out

The fingerprint itself is a dictionary of 80+ fields fp = {"x1": ua, "x2": "false", "x3": "zh-CN", "x4": 24, ...}—UA, language, color depth, device memory, CPU, screen resolution ("1920;1080"), available area ("1920;1040"), time zone (-480, "Asia/Shanghai"), GPU vendor/renderer, plugins, canvas fingerprint (x22), voice hash (x53), WebGL extension hash (x56), cookie copy (x57), DOM-related counts (x58 div / x59 resource / x61 window.* property count / x73 DOM node count), {referer, location, frame} in x66, {prefix}|{window_props}|{script_count} string in x69, and some hardcoded magic fields (x30 "swf object not loaded", x45 "__SEC_CAV__1-1-1-1-1|__SEC_WSA__|", etc.). The role of the entire dict is to describe the state of "the browser on a certain URL at a certain moment".

Before each call to an encrypted API, it will recalculate the fields in fp affected by the current request:

PYTHON

# Key: fields affected by the request
fp["x39"] = str(storage["sc"])                       # session counter, +1 per request
fp["x44"] = str(int(time.time()*1000))               # Current millisecond
fp["x57"] = "; ".join(f"{k}={v}" for k, v in cookies.items())  # Current cookie snapshot
fp["x58"] = str(div_count)                           # Number of DOM divs under the current page_type
fp["x59"] = str(resource_count)
fp["x61"] = str(window_props)
fp["x66"] = {"referer": ctx["referer"], "location": ctx["location"], "frame": 0}
fp["x69"] = f"{prefix}|{window_props}|{script_count}"
fp["x73"] = str(dom_count)

x58/x59/x73 have different baselines on different page_types: explore is (204, 14, 1240), note_detail adds (+36, +12, +420), search adds (+18, +8, +180), user_profile adds (+22, +10, +260). These baselines are values captured against real browsers on various pages. Fine-tuning one or two numbers does not affect risk control, but the overall magnitude and page_type must match.

Feed Request

The POST body of /api/sns/web/v1/feed is very simple:

PYTHON

data = {
    "source_note_id": note_id,
    "image_formats": ["jpg", "webp", "avif"],
    "extra": {"need_body_topic": "1"},
    "xsec_source": "pc_feed",               # Taken from the query of the share URL
    "xsec_token": xsec_token,
}

The signature input is (a1, loadts, "/api/sns/web/v1/feed", None, data)—JSON serialize data (separators=(',', ':')) and concatenate it directly after the URI to MD5 together. The response body looks like:

JSON

{
  "code": 0, "success": true, "msg": "成功",
  "data": {
    "items": [{
      "id": "...", "model_type": "note",
      "note_card": {
        "note_id": "...", "type": "video",
        "title": "...", "desc": "...",
        "user": {"user_id": "...", "nickname": "..."},
        "interact_info": {"liked_count": "7.9万", "comment_count": "3419", ...},
        "image_list": [{"url_default": "...", "info_list": [...], "live_photo": true}],
        "video": {"media": {"stream": {"h264": [{"master_url": "...", "backup_urls": [...]}]}}},
        "tag_list": [{"id": "...", "name": "...", "type": "topic"}]
      }
    }]
  }
}

Take items[0].note_card for downstream processing.

Initial Session Quality and Auto-Retry

Even if the signature, cookie, and bootstrap are all correct, the first time a newly created session hits the feed, there is still a probability of returning a response with a successful shell but empty data (success=True but items is empty). This is not a code issue, but a probabilistic demotion by the server for "first access by a brand new identity". With the same code and the same note, returning empty this time, rebuilding the session once and hitting it again will yield data.

Identification mark: success=True and items is empty. The handling method is an auto-retry loop—delete the persisted device state (device_state.json), rerun the 9-step bootstrap to change a set of identities, and hit the feed again. Try up to 3 rounds in a single call. The single hit rate of cold start is about 70-90%, and adding retry can pull it up to 95%+:

PYTHON

MAX_ATTEMPTS = 3
note_card = {}
for attempt in range(1, MAX_ATTEMPTS + 1):
    if attempt > 1:
        DEVICE_STATE_FILE.unlink(missing_ok=True)     # Delete persisted device state
        session = await create_xhs_session()           # Rerun 9-step bootstrap to change identity
    try:
        resp = await session.apis.note.note_detail(note_id, xsec_token)
        payload = await resp.json(content_type=None)
        items = (payload.get("data") or {}).get("items") or []
        if items and (items[0].get("note_card") or {}):
            note_card = items[0]["note_card"]
            break                                      # Success
    except NeedCaptchaVerify as e:
        # 216 semantic captcha / 102 slider verification; catch it and let the next round reset identity and retry
        last_err = f"attempt={attempt} verify={e.details.get('verifyType')}"
    finally:
        await session.close_session()

The return structure also carries two observable fields feed_attempts (how many times tried) and feed_recovered_on_attempt (which round was the successful one), so you can see at a glance whether it was saved by retrying—routes that do not go through feed are always 1/1, only the API route will change.

Actual Test

Image Note:

note_id      : 6955f790000000001f0042e2
title        : 𝐰𝐞𝐜𝐡𝐚𝐭｜情侣头像
author       : zhang / 5cd3f6730000000012033a83
content_type : image
images       : 12 images (each with infoList multi-resolution, url_pre, url_default, url)
stats        : liked=857  comment=22  collect=290  share=200
publish      : 2026-01-01T12:26:56
feed_attempts=2  feed_recovered_on_attempt=2   ← Initial jitter this time, recovered in the second round

Video Note:

note_id      : 69e3114b000000002202916e
title        : 仲夏可可很萌！
author       : 用眼泪把你复习一遍 / 6690bced000000000f0348e9
content_type : video
videos       : 6 items (master URL + multiple CDN backups)
stats        : liked=1209  comment=154  collect=160  share=42
publish      : 2026-04-18T13:06:19

Live Photo Note:

note_id      : 69e1594b000000000b010eaf
title        : 🇫🇷尼斯老城遇到杨超越董思成
author       : 喵了个汪 / 6161f5460000000002022ced
content_type : live_photo
images       : 3 static images
videos       : 6 items (each static image paired with a motion, stream.h264 structure is regular)
stats        : liked=7.9万  comment=3419  collect=5392  share=5024   ← Readable format
publish      : 2026-04-17T05:48:59

Route 3 has the finest field granularity: multiple backup URLs for videos, regular live photo motion stream structure, and statistics are readable 7.9万. The cost is that you have to run the complete 9-step bootstrap every time, calculate the JS signature locally once, the constants in the signature may need to be updated with the server version, and occasional retries are required.

Combining the Three Routes

In actual operation, the three routes are stacked together sorted by data quality:

step 1  /v1/feed + self-calculated x-s / x-s-common        → source=feed_api         Cleanest
step 2  PC session opens share page, reads __INITIAL_STATE__  → source=initial_state    Complete note
step 3  Reads <meta og:*> tags                       → source=meta_fallback    Only title and cover left

Each output comes with an extract_source field, so you can see at a glance which tier the data is in this time.

Route 1 is another entry point—scenarios that only need images and grab some metadata along the way use it, without building a PC session, and the overall overhead is one-tenth of the complete path. Pulling the results of the three test links on the three routes together, the three routes are actually almost equivalent in "what data can be obtained", the real difference is the cost of establishing the connection:

Route	Session Cost	Signature Cost	Data Freshness	Comment Content	Image Watermark
Mobile UA + CDN Replacement	None	None	SSR Snapshot	Cannot get	`!h5_1080jpg` included (change domain to resolve)
PC Session + `__INITIAL_STATE__`	9-step bootstrap	None	Real-time	Requires separate API call	Not included by default
PC Session + `/v1/feed` API	9-step bootstrap + fingerprint	Self-calculated x-s / x-s-common	Real-time	Returned together	Not included by default

Unified output structure (all three routes are normalized to this shape for the convenience of downstream):

PYTHON

{
    "source": "mobile | html | api",
    "extract_source": "feed_api | initial_state | meta_fallback",
    "original_url": "...",
    "final_url": "...",
    "note_id": "...",
    "xsec_token": "...",
    "title": "...", "content": "...",
    "content_type": "image | video | live_photo",
    "author": {"name": "...", "user_id": "..."},
    "stats": {"liked": "...", "comment": "...", "collect": "...", "share": "..."},
    "topics": ["..."],
    "publish_time_ms": 0, "publish_time_iso": "...",
    "images": [{"index", "id", "url", "raw_url", "live_photo"}],
    "videos": ["..."],
    "live_photos": [{"index", "image_url", "video_url"}],
    "feed_attempts": 1,
    "feed_recovered_on_attempt": 1,
}

Image Variant Lineage

Looking at the actual download results of the three routes together, although the image_id is exactly the same, the "default variant" given by the CDN differs significantly:

Source	URL Suffix	Size	Watermark
Mobile HTML (`!h5_1080jpg`)	Scaled to 1080, jpg	~77 KB	Included
Desktop HTML `__INITIAL_STATE__` (`!nd_dft_wlteh_jpg_3`)	Default processing, jpg	~147 KB	Not included
API `/v1/feed` (`!nd_dft_wlteh_webp_3`)	Default processing, webp	~42 KB	Not included
Change to `sns-img-qc.xhscdn.com/`	No suffix, no signature	~147 KB	Not included

A few rules worth noting down:

The watermark is a feature of the specific suffix variant !h5_1080jpg, not a feature of the sns-webpic-qc domain. The !nd_dft_wlteh_* variants under the same domain do not add watermarks by default.
The jpg variant given by the desktop HTML is almost the same size as the bare original image (147KB vs 147KB), indicating that variant is nearly lossless; changing the domain is not necessary on this route.
The webp variant given by the API is truly small (42KB), with volume optimization dominating. To get the original image, you have to change the domain.
The image_id of the three routes is exactly the same. So as long as you know the URL given by any route, you can get any other variant (including the unwatermarked original image) by "skipping date and sign, keeping image_id, and changing the domain".

Troubleshooting Order: Where to start checking when the link breaks

When the link breaks, the most feared thing is guessing backward from the "last step". The correct direction is to push backward from the foremost, cheapest check, and each layer has specific signals:

Layer 1: Entry Parameters

First confirm that note_id and xsec_token are parsed correctly—especially whether the = at the end of xsec_token has been eaten. Troubleshooting method: print the resolved note_id and xsec_token, use them to directly quote, and manually assemble an explore/{note_id}?xsec_token=... URL to open in the browser. If you can see the note, it means the entry parameters are fine.

Layer 2: Version Number

The typical symptom of hitting this layer is shield/webprofile returning 471 verifyType=290 ("当前版本过低"). Troubleshooting: see if the current ARTIFACT_VERSION / LANGUAGE_VERSION is the same as the one extracted from the online vendor-dynamic.xxx.js; if not, manually synchronize once and rerun.

Layer 3: Bootstrap

If any of the 9 steps crashes, the rest is a chain reaction. Troubleshooting: print out the status code and whether key cookies are written (websectiga / sec_poison_id / gid / acw_tc / web_session) at each step. Typical lesions:

The type parameter of scripting in steps 6 and 7 is wrong → websectiga cannot be decrypted
The fingerprint in profileData was escaped for non-ASCII by the JSON encoder → webprofile 401 / 461
The cookie jar is not clean before activate → cannot get web_session, or get the wrong prefix

Layer 4: Signature (Route 3 Exclusive)

When feed returns 461, look at Verifytype in the header:

Phenomenon	Stage	Meaning	Countermeasure
`461 verifyType=216`	`/v1/feed`	Signature invalid or JSVMP bytecode changed	Align new constants / re-decrypt bytecode
`461 verifyType=301`	`/v1/feed`	Session quality insufficient or `page_context` mismatch	Reset device state, switch to correct `page_type`
`461 verifyType=102`	`/v1/feed`	Trigger slider verification	Pass manually or hook up captcha service
`461 verifyType=216` + HTML	Share page HTML	Semantic captcha (select image and click words)	Parse theme/grid, call specialized branch
`code=0 success=true data={}`	`/v1/feed`	Soft risk control: shell successful but data emptied	First check if `=` at the end of `xsec_token` is truncated; otherwise reset device state and retry
`SSLError / UNEXPECTED_EOF`	Any	TLS fingerprint recognized or network jitter	Retry, change proxy
Redirected to `/login`	HTML share page	Session expired or guest state revoked	Clear device state, reactivate

Be especially alert to code=0 data={}. It looks completely normal, but actually returns nothing, easily fooling rough success judgments.

Layer 5: HTML Strong Extraction (Route 2)

If feed keeps crashing, but the HTML share page can be opened, fall back to Route 2 to check if __INITIAL_STATE__ can be extracted. The typical lesion is the marker regression of "spaces on both sides of the equals sign disappear" mentioned earlier—the symptom is that extract_source keeps falling to meta_fallback, the title and body are still there but the author/image/time are all empty. This is not the site cutting capabilities, but a local parser regression.

Layer 6: Mobile Fallback (Route 1)

When both PC routes fluctuate, use the mobile route as the final verification: if this one also cannot produce data, it is highly likely that the note itself has been deleted or risk control has directly blacklisted this exit IP; if this one can produce basic fields, it means the content is still alive, just the PC side state is not clean, reset the session and start over.

Confirmed, Inferred, and Stop Mythologizing

Having done this, I've accumulated a lot of experience. Let's distinguish these three categories to prevent future generations from taking detours.

Confirmed:

xsec_token tail = being truncated will cause /v1/feed to appear code=0 success=true data={}
Hardcoding the window.__INITIAL_STATE__ marker will cause the complete HTML extraction to degrade entirely to meta_fallback
/v1/feed currently has real fluctuations of initial failure and recovery after secondary session rebuilding
web_session is issued by the server, no matter how you piece it together locally, you cannot exchange for a "valid but unseen by the server" value

Highly Credible Inferences:

Session quality (cookie completeness + fingerprint consistency + page_context matching) will significantly affect the initial success rate of Route C
A persisted "stable device state" is more like a real browser than a completely random new device state, with a higher pass rate
The JSVMP VM instruction set is stable in the long term, what changes are mainly the constants in the bytecode—meaning the maintenance cost after reverse engineering it yourself is actually acceptable

Stop Mythologizing:

"web_session with an 03 prefix definitely won't work, and an 04 prefix definitely will" — False. 03 is sufficient for public data, 04 is only necessary on APIs like following feeds / private messages.
"As long as a certain header is added, it will definitely pass feed" — False. The success or failure of feed is the comprehensive result of session quality, cookie combination, fingerprint context, and timing fluctuations.
"Route 1 can only get images" — False. Most fields are actually scattered in the mobile HTML, and scattered regex can scan them out; the comment body indeed cannot be obtained, but everything else can.
"Watermarks are a feature of the sns-webpic-qc domain" — False. Watermarks are a feature of the !h5_1080jpg processing suffix, and !nd_dft_wlteh_* variants of the same domain do not have watermarks by default.

What Xiaohongshu's Anti-Scraping Looks Like

Stacking the three routes together, we can roughly draw Xiaohongshu's protection hierarchy:

Layer	Mechanism	Description
1	xsec_token	URL-level anti-hotlinking token, bound to session, don't get screwed by the `=` at the end
2	JS Cookie Generation	a1/webId/websectiga/gid etc. generated by JS runtime
3	Browser Fingerprint	80+ field device fingerprint, DES encrypted reporting
4	Request Signature x-s	JSVMP virtual machine execution, periodically changing bytecode constants
5	TLS Fingerprint + UA/client hints	Inconsistent UA and sec-ch-ua directly recognized
6	Version Check	`ARTIFACT_VERSION` expired directly 471/290
7	`page_context` Consistency	Request URL and referer/location mismatch triggers 301
8	Semantic Captcha	Type 216, needs to recognize themes and grids in images
9	Behavior Analysis	Comprehensive risk control of frequency, trajectory, timing, etc.

Layer 4 is a hard bone, but the first 8 layers combined have actually blocked 99% of scripts, and most people haven't even touched the door to Layer 9. JSVMP compiles the signature algorithm into custom bytecode. The VM instruction set is basically stable, and what changes are the constants in the bytecode (XOR key, code tables, magic bytes, etc.)—so after reverse engineering it once, maintenance is just a matter of diffing the old and new vendor-dynamic.js to grab constants. What really requires daily gaming with the server is actually session quality: cookie issuance, version number synchronization, fingerprint consistency, these are what change with every request.

Disclaimer

This article only records personal exploration and thinking during the technical learning process, and is for security research and educational purposes. All technical analyses involved in the text are aimed at publicly accessible pages and network requests, and do not involve any unauthorized access, bulk data scraping, or commercial use. Please abide by the terms of service of the relevant platforms and local laws and regulations, and do not use the content of this article in scenarios that infringe on the legitimate rights and interests of others.

Afterword

The three routes span quite a bit from "bare HTTP" to "complete Web session plus official API", but what really takes time is never the signature algorithm itself, but the quality of the session—whether UA and client hints need to be aligned, whether bootstrap can finish running, whether the web_session obtained is 03 or 04, whether page_context corresponds to the current request. The signature algorithm can be used for a long time once written, while session quality is the daily work of gaming with the server on every request.

Another observation is that the most expensive route (Route 3) and the cheapest route (Route 1) do not differ much in "what data can be obtained"—the field differences returned by the two are mainly in the comment body, multiple backup URLs for videos, and readable format statistics. Layered fallback stacks them together, diluting the cost of the expensive path, while using the extract_source field to explicitly mark which tier was reached each time, so you don't have to guess manually when problems occur.

The pair of CDN domains sns-webpic-qc and sns-img-qc is probably the most suitable example to conclude the entire system. The former takes the signature path and determines the image output specifications based on the URL suffix—whether to add a watermark, resolution, and encoding format are all hidden in the suffix; the latter directly gives the original image based on the bare image_id, without signature, and has no concept of suffix. The two domains share the same image_id namespace—knowing the id of an image on webpic-qc, getting the original image on img-qc is free. It is exactly this point that allows the watermarked version of !h5_1080jpg in the mobile share page to be easily bypassed: you don't need to know how it calculates the signature, just keep the image_id and change the domain. This door may close in the future. But at least for today, it is still open.

Appendix: Minimum Reproduction Checklist

Completing the capabilities in this order will allow you to run through all three routes, check them off for troubleshooting:

Entry Layer

Follow xhslink.com/o/... 302, get the final URL
Extract note_id / xsec_token / xsec_source from the final URL, use parse_qs to keep = padding

Route 1 (Zero Session)

iPhone UA requests share page, follow redirects
Scattered regex to extract title/author/body/stats/topics/time, with "Xiaohongshu" placeholder filtering
onix-carousel-item to grab images, _259.mp4 to filter watermarked videos
cdn_strip_watermark change to sns-img-qc to get original images
content_type three-branch determination (video / live_photo / image)

Cookie Generation (Shared by Route 2 and 3)

a1 = hex(ts) + rand30 + "50000" + crc32[:52], webId = MD5(a1)
Version sync: extract artifactVersion / languageVersion from online vendor-dynamic.xxx.js
9-step bootstrap runs through in order, check corresponding cookie writing at each step
websectiga's base64 + 5-character grouping + secondary table lookup decoding
gid's JSON → base64 → DES-ECB(zbp30y86) → hex
web_session taken from activate response, distinguish 03 / 04

Route 2 (HTML)

Two candidate URLs (discovery/item + explore), with document-level headers
extract_assigned_json_object: regex marker + brace-balance + string escaping
sanitize_initial_state: two rounds of undefined → null
choose_note_payload: lookup by note_id, fallback to find the first non-undefined key

Route 3 (API)

x-s: 11 segments of bytes → XORKEY124 → Base58(custom) → mns0101_ + outer Base64(custom) + XYS_
x-s-common: ARC4(xhswebmplfbt) encrypt 18-field fingerprint → b1, outer dict + diy_mrc + Base64
x-b3-traceid / x-xray-traceid / x-t
Update loadts and dynamic fields of fp before each signature (x39/x44/x57/x58/x59/x61/x66/x69/x73)
page_context switched to note_detail based on URL
feed POST body fixed shape (source_note_id / image_formats / extra / xsec_source / xsec_token)
3 rounds of auto-retry, delete device_state.json + clear init cookie cache + rebuild session on each failure

Unified Output

Three routes normalized to the same dict shape, with source / extract_source / feed_attempts / feed_recovered_on_attempt
Actual test shows all three content types (image / video / live_photo) can output complete fields

Cat and Mouse

Background

Short Link Redirects and xsec_token

Two Sources of Identity Materials

Route 1: Mobile UA + CDN Domain Replacement

Extracting Fields from HTML

Topics: Must use tagList, cannot use desc

Images: Anchor the onix-carousel-item DOM

Videos and Live Photos: Scan masterUrl + Filter Watermark Variants

Content Type Determination

Actual Test

Route 2: PC Web Session plus INITIAL_STATE in HTML

Cookie Generation Chain: 9-step bootstrap

Version Synchronization: Don't hardcode ARTIFACT_VERSION

Opening the Share Page

Extracting INITIAL_STATE from HTML

Supplementing Comments and Auxiliary Data

Actual Test

Route 3: Using /api/sns/web/v1/feed

Request Signature Overview

x-s: 11 Segments of Bytes Concatenated, then XOR + Base58

x-s-common: b1 Fingerprint + MRC Checksum

The Other Three Headers

Fingerprint context is dynamic

Feed Request

Initial Session Quality and Auto-Retry

Actual Test

Combining the Three Routes

Image Variant Lineage

Troubleshooting Order: Where to start checking when the link breaks

Confirmed, Inferred, and Stop Mythologizing

What Xiaohongshu's Anti-Scraping Looks Like

Disclaimer

Afterword

Appendix: Minimum Reproduction Checklist