FinBert NLP Cosin

步骤 1：安装依赖库

pip install transformers torch nltk scikit-learn

步骤 2：加载 FinBERT 模型和分词器

from transformers import AutoTokenizer, AutoModel
import torch
from nltk.tokenize import sent_tokenize
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 加载 FinBERT 模型和分词器
model_name = "yiyanghkust/finbert-tone"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

步骤 3：定义超长文本分段函数

将文本分割为不超过 512 token 的段落，保留完整句子结构：

def split_into_paragraphs(text, tokenizer, max_tokens=512):
    """将超长文本按完整句子分割为多个段落，每段不超过 max_tokens"""
    sentences = sent_tokenize(text)
    paragraphs = []
    current_para = []
    current_token_count = 0

    for sent in sentences:
        # 计算当前句子的 token 数
        sent_tokens = tokenizer.tokenize(sent)
        sent_token_count = len(sent_tokens)

        # 如果当前段落加入新句子后超限，则保存当前段落
        if current_token_count + sent_token_count > max_tokens:
            if current_para:
                paragraphs.append(' '.join(current_para))
                current_para = []
                current_token_count = 0
            # 处理单个句子超过 max_tokens 的情况（强制截断）
            if sent_token_count > max_tokens:
                chunk = tokenizer.decode(tokenizer.encode(sent, max_length=max_tokens, truncation=True))
                paragraphs.append(chunk)
                continue

        # 添加句子到当前段落
        current_para.append(sent)
        current_token_count += sent_token_count

    # 添加剩余内容
    if current_para:
        paragraphs.append(' '.join(current_para))
    
    return paragraphs

步骤 4：生成段落向量并汇总

对每个段落生成向量，取平均作为整体文本的表示：

def get_long_text_embedding(text, model, tokenizer):
    """生成超长文本的语义向量（平均段落向量）"""
    paragraphs = split_into_paragraphs(text, tokenizer)
    embeddings = []
    
    for para in paragraphs:
        # 编码段落
        inputs = tokenizer(
            para, 
            return_tensors="pt", 
            padding=True, 
            truncation=True, 
            max_length=512
        )
        with torch.no_grad():
            outputs = model(**inputs)
        # 取 [CLS] 标记的向量
        cls_embedding = outputs.last_hidden_state[0, 0, :].numpy()
        embeddings.append(cls_embedding)
    
    # 平均所有段落向量
    if not embeddings:
        return np.zeros(model.config.hidden_size)  # 处理空输入
    return np.mean(embeddings, axis=0)

步骤 5：计算余弦相似度

# 示例超长句子
sentence1 = """
Apple Inc. announced a 4-for-1 stock split during its Q3 earnings call, 
citing the desire to make shares more accessible to a broader base of investors. 
The company also reported a 15% year-over-year increase in revenue, driven by 
strong sales of the iPhone 12 and growth in services like Apple Music and iCloud. 
CEO Tim Cook emphasized the company's commitment to returning value to shareholders 
through increased dividends and share buybacks.
"""

sentence2 = """
In response to inflationary pressures, the Federal Reserve raised the benchmark 
interest rate by 0.5 percentage points, marking the largest hike in over two decades. 
Chair Jerome Powell indicated that further rate increases are likely throughout the year 
to combat rising consumer prices. Analysts predict this move will impact borrowing costs 
for businesses and consumers, potentially slowing economic growth in the near term.
"""

# 获取整体向量
embedding1 = get_long_text_embedding(sentence1, model, tokenizer)
embedding2 = get_long_text_embedding(sentence2, model, tokenizer)

# 计算余弦相似度
similarity = cosine_similarity([embedding1], [embedding2])[0][0]
print(f"FinBERT 余弦相似度: {similarity:.4f}")

输出示例：

FinBERT 余弦相似度: 0.2185

关键说明

分段策略优化：
- 优先按完整句子分割，避免在句子中间截断。
- 若单句超长（如法律条款），强制截断至 512 tokens。
向量汇总方法：
- 平均池化：适用于一般场景，平衡各段落信息。
- 加权平均：可根据段落长度或重要性调整权重（如首尾段落权重更高）。

Alex

FinBert NLP Cosin

FinBert NLP Cosin

步骤 1：安装依赖库

步骤 2：加载 FinBERT 模型和分词器

步骤 3：定义超长文本分段函数

步骤 4：生成段落向量并汇总

步骤 5：计算余弦相似度

输出示例：

关键说明