Sentiment Signal Aggregator
NLP pipeline that processes social media, news, and on-chain data to generate composite sentiment scores for crypto assets.
Project Background
The Sentiment Signal Aggregator (SSA-1) was an ambitious attempt to build a real-time NLP pipeline that could extract tradeable signals from the firehose of social media, news, and on-chain data surrounding crypto assets. The system was active from mid-2024 through early 2025 before being deprecated due to insufficient alpha generation relative to its operational complexity.
Despite being deprecated as a standalone trading system, the core NLP pipeline and sentiment scoring methodology were valuable enough to be integrated into the Neural Alpha Engine (NAE-v3) as an alternative data feed — so the engineering work wasn’t wasted, just repositioned.
NLP Pipeline Architecture
The system processed three primary data streams:
Social Media (Twitter/X and Reddit) — A streaming ingestion layer using the Twitter API v2 and Reddit’s pushshift archive consumed posts mentioning tracked crypto tickers and project names. Raw text was cleaned through a multi-stage preprocessing pipeline: URL/mention removal, emoji-to-text conversion (🚀 → “bullish_rocket”), slang normalization (“wen moon” → “when price increase”), and bot/spam filtering using a fine-tuned classifier trained on 50K manually labeled posts.
Cleaned text was passed through a fine-tuned FinBERT model (based on ProsusAI/finbert) that outputs sentiment probabilities across three classes: positive, negative, and neutral. The model was further fine-tuned on 10K crypto-specific labeled examples to handle domain-specific language that general financial sentiment models miss — “WAGMI,” “rug pull,” “diamond hands” carry strong sentiment signals that FinBERT’s original training corpus doesn’t cover.
News Articles — RSS feeds and API integrations from CoinDesk, The Block, CoinTelegraph, and Bloomberg Crypto were processed through a separate pipeline. News articles required different treatment than social media: they’re longer, more structured, and the sentiment signal is often buried in specific paragraphs rather than being the entire content. I used SpaCy for named entity recognition to extract mentioned assets, then applied sentiment analysis only to sentences containing those entities, weighted by their position in the article (headlines and first paragraphs carry more weight).
On-Chain Data — Blockchain analytics APIs (Glassnode, Dune Analytics) provided quantitative on-chain metrics: exchange inflow/outflow ratios, active address growth, whale transaction counts, and NVT ratio. These aren’t text-based sentiment signals but serve as a quantitative proxy for network-level sentiment. Sudden spikes in exchange inflows, for example, typically precede selling pressure.
Signal Aggregation
Raw signals from each data stream were aggregated into a composite sentiment score using a weighted blending approach:
-
Per-source normalization — Each data stream’s sentiment score was z-scored against its own 30-day rolling distribution to account for source-specific biases (Twitter is structurally more bullish than Reddit, for example).
-
Volume-weighted aggregation — Sentiment scores were weighted by mention volume, with a logarithmic dampening factor to prevent a single viral tweet from dominating the signal. The formula:
composite = Σ(sentiment_i × log(1 + volume_i) × source_weight_i) -
Temporal decay — Older sentiment observations decayed exponentially with a half-life of 4 hours for social media and 12 hours for news. On-chain signals used a 24-hour half-life due to their slower-moving nature.
-
Regime adjustment — During high-volatility periods (VIX > 25 or BTC 24H volatility > 5%), the system reduced its confidence in sentiment signals by 50%, acknowledging that sentiment becomes noise during market stress.
Why It Was Deprecated
The system was deprecated for three interconnected reasons:
Insufficient standalone alpha. As a standalone trading signal, the composite sentiment score produced a Sharpe ratio of 0.9 — below the threshold I set for live deployment. The signal had predictive power for extreme moves (correctly flagging 70%+ of large drawdowns 2-6 hours in advance) but generated too many false positives during normal market conditions. The win rate of 54% was only marginally above random, and the average winner was only slightly larger than the average loser.
Data pipeline fragility. The system depended on multiple third-party APIs (Twitter, Reddit, news feeds, on-chain providers) that were individually unreliable. Twitter’s API rate limits and pricing changes in late 2024 were particularly disruptive. Reddit’s pushshift archive experienced frequent downtime. The operational burden of maintaining 8+ API integrations, each with their own authentication, rate limiting, and schema changes, consumed more engineering time than the alpha justified.
Signal decay and adversarial dynamics. Crypto social media sentiment became increasingly gamified during the system’s operational period. Bot networks, paid shill campaigns, and coordinated FUD operations polluted the signal. The FinBERT model’s accuracy on crypto-specific text degraded from 78% to 71% over six months as the adversarial landscape evolved. Maintaining model accuracy would have required continuous labeling and retraining — essentially a full-time ML ops commitment.
Lessons Learned
The most valuable takeaway was that sentiment works best as a supporting signal rather than a primary alpha source. The sentiment pipeline’s best contributions were: (1) providing a confidence overlay that improved the performance of other strategies by reducing position sizes during negative sentiment divergence, and (2) flagging potential black swan events 2-4 hours before price impact. Both of these use cases are now integrated into NAE-v3’s alternative data module, where they add meaningful value without bearing the burden of standalone P&L generation.