Insight into Returns: How to Build Price Prediction Models Using a Systematic Approach

2026-01-07 19:45:31

This article systematically analyzes the entire process of constructing predictive signals in quantitative investing. Facing an environment with extremely low information noise ratios in financial markets, this paper reveals a systematic approach to building effective predictive signals through deconstructing four core components: data preparation, feature engineering, machine learning modeling, and portfolio allocation. The content originates from an article by sysls, organized, compiled, and written by Foresight News.
(Background recap: Can we track the next insider trader on Polymarket? Absolutely, and the threshold is not high)
(Additional background: A comprehensive guide to trading concepts (IX): How many times leverage should be used? All-in or incremental positions?)

Table of Contents

Introduction
Core Process Framework
Feature Engineering: The Art and Science Combined
Model Selection Guide
- Key Modeling Recommendations
The Art of Designing Prediction Targets
Conclusion

How to construct effective predictive signals in an environment with extremely low information noise ratios in financial markets? This article provides a systematic answer.

By deconstructing the four core stages of a quantitative strategy—data preparation, feature engineering, machine learning modeling, and portfolio allocation—the article reveals that most strategy failures are often due to issues at the data and feature levels, rather than the models themselves. It emphasizes key techniques for handling high-dimensional financial features, suitable scenarios for different model families, and a crucial insight: improving signal purity by “deconstructing sources of returns and predicting specific signals.” This serves as a reference for quantitative researchers and investors aiming to establish robust and interpretable predictive systems.

Introduction

In the field of systematic investing, a predictive signal refers to a mathematical model capable of forecasting future asset returns based on input feature data. The core architecture of many quantitative strategies essentially revolves around the generation, optimization, and asset allocation based on such signals, forming an automated process.

This process appears straightforward: data collection → feature processing → machine learning prediction → portfolio construction. However, financial prediction is a typical domain characterized by high noise and low signal-to-noise ratio. Daily volatility often reaches about 2%, while the true predictability per day is roughly around 1 basis point.

Therefore, most information within models is essentially market noise. How to build robust and effective predictive signals in such a harsh environment becomes a fundamental capability of systematic investing.

Core Process Framework

A complete machine learning system for return prediction usually follows a standardized four-stage process, with each stage tightly interconnected:

Stage One: Data Layer — The “Raw Material” of Strategies

Includes traditional data such as asset prices, trading volume, fundamental reports, as well as alternative data (e.g., satellite images, consumption trends). Data quality directly determines the upper limit of the strategy’s potential. Most strategy failures can be traced back to issues at the data source rather than the model itself.

Stage Two: Feature Layer — The “Refinery” of Information

Transforms raw data into structured features recognizable by models. This is a key step that condenses domain knowledge, for example:

Price series → Rolling returns (momentum factor)
Financial statements → Valuation ratios (value factor)
Market data → Liquidity indicators (transaction cost factors)

The quality of feature construction often has a greater impact than model choice.

Stage Three: Prediction Layer — The “Engine” of Algorithms

Uses machine learning models to predict future returns based on features. The main challenge is balancing model complexity: capturing nonlinear patterns while avoiding overfitting to noise. Besides directly predicting returns, models can also target specific structured signals (e.g., event-driven returns) to obtain sources of alpha with low correlation.

Stage Four: Allocation Layer — The “Realizer” of Signals

Converts predicted values into actionable portfolio weights. Classic approaches include cross-sectional ranking, long-short pairs, etc. This stage must be closely coupled with transaction cost models and risk constraints.

The entire process is a chain dependency: a weakness in any link constrains the final performance. In practice, allocating more resources to data quality and feature engineering often yields higher returns.

Data Source Classification

Market Data: Prices, volumes, return series. Highly standardized but homogeneous; single signals tend to decay quickly.
Fundamental Data: Corporate financial reports reflecting operational quality, but with reporting lags and seasonal gaps. Even in crypto, alternative fundamental indicators can be constructed via on-chain data, though their valuation logic differs from traditional assets.
Alternative Data: Non-traditional sources such as sentiment analysis, geospatial info, trading behavior, etc. Noisy and complex to process, but may contain undervalued information.

Feature Engineering: The Art and Science Combined

Features are quantifiable attributes that can independently or jointly predict future returns. Their construction heavily depends on a deep understanding of market mechanisms. The academic and industry communities have established several classic factor systems, such as:

Value factors: valuation levels (e.g., P/B, P/E)
Momentum factors: trend strength (returns over different windows)
Quality factors: financial robustness (profitability, leverage)
Size factors: market capitalization
Volatility factors: historical volatility
Liquidity factors: trading friction (bid-ask spread, turnover)

Key Techniques in Feature Processing

Standardization: Remove scale effects, enabling models to treat features fairly (e.g., market cap vs. volatility).
Winsorization: Limit extreme values to prevent outliers from dominating parameter estimation.
Interaction features: Capture synergy effects by combining features (e.g., momentum × short position ratio).
Dimensionality reduction and selection: To combat the “curse of dimensionality,” employ feature filtering (not just PCA) to retain the most relevant information for prediction.

Model Selection Guide

After preparing features, the next step is choosing algorithms. There is no universally best model; each has its advantages suited to different scenarios.

Linear Models

Ridge Regression: Keeps all features, suitable for weak signals.
Lasso: Performs automatic feature selection, ideal for sparse signals.
Elastic Net: Balances Ridge and Lasso, handling correlated features.

Advantages: Highly interpretable, computationally efficient, good at preventing overfitting. Can incorporate nonlinearities via interaction terms.

Tree Ensemble Models

Random Forests and Gradient Boosting Trees (XGBoost, LightGBM) excel at capturing nonlinear relationships and interactions automatically.

Random Forest: Strong resistance to overfitting, stable.
Gradient Boosting: Usually higher predictive accuracy but requires careful tuning.

When complex interactions and nonlinearities are significant, these models are preferred. They are more computationally intensive but modern interpretability tools have improved their transparency.

Neural Networks

Neural networks offer powerful representation capabilities, capable of modeling highly complex patterns. However, they require large data volumes, are sensitive to hyperparameters, and tend to overfit noise in low signal-to-noise environments. Use only when data is abundant and the team has deep tuning expertise.

Core Modeling Recommendations

Use linear models as a strong baseline.
Upgrade to tree models if clear nonlinear patterns exist with sufficient data.
Consider neural networks as an advanced option, not the default.
The impact of model choice is often smaller than feature quality and rigorous out-of-sample testing.

Designing Prediction Targets: The Art

Traditional approach predicts asset returns directly, but returns are a mixture of multiple factor signals, making prediction difficult and noisy. A better approach is to decompose return sources and model specific dominant logic:

For example, stock price reactions after earnings revisions are mainly driven by the revision event itself. Predicting the “revision magnitude” or “event-period return” directly can avoid unrelated noise. Flexible design of prediction targets is a key way to improve signal purity.

From Signal to Portfolio Implementation

Predictions must be monetized into actual holdings:

Basic method: Cross-sectional ranking to build long-short portfolios.
Key insight: Prediction accuracy does not directly translate to real trading performance; transaction costs, liquidity constraints, and turnover must be considered.

Building a Robust System: Key Principles

Start with well-understood models: Fully exploit known effective factors before innovating.
Regularization is everywhere: Prevent overfitting in high-dimensional settings.
Preprocessing must be rigorous: Standardization, winsorization, outlier handling are essential.
Dimensionality reduction should be targeted: Ensure retained information is relevant to the prediction goal.
Focus on trading results: Use net returns after costs as the ultimate evaluation criterion.

Conclusion

Predictive signals are the cornerstone of systematic investing. Their effective construction relies on a systematic grasp of the entire chain—data, features, models, and allocation.

In the low signal-to-noise battlefield of financial data, simple models combined with rigorous out-of-sample validation often outperform overly complex black-box systems. Always start with concise, interpretable frameworks, and only increase complexity gradually when necessary.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.