MaiMemo Memory Algorithm Experiment 2025 - Preliminary Report
Abstract
This document presents the "MaiMemo Memory Algorithm Experiment 2025 - Preliminary Report," detailing the initial outcomes of a large-scale A/B test conducted by the MaiMemo team to evaluate the practical effectiveness of spaced repetition algorithms. The core experiment (Exp.01) compared the performance of MMX-5 (a variant of FSRS-3) and MMX-6 (a variant of FSRS-6) among over 8,900 newly registered users on the MaiMemo flashcard (墨墨记忆卡) platform.
The study revealed significant discrepancies between theoretical benchmarks and actual user behavior. Although MMX-6 demonstrated superior performance in offline machine learning metrics (achieving lower RMSE/LogLoss and higher AUC), it underperformed MMX-5 in key operational and learning metrics during real-world application. Specifically, the MMX-6 group exhibited lower user retention rates (including daily task completion rates and learning engagement) and failed to demonstrate improvements in learning efficiency.
The report highlights a critical "gap" between laboratory research and real-world application. These findings underscore the necessity of establishing online A/B testing infrastructure to balance memory prediction accuracy with user motivation and long-term learning sustainability. Subsequent experimental phases will explore personalized parameter training and other algorithmic variables.
Key Keywords: Spaced Repetition, FSRS Algorithm, A/B Testing, Memory Prediction, User Retention, Learning Efficiency, MaiMemo.
1 Introduction
1.1 Background
Summary: This is the "MaiMemo Memory Algorithm Experiment 2025 - Preliminary Report" by the MaiMemo team. The report documents the team's A/B user experiments on memory algorithms, covering topics such as: the impact of algorithm upgrades, the effects of different algorithm frameworks, the influence of default parameters versus periodically personalized training parameters, the impact of adding random variations, the effects of displaying the next scheduled interval, and disallowing review completion with vague feedback. The aim is to bridge a critical gap in memory algorithm research, facilitating its transition from the "laboratory stage" to achieving "real-world impact."
Purpose of this Publication: Primarily, we find this research inherently interesting and hope to discuss and exchange ideas with others who share this interest. MaiMemo has been dedicated to memory research for over a decade. During this time, we have not only gained many significant and even counter-intuitive insights but have also encountered our share of challenges. By sharing this journey, we hope to attract more attention to and foster a deeper understanding of memory research, thereby contributing to the long-term advancement of the field.
Related Background: The relevant research is based on our two core products: MaiMemo Vocabulary (墨墨背单词, launched in 2014) and MaiMemo Flashcard (墨墨记忆卡, launched in 2021). As of October 2025, the platform has accumulated 133.8 billion data points on user memory behavior from over 40 million registered users. Academically, we have successfully published two research papers in ACM SIGKDD 2022 and IEEE TKDE 2023. Furthermore, to promote the development of the field, we open-sourced a dataset of 220 million records in 2022.
For other types of inquiries or collaboration, please feel free to contact us via email at datascience@maimemo.com.
1.2 Experiment Overview
- Exp.01(2025.09), Exp.04(2025.10), Exp.05(2025.10), Exp.06(2025.10) have been launched.
- Exp.02, Exp.03 are pending launch.
1.3 Reporting Schedule
- MaiMemo Memory Algorithm Experiment 2025 - Preliminary Report, 2025.11
- Explain the overall experimental plan, focus on analyzing the preliminary findings of Exp.01.
- MaiMemo Memory Algorithm Experiment 2025 - Mid-term Report, 2026.05
- Provide an update on the status of Exp.04, 05, and 06; complete the reports for Exp.01, 04, 05, and 06; analyze the preliminary findings of Exp.02.
- MaiMemo Memory Algorithm Experiment 2025 - Final Report, 2026.11
- Complete all experiments (01, 02, 03, 04, 05, 06) and finalize the report.
2 Exp.01 Status
2.1 Experimental Design
A portion of newly registered users will be divided into two groups for an initial AA test to confirm there are no differences between the groups.
Afterward, an A/B test will commence: Group A will use the MMX-5 algorithm (a variant of FSRS-3), while Group B will use the MMX-6 algorithm (a variant of FSRS-6).
Both groups will operate with default parameters, without personalized parameter optimization for users, to observe the impact on overall business performance and learning outcomes.
The platform for this experiment is MaiMemo Flashcards (墨墨记忆卡). Due to the unique nature of the memorization materials in MaiMemo Vocabulary (墨墨背单词), its algorithm is more specialized and thus not discussed in this experiment.
2.2 Algorithm Description
2.2.1 MMX-5
MMX-5 is the internal version of FSRS-3.
Default Parameters
[-0.6051, 1.2609, 1.0101, -0.9817, -2.181, 2.5985, -0.7287, -0.0232, 1.2021, 0.3485, 0.7679, 0.8443]
Formulas
w_i represents the i-th parameter, user_params[i]. This version uses 12 parameters. The memory state is represented by Stability (S) and Difficulty (D).
Rating System: MMX-5 uses a unique rating mapping system. The user's choice (1-4) is mapped to an internal grade level (G) for calculations.
- 1: Familiar → G=3
- 2: Vague → G=2
- 3: Forgot → G=1
- 4: Well Familiar → G=4
- Initial stability after the first rating:
- where is the initial stability when the first rating is
Forgot. When the first rating isWell Familiar, the initial stability is .
- where is the initial stability when the first rating is
- Initial difficulty after the first rating:
- where is the initial difficulty when the first rating is
Familiar.
- where is the initial difficulty when the first rating is
- New difficulty after a review:
- The new difficulty is adjusted from the current difficulty D based on the grade G. When the rating is
Familiar(G=3), the difficulty does not change. The value is clamped within the interval [1, 10]. Unlike FSRS, this formula does not include a "mean reversion" mechanism.
- The new difficulty is adjusted from the current difficulty D based on the grade G. When the rating is
- Retrievability after t days since the last review:
- where, when t=S, R(t,S)=0.9.
- The next interval can be calculated by solving for t in the above equation, where R is replaced by the Target Retention. MMX-5 uses a fixed Target Retention of 0.85:
- The final calculated interval value is rounded to the nearest integer and clamped within the range of [1, 36500] days.
- New stability after a successful recall:
This formula is used for ratings of
Familiar(G=3) orWell Familiar(G=4).- Similar to FSRS, the growth in stability () is influenced by the following factors:
- Difficulty (D): The larger the value of D (with ), the smaller the value of SInc. This means that the memory stability of difficult material grows more slowly.
- Stability (S): The larger the value of S (with ), the smaller the value of SInc. This indicates that the more stable a memory is, the harder it is to make it even more stable.
- Retrievability (R): The smaller the value of R (i.e., the longer the review delay), the larger the value of SInc. This reflects the spacing effect.
- If the review is successful, the value of SInc is always greater than or equal to 1.
- Similar to FSRS, the growth in stability () is influenced by the following factors:
- Stability after forgetting (i.e., stability after an incorrect answer):
This formula is used for ratings of
Forgot(G=1) orVague(G=2).- This formula is used to calculate the new stability after a user fails to recall an item. It is worth noting that, unlike FSRS, this version's post-forgetting stability calculation does not depend on difficulty (D).
2.2.2 Comparison between FSRS-3 and MMX-5
2.2.3 Comparison between FSRS-6 and MMX-6
Symbol Definitions
- : Retrievability (probability of recall)
- : Stability (the interval in days when the probability of recall, is 90%)
- : New stability after a successful recall
- : New stability after forgetting
- : Difficulty ()
- : Grade (the rating in Anki):
- :
againcorresponds to "Forgot" in Maimemo. - :
hardcorresponds to "Vague" in Maimemo. - :
goodcorresponds to "Familiar" in Maimemo. - :
easycorresponds to "Well Familiar" in Maimemo.
- :
Default Parameters
[0.3265, 1.21955, 2.4329, 8.2956, 6.41275, 0.834, 3.0125, 0.0314, 1.89125, 0.2144, 0.8208, 1.56435, 0.0409, 0.3591, 1.74945, 0.69375, 1.8729, 0.5425, 0.0912, 0.0658, 0.1]
Changes Relative to FSRS-6
- MMX-6 is based on FSRS-6, with specific optimizations for the memory model. This version uses the same 21 parameters as FSRS-6 but modifies the formulas and updates the default parameters.
- MMX-6 does not consider feedback during the short-term learning phase.
Success Threshold Modification
- In MMX-6, the success threshold is changed from to This means:
- Success (Recall): (
good,easy) - Failure (Forgetting): (
again,hard)
- Success (Recall): (
- This change affects the branching logic in the stability calculation.
Removal of the "Hard" Penalty in Successful Reviews
- The stability after a successful review in FSRS-6 includes a "Hard" penalty:
- In MMX-6, the "Hard" penalty ( when ) is removed, and the formula becomes:
- This change means that "Vague" feedback ( ) is now treated as a failure and handled by the forgetting formula.
Addition of an "Forgot" Penalty in Failed Reviews
- The stability after forgetting in FSRS-6 is:
- In MMX-6, a "Forgot" penalty is added when
- This means that "Forgot" feedback () receives an additional penalty compared to "Vague" feedback (), resulting in a lower post-failure stability.
Updated Default Parameters
- MMX-6 uses different optimized default parameters from FSRS-6, specifically:
- Lower decay parameter: (compared to 0.1542 in FSRS-6)
- Adjusted penalty/reward parameters: ,
- Modified initial stability parameters: , ,
Practical Implications
- "Vague" Feedback as Failure: "Vague" feedback is now incorporated into the forgetting model instead of being treated as a penalized successful recall.
- Enhanced Differentiation: The model better distinguishes between "Forgot" (complete failure) and "Vague" (partial failure) through the "Forgot" penalty.
- Improved Calibration: The updated parameters and modified penalty system provide better retention rate predictions for the specific use cases MMX-6 is optimized for.
- Consistent Forgetting Curve: All other formulas (initial difficulty, difficulty update, mean reversion, etc.) remain consistent with FSRS-6.
Interval Preview
- Retention rate set to 0.85
2.2.4 Comparison between MMX-5 and MMX-6
Model Structure
- The number of parameters was increased from 12 to 21.
- The exponential forgetting curve in MMX-5 has been replaced with a power-law forgetting curve.
Initialization and Difficulty Update
- Initial stability is now read directly from parameters instead of being an exponential mapping: MMX-5 used , whereas MMX-6 uses , while retaining item-wise customization for grades 1–4.
- Initial difficulty has changed from a linear decrease to an exponential decrease .
- Difficulty update is no longer a simple translation , but now incorporates linear damping and mean reversion: where is the initial difficulty for a grade of 1.
Stability Update Formula
- The recall branch has changed from a multiplicative amplification dependent on to an inverse adjustment based on difficulty and stability:
- MMX-6 removes the previous "difficulty penalty" but retains a reward coefficient, for "Well Familiar" feedback; MMX-5 had no similar separate bonus for "Well Familiar."
- The forgetting branch has been changed from a power function solely dependent on stability, , to one that simultaneously considers difficulty, stability, and an "Forget" penalty:
- The rating threshold remains G>2 to be considered a successful recall. However, because "Vague" feedback is now incorporated into the forgetting branch and "Forget" feedback is penalized separately, the scope of influence between success and failure is more distinct than in MMX-5.
2.3 Machine Learning Metrics(SRS Benchmark)
2.3.1 Evaluation Description
- Dataset: Selected a portion of users newly registered in 2024, with review requirements exceeding 1,000 times, totaling 12,370 users.
- Data split: The first 50% of each user's review records were used as the training set, and the last 50% as the test set.
- Evaluation was conducted using the SRS Benchmark rules established by FSRS for MMX-5-default, MMX-6-default, FSRS-3-default, and FSRS-6-default.
- A note: We consider default parameters as part of the algorithm, so MMX-5-default, FSRS-3-default, and FSRS-6-default were tested directly without training, while the parameters for MMX-6-default were newly trained.
2.3.2 Evaluation Results
- Evaluate the error between the model's predicted recall probability and the user's actual recall results.
- Smaller RMSE, Log Loss, and ICI values indicate lower prediction errors.
- A larger AUC value indicates more accurate ranking of recall probabilities.
Based on the evaluation from the SRS Benchmark, we can observe that mmx-6-default performs best in machine learning metrics, while mmx-5-default and fsrs-6-default also show strong performance.
2.4 User Experiment(A/B Testing)
We launched this experiment and are now conducting statistical analysis on the relevant data. The data can be found in the appendix.
Group A had 4,570 registered users, while Group B had 4,412 registered users.
The data covers business metrics (daily task completion, paid purchases) and learning metrics (learning feedback records) from Day 00 to Day 30 after user registration.
2.4.1 User Retention Rate - Completed
For business metrics in software, user retention rate is a very important evaluation indicator.
Here, we have calculated the daily user retention of two groups of users who completed their learning tasks.
Relevant data visualization:
- By observing the above data visualization, we can see that B.MMX-6-default (FSRS-6) maintained decent retention before Day06 of registration, but its user retention rate performed poorly after Day06.
- By Day30, the retention rate had already dropped by approximately 10% relatively.
Relevant data details:
- Currently, it appears that B.MMX-6-default (FSRS-6) has a somewhat negative trend in user retention rate - completed. However, the existing data is still insufficient.
- We conducted a statistical test on the Day30 data, resulting in a P-value of 0.11. Since a value of 0.05 is generally required to indicate significance, it would be worthwhile to increase the number of participants in the experiment and extend the observation period.
- At the same time, there's a peculiar phenomenon that needs to be addressed: the user retention rate for the A.MMX-5-default (FSRS-3) group actually increased after Day 18 instead of declining, which is counterintuitive.
- Additional context: some of the analyzed users registered between October 1, 2025, and October 8, 2025—China's National Day holiday—which might have impacted their learning patterns.
2.4.2 Paid Rate
Relevant data details:
- There is not much difference in payment between the two groups.
2.4.3 User Retention Rate - Active
Daily learning activity and daily completion of learning tasks are two separate data systems.
We have an independent data system (BMMS - Big Data Memory Matrix System, responsible for collecting, analyzing, and predicting memory data, which is the collective term for MaiMemo's memory algorithm implementation system) to gather user learning feedback.
The learning activity retention rate (User Retention Rate - Active) is collected through this system. Unlike the previous metric (2.4.1 User Retention Rate - Completed), here we can collect user learning data as long as they provide feedback, without requiring completion of all daily learning tasks.
Relevant data visualization:
- From the perspective of active learning user retention, the rate is acceptable before Day 20, but there's a declining trend after Day 20.
- It's important to note that the
User Retention Rate - Activeis higher than theUser Retention Rate - Complete, as users may have engaged in learning but didn't finish their daily learning tasks.
Relevant data details:
- There is a downward trend, but the p-value is still not sufficient.
- Meanwhile, the difference between groups A and B revealed a trend change point on Day 18, similar to what was observed in the
User Retention Rate - Completed.
2.4.4 Cumulative - New Cards Learned/Repetitions/Time
Based on the data collected by BMMS, we can further analyze users' learning progress.
- In the
maimemo_flashcard_abtest.tsvfile within the appendix, the relevant data is:acc_learn_cnt: Total number of new cards learnedacc_learn_repeat: Total number of new learning reviewsacc_review_repeat: Total number of review repetitionsacc_total_repeat: Total number of all reviewsacc_learn_time: Total time spent on new learningacc_review_time: Total time spent on reviewsacc_total_time: Total time spent overall
Relevant data visualization:
- Before Day 18, B.MMX-6-default (FSRS-6) showed a promising trend in cumulative new cards learned compared to A.MMX-5-default (FSRS-3). However, after Day 18, this advantage gradually diminished.
- In the end, the two groups performed similarly across three key metrics: cumulative new cards learned, cumulative repetitions, and cumulative time.
Relevant data details:
- We will not elaborate on this here, as it will be discussed in section 2.5.3 Learning Efficiency - Time Spent on New Learning and Review.
2.4.5 Feedback - New Learning Recognition Rate/Review Retention Rate
Further analyzing learning progress, we can assess the recognition rate of new cards and the retention rate of reviewed cards through user feedback, with these analyses also based on data collected by BMMS.
In MaiMemo Flashcards, there are four types of feedback: two ratings for failure (Forgot, Vague) and two for success (Familiar, Well Familiar).
The main interface of the software currently displays three options: Familiar, Vague, Forgot, as shown in the image.
The Well Familiar option requires a specific interaction to provide feedback.
- In the
maimemo_flashcard_abtest.tsvfile within the appendix, the relevant data is:learn_rsp1_cnt: Number of new learning responses marked "Familiar"learn_rsp2_cnt: Number of new learning responses marked "Vague"learn_rsp3_cnt: Number of new learning responses marked "Forgot"learn_rsp4_cnt: Number of new learning responses marked "Well Familiar"review_rsp1_cnt: Number of review responses marked "Familiar"review_rsp2_cnt: Number of review responses marked "Vague"review_rsp3_cnt: Number of review responses marked "Forgot"review_rsp4_cnt: Number of review responses marked "Well Familiar"
- Based on the above data, we make the following definitions:
Relevant data visualization:
- The most puzzling data from this experiment emerged: the recognition rate of newly learned cards in B.MMX-6-default (FSRS-6) was higher than that in A.MMX-5-default (FSRS-3). Normally, spaced repetition algorithms primarily affect review scheduling and shouldn't influence new learning. This stands as the strangest phenomenon observed in this experiment.
- The review retention rates are similar between the two groups, with B.MMX-6-default (FSRS-6) possibly slightly higher than A.MMX-5-default (FSRS-3), but the difference is not statistically significant. Both groups only achieved around 75%, which still falls short of our target of 85%. This discrepancy might be due to the first month's data, where no personalized parameter training was conducted, leading to some deviation.
Relevant data details:
- No additional details are provided for the data, as it has already been explained in the visualization analysis section above.
2.5 Discussion
2.5.1 Mutation Point Day 18
- The relative metrics of B.MMX-6-default (FSRS-6) weakened around Day18 compared to A.MMX-5-default (FSRS-3), which is quite intriguing. We aimed to investigate further but couldn't formulate suitable hypotheses after several attempts. This issue will be set aside for future analysis. If interested, you may also analyze using the data provided in the attachment.
2.5.2 User Retention Rate - Completed / Active
Relevant data visualization:
- The chart is divided into two main sections, defining "retention" from two different dimensions:
- Upper section (Completed): Users not only logged in but also completed their daily learning tasks.
- Lower section (Active): Users were merely active (possibly only completing part of the material) but didn't necessarily finish the tasks.
- B.MMX-6-default (FSRS-6) shows a negative trend in user retention compared to A.MMX-5-default (FSRS-3).
- Currently, we anticipate needing more data to make further judgments, but observable indicator changes exhibit a certain cyclical pattern, possibly related to specific interval arrangements.
Relevant data details:
2.5.3 Learning Efficiency - Time Spent on New Learning and Review
Learning Distribution Overview
This chart is a scatter plot that compares the relationship between "the number of new cards learned" and "the total time spent" for two user groups (Group A and Group B) when using different versions of spaced repetition algorithms (A.MMX-5-default vs. B.MMX-6-default).
Relevant data visualization:
- Time Efficiency: A.MMX-5-default (FSRS-3) shows slightly higher efficiency than B.MMX-6-default (FSRS-6).
- A lower slope indicates that more cards are learned per unit of time (or less time is spent learning each card).
- From the data, the A.MMX-5-default (FSRS-3) learns slightly more in the same amount of time or spends slightly less time learning the same number of cards compared to the B.MMX-6-default (FSRS-6).
- High Individual Variability (High Dispersion):
- Despite the trend line, the data points are highly scattered. Some points are well above the trend line (slower learning, more time spent), while others are far below (faster learning).
- This suggests that while the algorithm has an average performance, individual factors—such as the user's learning ability, card difficulty, or study habits—have a much larger impact on time efficiency than the minor differences between algorithm versions.
- Overall Trend:
- Both groups exhibit linear growth, meaning that as the number of cards increases, the time cost accumulates linearly, aligning with the expected learning curve.
User Group Observation
- Furthermore, we divided users into ten groups based on the cumulative number of newly learned cards over 30 days, and all subsequent analyses maintained this grouping.
- The users were segmented into 10 deciles (P00-P10 representing the bottom 10% in cumulative new learning, and P90-P100 representing the top 10% or "super users"). We then compared A.MMX-5-default (FSRS-3) and B.MMX-6-default (FSRS-6).
Relevant data visualization:
- Key Finding: Severe polarization exists, with "super users" overshadowing the performance of average users.
- Majority of users (P20 - P80): Group A performs stronger
- Super users (P90 - P100): Group B overtakes
Relevant data details:
Relevant Repetitions
Relevant data visualization:
- There isn't much to analyze regarding the number of repetitions.
Relevant data details:
Relevant Time
Relevant data visualization:
- A deeper analysis of study time reveals a fascinating phenomenon.
- For users in the P90-100 range, the cumulative number of new cards learned with B.MMX-6-default (FRSR-6) increases, and review time rises accordingly—yet the time spent learning new cards decreases.
- This contradicts our expectation that the spaced repetition algorithm would reduce review time through better scheduling.
- For users in the P80-90 range, B.MMX-6-default (FSRS-6) shows a decrease in cumulative new cards learned and a reduction in learning time—but review time actually increases, which is somewhat frustrating.
- For users in the P90-100 range, the cumulative number of new cards learned with B.MMX-6-default (FRSR-6) increases, and review time rises accordingly—yet the time spent learning new cards decreases.
Relevant data details:
3 Conclusion, Limitations, and Future Work
Currently, MMX-6-default (FSRS-6) has performed poorly in user experiments.
Compared to MMX-5-default (FSRS-3), MMX-6-default (FSRS-6) has shown a negative trend on user retention, with declines in retention rates for both daily active learners and learners who complete their daily tasks. In terms of learning efficiency, it’s difficult to say whether MMX-6-default (FSRS-6) is superior to MMX-5-default (FSRS-3).
While MMX-6-default (FSRS-6) significantly outperforms MMX-5-default (FSRS-3) in machine learning metrics, its performance in user experiments has been unsatisfactory.
Some have pointed out that the essence of the FSRS series of algorithms lies in personalized training, which we believe is a valuable hypothesis. Two points to note:
- We will test this hypothesis in Exp.02.
- It’s worth adding that in FSRS, parameters are recommended to be updated monthly, with the first month using default parameters for learning. The current experiment essentially reflects the learning situation in the initial month.
Based on the current situation, we plan to proceed with Exp.02 using MMX-5, while MMX-6 requires further analysis to determine why its online performance is subpar.
At this point, we have a concern: the current FSRS, which iterates algorithms based on machine learning metrics (SRS Benchmark), does not seem to yield positive results in user experiments. Perhaps the current research has yet to find the right direction.
Additionally, we have two products based on spaced repetition theory: MaiMemo Flashcards (墨墨记忆卡) and MaiMemo Vocabulary (墨墨背单词), each developed and iterated by separate teams. We simultaneously launched FSRS-related algorithm experiments in both products. After the real-world rollout, both teams independently planned to incorporate human rule interventions into the algorithms based on the bad cases encountered. We believe that for machine learning algorithms of this nature, A/B testing is necessary for objective evaluation when serving real users at scale. There is also a gap between simulated experiments and actual learning.
Of course, this report is preliminary, and the current user sample size may still be insufficient—currently, there are just over 4,000 users in each group. We plan to eventually have 20,000 users per group in Exp.01 and extend the observation period to three months. Meanwhile, Exp.02 will be actively advanced, and we will monitor the progress of Exp.04, 05, and 06. We aim to release an mid-term report on these experiments by May 2026.
Appendix
Memory Algorithm Evaluation Methods
Evolution of Memory Algorithm Evaluation Methods
The previous experiment was quite fascinating. Currently, it appears that in addition to collecting user memory learning data and evaluating machine learning algorithms, another critical infrastructure for memory algorithm research should be a user A/B testing system. This would allow real-world user experiments on algorithms to observe the actual behavior of complex memory systems.
The current infrastructure for memory algorithm research primarily focuses on collecting memory learning data and evaluating machine learning metrics. However, existing machine learning metrics have certain limitations and fail to capture real-world data related to retention and learning time.
We believe that to advance memory algorithm research, it is essential to establish a platform capable of conducting A/B tests with different algorithms on real online users. Only then can we truly observe the actual impact of algorithms on complex memory systems.
Why A/B Testing Platforms Are "Critical Infrastructure"
It is not just a nice-to-have feature but the cornerstone that supports the iteration of upper-layer algorithms.
From "Proxy Metrics" to "Goal Metrics": Offline machine learning metrics (e.g., prediction accuracy) are merely proxy metrics—we hope they indirectly improve users' actual learning outcomes and experience.
A/B testing platforms directly measure your goal metrics, such as long-term retention, knowledge acquisition per unit time, and more. These are the ultimate objectives of the product.
Discovering "Unexpected Correlations": A tweak to a memory algorithm might not affect the expected review efficiency but could instead impact learning efficiency or user completion rates. The real world is highly interconnected and complex. Only through online experiments can we observe the ripple effects of a change across the entire system (product, users, algorithms).
Establishing a Scientific Iteration Loop:
Without A/B testing, the research process is: data collection → algorithm modeling → offline evaluation → launch (followed by gut feelings or vague aggregate data). This is a fragmented, ambiguous process.
With A/B testing, the process becomes: data collection → algorithm modeling → offline evaluation (as a pre-launch checkpoint) → online A/B testing → data-driven decision-making → new data feeding back into the model. This is a scientific, rigorous, and quantifiable closed loop.
Relevant User Experimental Data
https://huggingface.co/datasets/Maimemo/maimemo-algorithm-experiment-01