AI: Lean Six Sigma + Gemini

5 example Lean Six Sigma projects with AI enabled tools

Lean Six Sigma

By Martin Hayden

Lead Instructor, LearnSixSigma.co.uk

Published: 2nd Jul 2026

In this article (expand)

Introduction
Quality Control
- Part 2 — Advanced
Measurement Systems
Design of Experiments
- Screening
- Modelling

Coming Up: An intensive and exhaustive test of Google Sheets' Gemini used in Lean Six Sigma process improvement.

AI chatbots can be a helpful tool in the learning and application of Lean Six Sigma.

Summary of Findings

We looked at five types of improvement projects commonly undertaken by professionals to see how modern AI tools could be used to guide analysis for those inexperienced with Lean Six Sigma.

While the numerical analysis was identical to other statistical software, it took us considerably longer and required the right prompts and questions to generate useful output.

We found that for an experienced user, there is little to be gained, and for the inexperienced user, there is considerable risk of misinterpreting the results.

While AI tools are becoming more expensive, there are free alternatives for statistical analysis, which provide analyses instantly and repeatably, unlike AI tools, which take several minutes to respond to each new question.

Introduction

There’s been a lot of speculation about a coming revolution with AI augmenting traditional strategies like Lean Six Sigma. It is confidently predicted that AI will make Lean Six Sigma more proactive, more predictive, more analytical, and yield results more quickly.

Let’s examine that prediction by using AI on several improvement projects. We’ll use Google’s Gemini because it claims to have strong data analysis capabilities when integrated with Google Sheets, and it comes with a free 14-day license.

If this is to benefit many of us, then it has to be simple and approachable. That’s supposed to be the benefit of AI: it’s going to make us smarter and more efficient without needing to be experts in data analytics.

We'll test these predictions for a step change in productivity by following a standard improvement project to see how current AI capabilities will handle it. The kind of projects that might lend themselves to AI support could be data-rich projects in either manufacturing or business processes.

Note: Text responses have been summarised where necessary to keep them to a reasonable length.

Case: Manufacturing Quality Control

Findings

For this simple defect reduction project, Gemini was able to find the relevant data quickly and present it in a logical way to show the priority of action. However, it didn’t suggest the Pareto analysis from the perspective of cost as a priority until it was prompted.

Gemini was less capable of developing improvement strategies. Each of the faults identified requires a different approach and likely the collection of specific diagnostic data. Gemini could only propose a logical set of tools, but it really needs more data to provide deeper analysis.

Gemini was good at crunching the available data and producing a logical and useful summary quickly. However, collecting the right diagnostic data is key to resolving these problems, and that will likely depend on the skills and experience of trained specialists.

In the next case study, we investigate how Gemini can contribute to analysing a much larger file of inspection and test data.

Case Study Details

This example is a standard quality control problem from the assembly process for a complex laboratory instrument and uses real data from a previous Lean Six Sigma project I was involved in.

The data I have is a month of defect logging on the final fault inspection of each instrument. The process was achieving a first-pass yield of around 30%. The project charter set a goal to improve First Pass Yield on assembly to 95%.

I’ll provide Gemini with the available data on fault descriptions and the cost of scrap or rework for each fault type. There were 121 faults in total listed in the first tab of this spreadsheet, including 15 different fault types. The second tab in the file contained the cost data representing the scrap and labour cost of replacement parts or adjustment and rework of each fault type.

Let’s start by asking simple questions and see where Gemini takes us.

Prompt:

Analyse the data in the Assembly Faults file.

Response:

Pareto chart of assembly fault types in a manufacturing process by frequency. — The Pareto Analysis chart produced by Gemini.

Gemini responded with the classic Pareto chart showing priority of improvement effort by fault count. It includes the usual cumulative percentage line, which indicates that fixing the first six faults would eliminate 80% of all defects. That would probably have saved the inexperienced user time in finding and using the right software or manually creating this chart in a spreadsheet package.

Prompt:

Which component is the biggest cost driver?

Response:

The 'Input' component is 21.9% of total faults at $18,630 total annual failure cost

Gemini suggested the next question in its previous answer:

Prompt:

Which four components account for over half of the annual cost?

Response:

Input component: $18,630
Main board: $10,247
Optical component: $9,247
PSU: $7,763

Gemini quickly found the relevant data without prompts by reading the file name and column headers and completed the cost analysis. This saved a lot of time and effort over extracting the data and calculating the cost of each fault.

Prompt:

Repeat the Pareto analysis by cost

Response:

I would have expected a Pareto chart indexed by cost this time, rather than count. That’s a fairly standard option for prioritising faults by greatest saving, but it seems Gemini didn’t answer appropriately. Clearly, Gemini completed the cost analysis before, as shown by the table above. Perhaps I needed to be more specific in my question.

Prompt:

Repeat the Pareto analysis prioritised by cumulative cost from data in sheet 2.

Response:

Pareto chart of assembly fault types in a manufacturing process by cost. — The Pareto Analysis chart of manufacturings faults by cost.

Gemini provided a running commentary on its strategy, more detailed than I could keep pace with and then successfully created the Pareto- Chart by Cost.

Gemini has redrawn the Pareto chart, clearly identifying the priority of action using cost-benefit to the business rather than a simple count of faults. Let’s ask what strategy we should use to prioritise fault investigation.

Prompt:

Which Pareto should we use to prioritise improvement action?

Response:

To prioritise improvement actions for assembly faults, you should primarily use the Cost-based Pareto. While both sets of data provide valuable insights, prioritising by cost ensures that your improvement efforts deliver the highest financial return and address the most significant business impacts.

A good answer, but remember this wasn’t Gemini’s first offer. It initially provided a Pareto analysis by count. If the fault data had been related to customer defects rather than inspection faults, then arguably, we should prioritise customer satisfaction and use the Pareto analysis of fault counts. Perhaps Gemini knew that all along?

Prompt:

What are the next steps in analysing Input component faults?

Response:

Breakdown faults into smaller groups to find patterns
Stratify the data using the following questions for each fault:
- What nature
- When in time
- Where located
- Who operated
- What material
Develop a Root Cause Analysis using the standard tools: Fishbone, ‘Five Whys’.
Run a Hypothesis test for the significance of differences, e.g. compare faults by each operator for significant change.

These are all standard tools in the Six Sigma toolbox. Gemini can’t tell you which ones are more likely to be useful with the information it has, so it just presents a generic strategy. One or more of these tools may be useful, but there is no real analysis here.

I repeated the same question on the following day to test the continuity of the response.

Prompt:

What are the next steps in analysing Input component faults?

Summary Response:

Individual Case Forensic Analysis
Supplier and Batch Investigation
Assembly Process Audit
Upstream Testing Enhancements
Cost Breakdown Verification

Gemini has changed it’s advice from the previous response to this question. This is still very generic advice on a strategy to investigate Root Cause of faults. With little information Gemini is unable to provide a clear direction.

In the next case study, we will investigate how Gemini can contribute to analysing a much larger file of inspection and test data.

Case: Manufacturing Quality Control — Part 2

Findings

Gemini is good at sifting through a large volume of data efficiently. It was good at flagging anomalies in that data, but it often needed to be prompted with the right question to usefully direct its analysis.

An open-ended prompt to find quality issues in the data revealed data integrity issues initially, but not product quality issues. A more specific question, like “Find anomalies in the distributions”, usefully identified a few areas worthy of investigation among the hundreds of data sets. This could save considerable time and effort in searching manually.

Having found anomalies to investigate, Gemini was less good at identifying root causes: it could only suggest generic lines of investigation to follow.

Some of the inferences drawn by Gemini were incorrect, e.g. the correlation it found between two output measures of system noise was due to them being different metrics of the same output property, while the weaker correlations with PSU noise discounted by Gemini did turn out to be important.

Gemini is a very useful support tool, but it doesn’t replace the experience of problem-solving of a skilled engineer. With the aid of Gemini, they would be able to home in more quickly on the right questions to ask.

Case Study Details

In this next case study, we’ll present Gemini with a much larger data set to further investigate its analytical capabilities.

The data I’m using is from the inspection records of the final test on a complex product assembly. There are over 400 records with over 200 fields in each record. These include descriptive fields like name of the test engineer, date and time, serial number, product variants and also output results showing quality or performance metrics of elements or subsystems in the product.

No cleaning or pre-formatting has been carried out; the data is as it has been exported from the testing software, typos and errors included, as input by the test operators.

Prompt:

Find quality issues in this data.

Response:

Data Integrity & Junk Records: The dataset contains placeholder or test data (notably row 414) and duplicate rows that should be removed to ensure accurate analysis.
Inconsistent Metadata: Tester names and column headers lack standardisation, which complicates automated reporting and data grouping.
Technical Failures: Several units have been flagged as "NOT IN SPEC," with serial numbers 75224 and 75225 showing multiple test failures across different categories.

The first two issues are easily fixed by eliminating duplicates and junk data. Tester names like “L SMITH”, “Lesley Smith” and “L Smith” are easily standardised. Some column headers are repeated: “Max.” “Min.”, “Average”, so Gemini couldn’t isolate the specific parameters. These columns are always preceded by a unique parameter name, so fairly easily fixed. This provides a useful lesson in data integrity.

Prompt:

Standardise the inconsistent Tester names

Response:

Most were correctly standardised, but Gemini failed to change Leslie Smith to L Smith, not sure why, but easily fixed with ‘Find and Replace’.

Prompt:

Find quality issues in the Product Test Data file

Response:

Several units failed to meet specifications in one or more tests:

Background Test: 8 failures, including units 74475, 74928, 75224, and 75225.
Stability Test: 4 failures, notably units 75201, 75224, and 75225.
Noise Test: 2 failures, 75224 and 75225.
Filter Wheel Calibration: 1 failure 75225.

There is no particular insight from Gemini here. These conclusions were inferred from simple text flags in the file for correction and re-test: e.g. entries like “NOT IN SPEC” or “PASSED”. Gemini did not identify any other technical issues; it had no information on the acceptable ranges for test parameters.

Prompt:

In the numerical data fields, identify anomalies in distributions.

Response Summary:

Noise and SNR Metrics are Highly Skewed:

Bimodal or multi-cluster distributions
Consistent outlier patterns

Detailed Anomaly Analysis

Identified five fields with the most statistical outliers or with extreme skew worthy of further analysis.

This is useful and represents a huge saving in effort over running histograms on upwards of 100 data sets. Skew in noise metrics is to be expected, but the bimodal or multi-cluster distributions suggest an unstable process. The five anomalous fields can now be examined in more detail. One that stands out is RMS SNR (Root Mean Square Signal to Noise Ratio).

Prompt:

Show a histogram of RMS SNR

Response:

A histogram of signal to noise ratio. — The histogram of signal to noise ratio produced by Gemini.

This clearly shows two groups of data which correspond to the same analysis when run in other software. The lower cluster is out of spec and would have been corrected and retested. Approximately 25% of products requiring rework and retest is definitely a problem worth investigating.

Prompt:

Analyse the Signal to Noise data

Response:

Signal-to-Noise Ratio shows variation by test engineer, see plot below.

A box and whisker plot of signal to noise ratio by test engineer. — Box and whisker plot of signal to noise by test engineer.

Gemini responds with a box and whisker plot suggesting this as a potential cause for the variation in SNR data. However, the first two groups represent spurious individual data points, the third is a small sample, and the remaining three don’t really show enough variation to be remarkable. Gemini does eventually recognise this in its more detailed insights accopanying the plot above:

Tester Consistency: Testing is primarily distributed between two main testers, W.N. and L.S. The performance metrics (SNR and Amplitude) are consistent between testers, suggesting that the test results are objective and not heavily influenced by the individual operator.
Noise and Stability Correlation: Units that failed the "Noise Test" or "Stability Test" typically exhibited lower SNR values (ranging from 2,525 to 5,906), confirming that SNR is a reliable predictor of overall unit stability.

Gemini has also recognised that there is a correlation between the text flag “NOT IN SPEC” and low SNR (signal-to-noise ratio) values. That’s good, but this was already well known. The real question is: “What is the root cause of low SNR values and the high rate of rework?”

Prompt:

Analyse the correlation between low SNR values and other parameters to look for a relationship.

Response:

Gemini provided a fairly verbose commentary on its strategy and progress. Much of the detail was not relevant, but it indicated progress during the lengthy pauses.

The correlation analysis between PSU (power supply unit) noise and RMS (root mean square) SNR shows a very weak linear relationship across all measured power supply rails.

The low correlation values suggest that variations in PSU noise, within the ranges observed in this dataset, do not significantly drive changes in the RMS SNR. Other factors likely have a more dominant impact on the signal-to-noise ratio.

Gemini suggests there is no smoking gun here and that the root cause of system noise lies elsewhere. Although the correlation is weak, it may still be worth investigating.

A strong relationship suggested by Gemini is the one between SNR and RMS Noise.

Prompt

Create a scatter plot between SNR and RMS Noise

Response

A scatter plot of signal to noise ratio vs root mean square noise.

Since these are both outputs and different measures of the same basic property of signal quality, that’s expected, so this doesn’t tell us anything new.

Let’s instead examine the relationship between the SNR of the output and the PSU noise level. I want to partition the data to include only the low-SNR values to see if the relationship is more than just what Gemini describes as “weak”.

Prompt

Show correlation coefficients between SNR values below 5000 and other related parameters

Response

Parameter: Correlation with SNR
Noise RMS: -0.91
12VP Supply: -0.44
12VP Noise: +0.44
5VL Supply: -0.25

The very strong correlation between SNR and Noise RMS is interpreted as two measures of the same output and may bias Gemini into considering weaker correlations insignificant. If this can be ignored, then the correlation between PSU noise and supply voltage seems to be worth investigating.

The next action in the real project was to identify PSUs with higher-than-usual noise levels and return them to the supplier for in-depth analysis. That eventually resulted in changes to the sourcing of some internal components, which led to a significant reduction in scrap and rework on assembly.

Gemini has produced an incorrect inference here that the high correlation was significant and that the weak correlations were not. Had Gemini known that the two Noise metrics were the same output, that conclusion may have been different. However, it is a well-known risk that correlation does not always imply cause.

Case: Measurement Systems Analysis

Findings

Gemini was able to guide the measurement systems analysis (MSA) and would provide support to an inexperienced team. It produced the correct inference that the measurement system was inadequate, but the benchmarks it used were too limited, being based just on the sample standard deviation.

Having provided the part specification and long-term standard deviation for the product, Gemini was able to compute the correct metrics for evaluation, but it still tended to focus on the sample standard deviation. With only 10 components, this would be a poor indication of long-term product variation.

The analysis of comparative operator performance required an additional chart to identify differences, which wasn’t highlighted by Gemini but is available as standard in other software packages.

Gemini can provide useful support to those lacking experience, but the skills and knowledge to ask the right questions are still valuable.

Case Study Details

A common need in manufacturing is to assess the quality of instrument measurements. These are often used in critical inspection requirements during product assembly or testing and give a pass/fail quality criterion.

Conducting a measurement systems analysis or gauge repeatability and reproducibility analysis (gauge R&R) is a fairly straightforward process, but it may not be familiar to many. Could AI guide and support such an analytical process?

Digital calipers and a one pence piece. — Digital calipers are an ideal instrument to conduct an MSA with.

Any instrument or measurement system would be suitable for the case study, but for simplicity, the test case chosen is a familiar teaching exercise using a small digital calliper to measure the diameter of coins. I will assume no prior knowledge or experience and ask Gemini to guide the data collection and analysis.

Prompt

How can I conduct a Measurement Systems Analysis?

The common advice is 10 parts, 3 operators and 2 repeats. When asked for a recommendation, Gemini volunteered 10, 3, 3, which is a safe if conservative sample.

I proposed 10 parts, 3 operators and 2 repeats. Gemini responded by creating the following table.

Response

A template table for collecting the inputs to a measurement systems analysis. — The table for the inputs to a measurement systems analysis input Gemini created.

Note that other dedicated statistics software packages would randomise the data collection in a single column with the objective that the operator’s second measurement should not be influenced by their first. That is an important consideration; all measurements need to be independent and not subject to bias.

I used data from a previous MSA study here, and loaded the raw data into the Google sheet.

Input data to a measurement systems analysis. — Measurement data previously collected using the digital calipers.

Prompt

Analyse the MSA data

Response

The current measurement system is unacceptable for this process, as the Gage R&R exceeds the standard threshold of 30%.

Key Takeaways

Significant Interaction: The "Operator*Part" interaction contributes significantly to reproducibility (39.96% of study variation)
System Performance is Unacceptable: The Total Gage R&R (GRR) accounts for 41.17% of the study variation. According to AIAG standards, a GRR over 30% indicates the measurement system needs improvement.
Low Resolution: The Number of Distinct Categories (ndc) is 3.12. A reliable measurement system should have an NDC of 5 or higher to effectively distinguish between different parts.
Balanced Error Sources: Both Repeatability (28.55%) and Reproducibility (29.67%) contribute significantly to the total error. This suggests that the issues are split between the equipment's consistency and differences in how the three operators perform the measurements.

A measurement systems analysis. — The results of the MSA by Gemini.

(Note that Gemini does not know the specification limits, so it can only compare measurement repeatability and reproducibility with sample variation.)

The takeaway suggested by Gemini that there is significant interaction between “Operator” and “Part” data implies that operators measure different-sized parts with different repeatability. This is likely to be a false conclusion, with variation in measurement between operators more likely. There is a risk that this is accepted as a literal conclusion rather than just a feature of the data.

The inference that the measurement system is unacceptable is quite correct, although it is incomplete without reference to the part specification. So I’ll provide the specification tolerance and repeat the question.

Prompt

Part spec. is 20.3 – 20.5 long term Std Dev is 0.038, analyse the MSA data

Response

The MSA conducted with your new parameters reveals that the current measurement system is unacceptable for both process control and part specification assessment.

Key Takeaways

High Measurement System Variation: The Gage R&R accounts for 63.15% of the Study Variation and 32.75% of the Tolerance. Both values exceed the standard 30% threshold for an acceptable measurement system.
Operator Inconsistency (Reproducibility): Reproducibility (55.14% of Study Var) is significantly higher than Repeatability (30.79%). This indicates that the primary source of error is the variation between operators (Operators 1, 2, and 3) rather than the equipment itself.
Insufficient Resolution (ndc): The Number of Distinct Categories (ndc) is 1.73. A value of at least 5 is required to ensure the measurement system can distinguish between different parts in the process.

Before reviewing the analysis and results provided by Gemini, I’ll repeat the analysis using an industry-standard statistics software package that includes MSA analysis as a built-in module.

Source	Std. dev.	Study variation	% Study variation	% Tolerance
Total gauge r&R	0.011	0.065	28.73	32.75
Repeatability	0.005	0.032	14.01	15.97
Reproducibility	0.010	0.057	25.08	28.59
Operator	0.004	0.021	9.23	10.53
Part No * Operator	0.009	0.053	23.32	26.58
Part-to-part	0.036	0.218	95.78	109.1
Total variation	0.038	0.228	100	114

Repeat of the same MSA using other software.

Note. Historical standard deviation is used to calculate some values for std. dev., study variation, and %study variation.
Note. Number of distinct categories = 4
Note. Study Variation is calculated as std. dev. × 6
Note. Values for %process variation are not displayed because they are identical to values for %study variation.

The numerical analysis is identical, with some minor differences in presentation and format. Let’s test the key takeaways that Gemini gave:

The Measurement System is unacceptable. This is correct. The target % variation and % tolerance should be below 30% and ideally below 10%.
High system variation at 63% of the study. That’s not the important benchmark here. It compares measurement error with 6 times the standard deviation of the sample or study variation. The product's long-term standard deviation is much more important as a realistic benchmark of the process and should always be used. The statistics software calculates 29% of the study variation.
Operator inconsistency is high at 55% of study. This refers to reproducibility. Again, the wrong benchmark 25% using the correct benchmarks.
Number of distinct categories Gemini reports 1.7. While the statistics software reports 4, this is due to the different benchmarks used in the analysis.
Gemini recommends in the commentary that a training exercise should be carried out to improve consistency between operators. That would address note 3 above and is very likely to be useful in this case. This is clearly illustrated by the range chart for operators produced by the same statistics software used before.

A range chart of measurement operators. — The range chart for measuremnet operators.

It would seem likely that Alice is a very experienced operator with highly consistent measurement skills, while Colin and Barry would benefit from additional training.

Design of Experiments — Screening

Findings

The initial analysis provided by Gemini was simple descriptive text. It identified the most influential factors that affected the distance achieved by the Statapult. Arguably, that is the primary purpose of a screening experiment. Having identified the most influential factors, a further modelling experiment would be used to develop a new model that more precisely predicts the output as a function of those primary inputs.

Gemini was able to calculate the statistical significance of each factor when prompted and correctly identified the factors to be retained in the model, including the rubber band with the marginal p-value below 0.1. It only eliminated the non-significant factors and recalculated the model when prompted to do so. However, when that process was repeated a day later, Gemini eliminated the non-significant factors without prompting. Could this be Gemini learning on the job?

It felt a little uncomfortable when Gemini followed a different path with a repeat analysis on a later date. It omitted previous steps and made assumptions about the output required. With the right prompts and questions, Gemini output the same final results and precisely copied the traditional statistical software analysis.

The model it produced had an R² value of 0.84, acceptable for a screening model and identical to the statistical software, but it took considerably longer to arrive at this result. It required prompting with the right questions to output all of the traditional results tables and charts; however, it did meet the primary purpose of screening with little or no expert knowledge required.

The next case study will take just the significant factors identified here and attempt to develop a full predictive model in Gemini to evaluate the R² value and the quality of the model compared to that from statistics software.

Case Study Details

To investigate the use of AI in a screening design of experiments (Screening DoE), I’ll make use of the well-known teaching example, the Statapult. This is a small catapult that has become an industry-standard tool for teaching statistical modelling.

It allows for the adjustment of multiple control factors (launch angle, arm radius, input energy, from the rubber band tension, and stop pin settings) to evaluate the effect of those settings on the distance achieved and its repeatability or accuracy.

With good experimental technique and the right analysis software, a modelling equation can be developed for the Statapult with an R² (correlation coefficient squared) exceeding 0.99 - this represents a model that predicts 99% of the data with only 1% noise content. Could AI provide that same quality of model prediction to an inexperienced user? The process would normally start with a screening experiment before proceeding to a full modelling experiment.

Let’s assume little knowledge of DoE and ask Gemini for advice.

Prompt:

How to create a DoE model for 7 input factors?

Response:

How many runs can you practically perform?

8 runs (Resolution III)
16 runs (Resolution IV)
32-64 runs (High Resolution)
128 runs (Full Factorial)
Other

The use of an 8-run Resolution III design represents the most economical but least effective design in recognising interactions. Also, 32 runs or more are probably impractical. So let’s ask Gemini what it recommends. After a few follow-up questions, and my limit set at 16 runs, Gemini still recommended an 8-run Plackett-Burman design.

Prompt:

16 runs

Response:

Run	Factor							Result
Run	A	B	C	D	E	F	G	Result
1	High	High	High	Low	High	Low	Low
2	Low	High	High	High	Low	High	Low
3	Low	Low	High	High	High	Low	High
4	High	Low	Low	High	High	High	Low
5	Low	High	Low	Low	High	High	High
6	High	Low	High	Low	Low	High	High
7	High	High	Low	High	Low	Low	High
8	Low	Low	Low	Low	Low	Low	Low

This would work as a first-level screening experiment. The choice between 8 or 12 run Plackett-Burman or a 16 run fractional factorial is probably not important here; all of them would work, but the greater the number of runs, the more data points and the more discriminating the design.

I am more familiar with a 12-run Placket Burman, so I’ll ask Gemini to analyse one that I have used many times before and compare its results to those from standard statistics software analysis. With 50% more data points, it should be more effective, but it is still only aimed at the first level of screening.

Prompt:

Replace this with a 12-run Plackett-Burman 2-level design

Response:

Run	Factor							Result
Run	A	B	C	D	E	F	G	Result
1	High	High	Low	High	High	High	Low
2	Low	High	High	Low	High	High	High
3	High	Low	High	High	Low	High	High
4	Low	High	Low	High	High	Low	High
5	Low	Low	High	Low	High	High	Low
6	Low	Low	Low	High	Low	High	High
7	High	Low	Low	Low	High	Low	High
8	High	High	Low	Low	Low	High	Low
9	High	High	High	Low	Low	Low	High
10	Low	High	High	High	Low	Low	Low
11	High	Low	High	High	High	Low	Low
12	Low	Low	Low	Low	Low	Low	Low

The standard advice in conducting this experiment is to randomise the run order. This is intended to spread the effects of any uncontrolled factors or environmental changes through the data. It may be of low relevance here and is often ignored. Gemini hasn’t mentioned it.

Conducting this experiment and collecting the output data, measuring distance to the point of impact, requires good practical experimental technique. This is the step that will make the difference between a useful model and a failed one. Repeatability and accuracy of the input settings and the response or output measure are vital to model quality.

A Statapult made by Air Academy — A Statapult used for teaching DoE.

The Statapult, with its pin positions, makes that easy to achieve for some of the settings. However, for the start angle, there is just a simple scale measured in degrees. That requires control using some kind of position stop to guarantee repeatability. Similarly, the distance to impact needs a clear, measurable indication, usually achieved with the aid of powdered talc to provide a “splash”.

Gemini doesn’t provide this kind of practical input or expertise, but let's see how it analyses the data collected. This was from a classroom training example with the operators well briefed on the importance of repeatability and accuracy.

Start Angle	Stop Pin	Fixed Pin	Cup Posn	Rubber Band	Ball Size	Operator	Distance
170	3	3	5	2	Small	Bill	216.6
170	4	2	6	2	Small	Ann	279.2
150	4	3	5	3	Small	Ann	210
170	3	3	6	2	Large	Ann	220.4
170	4	2	6	3	Small	Bill	308.7
170	4	3	5	3	Large	Ann	262.5
150	4	3	6	2	Large	Bill	187.4
150	3	3	6	3	Small	Bill	267.5
150	3	2	6	3	Large	Ann	175
170	3	2	5	3	Large	Bill	193.6
150	4	2	5	2	Large	Bill	137.2
150	3	2	5	2	Small	Ann	184.1

Data previously collected from a Statapult for a series of screening experiments.

Collecting the data for this takes at least half an hour with the aid of three people, so I’ll use the data from a recent training class. It is shown here with the standard descriptions and values of the input settings. Notice that some of these are discrete, and really only the Start Angle is a continuous metric. This is a feature of the Statapult with its fixed pin settings, but it won’t affect the analysis.

During analysis, these values will be substituted or coded as +1 and -1 to simplify the computation and then replaced with descriptions on completion. The Distance or output is measured in cm with a resolution of 0.1cm, that may imply a little more precision than is provided by a standard tape measure, but it won’t harm the analysis.

Prompt

Analyse the DoE response data in 'Distance'

Response:

The analysis of the Design of Experiments (DoE) response for Distance reveals several key factors that significantly influence how far the object travels. The data includes 12 experimental runs with 7 factors: Start Angle, Stop Pin, Fixed Pin, Cup Position, Rubber Band, Ball Size, and Operator.

Primary Drivers of Distance: The Start Angle and Ball Size are the most influential factors.
Secondary Factors: Cup Position and the Rubber Band also show clear positive correlations with distance.
Process Robustness: The Operator (Ann vs. Bill) had the smallest impact on the results. This suggests that the distance response is relatively stable regardless of who performs the test.

The charts or visuals provided by Gemini were not easy to interpret or particularly useful. The text responses above, while correct in principle, leave out the important information on statistical significance.

Prompt:

Which factors are statistically significant in the prediction model?

Response:

Start Angle: 0.018
Ball Size: 0.025
Cup Position: 0.047
Rubber Band: 0.080 (marginal)
Stop Pin: 0.196 (not significant)
Fixed Pin: 0.353 (not significant)
Operator: 0.819 (not significant)

These are probability or p-values showing significance or relevance in the predictive model. The classic threshold used in hypothesis testing is 0.05, indicating a 5% probability that the factor is not significant in the model. Anything less than 5% is assumed to be significant and therefore relevant.

However, in DoE, the threshold is commonly set at 0.1 or 10% with the aim that a factor that is marginal is not discarded.

The next step in DoE is to eliminate factors with p-values above 0.1 and recompute the model and the p-value table. Eliminating factors provides more bandwidth for the model and will change the p-values used to assess significance.

Prompt:

Eliminate Stop Pin, Fixed Pin, Operator from the model and repeat analysis

Response:

Start Angle: 0.007
Ball Size: 0.012
Cup Position: 0.029
Rubber Band: 0.060 (marginal)

This has changed the p-values, but it hasn’t changed the decision on whether to include or discard the factor in the model. The p-values computed by Gemini are identical to those provided by statistical software below after the model has been reduced to the significant terms.

Design of Experiments coded coefficients — Results of the DoE

The usual presentation of this data is to use a Pareto chart, which Gemini provided on prompting.

Prompt:

Create a Pareto Chart of Effects

Response:

Design of Experiments Pareto chart — Results of the DoE as a Pareto chart

Design of Experiments — Modelling

Findings

The initial results provided by Gemini were accurate but were presented as descriptive text with embedded values. That was unfamiliar to me, but it may have been easier for an inexperienced user to interpret. Gemini did not volunteer multiple reps or randomisation, but responded well when asked for advice specifically. The crucial effect of the precision of input settings and response measurements on model quality was not mentioned.

Gemini accurately analysed the model with results identical to dedicated statistics software. It provided less supporting data unless asked specifically for additional information or a different presentation format. It commented correctly and accurately on the key takeaways from this modelling experiment. Gemini was much slower to respond, taking several minutes to respond to each question compared to the instant response from dedicated statistics software.

The numerical analysis provided by Gemini was identical to the results from dedicated software. It would, in my opinion, be helpful to the inexperienced user of DoE if a number of standard questions were followed that directed the analysis. Without at least that minimum user experience, the results from Gemini were limited to a few key points.

Gemini will not remain free to use following my 14-day trial, unlike the dedicated statistics software JASP, which remains free to any user. My preference is to continue to use JASP, which is a fully featured rival package to the industry standard software Minitab. Like any dedicated software, it requires learning and experience to use.

Case Study Details

In this next case study, I’ll investigate using AI in a DoE modelling experiment. The objective is to create an accurate prediction model using just 3 factors with a Full Factorial design. This will include sufficient repeat measurements to provide a high-quality model with R² at 0.99 or greater. A model that represents more than 99% of the data with less than 1% noise.

This is the natural next step after a screening experiment that determined the most important factors among a large number of possible inputs.

The example data that I have used comes from a classroom experiment with the Stataput in the hands of an experienced team. They developed their experimental process to eliminate variation, as far as is possible, from all input settings and from the response measurement. The model they achieved using statistics software has an R² value of 0.999, or less than 0.1% noise, a remarkable achievement. Note that this is a different Statapult with different set-up conditions than the previous screening experiment. We should not expect any similarity of models or significance of factors.

We will investigate the model they achieved using their input data in Gemin and compare it for accuracy and ease of use with their result from the statistics software.

We will again assume little knowledge of DoE and ask Gemini for advice.

Prompt:

Create a modelling DoE for 3 factors

Response:

Standard Order	Run Order	Factor A	Factor B	Factor C	Response
1	1	-1	-1	-1
2	2	1	-1	-1
3	3	-1	1	-1
4	4	1	1	-1
5	5	-1	-1	1
6	6	1	-1	1
7	7	-1	1	1
8	8	1	1	1

This is the standard format for a coded model just using -1 and +1 input values and generic names for the inputs. It does not mention randomisation nor the number of repeats that should be applied. First we should update this with the common factor names and the real input high and low settings.

Prompt:

Reset Factor A = Start Angle 150–170 (angles of 150 deg and 170 deg)
Reset Factor B = Stop Pin 3, 4 (Pin positions 3 and 4)
Reset Factor C = Arm Pin 1, 2 (Pin positions 1 and 2)

Response:

Standard Order	Run Order	Start Angle	Stop Pin	Arm Pin	Response
1	1	150	3	1
2	2	170	3	1
3	3	150	4	1
4	4	170	4	1
5	5	150	3	2
6	6	170	3	2
7	7	150	4	2
8	8	170	4	2

Prompt:

How many reps is recommended?

Response:

1 Rep (8 total runs): Good for initial screening
2 Reps (16 total runs): The most common recommendation.
3+ Reps (24+ total runs): Recommended if your testing process is known to have high variability

My textbook recommends 5 reps to aim for 95% confidence in the S model (model of standard deviation in the process). Since we didn’t ask for an S model but implied only a Y model (models the average of the process), let's set 4 reps with the aim of getting a good Y model.

Prompt:

Adjust the design for 4 reps

Response:

Gemini responded by increasing the number of runs to 32 presented in a single vertical table.

Prompt:

Is a random run order recommended?

Response:

Yes, absolutely. Randomizing the run order is a fundamental best practice in Design of Experiments (DoE).

This is good advice from Gemini, and when asked, it provided a single vertical table of 32 runs in random order. However, that considerably extends the time for conducting this experiment by a factor of 4 or more. It is much more efficient to collect all 4 reps one after the other. For that reason, most experimenters usually persuade themselves that the effects of external environmental changes are minimal and that randomisation can be dispensed with.

Our data was collected as an array table with four columns representing the repeats. The four measurements for each run D1, D2, D3, D4 were collected immediately one after the other. In this case the output was measured to the nearest cm. This whole experiment took around half an hour, similar to the first screening experiment. With randomisation it would have likely taken 2 hours.

Start Angle	Stop Pin	Fixed Pin	D1	D2	D3	D4
150	3	1	215	219	214	217
170	3	1	348	342	334	339
150	4	1	108	105	106	110
170	4	1	276	270	277	273
150	3	2	247	255	242	253
170	3	2	400	400	396	402
150	4	2	133	130	131	123
170	4	2	339	337	337	337

Data previously collected from a Statapult for a modelling experiment.

Prompt:

Analyse the DoE modelling data representing 4 reps in D1, D2, D3, D4

Response:

Key Takeaway 1: Dominant Factors and Main Effects

The main effects of all three factors are highly statistically significant. The relative importance and magnitude of these effects can be visualized in the Pareto Chart of DoE Effects (alpha = 0.05).

Start Angle (A) is the most dominant factor: Increasing the Start Angle from 150 to 170 degrees increases the response by 162.44 units on average (p < 0.001). This is by far the largest effect in the system.
Stop Pin (B) has a strong negative effect: Increasing the Stop Pin from 3 to 4 decreases the response by 89.44 units on average (p < 0.001).
Fixed Pin (C) has a moderate positive effect: Increasing the Fixed Pin from 1 to 2 increases the response by 44.31 units on average (p < 0.001).
The Main Effects Plot illustrates these linear trends clearly, showing the steep positive slope for Start Angle, a steep negative slope for Stop Pin, and a moderate positive slope for Fixed Pin.

Key Takeaway 2: Significant Factor Interactions

The factors do not behave independently; instead, there are strong, statistically significant interactions that must be considered when optimizing the system.

Start Angle & Stop Pin Interaction (A:B): This is the strongest interaction in the design (+25.06 effect, p < 0.001). As shown in Interaction: Start Angle & Stop Pin, the negative effect of changing the Stop Pin from 3 to 4 is much more severe at a Start Angle of 150 (decreasing the response by 114.50 units) than at a Start Angle of 170 (decreasing the response by 64.38 units).
Start Angle & Fixed Pin Interaction (A:C): This interaction is also highly significant (+16.81 effect, p < 0.001). As shown in Interaction: Start Angle & Fixed Pin, the positive effect of changing the Fixed Pin from 1 to 2 is much stronger at a Start Angle of 170 (increasing the response by 61.13 units) than at a Start Angle of 150 (increasing the response by 27.50 units).
Stop Pin & Fixed Pin Interaction (B:C): This interaction is not statistically significant (-1.56 effect, p = 0.256). The effect of the Stop Pin remains consistent regardless of which Fixed Pin level is used.
Three-Way Interaction (A:B:C): The three-way interaction is statistically significant but very small (+3.94 effect, p = 0.007), suggesting a minor joint dependency between all three factors.

Key Takeaway 3: Exceptional Model Quality and Precision

The experiment was conducted with extremely high precision, making the conclusions highly reliable.

High Reproducibility: The standard deviation across the 4 replicates (D1, D2, D3, D4) for each of the 8 runs is extremely low, ranging from a minimum of 1.00 (for Run 8: 170, 4, 2) to a maximum of only 5.91 (for Run 5: 150, 3, 2).
Excellent Model Fit ( $R^2 = 0.999$ ): A full factorial OLS regression model explains 99.9% of the total variance in the response. The residual mean square error is only 14.41, indicating that the average deviation of any individual replicate from its group mean is only about 3.79 units.

I have reproduced Gemini’s response at this stage in full, to show the text nature of its output. I found this quite difficult to interpret, being more used to tabulated values for coefficients and the significance of individual factors and interactions. I will ask Gemini to produce this in tabular form.

Prompt:

Present a table of coded coefficients and p-values for all model terms

Model Term	Factor(s)	Total Effect	Coded Coefficient	p-value	Significance
Intercept	(Overall Mean)	-	256.72	<0.001	***
A	Start Angle	162.44	81.22	<0.001	***
B	Stop Pin	-89.44	-44.72	<0.001	***
C	Fixed Pin	44.31	22.16	<0.001	***
A*B	Start Angle * Stop Pin	25.06	12.53	<0.001	***
A*C	Start Angle * Fixed Pin	16.81	8.41	<0.001	***
B*C	Stop Pin * Fixed Pin	-1.56	-0.78	0.256	ns
ABC	3-Way Interaction	3.94	1.97	0.007	**

Significance Codes:

*** = Highly Significant (p < 0.001)
** = Significant (p < 0.01)
* = Marginally Significant (p < 0.05)
ns = Not Significant (p > 0.05)

The table format is more familiar to me and easier in my opinion to interpret. For comparison the results from statistics software are below, showing precise agreement on all the coefficients and p values.

Prompt:

Calculate the model equation and the R²

     $D = 256.7188 + 81.2188 A - 44.7188 B + 22.1563 C + 12.5313 AB + 8.4063 AC - 0.7813 BC + 1.9688 ABC$ 
     ${R^2 = 0.9988}$

For comparison here is the equivalent model equation and R² from other statistics software.

Design of Experiments modelling equation — Model equation and associated R² from the DoE, produced by other statistics software.

Design of Experiments R^2 calculation — Model equation and associated R² from the DoE, produced by other statistics software.