New to Rust? Grab our free Rust for Beginners eBook Get it free →
Top 100 AI ML MCQs (2026) – With Answers & Explanations

This article covers the top 100 AI ML MCQ questions with answers and explanations. It’s a fast, beginner-friendly way to prepare for interviews, exams, and stay updated with essential AI and Machine Learning concepts in 2026.
AI & Machine Learning MCQs for Interviews
Before 2026, when we went for interviews, we usually searched for “Top 100 Python MCQ questions”, “50 JavaScript concepts,” etc. But now, after the AI boom, things have changed.
Even if you’re going for a C interview, interviewers expect you to have at least basic knowledge of AI and Machine Learning. And if you’re not actively working in ML, it becomes even more important to understand the key concepts, terms, and fundamentals.
That’s exactly why I created this compact and essential AI & ML MCQ guide. You can go through all 100 questions in about an hour, but the value it can add to your interview preparation is massive. Potentially worth thousands if you negotiate well!
Here are your top 100 AI ML MCQs, along with their answers and explanations:
Q1. Which of the following best describes the relationship between Artificial Intelligence, Machine Learning, and Deep Learning?
A. AI is a subset of Machine Learning.
B. Deep Learning is a subset of AI, and AI is a subset of Machine Learning.
C. Machine Learning is a subset of AI, and Deep Learning is a subset of Machine Learning.
D. All three are distinct fields with no overlap.
Show Answer
Answer: C
Machine Learning acts as a subset of the broader AI field, while Deep Learning is a specialized subset of Machine Learning.
Q2. In the context of Machine Learning, what is the primary difference between supervised and unsupervised learning?
A. Supervised learning uses labeled data, while unsupervised learning uses unlabeled data.
B. Supervised learning is used for clustering, while unsupervised learning is used for classification.
C. Supervised learning requires no human intervention, while unsupervised learning does.
D. Supervised learning works with images, while unsupervised learning works with text.
Show Answer
Answer: A
Supervised learning algorithms learn from input-output pairs (labeled data), whereas unsupervised learning finds patterns in data without labels.
Q3. Which algorithm is commonly used for both classification and regression tasks in supervised learning?
A. K-Means Clustering
B. Apriori Algorithm
C. K-Nearest Neighbors (KNN)
D. Principal Component Analysis (PCA)
Show Answer
Answer: C
KNN is a versatile algorithm that can be adapted to predict discrete labels (classification) or continuous values (regression).
Q4. What is the primary goal of a regression algorithm in machine learning?
A. To group similar data points together.
B. To predict a continuous numerical value.
C. To classify data into distinct categories.
D. To reduce the dimensionality of the dataset.
Show Answer
Answer: B
Regression analysis is used to predict a dependent variable (output) which is continuous in nature, such as price or temperature.
Q5. Which Python library is most widely used for performing numerical computations in AI and ML projects?
A. Matplotlib
B. NumPy
C. Scikit-learn
D. TensorFlow
Show Answer
Answer: B
NumPy provides support for large, multi-dimensional arrays and matrices, along with a collection of mathematical functions to operate on these arrays.
Q6. In a Confusion Matrix for a binary classification problem, what does the term “True Positive” represent?
A. The model predicted negative, and the actual value was negative.
B. The model predicted positive, and the actual value was negative.
C. The model predicted positive, and the actual value was positive.
D. The model predicted negative, and the actual value was positive.
Show Answer
Answer: C
A True Positive indicates a correct prediction where the model correctly identified the positive class.
Q7. Which evaluation metric is best suited when dealing with an imbalanced dataset in a classification problem?
A. Accuracy
B. F1-Score
C. Mean Squared Error
D. R-Squared
Show Answer
Answer: B
The F1-Score is the harmonic mean of precision and recall, making it more robust than accuracy when class distribution is uneven.
Q8. What is the phenomenon called when a machine learning model learns the training data too well, including noise and outliers?
A. Underfitting
B. Overfitting
C. Regularization
D. Cross-validation
Show Answer
Answer: B
Overfitting occurs when a model captures noise in the training data, leading to poor generalization on unseen data.
Q9. Which technique is used to reduce the complexity of a model to prevent overfitting?
A. Feature Scaling
B. Regularization
C. Data Augmentation
D. Gradient Descent
Show Answer
Answer: B
Regularization (like L1 and L2) adds a penalty term to the loss function to discourage the model from fitting complex patterns in the noise.
Q10. In the context of neural networks, what is the function of an activation function?
A. To initialize the weights of the network.
B. To calculate the loss of the network.
C. To introduce non-linearity into the network.
D. To connect different layers of the network.
Show Answer
Answer: C
Activation functions allow neural networks to learn complex patterns by introducing non-linear properties to the output of neurons.
Q11. Which activation function outputs a value between 0 and 1 and is often used for binary classification output layers?
A. ReLU (Rectified Linear Unit)
B. Tanh
C. Sigmoid
D. Softmax
Show Answer
Answer: C
The Sigmoid function maps any input to a value between 0 and 1, making it suitable for modeling probabilities in binary classification.
Q12. What is the main purpose of using the “Softmax” activation function in a neural network?
A. To handle missing values.
B. To perform binary classification.
C. to convert output scores into probabilities for multi-class classification.
D. To speed up the training process.
Show Answer
Answer: C
Softmax converts a vector of numbers into a vector of probabilities, where the probabilities sum up to 1, used for multi-class problems.
Q13. Which unsupervised learning algorithm is primarily used for dimensionality reduction?
A. K-Means
B. DBSCAN
C. Principal Component Analysis (PCA)
D. Linear Regression
Show Answer
Answer: C
PCA reduces the number of variables in a dataset while preserving as much variance (information) as possible.
Q14. In Reinforcement Learning, what is the entity that interacts with the environment and makes decisions called?
A. The Supervisor
B. The Agent
C. The Model
D. The Oracle
Show Answer
Answer: B
An agent in reinforcement learning takes actions within an environment to maximize a cumulative reward.
Q15. Which optimization algorithm is considered the standard for training deep neural networks due to its efficiency?
A. Stochastic Gradient Descent (SGD)
B. Adam (Adaptive Moment Estimation)
C. Batch Gradient Descent
D. Ridge Regression
Show Answer
Answer: B
Adam combines the benefits of two other extensions of SGD (AdaGrad and RMSProp) and generally converges faster.
Q16. What does the term “Epoch” refer to in the context of training a neural network?
A. One complete pass of the entire training dataset through the algorithm.
B. A single update of the model’s weights.
C. The time taken to train the model.
D. A subset of the training data.
Show Answer
Answer: A
An epoch means the entire dataset has been passed forward and backward through the neural network exactly once.
Q17. Which library in Python is specifically designed for data manipulation and analysis, often used in AI preprocessing?
A. NumPy
B. Pandas
C. PyTorch
D. Keras
Show Answer
Answer: B
Pandas offers data structures and operations for manipulating numerical tables and time series, crucial for data cleaning.
Q18. What is the primary disadvantage of using a very large batch size during training?
A. It makes the model overfit immediately.
B. It requires less memory.
C. It can lead to slower convergence and getting stuck in local minima.
D. It increases the number of epochs required.
Show Answer
Answer: C
Large batch sizes can reduce the stochastic nature of gradient descent, potentially causing the model to converge to sharp, less generalizable minima.
Q19. In the K-Means clustering algorithm, what does the letter “K” represent?
A. The number of iterations.
B. The number of clusters.
C. The distance metric.
D. The number of features.
Show Answer
Answer: B
K is a hyperparameter that specifies the number of centroids (clusters) the algorithm should find in the data.
Q20. Which type of machine learning problem involves predicting a category label?
A. Regression
B. Clustering
C. Classification
D. Dimensionality Reduction
Show Answer
Answer: C
Classification is the task of predicting a discrete class label, such as “Spam” or “Not Spam”.
Q21. What is the “Vanishing Gradient Problem” typically associated with in deep learning?
A. Gradients becoming too large, causing weights to explode.
B. Gradients becoming too small, preventing earlier layers from learning.
C. The loss function becoming zero.
D. The model training too quickly.
Show Answer
Answer: B
In deep networks, gradients can shrink exponentially as they backpropagate, leaving early layers with near-zero updates, halting learning.
Q22. Which deep learning architecture is best suited for processing sequential data like time series or text?
A. Convolutional Neural Network (CNN)
B. Recurrent Neural Network (RNN)
C. Random Forest
D. Logistic Regression
Show Answer
Answer: B
RNNs have loops allowing information to persist, making them effective for tasks where the order of data points matters.
Q23. What is the technique called where a pre-trained model is used as the starting point for a new task?
A. Regularization
B. Transfer Learning
C. Ensemble Learning
D. Feature Extraction
Show Answer
Answer: B
Transfer learning leverages knowledge learned from a large dataset to solve a similar problem with a smaller dataset.
Q24. Which metric represents the proportion of actual positives that were correctly identified by the model?
A. Precision
B. Recall (Sensitivity)
C. Accuracy
D. Specificity
Show Answer
Answer: B
Recall calculates how many of the actual positive cases the model was able to predict correctly.
Q25. In a decision tree, what is the name of the measure used to select the best split at a node?
A. Learning Rate
B. Information Gain / Gini Impurity
C. Root Mean Square Error
D. Correlation Coefficient
Show Answer
Answer: B
These metrics quantify the purity of the split, helping the algorithm decide which feature provides the most useful information.
Q26. What does the term “Bias” represent in the context of the Bias-Variance tradeoff?
A. Error introduced by approximating a real-world problem with a simplified model.
B. Error caused by sensitivity to small fluctuations in the training set.
C. The difference between predicted and actual values.
D. The noise in the data.
Show Answer
Answer: A
High bias usually leads to underfitting, where the model is too simple to capture the underlying pattern of the data.
Q27. Which layer in a Convolutional Neural Network (CNN) is responsible for extracting features like edges and textures?
A. Fully Connected Layer
B. Pooling Layer
C. Convolutional Layer
D. Dropout Layer
Show Answer
Answer: C
Convolutional layers apply filters (kernels) to the input image to create feature maps that highlight distinct features.
Q28. What is the primary function of the “Pooling Layer” in a CNN?
A. To increase the dimensionality of the image.
B. To reduce the spatial dimensions (downsampling).
C. To add non-linearity.
D. To connect all neurons.
Show Answer
Answer: B
Pooling layers reduce the number of parameters and computation in the network, helping to control overfitting.
Q29. Which method is commonly used to handle missing values in a dataset during preprocessing?
A. Backpropagation
B. Imputation
C. Normalization
D. One-Hot Encoding
Show Answer
Answer: B
Imputation involves replacing missing values with substituted values, such as the mean, median, or mode of the column.
Q30. What is the process of converting categorical variables into a form that could be provided to ML algorithms called?
A. Normalization
B. Standardization
C. One-Hot Encoding
D. Imputation
Show Answer
Answer: C
One-Hot Encoding creates binary columns for each category, preventing the algorithm from assuming an ordinal relationship.
Q31. Which algorithm is based on Bayes’ Theorem and assumes independence between predictors?
A. Naive Bayes
B. Support Vector Machine
C. Decision Tree
D. Random Forest
Show Answer
Answer: A
Naive Bayes is a probabilistic classifier that assumes the presence of a particular feature in a class is unrelated to the presence of any other feature.
Q32. In Support Vector Machine (SVM), what is the name of the boundary that separates the data points of different classes?
A. Decision Boundary / Hyperplane
B. Centroid
C. Root Node
D. Weight Vector
Show Answer
Answer: A
The hyperplane is the line (in 2D) or plane (in 3D) that maximizes the margin between different classes.
Q33. What is “Feature Scaling” in the context of data preprocessing?
A. Selecting only the most important features.
B. Transforming features to a similar scale.
C. Removing duplicate features.
D. Adding new features.
Show Answer
Answer: B
Feature scaling (like Min-Max or Standardization) ensures that one feature does not dominate others due to larger magnitude.
Q34. Which popular deep learning framework was developed by the Facebook AI Research lab?
A. TensorFlow
B. Keras
C. PyTorch
D. Theano
Show Answer
Answer: C
PyTorch is an open-source machine learning library developed by Facebook, known for its flexibility and dynamic computation graphs.
Q35. What is the term for the error between the predicted value and the actual value in regression analysis?
A. Residual
B. Variance
C. Gradient
D. Bias
Show Answer
Answer: A
A residual is the difference between the observed value and the predicted value provided by the model.
Q36. Which layer is typically placed at the end of a CNN for performing the final classification task?
A. Convolutional Layer
B. Pooling Layer
C. Fully Connected (Dense) Layer
D. Batch Normalization Layer
Show Answer
Answer: C
Fully connected layers take the flattened feature maps and perform the high-level reasoning to output class probabilities.
Q37. What technique is used to prevent a neural network from overfitting by randomly disabling neurons during training?
A. Batch Normalization
B. Dropout
C. Gradient Clipping
D. Early Stopping
Show Answer
Answer: B
Dropout randomly sets a fraction of input units to 0 during training, forcing the network to learn more robust features.
Q38. In the context of NLP, what is the simplest way to convert text into numerical vectors?
A. Word2Vec
B. Bag of Words (BoW)
C. LSTM
D. Transformer
Show Answer
Answer: B
The Bag of Words model represents text by the frequency of words, disregarding grammar and word order.
Q39. Which loss function is most appropriate for a multi-class classification problem?
A. Mean Squared Error
B. Binary Cross-Entropy
C. Categorical Cross-Entropy
D. Hinge Loss
Show Answer
Answer: C
Categorical Cross-Entropy compares the predicted probability distribution with the actual distribution for multiple classes.
Q40. What is the role of the “Learning Rate” in gradient descent?
A. To determine the size of the step taken towards the minimum.
B. To determine the direction of the gradient.
C. To initialize the weights.
D. To calculate the bias.
Show Answer
Answer: A
The learning rate controls how much the model weights are updated in response to the estimated error during training.
Q41. Which ensemble method combines multiple models (typically decision trees) trained on random subsets of the data?
A. Boosting
B. Bagging (Random Forest)
C. Stacking
D. Blending
Show Answer
Answer: B
Bagging (Bootstrap Aggregating) trains models in parallel on different samples and aggregates their predictions, as seen in Random Forests.
Q42. What is the primary difference between Bagging and Boosting?
A. Bagging trains models sequentially, while Boosting trains them in parallel.
B. Bagging trains models in parallel, while Boosting trains them sequentially.
C. Bagging is for regression only, while Boosting is for classification only.
D. There is no difference.
Show Answer
Answer: B
Bagging builds independent models, whereas Boosting builds models sequentially, where each new model corrects the errors of the previous one.
Q43. What is the technique of providing input data to a model and checking its output against known results called?
A. Training
B. Testing / Evaluation
C. Validation
D. Inference
Show Answer
Answer: B
Testing/Evaluation uses a separate dataset to measure the performance of the trained model on unseen data.
Q44. Which algorithm is effective for anomaly detection and recommendation systems?
A. Linear Regression
B. K-Means Clustering
C. Logistic Regression
D. SVM
Show Answer
Answer: B
K-Means can identify anomalies as points that are far from any cluster centroid or form their own small clusters.
Q45. What does the “R-Squared” (R2) score indicate in regression?
A. The average error magnitude.
B. The proportion of the variance in the dependent variable that is predictable from the independent variables.
C. The squared root of the mean error.
D. The correlation between predicted and actual values.
Show Answer
Answer: B
R-Squared represents the goodness of fit of the model, ranging from 0 to 1, where 1 indicates a perfect fit.
Q46. In NLP, which technique reduces words to their base or root form?
A. Tokenization
B. Lemmatization / Stemming
C. Parsing
D. One-Hot Encoding
Show Answer
Answer: B
Stemming and Lemmatization normalize text by reducing inflected words to their word stem or lemma.
Q47. What is a “Generative Adversarial Network” (GAN) composed of?
A. An Encoder and a Decoder.
B. A Generator and a Discriminator.
C. A Convolutional and a Pooling layer.
D. A Regressor and a Classifier.
Show Answer
Answer: B
GANs consist of two neural networks contesting with each other: a generator creates data, and a discriminator evaluates it.
Q48. What is the purpose of a validation set during model training?
A. To train the model weights.
B. To evaluate the model’s final performance.
C. To tune hyperparameters and prevent overfitting during training.
D. To store the data.
Show Answer
Answer: C
The validation set provides an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
Q49. Which metric is defined as the ratio of True Positives to the total predicted Positives?
A. Recall
B. Precision
C. Accuracy
D. F1-Score
Show Answer
Answer: B
Precision answers the question: “Of all the instances predicted as positive, how many were actually positive?”
Q50. What is the name of the process where raw text is broken down into smaller units (words or sentences)?
A. Filtering
B. Tokenization
C. Normalization
D. Vectorization
Show Answer
Answer: B
Tokenization is the first step in NLP preprocessing, breaking a stream of text into words, phrases, symbols, or other meaningful elements.
Q51. Which algorithm works by calculating the probability of a data point belonging to a class based on the distance to its neighbors?
A. K-Nearest Neighbors (KNN)
B. Logistic Regression
C. Naive Bayes
D. SVM
Show Answer
Answer: A
KNN classifies a data point based on how its neighbors are classified, assuming similar things exist in close proximity.
Q52. What is the main assumption made by the Naive Bayes classifier?
A. Features are highly correlated.
B. Features are independent of each other.
C. Data is normally distributed.
D. Data is linearly separable.
Show Answer
Answer: B
The “naive” assumption is that the presence of a particular feature in a class is unrelated to the presence of any other feature.
Q53. Which technique helps in handling the “exploding gradient” problem in RNNs?
A. Gradient Clipping
B. Dropout
C. Batch Normalization
D. ReLU Activation
Show Answer
Answer: A
Gradient clipping sets a threshold to cap the maximum value of gradients, preventing them from growing exponentially.
Q54. What is “Data Augmentation” primarily used for in computer vision?
A. To clean the data.
B. To increase the diversity of the training set by applying transformations.
C. To reduce the number of images.
D. To label the images.
Show Answer
Answer: B
Data augmentation increases dataset size by rotating, flipping, or cropping existing images, helping models generalize better.
Q55. Which algorithm is specifically designed to solve the vanishing gradient problem in traditional RNNs?
A. LSTM (Long Short-Term Memory)
B. Perceptron
C. CNN
D. K-Means
Show Answer
Answer: A
LSTMs introduce gating mechanisms that allow the network to decide what to keep and what to forget, preserving gradients over longer sequences.
Q56. What is the purpose of the “Kernel Trick” in SVM?
A. To speed up training.
B. To map non-linear data into a higher-dimensional space where it is linearly separable.
C. To reduce the number of support vectors.
D. To handle missing values.
Show Answer
Answer: B
The Kernel Trick allows SVMs to create non-linear decision boundaries by implicitly mapping inputs to high-dimensional feature spaces.
Q57. Which Python visualization library is commonly used to plot heatmaps and statistical data?
A. Matplotlib
B. Seaborn
C. Plotly
D. Bokeh
Show Answer
Answer: B
Seaborn is built on top of Matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.
Q58. What does the term “Backpropagation” refer to?
A. The process of feeding input forward through the network.
B. The algorithm used to calculate the gradient of the loss function with respect to the weights.
C. The activation function logic.
D. The method of initializing weights.
Show Answer
Answer: B
Backpropagation moves backward from the output layer to the input layer, calculating gradients to update the weights via gradient descent.
Q59. Which type of scaling transforms features to have a mean of 0 and a standard deviation of 1?
A. Min-Max Scaling
B. Standardization (Z-score normalization)
C. Normalization
D. Robust Scaling
Show Answer
Answer: B
Standardization rescales data so that the distribution has a mean of 0 and a standard deviation of 1.
Q60. In Reinforcement Learning, what is the “Reward” signal?
A. The input data given to the agent.
B. The feedback signal indicating how well the agent is performing.
C. The action taken by the agent.
D. The environment state.
Show Answer
Answer: B
The reward is a scalar feedback signal that the agent tries to maximize over time through its actions.
Q61. What is “Underfitting” in machine learning?
A. A model that performs too well on training data but poorly on test data.
B. A model that is too simple to capture the underlying structure of the data.
C. A model that has too many parameters.
D. A model that takes too long to train.
Show Answer
Answer: B
Underfitting occurs when a model is too simple (e.g., linear model for complex data) and fails to learn the pattern.
Q62. Which regularization technique adds the absolute value of weights to the loss function?
A. L1 Regularization (Lasso)
B. L2 Regularization (Ridge)
C. Dropout
D. Elastic Net
Show Answer
Answer: A
L1 regularization adds the magnitude of coefficients, often resulting in sparse models where some weights become zero.
Q63. Which regularization technique adds the squared value of weights to the loss function?
A. L1 Regularization
B. L2 Regularization (Ridge)
C. Lasso
D. Gradient Clipping
Show Answer
Answer: B
L2 regularization adds the squared magnitude of coefficients, preventing weights from becoming too large.
Q64. What is a “Tensor” in the context of deep learning frameworks?
A. A scalar value.
B. A mathematical operation.
C. A multi-dimensional array (generalization of matrices).
D. A type of activation function.
Show Answer
Answer: C
Tensors are the primary data structure in frameworks like TensorFlow and PyTorch, representing scalars, vectors, matrices, and n-dimensional arrays.
Q65. Which algorithm is widely used for market basket analysis?
A. K-Means
B. Apriori Algorithm
C. Linear Regression
D. SVM
Show Answer
Answer: B
The Apriori algorithm is used for association rule learning to find frequent itemsets in transactional databases.
Q66. What does the acronym “CNN” stand for?
A. Central Neural Network
B. Convolutional Neural Network
C. Connected Neural Node
D. Computational Neural Network
Show Answer
Answer: B
CNNs are a class of deep neural networks most commonly applied to analyzing visual imagery.
Q67. Which algorithm builds multiple decision trees and merges them to get a more accurate and stable prediction?
A. Decision Tree
B. Random Forest
C. Logistic Regression
D. KNN
Show Answer
Answer: B
Random Forest is an ensemble method that operates by constructing a multitude of decision trees at training time.
Q68. What is the purpose of the ROC curve?
A. To visualize the training loss.
B. to show the tradeoff between True Positive Rate and False Positive Rate.
C. To plot the accuracy vs epochs.
D. To visualize the data distribution.
Show Answer
Answer: B
The Receiver Operating Characteristic (ROC) curve illustrates the diagnostic ability of a binary classifier as its discrimination threshold is varied.
Q69. What does AUC stand for in classification metrics?
A. Area Under the Curve
B. Accuracy Under Classification
C. Average Unit Cost
D. Augmented Unit Count
Show Answer
Answer: A
AUC measures the entire two-dimensional area underneath the ROC curve, providing an aggregate measure of performance across all thresholds.
Q70. Which loss function is typically used for binary classification problems?
A. Mean Squared Error
B. Binary Cross-Entropy
C. Hinge Loss
D. Categorical Cross-Entropy
Show Answer
Answer: B
Binary Cross-Entropy measures the performance of a classification model whose output is a probability value between 0 and 1.
Q71. What is “Early Stopping” used for?
A. To stop the computer from sleeping.
B. To halt training when validation loss stops improving to prevent overfitting.
C. To stop the data loading process.
D. To reduce learning rate.
Show Answer
Answer: B
Early stopping is a form of regularization used to avoid overfitting when training a learner with an iterative method.
Q72. Which technique is used to visualize high-dimensional data in 2D or 3D?
A. PCA
B. t-SNE
C. LDA
D. Both A and B
Show Answer
Answer: D
Both PCA and t-SNE are dimensionality reduction techniques used to visualize complex datasets in lower dimensions.
Q73. What is the “Curse of Dimensionality”?
A. The difficulty of processing images.
B. The phenomenon where data becomes sparse as the number of features increases.
C. The inability to train deep networks.
D. The slow speed of gradient descent.
Show Answer
Answer: B
As dimensions increase, the volume of the space increases so fast that the available data becomes sparse, making analysis difficult.
Q74. Which optimizer uses momentum to accelerate gradient descent?
A. SGD with Momentum
B. AdaGrad
C. RMSProp
D. Adam
Show Answer
Answer: A
Momentum helps accelerate gradients in the right direction, dampening oscillations.
Q75. In the context of NLP, what is TF-IDF?
A. A type of neural network.
B. A numerical statistic that reflects how important a word is to a document in a collection.
C. A tokenization method.
D. A stemming algorithm.
Show Answer
Answer: B
Term Frequency-Inverse Document Frequency weighs words by their frequency in a document versus their frequency across all documents.
Q76. Which method is used to save a trained machine learning model to disk?
A. Serialization (e.g., Pickle or Joblib)
B. Fitting
C. Transforming
D. Parsing
Show Answer
Answer: A
Serialization converts the model object into a byte stream to be stored or transmitted.
Q77. What is the “Input Layer” in a neural network?
A. The layer that makes the final prediction.
B. The first layer that receives the raw input data.
C. The layer with the most weights.
D. The layer that applies activation functions.
Show Answer
Answer: B
The input layer is the entry point of the network, containing one neuron per feature in the dataset.
Q78. What is the name of the process of adjusting weights during training?
A. Forward Propagation
B. Optimization / Weight Update
C. Activation
D. Pooling
Show Answer
Answer: B
Optimization algorithms (like Gradient Descent) adjust the weights to minimize the loss function.
Q79. Which library provides a high-level API for building and training deep learning models?
A. NumPy
B. Keras
C. Pandas
D. OpenCV
Show Answer
Answer: B
Keras acts as an interface for libraries like TensorFlow, allowing for fast experimentation with deep neural networks.
Q80. What is “Batch Normalization”?
A. Normalizing the entire dataset before training.
B. Normalizing the inputs of each layer to stabilize and accelerate training.
C. A regularization technique that drops neurons.
D. A method to increase batch size.
Show Answer
Answer: B
Batch Norm normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation.
Q81. Which distance metric is most commonly used in KNN for continuous variables?
A. Manhattan Distance
B. Euclidean Distance
C. Cosine Similarity
D. Hamming Distance
Show Answer
Answer: B
Euclidean distance is the straight-line distance between two points in Euclidean space, standard for continuous data.
Q82. What is “Transfer Learning”?
A. Moving data from one database to another.
B. Reusing a pre-trained model on a new problem.
C. Transferring weights from one layer to another randomly.
D. Learning without data.
Show Answer
Answer: B
Transfer learning stores knowledge gained while solving one problem and applies it to a different but related problem.
Q83. What is the function of “Word Embeddings” like Word2Vec?
A. To convert words into sparse vectors.
B. To map words to dense vectors of real numbers where similar words have similar encodings.
C. To count word frequencies.
D. To correct spelling mistakes.
Show Answer
Answer: B
Word embeddings capture semantic meaning, placing similar words close together in the vector space.
Q84. Which technique allows a model to learn from a stream of data without storing the entire dataset?
A. Batch Learning
B. Online Learning
C. Transfer Learning
D. Ensemble Learning
Show Answer
Answer: B
Online learning incrementally updates the model as new data arrives, suitable for dynamic environments.
Q85. What is the role of a “Loss Function”?
A. To measure the accuracy of the model.
B. To quantify how far the predicted value is from the actual value.
C. To activate the neurons.
D. To initialize the network.
Show Answer
Answer: B
The loss function acts as a guide for the optimizer, telling it how wrong the model’s predictions are.
Q86. What does the “Flatten” layer do in a CNN?
A. It shrinks the image dimensions.
B. It converts the multi-dimensional feature maps into a single 1D vector.
C. It removes noise.
D. It adds depth to the image.
Show Answer
Answer: B
Flattening is necessary to transition from convolutional/pooling layers to the fully connected layers for classification.
Q87. Which activation function is known for being computationally efficient and mitigating the vanishing gradient problem?
A. Sigmoid
B. Tanh
C. ReLU
D. Softmax
Show Answer
Answer: C
ReLU (Rectified Linear Unit) outputs the input directly if positive, otherwise, it outputs zero, allowing faster training.
Q88. What is “Cross-Validation”?
A. Training the model only once.
B. A technique to assess how the results of a statistical analysis will generalize to an independent dataset.
C. Validating the code syntax.
D. Comparing two different programming languages.
Show Answer
Answer: B
Common methods include K-Fold CV, where the dataset is split into K folds, and the model is trained K times, each time using a different fold for validation.
Q89. What is the name of the standard dataset often used to test image classification algorithms?
A. MNIST
B. IMDB
C. Titanic
D. Boston Housing
Show Answer
Answer: A
MNIST contains handwritten digits (0-9) and is the “Hello World” of computer vision and deep learning.
Q90. What does a “Dense” layer imply in a neural network?
A. The layer has very few connections.
B. The layer is fully connected, meaning all neurons receive input from all neurons in the previous layer.
C. The layer has no weights.
D. The layer performs convolution.
Show Answer
Answer: B
Dense layers are standard fully connected layers where each input node is connected to each output node.
Q91. Which unsupervised algorithm is used for anomaly detection by modeling the probability distribution of data?
A. Isolation Forest
B. K-Means
C. PCA
D. Random Forest
Show Answer
Answer: A
Isolation Forest explicitly isol anomalies closer to the root of the decision tree rather than profiling normal data points.
Q92. What is the main disadvantage of the Sigmoid activation function in deep networks?
A. It is computationally expensive.
B. It outputs negative values.
C. It can cause vanishing gradients and outputs are not zero-centered.
D. It is linear.
Show Answer
Answer: C
Sigmoid squishes large inputs to a small range (0-1), causing gradients to become very small during backpropagation.
Q93. In Reinforcement Learning, what is the strategy called where the agent chooses the action believed to yield the highest reward?
A. Exploration
B. Exploitation
C. Pruning
D. Backtracking
Show Answer
Answer: B
Exploitation uses current knowledge to maximize immediate reward, whereas exploration seeks new information.
Q94. Which concept is used to handle categorical variables with no inherent order (like Red, Green, Blue)?
A. Label Encoding
B. Ordinal Encoding
C. One-Hot Encoding
D. Frequency Encoding
Show Answer
Answer: C
One-Hot Encoding prevents the model from assuming a numerical order exists between the categories.
Q95. What is the function of “Padding” in a CNN?
A. To reduce the image size.
B. To add extra pixels around the border of the input image.
C. To increase the number of channels.
D. To remove noise.
Show Answer
Answer: B
Padding (e.g., ‘same’ padding) preserves the spatial dimension of the volume after convolution, ensuring edge pixels are not ignored.
Q96. What is the primary difference between Linear Regression and Logistic Regression?
A. Linear Regression is for classification, Logistic is for regression.
B. Linear Regression predicts continuous values, Logistic Regression predicts probabilities for classification.
C. Linear Regression uses deep learning, Logistic uses shallow learning.
D. There is no difference.
Show Answer
Answer: B
Linear Regression fits a straight line to data, while Logistic Regression fits an “S” shaped curve (sigmoid) to predict categorical outcomes.
Q97. What is “Gradient Descent”?
A. An optimization algorithm to minimize the cost function.
B. A way to visualize data.
C. A type of neural network layer.
D. A method to clean data.
Show Answer
Answer: A
It iteratively adjusts parameters in the opposite direction of the gradient to find the minimum of the cost function.
Q98. Which deep learning model architecture utilizes “Self-Attention” mechanisms?
A. CNN
B. RNN
C. Transformer
D. Autoencoder
Show Answer
Answer: C
Transformers use self-attention to process sequential data in parallel, forming the basis of models like BERT and GPT.
Q99. What is the purpose of a “Bias” term in a neural network neuron?
A. To weigh the input.
B. To shift the activation function to the left or right.
C. To normalize the output.
D. To reduce the learning rate.
Show Answer
Answer: B
The bias allows the activation function to be shifted, allowing the model to fit data that doesn’t pass through the origin.
Q100. Which popular AI platform provides pre-trained models and APIs for vision, language, and speech tasks?
A. Scikit-learn
B. Google Cloud AI / Azure AI / AWS AI
C. Matplotlib
D. Pandas
Show Answer
Answer: B
Major cloud providers offer AI-as-a-Service, allowing developers to integrate advanced AI capabilities without building models from scratch.
Conclusion
I hope this 1 hour of reading boosts your confidence and helps you stand out in today’s AI-driven hiring. If you’re preparing for a Python interview, check out the article below:
- Top 50 Python Interview Questions to Expect in 2026
- 100 Python MCQ with Answers (Python Quiz Test 2026)
And if you’re more into SQL, we’ve got a few helpful guides for you as well:
- 150+ SQL Commands Explained With Examples (2026 Update)
- 100 SQL MCQ with Answers (SQL Test 2026)
- Top 50 Essential SQL Interview Questions and Answers [2026]
Best of luck!




