0x23: 35: LDAPISLEAF: Indicates that the specified operation cannot be performed on a leaf entry. (This code is not currently in the LDAP specifications, but is reserved for this constant.) 0x24: 36: LDAPALIASDEREFPROBLEM: NamingException: 37-47: Not used. 0x30: 48: LDAPINAPPROPRIATEAUTH.
The vanishing gradients problem is one example of unstable behavior that you may encounter when training a deep neural network.
It describes the situation where a deep multilayer feed-forward network or a recurrent neural network is unable to propagate useful gradient information from the output end of the model back to the layers near the input end of the model.
The result is the general inability of models with many layers to learn on a given dataset or to prematurely converge to a poor solution.
Many fixes and workarounds have been proposed and investigated, such as alternate weight initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient descent. Perhaps the most common change is the use of the rectified linear activation function that has become the new default, instead of the hyperbolic tangent activation function that was the default through the late 1990s and 2000s.
In this tutorial, you will discover how to diagnose a vanishing gradient problem when training a neural network model and how to fix it using an alternate activation function and weight initialization scheme.
After completing this tutorial, you will know:
- The vanishing gradients problem limits the development of deep neural networks with classically popular activation functions such as the hyperbolic tangent.
- How to fix a deep neural network Multilayer Perceptron for classification using ReLU and He weight initialization.
- How to use TensorBoard to diagnose a vanishing gradient problem and confirm the impact of ReLU to improve the flow of gradients through the model.
Let’s get started.
How to Fix the Vanishing Gradient By Using the Rectified Linear Activation Function
Photo by Liam Moloney, some rights reserved.
Photo by Liam Moloney, some rights reserved.
Tutorial Overview
This tutorial is divided into five parts; they are:
- Vanishing Gradients Problem
- Two Circles Binary Classification Problem
- Multilayer Perceptron Model for Two Circles Problem
- Deeper MLP Model with ReLU for Two Circles Problem
- Review Average Gradient Size During Training
Vanishing Gradients Problem
Neural networks are trained using stochastic gradient descent.
This involves first calculating the prediction error made by the model and using the error to estimate a gradient used to update each weight in the network so that less error is made next time. This error gradient is propagated backward through the network from the output layer to the input layer.
It is desirable to train neural networks with many layers, as the addition of more layers increases the capacity of the network, making it capable of learning a large training dataset and efficiently representing more complex mapping functions from inputs to outputs.
A problem with training networks with many layers (e.g. deep neural networks) is that the gradient diminishes dramatically as it is propagated backward through the network. The error may be so small by the time it reaches layers close to the input of the model that it may have very little effect. As such, this problem is referred to as the “vanishing gradients” problem.
Vanishing gradients make it difficult to know which direction the parameters should move to improve the cost function …
— Page 290, Deep Learning, 2016.
In fact, the error gradient can be unstable in deep neural networks and not only vanish, but also explode, where the gradient exponentially increases as it is propagated backward through the network. This is referred to as the “exploding gradient” problem.
The term vanishing gradient refers to the fact that in a feedforward network (FFN) the backpropagated error signal typically decreases (or increases) exponentially as a function of the distance from the final layer.
— Random Walk Initialization for Training Very Deep Feedforward Networks, 2014.
Vanishing gradients is a particular problem with recurrent neural networks as the update of the network involves unrolling the network for each input time step, in effect creating a very deep network that requires weight updates. A modest recurrent neural network may have 200-to-400 input time steps, resulting conceptually in a very deep network.
The vanishing gradients problem may be manifest in a Multilayer Perceptron by a slow rate of improvement of a model during training and perhaps premature convergence, e.g. continued training does not result in any further improvement. Inspecting the changes to the weights during training, we would see more change (i.e. more learning) occurring in the layers closer to the output layer and less change occurring in the layers close to the input layer.
There are many techniques that can be used to reduce the impact of the vanishing gradients problem for feed-forward neural networks, most notably alternate weight initialization schemes and use of alternate activation functions.
Different approaches to training deep networks (both feedforward and recurrent) have been studied and applied [in an effort to address vanishing gradients], such as pre-training, better random initial scaling, better optimization methods, specific architectures, orthogonal initialization, etc.
— Random Walk Initialization for Training Very Deep Feedforward Networks, 2014.
In this tutorial, we will take a closer look at the use of an alternate weight initialization scheme and activation function to permit the training of deeper neural network models.
Want Better Results with Deep Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Two Circles Binary Classification Problem
As the basis for our exploration, we will use a very simple two-class or binary classification problem.
The scikit-learn class provides the make_circles() function that can be used to create a binary classification problem with the prescribed number of samples and statistical noise.
Each example has two input variables that define the x and y coordinates of the point on a two-dimensional plane. The points are arranged in two concentric circles (they have the same center) for the two classes.
The number of points in the dataset is specified by a parameter, half of which will be drawn from each circle. Gaussian noise can be added when sampling the points via the “noise” argument that defines the standard deviation of the noise, where 0.0 indicates no noise or points drawn exactly from the circles. The seed for the pseudorandom number generator can be specified via the “random_state” argument that allows the exact same points to be sampled each time the function is called.
The example below generates 1,000 examples from the two circles with noise and a value of 1 to seed the pseudorandom number generator.
2 4 | model=Sequential() model.add(Dense(5,input_dim=2,activation='tanh',kernel_initializer=init)) model.add(Dense(1,activation='sigmoid',kernel_initializer=init)) |
The model uses the binary cross entropy loss function and is optimized using stochastic gradient descent with a learning rate of 0.01 and a large momentum of 0.9.
2 | opt=SGD(lr=0.01,momentum=0.9) model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) |
The model is trained for 500 training epochs and the test dataset is evaluated at the end of each epoch along with the training dataset.
2 4 | _,train_acc=model.evaluate(trainX,trainy,verbose=0) print('Train: %.3f, Test: %.3f'%(train_acc,test_acc)) |
Finally, the accuracy of the model during each step of training is graphed as a line plot, showing the dynamics of the model as it learned the problem.
2 4 | pyplot.plot(history.history['acc'],label='train') pyplot.plot(history.history['val_acc'],label='test') pyplot.show() |
Tying all of this together, the complete example is listed below.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 | from sklearn.datasets import make_circles from keras.layers import Dense from keras.optimizers import SGD from matplotlib import pyplot X,y=make_circles(n_samples=1000,noise=0.1,random_state=1) scaler=MinMaxScaler(feature_range=(-1,1)) # split into train and test trainX,testX=X[:n_train,:],X[n_train:,:] # define model init=RandomUniform(minval=0,maxval=1) model.add(Dense(5,input_dim=2,activation='tanh',kernel_initializer=init)) model.add(Dense(1,activation='sigmoid',kernel_initializer=init)) opt=SGD(lr=0.01,momentum=0.9) model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) history=model.fit(trainX,trainy,validation_data=(testX,testy),epochs=500,verbose=0) _,train_acc=model.evaluate(trainX,trainy,verbose=0) print('Train: %.3f, Test: %.3f'%(train_acc,test_acc)) pyplot.plot(history.history['acc'],label='train') pyplot.plot(history.history['val_acc'],label='test') pyplot.show() |
Running the example fits the model in just a few seconds.
The model performance on the train and test sets is calculated and displayed. Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.
We can see that in this case, the model learned the problem well, achieving an accuracy of about 81.6% on both the train and test datasets.
A line plot of model accuracy on the train and test sets is created, showing the change in performance over all 500 training epochs.
The plot suggests, for this run, that the performance begins to slow around epoch 300 at about 80% accuracy for both the train and test sets.
Line Plot of Train and Test Set Accuracy Over Training Epochs for MLP in the Two Circles Problem
Now that we have seen how to develop a classical MLP using the tanh activation function for the two circles problem, we can look at modifying the model to have many more hidden layers.
Deeper MLP Model for Two Circles Problem
Traditionally, developing deep Multilayer Perceptron models was challenging.
Deep models using the hyperbolic tangent activation function do not train easily, and much of this poor performance is blamed on the vanishing gradient problem.
We can attempt to investigate this using the MLP model developed in the previous section.
The number of hidden layers can be increased from 1 to 5; for example:
2 4 6 8 | init=RandomUniform(minval=0,maxval=1) model.add(Dense(5,input_dim=2,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(1,activation='sigmoid',kernel_initializer=init)) |
We can then re-run the example and review the results.
The complete example of the deeper MLP is listed below.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 | # deeper mlp for the two circles classification problem from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.initializers import RandomUniform # generate 2d classification dataset X,y=make_circles(n_samples=1000,noise=0.1,random_state=1) X=scaler.fit_transform(X) n_train=500 trainy,testy=y[:n_train],y[n_train:] init=RandomUniform(minval=0,maxval=1) model.add(Dense(5,input_dim=2,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(1,activation='sigmoid',kernel_initializer=init)) opt=SGD(lr=0.01,momentum=0.9) model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) history=model.fit(trainX,trainy,validation_data=(testX,testy),epochs=500,verbose=0) _,train_acc=model.evaluate(trainX,trainy,verbose=0) print('Train: %.3f, Test: %.3f'%(train_acc,test_acc)) pyplot.plot(history.history['acc'],label='train') pyplot.plot(history.history['val_acc'],label='test') pyplot.show() |
Running the example first prints the performance of the fit model on the train and test datasets.
Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.
In this case, we can see that performance is quite poor on both the train and test sets achieving around 50% accuracy. This suggests that the model as configured could not learn the problem nor generalize a solution.
The line plots of model accuracy on the train and test sets during training tell a similar story. We can see that performance is bad and actually gets worse as training progresses.
Line Plot of Train and Test Set Accuracy of Over Training Epochs for Deep MLP in the Two Circles Problem
Deeper MLP Model with ReLU for Two Circles Problem
The rectified linear activation function has supplanted the hyperbolic tangent activation function as the new preferred default when developing Multilayer Perceptron networks, as well as other network types like CNNs.
This is because the activation function looks and acts like a linear function, making it easier to train and less likely to saturate, but is, in fact, a nonlinear function, forcing negative inputs to the value 0. It is claimed as one possible approach to addressing the vanishing gradients problem when training deeper models.
When using the rectified linear activation function (or ReLU for short), it is good practice to use the He weight initialization scheme. We can define the MLP with five hidden layers using ReLU and He initialization, listed below.
2 4 6 8 | model=Sequential() model.add(Dense(5,input_dim=2,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) |
Tying this together, the complete code example is listed below.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 | # deeper mlp with relu for the two circles classification problem from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.initializers import RandomUniform # generate 2d classification dataset X,y=make_circles(n_samples=1000,noise=0.1,random_state=1) X=scaler.fit_transform(X) n_train=500 trainy,testy=y[:n_train],y[n_train:] model=Sequential() model.add(Dense(5,input_dim=2,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) # compile model model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) history=model.fit(trainX,trainy,validation_data=(testX,testy),epochs=500,verbose=0) _,train_acc=model.evaluate(trainX,trainy,verbose=0) print('Train: %.3f, Test: %.3f'%(train_acc,test_acc)) pyplot.plot(history.history['acc'],label='train') pyplot.plot(history.history['val_acc'],label='test') pyplot.show() |
Running the example prints the performance of the model on the train and test datasets.
Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.
In this case, we can see that this small change has allowed the model to learn the problem, achieving about 84% accuracy on both datasets, outperforming the single layer model using the tanh activation function.
A line plot of model accuracy on the train and test sets over training epochs is also created. The plot shows quite different dynamics to what we have seen so far.
The model appears to rapidly learn the problem, converging on a solution in about 100 epochs.
Line Plot of Train and Test Set Accuracy of Over Training Epochs for Deep MLP with ReLU in the Two Circles Problem
Use of the ReLU activation function has allowed us to fit a much deeper model for this simple problem, but this capability does not extend infinitely. For example, increasing the number of layers results in slower learning to a point at about 20 layers where the model is no longer capable of learning the problem, at least with the chosen configuration.
For example, below is a line plot of train and test accuracy of the same model with 15 hidden layers that shows that it is still capable of learning the problem.
Line Plot of Train and Test Set Accuracy of Over Training Epochs for Deep MLP with ReLU with 15 Hidden Layers
Below is a line plot of train and test accuracy over epochs with the same model with 20 layers, showing that the configuration is no longer capable of learning the problem.
Line Plot of Train and Test Set Accuracy of Over Training Epochs for Deep MLP with ReLU with 20 Hidden Layers
Although use of the ReLU worked, we cannot be confident that use of the tanh function failed because of vanishing gradients and ReLU succeed because it overcame this problem.
Review Average Gradient Size During Training
This section assumes that you are using the TensorFlow backend with Keras. If this is not the case, you can skip this section.
In the cases of using the tanh activation function, we know the network has more than enough capacity to learn the problem, but the increase in layers has prevented it from doing so.
It is hard to diagnose a vanishing gradient as a cause for bad performance. One possible signal is to review the average size of the gradient per layer per training epoch.
We would expect layers closer to the output to have a larger average gradient than those layers closer to the input.
Keras provides the TensorBoard callback that can be used to log properties of the model during training such as the average gradient per layer. These statistics can then be reviewed using the TensorBoard interface that is provided with TensorFlow.
We can configure this callback to record the average gradient per-layer per-training epoch, then ensure the callback is used as part of training the model.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 | # deeper mlp for the two circles classification problem with callback from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.initializers import RandomUniform # generate 2d classification dataset X,y=make_circles(n_samples=1000,noise=0.1,random_state=1) X=scaler.fit_transform(X) n_train=500 trainy,testy=y[:n_train],y[n_train:] init=RandomUniform(minval=0,maxval=1) model.add(Dense(5,input_dim=2,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(5,activation='tanh',kernel_initializer=init)) model.add(Dense(1,activation='sigmoid',kernel_initializer=init)) opt=SGD(lr=0.01,momentum=0.9) model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) tb=TensorBoard(histogram_freq=1,write_grads=True) model.fit(trainX,trainy,validation_data=(testX,testy),epochs=500,verbose=0,callbacks=[tb]) |
Running the example creates a new “logs/” subdirectory with a file containing the statistics recorded by the callback during training.
We can review the statistics in the TensorBoard web interface. The interface can be started from the command line, requiring that you specify the full path to your logs directory.
For example, if you run the code in a “/code” directory, then the full path to the logs directory will be “/code/logs/“.
Below is the command to start the TensorBoard interface to be executed on your command line (command prompt). Be sure to change the path to your logs directory.
Next, open your web browser and enter the following URL:
If all went well, you will see the TensorBoard web interface.
Plots of the average gradient per layer per training epoch can be reviewed under the “Distributions” and “Histograms” tabs of the interface. The plots can be filtered to only show the gradients for the Dense layers, excluding the bias, using the search filter “kernel_0_grad“.
I have provided a copy of the plots below, although your specific results may vary given the stochastic nature of the learning algorithm.
First, line plots are created for each of the 6 layers (5 hidden, 1 output). The names of the plots indicate the layer, where “dense_1” indicates the hidden layer after the input layer and “dense_6” represents the output layer.
We can see that the output layer has a lot of activity over the entire run, with average gradients per epoch at around 0.05 to 0.1. We can also see some activity in the first hidden layer with a similar range. Therefore, gradients are getting through to the first hidden layer, but the last layer and last hidden layer is seeing most of the activity.
TensorBoard Line Plots of Average Gradients Per Layer for Deep MLP With Tanh
TensorBoard Density Plots of Average Gradients Per Layer for Deep MLP With Tanh
We can collect the same information from the deep MLP with the ReLU activation function.
The complete example is listed below.
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 | # deeper mlp with relu for the two circles classification problem with callback from sklearn.preprocessing import MinMaxScaler from keras.models import Sequential from keras.callbacks import TensorBoard X,y=make_circles(n_samples=1000,noise=0.1,random_state=1) X=scaler.fit_transform(X) n_train=500 trainy,testy=y[:n_train],y[n_train:] model=Sequential() model.add(Dense(5,input_dim=2,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) model.add(Dense(5,activation='relu',kernel_initializer='he_uniform')) # compile model model.compile(loss='binary_crossentropy',optimizer=opt,metrics=['accuracy']) tb=TensorBoard(histogram_freq=1,write_grads=True) model.fit(trainX,trainy,validation_data=(testX,testy),epochs=500,verbose=0,callbacks=[tb]) |
The TensorBoard interface can be confusing if you are new to it.
To keep things simple, delete the “logs” subdirectory prior to running this second example.
Once run, you can start the TensorBoard interface the same way and access it through your web browser.
The plots of the average gradient per layer per training epoch show a different story as compared to the gradients for the deep model with tanh.
We can see that the first hidden layer sees more gradients, more consistently with larger spread, perhaps 0.2 to 0.4, as opposed to 0.05 and 0.1 seen with tanh. We can also see that the middle hidden layers see large gradients.
TensorBoard Line Plots of Average Gradients Per Layer for Deep MLP With ReLU
TensorBoard Density Plots of Average Gradients Per Layer for Deep MLP With ReLU
The ReLU activation function is allowing more gradient to flow backward through the model during training, and this may be the cause for improved performance.
Extensions
This section lists some ideas for extending the tutorial that you may wish to explore.
- Weight Initialization. Update the deep MLP with tanh activation to use Xavier uniform weight initialization and report the results.
- Learning Algorithm. Update the deep MLP with tanh activation to use an adaptive learning algorithm such as Adam and report the results.
- Weight Changes. Update the tanh and relu examples to record and plot the L1 vector norm of model weights each epoch as a proxy for how much each layer is changed during training and compare results.
- Study Model Depth. Create an experiment using the MLP with tanh activation and report the performance of models as the number of hidden layers is increased from 1 to 10.
- Increase Breadth. Increase the number of nodes in the hidden layers of the MLP with tanh activation from 5 to 25 and report performance as the number of layers are increased from 1 to 10.
If you explore any of these extensions, I’d love to know.
Further Reading
This section provides more resources on the topic if you are looking to go deeper.
Posts
Papers
- Random Walk Initialization for Training Very Deep Feedforward Networks, 2014.
- Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies, 2001.
Books
- Section 8.2.5 Long-Term Dependencies, Deep Learning, 2016.
- Chapter 5 Why are deep neural networks hard to train?, Neural Networks and Deep Learning.
API
Articles
- Vanishing gradient problem, Wikipedia.
Summary
In this tutorial, you discovered how to diagnose a vanishing gradient problem when training a neural network model and how to fix it using an alternate activation function and weight initialization scheme.
Specifically, you learned:
- The vanishing gradients problem limits the development of deep neural networks with classically popular activation functions such as the hyperbolic tangent.
- How to fix a deep neural network Multilayer Perceptron for classification using ReLU and He weight initialization.
- How to use TensorBoard to diagnose a vanishing gradient problem and confirm the impact of ReLU to improve the flow of gradients through the model.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Ask your questions in the comments below and I will do my best to answer.
Develop Better Deep Learning Models Today!
Train Faster, Reduce Overftting, and Ensembles
…with just a few lines of python code
Discover how in my new Ebook:
Better Deep Learning
Better Deep Learning
It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…
Bring better deep learning to your projects!
Skip the Academics. Just Results.
Click to learn more.
A corrupt database is probably one of most DBA's worst nightmares. It results in downtime, managers shouting and all other sorts of unpleasant things
In this article, I'm going to explain some of the things not to do to a corrupt database, and then go through some of the things that should be done, some of the scenarios and the fixes for those.
How to identify corruption
Corruption's typically pretty obvious when someone runs across the damaged pages. Queries fail with high severity errors. Backups or reindex jobs fail with high severity errors. Some messages that indicate corruption within a database are:
The main problem is that, if regular database integrity checks are not being done, the corruption may be picked up hours, days or even months after it occurred, at which point it may be difficult to resolve
I'm not going to cover the situation where the database is in a suspect state. Covering the possible reasons why a database is suspect, the methods to discover why it is suspect and the various means of fixing that are a whole article in themselves, if not a full book.
What to do when the database is corrupt.
- Don't panic
- Don't detach the database
- Don't restart SQL
- Don't just run repair.
- Run an integrity check
- Afterwards, do a root-cause analysis
Don't panic
The most important thing when dealing with database corruption of any form is not to panic. Any decisions made or actions taken should be carefully thought through and made after careful consideration with all factors taken into account. It's very easy to make the situation worse with ill-thought through decisions.
Don't detach the database
While it is possible that the corruption message describes a transient condition, that is not the usual situation. In the vast majority of cases if SQL detects corruption within a database it means that there really are some damaged pages within the DB. Trying to trick SQL into not seeing that, by detaching and reattaching the database, backing up then restoring the database, restarting the SQL Service or rebooting the machine is not going to make corruption go away.
If there is corruption in the database, and SQL detects that corruption when it attaches the database, the attach will fail. There are ways to hack the database back into SQL, but it's much better to simply not detach the database in the first place
Don't restart SQL
Similar to detaching, restarting the SQL service will not fix corruption if it is present.
As with detaching the database, restarting the service may make matters worse. If SQL Server encounters corruption while performing the restart-recovery on a database, that database will be marked suspect, making any necessary repairs much harder to achieve.
Don't just run repair
It may be tempting to just run CheckDB with one of the repair options (typically allow data loss) and believe that it will make everything better. In many cases running repair is not the recommended fix. It is not guaranteed to fix all errors and it may result in unacceptable data loss.
Repair is, in most cases, the last resort for fixing corruption. It should be done only when none of the alternatives are possible, not done as the first thing tried.
Run an integrity check
To decide on a method of repairing the corruption, the details of exactly what is damaged must be known. The only way to get this information is to run CheckDB with the All_ErrorMsgs option (on 2005 SP3, this is a default option in CheckDB and does not need to be specified. It does need to be specified on SQL 2008). Additionally, the No_Infomsgs option removes all of the information about the number of rows and number of pages in the table, which is unnecessary when dealing with corruption.
CheckDB can take quite a while on larger databases, but it is necessary to let it run to completion. A repair strategy should not be considered without knowing all of the problems in the database
Root cause
Once the corruption has been resolved, the work isn't over. If the root cause of the corruption isn't found, it may happen again. Typically the leading cause of corruption is problems with the IO subsystem. Other possible causes are misbehaving filter drivers (like an antivirus), human intervention or a bug in SQL.
Next steps
The steps to resolve corruption depend entirely on the results of CheckDB. I'm going to go through some of the more common scenarios here. This is by no means a comprehensive document of all possible corruptions within a database. 1
This list is in approximate order of severity, from the least severe problem to the most. Each has an example of possible error messages that indicate that particular problem. In general, the most severe error that CheckDB finds determines the methods available to resolve the corruption problem.
If anyone encounters an error that is not detailed here, see the last section – Obtaining Help.
Inaccurate space metadata
This error indicates that the page has an incorrect value on it for the reserved space. In SQL 2000 it was possible for the row and page counts for a table or index to be incorrect, even negative. CheckDB did not pick this up. On SQL 2005, the counts should be correctly kept and CheckDB gives a warning when it finds this scenario
This is not a serious problem and it's trivial to fix. As the message says, run DBCC UPDATEUSAGE on the database in question and the warnings will disappear. This is common on databases upgraded from SQL 2000, for the reasons mentioned above, and should not occur on database created in SQL 2005/2008.
This error indicates that the PFS page (page free space) that tracks how full pages are has incorrect values. This, like the above error, is not serious. The algorithm that tracked this in SQL 2000 was not always accurate. While fixing this does require running CheckDB with the Repair_Allow_Data_Loss option, if this is the only error that there is, it will not actually delete any data.
Corruption only in the nonclustered indexes
If all the errors that checkDB returns refer to indexes with IDs of 2 or greater, then it indicates that all of the corruption is within the nonclustered indexes. Since the data in a nonclustered index is redundant these corruptions can be repaired without data loss.
If all of the errors that CheckDB picks up are in the nonclustered indexes, the recommended repair level will be Repair_Rebuild.
Those are just examples; there are many more possible errors.
In this case the corruption can be completely repaired by dropping the damaged nonclustered indexes and recreating them. Online index rebuilds (and some of the offline index rebuilds) read the old index to create the new and hence will encounter the corruption. Hence it's necessary to drop the old index completely and create a new one.
This is mostly what CheckDB with the repair_rebuild option will do, however the database must be in single user mode for the repair to be done. Hence it's usually better to manually rebuild the indexes as the database can remain online and in use while the affected indexes are recreated.
If there is insufficient time available to rebuild the affected index and there is a clean backup with an unbroken log chain, the damaged pages can be restored from backup.
Corruption in the LOB pages
This indicates that there are LOB (large object) pages that are not referenced by any data row. This can come about if there was corruption of the clustered index or heap and the damaged pages were deallocated.
If these are the only errors that CheckDB returns, then running repair with the Allow_Data_Loss will simply deallocate these pages. Since the data rows that the LOB data belongs to does not exist, this will not result in further data loss.
Data Purity errors
The data purity error indicates that there is a value in the column that's outside of the acceptable range of the column. That can be a datetime where the minutes past midnight exceed 1440, a Unicode string where the number of bytes is not a multiple of 2, or a float or real with an invalid floating point value.
These errors are not checked for by default on a database upgraded from SQL 2000 or lower. CheckDB must run successfully once, with the DATA_PURITY option before
CheckDB will not fix this. It doesn't know what values to put in the column to replace the invalid ones. The fix for this is fairly easy, but manual. The bad values have to annually updated to something meaningful. The main challenge is finding the bad rows. This kb article goes over the steps in detail. http://support.microsoft.com/kb/923247
Corruption in the clustered index or heap
If there is corruption to the clustered index's leaf pages or to the heap, it means that data has been lost. The leaf pages of the clustered index are the actual data pages of the table, and hence this is not redundant information.
If any of the errors that CheckDB picks up are in the leaf pages of the clustered index, the recommended repair level will be Repair_Allow_Data_Loss
Those are just examples; there are many more possible errors. The thing to notice is that the index ID mentioned is either 0 or 1. If any of the errors returned by CheckDB have an index id of 0 or 1, it means that there's damage to the base tables.
This kind of damage is repairable, but repairing it involves discarding rows or entire pages. If CheckDB deletes data to fix corruption, it will not check foreign keys and it will not fire triggers. The rows or pages will simply be deallocated. This can result in data integrity violations (child records without a parent) and it can result in logical database inconsistencies (nonclustered index rows or LOB pages that no longer reference a row). As such, repair is not the recommended route.
If there is a clean backup, restoring from the backup is usually the recommended method of fixing these errors. If the database is in full recovery and there is an unbroken log chain since the clean database backup, then it is possible backup the tail of the transaction log and to restore, either the entire database or just the damaged pages, with no data loss at all.
If there is no clean backup, then it's necessary to run CheckDB with the Repair_allow_data_loss option. This requires that the database be in single user mode for the duration of the repair.
It may be possible to determine what CheckDB will delete, for a clustered index. See this blog post - http://sqlskills.com/BLOGS/PAUL/post/CHECKDB-From-Every-Angle-Using-DBCC-PAGE-to-find-what-repair-will-delete.aspx
Corruption in the Metadata
This type of error usually appears in a database upgraded from SQL 2000, where someone did direct updates to the system tables.
There are no foreign keys enforced among the system tables in any version of SQL, so it was possible on SQL 2000 to delete a row from sysobjects (for example a table) and leave the rows in syscolumns and sysindexes that reference the deleted row.
On SQL 2000, CheckDB did not do a check of the system catalog, so this kind of problem often went completely unnoticed. On SQL 2005, CheckDB does do consistency checks of the system catalog, and so these errors can appear.
Fixing these is not trivial. CheckDB will not repair them, as the only fix is to delete records from the system tables, which may cause major data loss. If there's a backup of the database from before it was upgraded to SQL 2005 and the upgrade was very recent, then that backup can be restored to SQL 2000, the system tables fixed on SQL 2000 and then the database upgraded again.
If there is no SQL 2000 backup, or the upgrade was too long ago and the data loss is unacceptable, then there are two possible fixes. First, edit the system tables in SQL 2005, which is a complex and very risky process, as the system tables are not documented and are much more complex than they were on previous versions. See this blog post for details - http://www.sqlskills.com/BLOGS/PAUL/post/TechEd-Demo-Using-the-SQL-2005-Dedicated-Admin-Connection-to-fix-Msg-8992-corrupt-system-tables.aspx
The other solution is to generate scripts of all the objects in the database and export all of the data. Create a new database, recreate the objects and reload the data.
The second option is usually the recommended one.
Irreparable Corruption
CheckDB can't repair everything. Any errors like these are irreparable and the only way to resolve them is to restore a backup of the database that does not have the corruption. If there is a full, unbroken log chain from that backup up until the current time, then the log tail can be backed up and the database can be restored without any data loss.
If there is no clean backup, then the only remaining option is to generate scripts of the database objects and export the data that is accessible. It is quite likely, due to the corruption, that not all of the data will be accessible, and most likely not all of the objects will script without error.
Damaged system tables
CheckDB depends on a few of the critical system tables to get a view of what should be in the database. If those tables themselves are damaged, then CheckDB cannot know what the database should look like, and won't even be able to analyze it, let alone repair.
Damaged allocation pages
In this case, one or more of the database allocation pages are damaged. The allocation pages are used to mark which pages and extents in the database are allocated and which are free. CheckDB will not repair damage to the allocation pages, as it is extremely difficult to work out, without those pages, what extents are allocated and which are not. Dropping the allocation page is not an option as that would discard up to 4GB of data.
Obtaining Help
If you're not sure what to do, get help. If you've run into a corruption message that you don't understand, that isn't described here, get help. If you're not sure the best way to recover, get help.
Help can come in many forms. If there's a senior DBA, ask them. If you have a mentor, ask them. Ask on a forum; the forums here, the Microsoft newsgroups or forums, or another forum if you prefer. Just be aware that not all advice given on the forums is good advice. In fact, there's some downright dangerous suggestions posted from time to time.
Finally, consider calling Microsoft's customer support people. They will charge, but they do know how to deal with corruption, and if it's a critical system database that's down, the cost of support may be far less than the cost of downtime while searching for a solution.
Conclusion
In this article I've given some suggestions on how to deal with corruption and, more importantly, how not to deal with corruption. I hope that people have a better understanding of the available methods of fixing such problems and of the importance of having good backups.
Footnote
1) Paul Randal has written an 80 page chapter for the forthcoming SQL Server 2008 Internals book that covers, in detail, how CheckDB works and has a comprehensive list of all errors it can produce.