## Appendix 1. References

[1] Hsu, F.-H. (2002). Behind Deep Blue: Building the Computer That Defeated the World Chess Champion . Princeton University Press, Princeton, NJ, USA. [2] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7 (July 2006), 1527-1554. [3] Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. [4] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [5] Machine Learning, Tom Mitchell, McGraw Hill, 1997. [6] Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning series), Kevin P. Murphy [7] O. Chapelle, B. Scholkopf and A. Zien Eds., "Semi-Supervised Learning (Chapelle, O. et al., Eds.; 2006) [Book reviews]," in IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542-542, March 2009. [8] Y. Bengio. Learning deep architectures for AI. in Foundations and Trends in Machine Learning, 2(1):1–127, 2009. [9] G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent DBNHMMs in large vocabulary continuous speech recognition. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP). 2011. [10] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. IEEE Transactions on Audio, Speech, & Language Processing, 20(1), January 2012. [11] A. Mohamed, D. Yu, and L. Deng. Investigation of full-sequence training of deep belief networks for speech recognition. In Proceedings of Inter speech. 2010. [12] Indyk, Piotr, and Rajeev Motwani. "Approximate nearest neighbors: towards removing the curse of dimensionality." Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998. [13] Friedman, Jerome H. "On bias, variance, 0/1—loss, and the curse-of-dimensionality." Data mining and knowledge discovery 1.1 (1997): 55-77. [14] Keogh, Eamonn, and Abdullah Mueen. "Curse of dimensionality." Encyclopedia of Machine Learning. Springer US, 2011. 257-258. [15] Hughes, G.F. (January 1968). "On the mean accuracy of statistical pattern recognizers". IEEE Transactions on Information Theory. 14 (1): 55–63. [16] Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. "Learning long-term dependencies with gradient descent is difficult." IEEE transactions on neural networks 5.2 (1994): 157-166.

[17] Ivakhnenko, Alexey (1965). Cybernetic Predicting Devices. Kiev: Naukova Dumka. [18] Ivakhnenko, Alexey (1971). "Polynomial theory of complex systems". IEEE Transactions on Systems, Man and Cybernetics (4): 364–378. [19] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feed-forward neural networks. In Proceedings of Artificial Intelligence and Statistics (AISTATS). 2010. [20] G. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, July 2006 [21] M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In Proceedings of Neural Information Processing Systems (NIPS). 2006. [22] I. Goodfellow, M. Mirza, A. Courville, and Y. Bengio. Multi-prediction deep boltzmann machines. In Proceedings of Neural Information Processing Systems (NIPS). 2013. [23] R. Salakhutdinov and G. Hinton. Deep boltzmann machines. In Proceedings of Artificial Intelligence and Statistics (AISTATS). 2009. [24] R. Salakhutdinov and G. Hinton. A better way to pretrain deep boltzmann machines. In Proceedings of Neural Information Processing Systems (NIPS). 2012. [25] N. Srivastava and R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In Proceedings of Neural Information Processing Systems (NIPS). 2012. [26] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In Proceedings of Uncertainty in Artificial Intelligence. 2011. [27] R. Gens and P. Domingo. Discriminative learning of sum-product networks. Neural Information Processing Systems (NIPS), 2012. [28] R. Gens and P. Domingo. Discriminative learning of sum-product networks. Neural Information Processing Systems (NIPS), 2012. [29] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. Diploma thesis, Institut fur Informatik, Technische Universitat Munchen, 1991. [30] J.Martens. Deep learning with hessian-free optimization. In Proceedings of international Conference on Machine Learning (ICML). 2010. [31] Y. Bengio. Deep learning of representations: Looking forward. In Statistical Language and Speech Processing, pages 1–37. Springer, 2013. [32] I. Sutskever. Training recurrent neural networks. Ph.D. Thesis, University of Toronto, 2013. [33] J. Ngiam, Z. Chen, P. Koh, and A. Ng. Learning deep energy models. In Proceedings of International Conference on Machine Learning (ICML). 2011. [34] Y. LeCun, S. Chopra, M. Ranzato, and F. Huang. Energy-based models in document recognition and computer vision. In Proceedings of International Conference on Document Analysis and Recognition (ICDAR). 2007. [35] R. Chengalvarayan and L. Deng. Speech trajectory discrimination using the minimum classification error learning. IEEE Transactions on Speech and Audio Processing, 6(6):505–515, 1998.

[36] M. Gibson and T. Hain. Error approximation and minimum phone error acoustic model estimation. IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1269–1279, August 2010 [37] X. He, L. Deng, andW. Chou. Discriminative learning in sequential pattern recognition — a unifying review for optimization-oriented speech recognition. IEEE Signal Processing Magazine, 25:14–36, 2008. [38] H. Jiang and X. Li. Parameter estimation of statistical models using convex optimization: An advanced method of discriminative training for speech and language processing. IEEE Signal Processing Magazine, 27(3):115–127, 2010. [39] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech recognition. IEEE Transactions On Speech and Audio Processing, 5:257–265, 1997. [40] D. Povey and P. Woodland. Minimum phone error and I-smoothing for improved discriminative training. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP). 2002 [41] D. Yu, L. Deng, X. He, and X. Acero. Large-margin minimum classification error training for large-scale speech recognition tasks. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP). 2007. [42] A. Robinson. An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5:298–305, 1994 [43] A. Graves. Sequence transduction with recurrent neural networks. Representation Learning Workshop, International Conference on Machine Learning (ICML), 2012. [44] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labeling unsegmented sequence data with recurrent neural networks. In Proceedings of International Conference on Machine Learning (ICML). 2006. [45] A. Graves, N. Jaitly, and A. Mohamed. Hybrid speech recognition with deep bidirectional LSTM. In Proceedings of the Automatic Speech Recognition and Understanding Workshop (ASRU). 2013. [46] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In Proceedings of International Conference on Acoustics Speech and Signal Processing (ICASSP). 2013 [47] K. Lang, A. Waibel, and G. Hinton. A time-delay neural network architecture for isolated word recognition. Neural Networks, 3(1):23–43, 1990. [48] A.Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustical Speech, and Signal Processing, 37:328–339, 1989. [50] Moore, Gordon E. (1965-04-19). "Cramming more components onto integrated circuits". Electronics. Retrieved 2016-07-01. [51] http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf [52] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, \Finding a needle in haystack: Facebooks photo storage," in OSDI, 2010, pp. 4760. [53] Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics (ACL '01). Association for Computational Linguistics, Stroudsburg, PA, USA, 26-33. [54] http://www.huffingtonpost.in/entry/big-data-and-deep-learnin_b_3325352 [55] X. W. Chen and X. Lin, "Big Data Deep Learning: Challenges and Perspectives," in IEEE Access, vol. 2, no. , pp. 514-525, 2014. [56] Bengio Y, LeCun Y (2007) Scaling learning algorithms towards, AI. In: Bottou L, Chapelle O, DeCoste D, Weston J (eds). Large Scale Kernel Machines. MIT Press, Cambridge, MA Vol. 34. pp 321–360. http://www.iro.umontreal.ca/~lisa/pointeurs/bengio+lecun_chapter2007.pdf [57] A. Coats, B. Huval, T. Wng, D. Wu, and A. Wu, ``Deep Learning with COTS HPS systems,'' J. Mach. Learn. Res., vol. 28, no. 3, pp. 1337-1345, 2013. [58] J.Wang and X. Shen, ``Large margin semi-supervised learning,'' J. Mach. Learn. Res., vol. 8, no. 8, pp. 1867-1891, 2007 [59] R. Fergus, Y. Weiss, and A. Torralba, ``Semi-supervised learning in gigantic image collections,'' in Proc. Adv. NIPS, 2009, pp. 522-530. [60] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Ng, ``Multimodal deep learning,'' in Proc. 28th Int. Conf. Mach. Learn., Bellevue, WA, USA, 2011 [61] N. Srivastava and R. Salakhutdinov, ``Multimodal learning with deep Boltzmann machines,'' in Proc. Adv. NIPS, 2012 [62] L. Bottou, ``Online algorithms and stochastic approximations,'' in On-Line Learning in Neural Networks, D. Saad, Ed. Cambridge, U.K.: Cambridge Univ. Press, 1998. [63] A. Blum and C. Burch, ``On-line learning and the metrical task system problem,'' in Proc. 10th Annu. Conf. Comput. Learn. Theory, 1997, pp. 45-53. [64] N. Cesa-Bianchi, Y. Freund, D. Helmbold, and M. Warmuth, ``On-line prediction and conversation strategies,'' in Proc. Conf. Comput. Learn. Theory Eurocolt, vol. 53. Oxford, U.K., 1994, pp. 205-216. [65] Y. Freund and R. Schapire, ``Game theory, on-line prediction and boosting,'' in Proc. 9th Annu. Conf. Comput. Learn. Theory, 1996, pp. 325-332. [66] Q. Le et al., ‘‘Building high-level features using large scale unsupervised learning,’’ in Proc. Int. Conf. Mach. Learn., 2012. [67] C. P. Lim and R. F. Harrison, ``Online pattern classifcation with multiple neural network systems: An experimental study,'' IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 33, no. 2, pp. 235-247, May 2003. [68] P. Riegler and M. Biehl, ``On-line backpropagation in two-layered neural networks,'' J. Phys. A, vol. 28, no. 20, pp. L507-L513, 1995 [69] M. Rattray and D. Saad, ``Globally optimal on-line learning rules for multi-layer neural networks,'' J. Phys. A, Math. General, vol. 30, no. 22, pp. L771-776, 1997.

[70] P. Campolucci, A. Uncini, F. Piazza, and B. Rao, ``On-line learning algorithms for locally recurrent neural networks,'' IEEE Trans. Neural Netw., vol. 10, no. 2, pp. 253-271, Mar. 1999 [71] N. Liang, G. Huang, P. Saratchandran, and N. Sundararajan, ``A fast and accurate online sequential learning algorithm for feedforward networks,'' IEEE Trans. Neural Netw., vol. 17, no. 6, pp. 1411-1423, Nov. 2006. [72] L. Bottou and O. Bousequet, ``Stochastic gradient learning in neural networks,'' in Proc. Neuro-Nimes, 1991. [73] S. Shalev-Shwartz, Y. Singer, and N. Srebro, ``Pegasos: Primal estimated sub-gradient solver for SVM,'' in Proc. Int. Conf. Mach. Learn., 2007. [74] D. Scherer, A. Müller, and S. Behnke, ``Evaluation of pooling operations in convolutional architectures for object recognition,'' in Proc. Int. Conf. Artif. Neural Netw., 2010, pp. 92-101. [75] J. Chien and H. Hsieh, ``Nonstationary source separation using sequential and variational Bayesian learning,'' IEEE Trans. Neural Netw. Learn. Syst., vol. 24, no. 5, pp. 681-694, May 2013. [76] W. de Oliveira, ``The Rosenblatt Bayesian algorithm learning in a nonstationary environment,'' IEEE Trans. Neural Netw., vol. 18, no. 2, pp. 584-588, Mar. 2007. [77] Hadoop Distributed File System,http://hadoop.apache.org/2012. [78] T. White. 2009. Hadoop: The Definitive Guide. OReilly Media, Inc. June 2009 [79] Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R., May 2010. The Hadoop Distributed File System,"2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). vol., no., pp.1,10 [80] Hadoop Distributed File System,https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/. [81] Dev, Dipayan, and Ripon Patgiri. "Dr. Hadoop: an infinite scalable metadata management for Hadoop—How the baby elephant becomes immortal." Frontiers of Information Technology & Electronic Engineering 17 (2016): 15-31. [82] http://deeplearning4j.org/ [83] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [84] http://deeplearning.net/software/theano/ [85] http://torch.ch/ [86] Borthakur, Dhruba. "The hadoop distributed file system: Architecture and design." Hadoop Project Website 11.2007 (2007): 21. [87] Borthakur, Dhruba. "HDFS architecture guide." HADOOP APACHE PROJECT https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf (2008): 39. [88] http://deeplearning4j.org/quickstart

[89] LeCun, Yann, and Yoshua Bengio. "Convolutional networks for images, speech, and time series." The handbook of brain theory and neural networks 3361.10 (1995): 1995. [90] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324. doi:10.1109/5.726791 [91] Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., and Xu, W. (2015). Are you talking to a machine? Dataset and methods for multilingual image question answering. arXiv preprint arXiv:1505.05612. [92] Srinivas, Suraj, et al. "A Taxonomy of Deep Convolutional Neural Nets for Computer Vision." arXiv preprint arXiv:1601.06615 (2016). [93] Zhou, Y-T., et al. "Image restoration using a neural network." IEEE Transactions on Acoustics, Speech, and Signal Processing 36.7 (1988): 1141-1151. [94] Maas, Andrew L., Awni Y. Hannun, and Andrew Y. Ng. "Rectifier nonlinearities improve neural network acoustic models." Proc. ICML. Vol. 30. No. 1. 2013. [95] He, Kaiming, et al. "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification." Proceedings of the IEEE International Conference on Computer Vision. 2015. [96] http://web.engr.illinois.edu/~slazebni/spring14/lec24_cnn.pdf [97] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. [98] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). [99] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [100] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [101] Krizhevsky, Alex. "One weird trick for parallelizing convolutional neural networks." arXiv preprint arXiv:1404.5997 (2014). [102] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997. [103] Mikolov, Tomas, et al. "Recurrent neural network based language model." Interspeech. Vol. 2. 2010. [104] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by backpropagating errors. Nature, 323, 533–536. [105] Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013a). Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119. [106]Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs.NE]. [107] Pascanu, R., Mikolov, T., and Bengio, Y. (2013a). On the difficulty of training recurrent neural networks. In ICML’2013.

[108] Mikolov, T., Sutskever, I., Deoras, A., Le, H., Kombrink, S., and Cernocky, J. (2012a). Subword language modeling with neural networks. unpublished [109] Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural networks. ICASSP [110] Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J. (2009). A novel connectionist system for improved unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. [111] http://karpathy.github.io/2015/05/21/rnn-effectiveness/ [112] https://web.stanford.edu/group/pdplab/pdphandbook/handbookch8.html [113] Schuster, Mike, and Kuldip K. Paliwal. "Bidirectional recurrent neural networks." IEEE Transactions on Signal Processing 45.11 (1997): 2673-2681. [114] Graves, Alan, Navdeep Jaitly, and Abdel-rahman Mohamed. "Hybrid speech recognition with deep bidirectional LSTM." Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 2013 [115] Baldi, Pierre, et al. "Exploiting the past and the future in protein secondary structure prediction." Bioinformatics 15.11 (1999): 937-946 [116] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780. [117] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. [118] With QuickType, Apple wants to do more than guess your next text. It wants to give you an AI.". WIRED. Retrieved 2016-06-16 [119] Sak, Hasim, Andrew W. Senior, and Françoise Beaufays. "Long short-term memory recurrent neural network architectures for large scale acoustic modeling." INTERSPEECH. 2014. [120] Poultney, Christopher, Sumit Chopra, and Yann L. Cun. "Efficient learning of sparse representations with an energy-based model." Advances in neural information processing systems. 2006. [121] LeCun, Yann, et al. "A tutorial on energy-based learning." Predicting structured data 1 (2006): 0. [122] Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski. "A learning algorithm for Boltzmann machines." Cognitive science 9.1 (1985): 147-169. [123] Desjardins, G. and Bengio, Y. (2008). Empirical evaluation of convolutional RBMs for vision. Technical Report 1327, Département d’Informatique et de Recherche Opérationnelle, Université de Montréal. [124] Hinton, G. E., Osindero, S., and Teh, Y. (2006). A fast learning algorithm for deep belief nets. Neural Computation, 18, 1527–1554. [125] Hinton, G. E. (2007b). Learning multiple layers of representation. Trends in cognitive sciences , 11(10), 428–434.

[126] Bengio, Yoshua, et al. "Greedy layer-wise training of deep networks." Advances in neural information processing systems 19 (2007): 153. [127] A.-R. Mohamed, T. N. Sainath, G. Dahl, B. Ramabhadran, G. E. Hinton, and M. A. Picheny, ``Deep belief networks using discriminative features for phone recognition,'' in Proc. IEEE ICASSP, May 2011, pp. 5060-5063. [128] R. Salakhutdinov and G. Hinton, ``Semantic hashing,'' Int. J. Approx. Reasoning, vol. 50, no. 7, pp. 969-978, 2009. [129] G. W. Taylor, G. E. Hinton, and S. T. Roweis, ``Modeling human motion using binary latent variables,'' in Advances in Neural Information Processing Systems. Cambridge, MA, USA: MIT Press, 2006,pp. 1345-1352. [130] Zhang, Kunlei, and Xue-Wen Chen. "Large-scale deep belief nets with mapreduce." IEEE Access 2 (2014): 395-403.

[131] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. Technical report, arXiv:1206.5538, 2012b. [132] Makhzani, Alireza, and Brendan Frey. "k-Sparse Autoencoders." arXiv preprint arXiv:1312.5663 (2013). [133] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. "Reducing the dimensionality of data with neural networks." Science 313.5786 (2006): 504-507. [134] Vincent, Pascal, et al. "Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion." Journal of Machine Learning Research 11.Dec (2010): 3371-3408. [135] Salakhutdinov, Ruslan, and Geoffrey Hinton. "Semantic hashing." RBM 500.3 (2007): 500. [136] Nesi, Paolo, Gianni Pantaleo, and Gianmarco Sanesi. "A hadoop based platform for natural language processing of web pages and documents." Journal of Visual Languages & Computing 31 (2015): 130-138.