Data Science and the Go Programming Language
Sponsored Content
Comments by Tom Miller, Faculty Director of Northwestern University’s MSDS program.
Years ago, as a student of applied statistics at the University of Minnesota, I learned a lesson about programming in academia. At the start of the course, the professor said,
“I don’t care what language you use for assignments, as long as you do your own work.”
I had experience with Fortran but was teaching myself Pascal, trying to adopt a structured programming style.
Taking the professor at his word, I programmed the first assignment in Pascal while my classmates used Fortran. The first assignment comes due. I walk my paper (a program listing) to the front of the room and hand it to the professor. He looks at it quizzically and asks, “What’s this?”
I explain, “It’s Pascal. You told us we could program in any language we like, as long as we do our own work.”
To which, the professor says, “Pascal. I don’t read Pascal. I only read Fortran.”
Lesson learned: Academics are not especially open to new programming languages.
FORTRAN
Fortran was developed by John Backus at IBM and introduced in 1957. When you hear its name, think “formula translation.” Fortran is well-suited for numeric calculations, as needed for scientific and engineering applications. Fortran has seen a resurgence recently, perhaps due to the computational demands of large data sets and supercomputing.
PASCAL
Designed by Nicholas Wirth, a Swiss Computer scientist, and introduced in 1970, Pascal is a derivative of ALGOL. Pascal was aligned with a movement toward structured programming at many universities in the 1970s and 80s. Variations on Pascal have been used for systems programming at Apple and Microsoft.
Data science students at most universities today would have a similar experience if they were to submit assignments in Go, Rust, or any other contemporary language rather than Python or R.
With machine learning applications and AI, Python rules the day. Data scientists might feel content sailing along in a Python boat with life preservers such as Numpy, Pandas, Scikit-learn, and TensorFlow by their sides.
But watch out. Today’s data oceans are choppy. Sharks are approaching.
Recall the words of Chief Brody to Quint in the movie Jaws: “You’re gonna need a bigger boat.” I would suggest that a bigger, faster boat be built with Go.
GO (GOLANG)
Go was developed by three Google computer scientists: Robert Griesemer, Rob Pike, and Ken Thompson. It retains the performance advantages of C, while being easier and safer to work with than C. Go was introduced in 2009 and has been the primary systems programming language at Google. For mission-critical systems in many organizations, Go is replacing C/C++, C#, Java, and Python. Go is sometimes called “Golang” to distinguish it from the Go board game and to provide a more reliable term in search engines.
Data Science Careers: The Why of Go
In a presentation entitled “The Why of Go,” Carmen Andoh traced the development of computer languages from 1980 through 2017. She made a convincing argument for using Go in large programming projects. Her argument rings true today.
- Go is Machine Efficient. It beats languages that are interpreted as well as languages that depend on virtual machines.
- Python joined the computer scene more than thirty years ago, before the prevalence of multi-core processors. Python is a single-threaded, interpreted language, poorly suited for systems that demand concurrent processing.
- Data scientists may be writing in Python, but for compute-intensive tasks it is C or C++ that does the work. Python is just the “glue” that holds the pieces of the machine learning boat together.
- It does not take long to find examples of benchmarks demonstrating the advantages of Go over Python and R, the leading languages in data science.
Sometimes described as “C for the 21st century,” Go is a strongly typed language that compiles directly to machine code. It compiles much faster than C and executes almost as fast as C.
C, C++, AND C#
C was developed by Dennis Ritchie at Bell Labs and introduced in 1972. Because it provides low-level access to memory and maps easily to machine instructions, C has been a popular systems programming language for many years. C has performance advantages over most other programming languages. C++ and C# provide object-oriented extensions to C, while retaining C’s structure and performance advantages.
Concurrent processing (never an easy task) is an intrinsic feature of Go
Go offers a rich set of tools for taking advantage of today’s multicore digital computers. Data science needs languages and systems that can handle the demands of today’s data-driven, data-intensive world. Data science needs Go.
Go Is Programmer Efficient. Python is often touted as easy to learn. But I would argue that Go is easier to learn than Python. Go is simplicity by design, a language with only twenty-five keywords. Go is easy to read, easy to use, and easy to maintain over time.
Let’s be happy that the leaders of the Go community are reluctant to add new features. Donald Knuth had the right idea. When he got to version 3.14 of TeX, he declared that there would be no new versions of the language, no new features, only bug fixes. And with each bug fix, he would borrow another digit from π (pi).
A mantra of Go programmers: “Keep it simple. Keep it running.”
Go has a well-defined structure with formatting utilities to ensure a common style across programmers, a style that is sometimes called “idiomatic Go.” Go has automated memory management (garbage collection), protecting programmers from memory leaks and errors. Go is safer than C and C++.
Go core developers have a commitment to backward compatibility, and Go’s module system promotes safety, ensuring that the right packages are incorporated into each build at compile time. Go keeps track of software versions as the software stack grows.
Think of software development as a game of Jenga. We want to access the blocks at the bottom of the stack, while ensuring that the entire stack does not collapse. Go lets us do this.
Go Simplifies the Software Stack. What about the software stack, the infrastructure?
When Python (even bolstered by C or C++) is not up to the task, data scientists turn to other languages and systems. Here is a so-called solution to Python’s performance problems:
To implement high-performance solutions, data scientists turn to Spark, which is built on Scala, which depends on the Java Virtual Machine. And to provide easy access, these well-meaning data scientists add PySpark to the mix. Is this the best way to address Python’s performance problems? No.
Consider a simpler software stack. It’s Go, just Go:
With code examples from GopherCon conferences in 2021 and 2023, Daniel Whitenack shows how to implement machine learning and artificial intelligence solutions in Go. We can use Go to build integrated, intelligent web applications, including those that call on generative AI and large language models.
Go represents the quintessential systems programming language for today’s multicore, digital computers. Go is the language of the cloud. Go is the language of distributed computing. Data scientists who looked to Python as the “glue language” of the past can now look to Go as the “super glue.”
Go Is Widely Used in Industry. Companies value the safety, simplicity, and performance of Go. They also recognize Go’s strengths as a backend systems programming environment. Go is well-suited for developing web and database servers, application programming interfaces, and microservices. Go is well-suited for implementing scalable, high-performance systems.
Beginning with Google, the birthplace of Go, many companies rely on Go for large, mission-critical systems. If Go is good enough for Google, Netflix, Uber, Dropbox, PayPal, American Express, Capital One, Salesforce, Zillow, and many others, then Go is good enough for the rest of us.
If Go can provide an effective platform for building Docker, Kubernetes, Prometheus, Grafana, Pachyderm, Terraform, CrowdStrike, etcd, CockroachDB, Weaviate, milvus, Aerospike, and a diverse array of distributed systems and cloud-native microservices, then Go can be an effective platform for building data science applications.
Computer science and data science educators should learn from industry. They should add Go to their courses. This is what we are doing at Northwestern.
Three Languages for Data Science at Northwestern
Using Go for data science does not imply that we must give up the good things that R and Python provide. We can be multilingual.
It is not hard to imagine projects for which a data scientist might explore data with R, develop models with Python, and implement systems in Go. Among the three languages for data science, Go is the newest. Go is trending upward and offers substantial job opportunities.
Northwestern’s data science program appreciates the strengths of the three languages for data science across specializations with the program.
- R, with numerous packages for analytics and modeling, is well-regarded by applied statisticians. It is an excellent choice for scientific programming and applied research. R is especially good for exploring and visualizing data. R is the primary language in most courses in Northwestern’s Analytics and Modeling specialization.
- Python is currently the most popular computer language in data science. It is especially strong in natural language processing and serves as the primary client to deep learning platforms. Python provides a feature-rich environment for developing models, and Python is the primary language in most courses in Northwestern’s Artificial Intelligence specialization.
- Go is a systems programming language designed for today’s multi-processor computers. It is well-suited for implementing scalable, high-performance systems for data science, including web applications and database servers. Go is the primary language in Northwestern’s Data Engineering specialization, as shown in the Learning Go for Data Science website.
Students in Northwestern University’s online MS in Data Science program build the essential analysis and leadership skills needed to analyze and interpret data to make informed, impactful decisions in a wide range of fields. Classes are led by an accomplished faculty of industry experts. Students develop expertise in their areas of interest by selecting a general data science track or one of five specializations: Analytics and Modeling, Analytics Management, Artificial Intelligence, Data Engineering, and Technology Entrepreneurship. Students learn part-time, at their own pace entirely online. Applications are accepted quarterly.