Chapter 4 Choosing and using coding tools
4.1 What are open source tools?
Open source tools refer to software applications that have been developed and released to the public under an open source license. Open source licenses are legal agreements that allow users to freely use, modify, and distribute the source code of a software application.
The main open source tools we will be referring to in this book are:
- Python is a general-purpose language that is known for its simplicity and readability
- R is a specialised language that is designed specifically for statistical computing, graphics and data analysis.
- Git is an open source version control system that is widely used in software development to track changes to code and collaborate with others.
- SQL itself is not open source or proprietary, it is a standardized language used in databases. However, there are many open source database management systems (DBMS) that support SQL.
4.2 Why are we promoting these tools over things like SAS/SPSS?
The main advantages of open source coding tools for us at DfT are:
- Cost: Open source coding tools are free to use, distribute, and modify, which makes them a cost-effective option. There are no licensing fees or other costs associated with using the tools, which helps reduce the barriers to accessing advanced data science and statistical computing capabilities.
- Flexibility: R and Python are both general-purpose programming languages that can be used for a wide range of tasks. They have vast libraries of packages and modules that allow users to extend the functionality of the languages for their specific needs.
- Community: Open source coding tools are supported by active communities of developers and users who collaborate on their development and support. This can lead to more frequent updates, bug fixes, and new features, as well as a broader range of support options and documentation.
- Integration: R and Python can easily integrate with other tools and languages, such as SQL and Git, which allows users to leverage existing resources and infrastructure.
4.3 The coding tools available to you
4.3.1 Data analyst languages
While there are an almost unlimited number of coding languages that you could theoretically use to process and analyse data, the two recommended consistently across the analyst community are R and Python. R is a specialist statistical coding language specificially designed for all stages of data ingestion and analysis, whereas Python is a more general coding language with strong support for analysis.
These data analyst languages are ideal for bringing in, processing, and visualising data ready to share with others. They are complimentary (and not replacements for!) database tools covered in the next section.
4.3.1.1 R
R is the most commonly used language by analysts in DfT (and generally across the Civil Service). It is well-supported and easy to get access to in the department through our Cloud R platform; this is a virtual instance of R which users can request access to through the CRAN R channel, and does not need a high performance laptop to run. It is also regularly upgraded to make sure it has a recent version of R available, and allows users to download and install the packages they need.
The pros and cons of R:
Advantages:
Comprehensive statistical analysis tools: R provides a comprehensive suite of statistical analysis tools that allow you to perform a wide range of data analyses, from basic descriptive statistics to advanced statistical modeling.
Data visualisation capabilities: R provides a wide range of data visualisation tools, including packages such as ggplot2, which allow you to create high-quality and customisable plots and charts.
Large and active user community: R has a large and active community of users and developers, both inside DfT and more widely across the web, which means that you can find a lot of resources, documentation, and help within the CRAN network and online.
Good support across the analyst community: R is the most commonly used language in DfT, so you can be assured of high quality R technical support, and good resilience when using R.
Integration with other programming languages and software: R can easily integrate with other programming languages and software, such as BigQuery, Python and SQL, allowing you to take advantage of their respective strengths.
Disadvantages:
Memory limitations: R relies heavily on using memory, which means that it can struggle with large datasets or complex computations.
Lack of standardised language: R has a range of different packages and functions which can be used, which creates a lot of variation in what the code looks like if best practice isn’t followed.
Limited support for machine learning: While R does have some machine learning packages, it may not be the best choice for users who need to perform complex machine learning tasks.
Use cases:
Broadly R is the preferred data analyst language across DfT, and is appropriate for the majority of tasks analysts would want to carry out on a daily basis. While it is not necessarily better than Python at these tasks, it is more widely supported both in terms of technical and training across DfT, so most teams will want to default to R where possible. It is ideal for a range of tasks including:
Statistical analysis: R is ideal for statistics due to its wide range of statistical tools and packages, such as dplyr and tidyr, which allow you to easily to explore, manipulate, and summarise data.
Production of outputs: R can format data outputs in a variety of neat formats, including HTML reports and slidepacks, dashboards and apps, and neat publication-ready tables.
Data visualisation: R provides powerful data visualisation packages, which allow analysts to create high-quality and interactive visualisations.
Time series analysis: R has a range of time series packages that allow data analysts to explore and model time series data for financial forecasting and demand prediction.
Text mining: R is ideal for text mining, and allows analysts to extract qualitative data for sentiment analysis, topic modeling, and information extraction.
4.3.1.2 Python
Outside of DfT, Python use is ubiquitous across a wide range of applications, including data analysis, although within the Civil Service uptake has been slower. Access to Python within DfT is less well established than R, and is currently split between local access via PyCharm and Jupyter notebooks and access on virtual machines in GCP, depending on use case. You can ask on the CRAN Python channel for details of the right platform for you.
Advantages:
Large and active user community: Python has a large and active user community outside of DfT, and a growing one within the department. This means that while you may not be able to find DfT-specific learning as for R, you can find resources, libraries, and tutorials externally.
Comprehensive data analysis libraries: Python has a wide range of powerful and comprehensive data analysis libraries, such as NumPy and Pandas, that make it easy to perform a variety of data analysis tasks.
Strong machine learning capabilities: Python has several powerful machine learning libraries which make it a popular choice for data analysis tasks that involve machine learning and AI.
Disadvantages:
Memory limitations: Like R, Python can be a heavy user of computer memory, which can make it challenging to work with very large datasets.
Limited visualisation capabilities: While Python has some visualisation libraries, such as Matplotlib and Seaborn, it is more limited in comparison to R.
Lack of standardised language: Like R, Python has several different styles of writing code, which can make it challenging to maintain code consistency across projects or teams without strict adherence to best practice.
Steep learning curve for machine learning: While Python has powerful machine learning libraries, the learning curve for using them effectively can be steep, requiring a solid understanding of math and statistics.
Use cases:
Broadly, while Python can achieve many of the tasks that R can, the fact that it is rarely used in DfT means that a strong business case for its use should be established before moving away from R. You should be confident that it is the best tool for the job, and that you have the skills pipeline to use and maintain the code in the long term. Some instances where Python should be considered over R include:
Machine learning: Python is an excellent choice for machine learning, using libraries such as Scikit-learn and TensorFlow to develop and deploy predictive models for various applications, including as fraud detection, customer churn prediction, and product recommendation.
Natural language processing: Python has a range of natural language processing libraries, which are useful for a variety of qualitative applications, such as sentiment analysis, topic modeling, and information extraction.
Modelling: Python’s capability in modelling make it a popular choice particularly for building complex predictive and economic models.
Image processing: Python’s OpenCV library allows processing and analysing of images for various applications, such as facial recognition and object detection.
4.3.2 Database languages
Unlike data analysis languages, database languages are used to create and interact with databases, to store, extract and manipulate data within tables. As previously mentioned, one is not a substitute for the other, and it’s very normal to use a database language to store and query your data, and then pull that data into a data analysis language for further manipulation, processing and visualisation.
The only database language used in DfT by analysts is SQL. It is used across two main platforms (SQL Server and BigQuery on GCP), which have slightly different user interfaces, and small variations in the syntax used.
4.3.2.1 SQL Server
Microsoft SQL Server is widely used across the analyst community for storing large quantities of data across a few server instances. You can access the data directly in the SQL server interface, or you can interrogate the data directly in R or Python. Many older databases within DfT have been stored in SQL Server for a long time, and usage is expected to continue for a while.
Advantages:
Familiarity: Most users have made use of the SQL Server interface before, and understand the program itself and the associated syntax.
Integration: SQL Server is well integrated with other Microsoft products, such as Excel and Power BI, making it easy to move data between applications and perform complex data analysis.
Cost transparency: The cost of hosting SQL Server is front-loaded, so users don’t have to worry about creating expensive queries by mistake!
Disadvantages:
Cost: SQL Server can be expensive, particularly for larger deployments that require more advanced features and capabilities.
Complexity: SQL Server instances can be complex to set up and manage, particularly when they are part of legacy systems and contain a lot of data within the same database.
Compatibility: SQL Server is designed primarily for use with Microsoft products, so it may not be the best choice as DfT moves towards a mixture of coding platforms and other software.
Proprietary product: Users are tied to Microsoft for updates, support, and future development.
Use cases:
Broadly, while SQL Server is still a practical way to access data which is already stored in a legacy database, it is not the platform of choice for new data storage. You will likely want to consider using BigQuery over SQL Server for new projects, even if you don’t plan to transfer over older data storage in the near future.
4.3.2.2 BigQuery
In contrast to SQL Server, BigQuery is a Google product which has been around DfT for a relatively short amount of time. It still operates via the SQL, but stores the data in a different way which makes it more efficient to store and query, as well as more scalable. Newer data storage is more frequently being done in BigQuery in DfT, as it offers a number of advantages for teams.
Advantages:
Interface with Other Google Cloud Products: As BigQuery is part of the larger GCP infrastructure, data ingestion can be simplified using data pipelining products to easily bring data in from APIs and storage buckets automatically.
Scalability: BigQuery is designed for massive scalability, allowing it to easily handle large volumes of data and complex queries without requiring additional infrastructure.
Fast Query Performance: BigQuery is optimised for fast query performance, even on large datasets, thanks to its improved data storage.
Cost: BigQuery uses a pay-as-you-go pricing model, which can be more cost-effective than hosting a SQL Server instance.
Disadvantages:
Pricing Complexity: While the pay-as-you-go costs for BigQuery can be much cheaper, the pricing for queries isn’t immediately obvious, and can lead to concerns that complex or frequent queries could be very expensive.
Limited SQL Compatibility: BigQuery does not support all of the features supported by SQL Server, which can be a disadvantage when trying to transfer existing code and databases to BigQuery.
Use cases:
While BigQuery is a new technology, it is a fast and efficient way to store your data, and opens up a range of opportunities in terms of pipelining data in and out. While you may not be planning to transfer existing data from SQL Server in the short term, you will likely want to consider using BigQuery over SQL Server for new projects.
4.3.3 Version control
Alongside both data analysis and database languages, version control is a powerful tool which allows you to track the when, what, who and how of every single change to your code. Using Git and Github doesn’t replace any other coding language, but provides a simple way to keep track of what you are coding.
4.3.3.1 Git/Github
Git and Github are used in combination to version control your code, with Git being the local software and Github the cloud storage it is linked to.
Advantages:
Version control: Git allows you to easily track changes made to your code and data, and roll back to previous versions if needed. This is important for quality data analysis, where you need to keep track of changes made to the data and code over time.
Collaboration: Git and GitHub make it easy to collaborate on projects with others. Multiple people can work on the same code and data simultaneously, and merge changes together in a controlled and organised way.
Code sharing: GitHub provides a platform for sharing code and data with others, making it easy to publish and share your code with others in DfT, or more widely with the public.
Workflow management: Git and GitHub provide tools for managing coding workflows, which can help ensure that changes to the code are reviewed and approved before they are merged into the main code base, and that a log of these reviews is stored alongside the code.
Documentation: GitHub provides a platform for documenting your code and analysis, making it easy to find and make use of documentation over time.
Disadvantages:
Learning curve: Git and GitHub can have a steep learning curve, particularly for those who are new to version control and collaboration tools, and may take time to implement at first.
Data storage: GitHub is designed for storing and versioning code, and is not suitable for storing large datasets or binary files such as Excel or Word documents.
Security concerns: Storing sensitive information inappropriately on Github can lead to increased risk of sharing the data with others accidentally.
4.3.4 How to get access to coding tools
4.3.4.1 R
Contact an R Super User (you can do this by posting directly on the CRAN R channel), asking for access to Cloud R. You should provide the name of the server you want access to, or failing this, the name of the team you are in.
You will be provided with a server link which you can log on to with your usual network details.
4.3.4.2 Python
- There are several different ways to get access to Python in DfT at present. Check the CRAN Python wiki to learn about the different platforms and to choose the right one for you.
4.3.4.3 SQL
- You can request both the Microsoft SQL Server software and access to the appropriate server and tables via your IT Focal Point.
4.3.4.4 BigQuery
You can request access to an existing BigQuery existence through a Google Cloud Platform service request from your IT Focal point.
If you’d like to try out the features of BigQuery, you can also request a sandbox account to trial out with non-sensitive data (also through your IT Focal Point)
4.3.4.5 Git/Github
Contact your IT Focal Point and ask them to do a service request for you to be added to Github.
You will receive an email invitation to join the DfT enterprise account.
Follow the instructions in the CRAN wiki to set up your account