Linux Systems Programming

Kernel

/boot/vmlinuz Path to kernel
Linux is a preemptive multitasking operating system (the kernel controls a process’s CPU usage)
Supports virtual memory

Kernel functions

Process Scheduling
Process creation and termination
Memory management
Provision of a Filesystem
Access to devices
Networking
System Calls (API)
User Account Management

User mode vs Kernel mode

Modern processor architectures typically allow the CPU to operate in at least two different modes: user mode and kernel mode (sometimes also referred to as supervisor mode). Hardware instructions allow switching from one mode to the other. Correspondingly, areas of virtual memory can be marked as being part of user space or kernel space. When running in user mode, the CPU can access only memory that is marked as being in user space; attempts to access memory in kernel space result in a hardware exception. When running in kernel mode, the CPU can access both user and kernel memory space.

Certain operations can be performed only while the processor is operating in kernel mode. Examples include executing the halt instruction to stop the system, accessing the memory-management hardware, and initiating device I/O operations. By taking advantage of this hardware design to place the operating system in kernel space, operating system implementers can ensure that user processes are not able to access the instructions and data structures of the kernel, or to perform operations that would adversely affect the operation of the system.

Process

A process is started by the kernel, it is also the kernel that can end it, all inputs into a process come through the kernel and all outputs from a process pass through the kernel.

ps -e

Show all processes

If A and B are processes, they cant talk to each other directly all communication should pass through the Kernel.

In a sense this is like a client-server API, if the client (process) wants to access any resources it asks the server (kernel).

cat /proc/1/limits |grep processes

Get maximum number of processes on your system

Each process has a current working directory
A process inherits its current working directory from its parent process.
A process can create a new process using the fork() system call.
All processes except init are created with fork
Each process has a unique integer process identifier (PID)
Each process has a parent process identifier (PPID) which process’s parent.
A process inherits three open file descriptors when it is started by the shell:
Process can be terminated with the _exit() system call.
Process can also be terminated with a signal.
termination status of 0 indicates that the process succeeded
each process has a environment variables list which it inherited from its parent
environment variables are simple ways for parent to communicate with child process
each process can have multiple threads
a nonzero termination status indicates that some error occurred

descriptor 0 is standard input
descriptor 1 is standard output
descriptor 2 is standard error

Memory Mappings (mmap())

//TODO:

init

Created by kernel when the system is booting
Parent of all processes
Lives in /sbin/init
Is PID 1
Runs with superuser privileges
Can’t be killed
Only dies on system shutdown

daemon

Long lived process
Run’s in the background
No controlling terminal

Interprocess Communication and Synchronization

signals are used to indicate that an event has occured
pipes can be used to transfer data between processes
sockets can be used to transfer data from one process to another process on the same computer or on a different computer via the network
file locking files can be locked by a process to prevent other processes from reading or updating the file.
message queues can be used to exchange data between processes
semaphores can be used to synchronize the actions of processes
shared memory allow two or more processes to share memory that they can all monitor

Signals

Informs a process that some event has occured
Each signal is identified by an integer SIGxxxx
Send to the process by the kernel
Signals can also be sent by another process with suitable permissions
A process can send itself signals

Kernel signals

interrupt character Control-C
A child process has terminated
A timer has expired
Attempted to access an invalid memory address

Process Time

system CPU time: Time spent executing system calls and per-forming other kernel services on behalf of the process
user CPU time: The time spent executing code in user mode (Program Code)

User Profile

Users are just another type of client to the server (kernel).

Every user has a unique login name username
Every user has a unique user id UID
By default, when a new user is created, a group with the same name as that of username is also created at the same time and the newly created user become a member of that group.

cat /etc/passwd | grep <USERNAME>

Show user profile

The above outputs something like:

username:password:UID:GID:comment:home:shell

The first username_ followed by : then the password_ (x means encrypted and stored in /etc/shadow) followed by the user ID (UID) which is followed by the group id of the user GID followed by a comment that describes the user account followed by the users home directory and the shell that is launched on user login.

Superuser

UID = 0
username is usually root
Can bypass all permission checks in the system

cat /etc/passwd | grep root

Show superuser profile

User groups

Users can be grouped together for administrative, imagine some files that can only be accessed by users that are part of a specific group.

cat /etc/group | grep <GROUP_NAME>

Show all available groups

The above outputs something like:

group_name:group_password:group_id:group_members

Filesystem

Hierarchical file structure
/ is the base of the hierarchy also know as the root directory
All files and directories are children or descendants of the root directory
Filenames can be up to 255 characters long
Filenames may contain any character except slashes ( / ) and null characters ( \0)
It is advisable to use only letters and digits, and the . (period), _ (underscore) for file names , and - (hyphen) characters

ls /

List the files and directories there in root directory

Permissions

User Permissions

Each file has an associated user user id (UID) and group (GID) that define the owner of the file and group it belongs to. These properties are also the building block of file permissions.

In the context of permissions, there are 3 types of entities within the system, entities can interact with files depending on the file’s permissions.

Here are the entities:

Owner of the file
Group members of the file
Other users

There 3 types of permissions:

read allows an entity to read the file > (for directories read allows an entity to list the content of a directory)
write allows an entity to modify the file > (for directories write allows the contents of the directory to be changed)
execute allows an entity to execute the file > (for directories execute allows access to files within the directory

Process Permissions

// TODO:

Syscalls

+-----------------+   |              ...
   | Program/Process |   |      .-----> [ ]
   +-||----------||--+   |      |       [ ]
    \      libc     /    | .----'       [ ]
     '-------------'     | |            [ ]
           |             | |            [ ]
           |             | |            [ ]
           '---------------'            ...
                         |
       [User Space]      |     [Kernel Space]
                         |
                         |

IO (syscall) Buffering

This section has been adapted from: https://era.co/blog/unbuffered-io-slows-rust-programs

Programming languages have access to OS syscalls, these are used for things such as IO.

syscalls are slow to call, when designing high performance code all syscalls should be analyzed.

No buffering, slow:

use std::fs;
use std::io::{self, Write};

fn main() -> io::Result<()> {
    let mut f = fs::File::create("/tmp/unbuffered.txt")?;
    f.write(b"foo")?;
    f.write(b"\n")?;
    f.write(b"bar\nbaz\n")?;
    return Ok(());
}

We can use the strace program to see the syscalls used in a program:

$ strace --trace=write ./target/release/01_unbuffered
write(3, "foo", 3)                      = 3
write(3, "\n", 1)                       = 1
write(3, "bar\nbaz\n", 8)               = 8

We should rather use buffered io like this:

use std::fs;
use std::io::{self, BufWriter, Write};

fn main() -> io::Result<()> {
    let mut f = BufWriter::new(fs::File::create("x.txt")?);
    f.write(b"foo")?;
    f.write(b"\n")?;
    f.write(b"bar\nbaz\n")?;
    return Ok(());
}

$ strace --trace=write ./target/release/02_buffered
write(3, "foo\nbar\nbaz\n", 12)         = 12

fsync

This section has been adapted from: https://bonsaidb.io/blog/durable-writes/#What%20are%20%27durable%20writes%27%3F

When writing data to a file, the data is cached in RAM by the OS, it is not immediately written to the file-system. This writing to the disk is slow, so buffered io is used. But when power is suddenly cut the data (in RAM) to be written to the file is lost forever.

To prevent the lost of buffered/cached data, we need to flush, or sync the data.

On Linux, it’s fsync(), fdatasync(), and sync_file_range().

On Windows, it’s FlushFileBuffers.

On Mac/iOS, fsync() is available but does not provide the same guarantees as Linux. Instead, a call to fcntl with the F_FULLFSYNC option must be used to trigger a write to physical media.

Rust uses the correct APIs for each platform when calling File::sync_all or File::sync_data to provide durable writes. The standard library does not provide APIs to invoke the underlying APIs mentioned above. Thankfully, the libc crate makes it easy to call the APIs we are interested in for this post.

SQLite

Transactions

In programs that use SQLite, there can be various actions to be performed by the database. These actions can be grouped together in what’s called a Transaction.

A transaction is a sequence of actions on data items

Transactions help prevent problems that could arise, such as data durability when a program crushes or in the event of an unexpected power failure or even during complex concurrent programming procedures etc. (These restrictions/guarantees is basically ACID, more info below)

Programs can start a transaction and execute operations as part of the transaction. But for the transactions changes to occur the transaction must be committed. To commit simply means to instruct the database to permanently update its state according to the operations contained within the transaction.

Transactions can be considered as logical units of work for a database system. If a transaction fails the database must remove it’s effects from the database and revert back to the state the database was in before the transaction occurred.

No only are transaction units of work that move the database state forward, transactions are also a database abstraction with the following guarantees (aka ACID):

Atomic: All operations within a transaction should succeed, if even a single operation fails all other operation will not be considered and the transaction fails to be committed.
Consistency: A transaction mutates a consistent database state to a consistent state and a transaction must be deterministic.
Isolation: All operations of each transaction happen ‘together’ instantaneously.
Durability: Effects of successful transactions must become a part of the database.

To get a greater view of Transactions let us see them at work, using Rust and the Rusqlite crate:

use rusqlite::{params, Connection, Result};

/// A helper function for connecting the database
fn connect_db() -> Result<Connection> {
    let conn = Connection::open("/tmp/TEST_DB.db")?;

    conn.execute(
        "CREATE TABLE IF NOT EXISTS vals(
            v  INTEGER NOT NULL
        )",
        [],
    )?;

    Ok(conn)
}

/// A slow way to insert rows
fn slow_insert(conn: &Connection) -> Result<()> {
    for count in 1..=1000 {
        conn.execute("INSERT INTO vals (v) VALUES (?1)", params![count])?;
    }

    Ok(())
}

/// A fast way to insert rows
fn fast_insert(conn: &mut Connection) -> Result<()> {
    let tx = conn.transaction()?;

    for count in 0..1000 {
        tx.execute("INSERT INTO vals (v) VALUES (?1)", params![count])?;
    }
    tx.commit()?;
    Ok(())
}

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_slow_insert() {
        let conn = connect_db().unwrap();
        slow_insert(&conn).unwrap();
    }

    // #[test]
    // fn test_fast_insert() {
    //     let mut conn = connect_db().unwrap();

    //     fast_insert(&mut conn).unwrap();
    // }
}

In the above code we try out two ways to insert 1000 rows in an SQLite database. The code has three function:

fn connect_db() -> Result<Connection>: A helper function for connecting the database
fn slow_insert(conn: &Connection) -> Result<()>: A slow way to insert rows.
fn fast_insert(conn: &mut Connection) -> Result<()>: A fast way to insert rows

The code also has two test function test_slow_insert and test_fast_insert the later is commented out because we only want to test the slow one first by running:

cargo test

We see that it is quite slow, the test on my machine output:

running 1 test
test tests::test_slow_insert has been running for over 60 seconds
test tests::test_slow_insert ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 165.22s

   Doc-tests st
   
running 0 tests
   
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

Let us try out the fast version by commenting out the test_slow_insert unit test function and uncommenting the test_fast_insert unit test function. Then after we run cargo test we get

running 1 test
test tests::test_fast_insert ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.18s

   Doc-tests st
   
running 0 tests
   
test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

This time the test completes instantly.

So why is the first one slow, specifically why is this slow:

// A slow way to insert rows
fn slow_insert(conn: &Connection) -> Result<()> {
    for count in 1..=1000 {
        conn.execute("INSERT INTO vals (v) VALUES (?1)", params![count])?;
    }

    Ok(())
}

It is slow because it uses the connection’s execute method which results in a new transaction being created to insert each and every row. This might be acceptable if the database was held in memory but in this case the database is being held on the filesystem on spinning disk drive. Interactions with the filesystem is slow usually several syscalls have to be called. For example for durability reasons (a key requirement for ACID) databases often make use of the fsync system call. All this means creating a 1000 transactions and committing them is very slow, it is much better to batch the database operations on a single transaction and only committing them once like this:

// A fast way to insert rows
fn fast_insert(conn: &mut Connection) -> Result<()> {
    let tx = conn.transaction()?;

    for count in 0..1000 {
        tx.execute("INSERT INTO vals (v) VALUES (?1)", params![count])?;
    }
    tx.commit()?;
    Ok(())
}

Tools

strace: It captures and records all system calls made by a process and the signals received by the process.

[home] | Emancipation through technology |