Various small notes which I find useful to record.
Posts by: sergey voronin. Last edited: 2020.
With Java GUI applications, there can be an issue with small font size. For example, the following command sequence can be used to upscale the font sizes in the Java Weka application: GDK_SCALE=2 java -Dswing.aatext=true -Dswing.plaf.metal.controlFont=Tahoma-plain-22 -Dswing.plaf.metal.userFont=Tahoma-plain-22 -jar /path/to/weka.jar.
Here is a note on modeling the spread of the virus.
Producer/consumer example with struct option pass to thread functions.
Automated algorithm and parameter selection in ML models is available via auto weka and autosklearn, relying on parameter sweep, coordinate descent, and Bayesian opt methods. The drawback is of course the increased runtime, but the upper bound can be passed as a parameter along with a list of algorithms to try. See here for call sequence examples.
Here is an implementation of an O(n) integer array counting sort which returns sorted result and permutation re-indexing information, useful for Burrows-Wheeler based compression.
Combining multiple images (e.g. plots) into a montage with ImageMagick: montage -mode concatenate -tile nrxnc files_in*.jpg res.jpg with nrxnc the specified geometry.
Managing multiple python versions is often necessary due to library compatibility issues.
On many OS (e.g. fedora, openSUSE) it is possible to install several python versions
at once. Then, one can simply install pip package for each wanted version with
'--user' flag as in python35 get-pip.py --user. This way, needed packages can then be installed for each version. To launch ipython for a wanted version, simply install the
package with pipXX, launch pythonXX command and then type: from IPython import start_ipython; start_ipython();.
Lost track of a file? Make list of files accessed in last 24 hrs in the home directory: find /home/user/* -atime 1 -type f.
A good way to use Microsoft applications in Linux is by running Windows in a virtual machine on the Linux based host. I recommend
the use of VirtualBox.
To make sure a good resolution for the guest is obtained,
increase available video memory and RAM in settings and run the command:
VBoxManage controlvm "name of vm" setvideomodehint 3200 1800 32 where e.g. 3200x1800
should match the max resolution of your host which you can check with the xrandr command.
To get file sharing working between host and guest, enable shared folders and download
guest additions iso and install from within the guest (from here). The version installed must match the vbox version, which you can find with the VBoxManage --version command.
Fast parallel sorting is availabe in Java (8 and up). See
Array methods (serial and parallel sort). Here is an example program for sorting ";" separated strings.
How to OCR on Linux (e.g. turning screenshot from google books into usable source
code). Install tesseract packages. Save screenshot; enhance; convert to text:
convert -colorspace gray -fill white -resize 500% -sharpen 0x1 code0.png code1.jpg;
tesseract code1.jpg code1
How to plot a basic confusion matrix with R:
library('caret');
confmat = confusionMatrix(predicted, actuals);
library(vcd);
mosaic(confmat$table);
To get column count in a csv file for every row (useful to check if csv was created correctly), you can use
Perl as follows: perl -nle 's/".*?"//g;print s/,//g+1' fname.csv.
To remove the last few columns of a csv file, you can use awk like this:
awk 'BEGIN{FS=OFS=","}{for(i=0; i<6; i++) NF--; print}' sample.csv > sample2.csv.
This is handy for large files.
When processing text files, two useful commands are awk and sed. To extract a given row, one can use (awk 'FNR==2' file.txt). To display a
given column of a csv file, one can use (awk -F "\"*,\"*" '{print $2}' file.csv).
It's often useful to remove a subset of the rows of a file (or set of files).
This can be efficiently accomplished with the sed command (e.g. sed --in-place '2,40d;' *txt will remove lines
2 to 40). If you are interested in removing a random subset of a select range of rows from a set of text files,
you can make use of this Perl script.
Using NVIDIA CUDA and related packages on a Linux based system requires you in most cases to manually install
the driver. With recent kernels (4.4.X series) used e.g. by opensuse, the nvidia-drm module has been
incompatible with the kernel build resulting in failed installation and screen flicker. To get around this, notice from
here that
the --no-drm option can now be passed to the *.sh installer. To fix screen flicker: edit kernel boot
entry in grub to display 'single ro' (e.g. BOOT_IMAGE=/boot/vmlinuz-4.4.90-28-default root=UUID=f45... single ro quiet showopts);
then boot to single user mode and run installer with no drm option.
If you work with machine learning, you want to be able to assess the accuracy of your classifer. For binary
classification, there are well defined notions: success rate, false-positive rate. For multi-class
classification, these and related quantities can be defined on a per class basis
(see this paper).
Here is a code snippet
in Python which shows how to get individual per class classification measures using scikit functions.
The main trick is to one hot encode the actual and predicted labels.
The construction of factorizations (QR, LU) with pivoting (a shuffle of the columns represented by
the permutation matrix $P$) can be applied to system solves involving rank deficient matrices.
As an example, consider $A P = Q R$. Plugging into $Ax = b$ yields $Q R P^T x = b \Rightarrow Q R y = b \Rightarrow R y = Q^T b$,
which is an upper traingular system, and can be solved by back substitution for $y$. A simple
permutation $P^T x = y \Rightarrow x = P y$ yields the solution $x$.
Similarly, suppose we have the pivoted LU factorization $AP = LU$. Then plugging
into $Ax = b$ yields $LU P^T x = b$. Next, set $z=U P^T x = U y$ with $y = P^T x$.
Then $L z = b$ can be solved by forward substitution for $z$, while
$U y = z$ can be solved by back substitution for $y$. Again applying a permutation
matrix to $y$ in $x = Py$ yields the result.
The example codes for the algorithms in our conjugate
gradient acceleration paper are now available. L curve reconstruction using wavelet basis and regularization parameter estimation.
See the script build_and_run_continuation.m which builds the system and runs the continuation scheme along the L curve
traced out by ||w||_p (with w = W*x) and ||Ax - b||_l (e.g. p=1,l=2).
The region of maximum curvature (computed using finite differences)
of the curve parametrized by the log of these quantities often gives the best guess for the optimal regularization parameter lambda.
Notice that for the convolution based scheme, line search has to be used. We used a Taylor based approximation involving
the Hessian and gradient:
function alpha=approximate_line_search(A,At,AtA,b,p,lambda,sigma,xn,s)
gradxn = gradF(A,At,b,p,lambda,sigma,xn);
Hessxn = hessianF(A,At,b,p,lambda,sigma,xn);
alpha = -dot(gradxn,s)/(dot(s,Hessxn*s));
end
However, other line search techniques which do not require the Hessian can also be used.
The Gauss Newton (GN) method is a popular apporach to solving non-linear least squares problems.
In particular, it is very useful in fitting non-linear models to data. In this post, we
will investigate the application from the wikipedia page in which a nonlinear model (f(x) = a*x/(b + x)) is fitted to the data.
The challenge consists of estimating the best parameters a,b to the model (contained in vector xs).
For the overall script, see: run_newton_gauss.m.
At the top we define the set of coordinates we are fitting and the initial guess. Since we are solving a non-linear
problem, the initial guess must be in the ball park of the true NLS minimizer; GN will not converge with an arbitrary guess.
At each iteration, we build the Jacobian
and the residual vectors.
Denoting the Jacobian by J and the residual vector by r, we must solve the linear system (J^T J) x = J^T r at every iteration,
via a direct or iterative solver. We choose to use a pivoted LU decomposition solve.
The solution is updated via xs = xs - alpha*x; where alpha is determined via a standard
line search technique.
Symmetric matrices have real eigenvalues. The dominant eigenvalue in magnitude can be found
by power iteration. A slight generalization makes it possible to identify the correct eigenvalue sign.
Then to find other eigenvalues of a matrix, spectral deflation can be performed.
That is, the original matrix A can be overwritten with A - eval*evec*evec' once
the eigenpair has been identified. Power iteration can then be performed on the new matrix.
See the eigensolver
code for an example in C. The small library also provides a good example of a basic multi-core mat-vec
implementaton. For examle, notice the matrix constructor:
/* initialize new matrix and set all entries to zero */
void matrix_new(mat **M, int nrows, int ncols)
{
*M = malloc(sizeof(mat));
(*M)->d = (double*)calloc(nrows*ncols, sizeof(double));
(*M)->nrows = nrows;
(*M)->ncols = ncols;
}
The double pointer insures that memory (malloc) can be initialized inside the function
and remain allocated outside the call. All matrix elements are stored in column major order
inside a one dimensional array.