tiny thoughts: July 2009

Tuesday, July 28, 2009

extract a file from the .tar file

Under stdsun:
Take a file named with backup.tar for example. We want to extract the file named update_log inside the backup directory.

tar -xvf backup.tar backup/update__log

Done!
If the file is .tar.gz, we can similarly use
tar -xzvf backup.tar.gz backup/update__log

a comment about the find command

To find all the directories in a directory and remove them, use the following command:
find /home/ye/Backup/ -type d -name apattern -exec rm -rf '{}' \;

-name pattern: Base of file name (the path with the leading directories removed) matches shell pattern pattern. The metacharacters (`*', `?', and `[]') do not match a `.' at the start of the base name. To ignore a directory and the files under it, use -prune; see an example in the description of -path.

-exec command ;: Execute command; true if 0 status is returned. All following arguments to find are taken to be arguments to the command until an argument consisting of `;' is encountered. The string `{}' is replaced by the current file name being processed everywhere it occurs in the arguments to the command, not just in arguments where it is alone, as in some versions of find. Both of these constructions might need to be escaped (with a `\') or quoted to protect them from expansion by the shell. The command is executed in the starting directory.

Monday, July 27, 2009

starting the implementation of the research project!

I now go on to implement my outlier detection code.
We need to first find the proper datasets.
Here we target on two datasets.

One is the climate dataset. It has the format of

The other one is the stock dataset. It has the format of

We use C++ to implement the problem. Write more when I start coding.

One is about the
A time schedule for the project:

Sunday, July 26, 2009

for fun~

在《胡适留学日记》里有这样的记载，大意如下：

... 7月4日新开这本日记，也为了督促自己下个学期多下些苦功。先要读完手边的莎士比亚的《亨利八世》... …

7月13日打牌。

7月14日打牌。

7月15日打牌。

7月16日胡适之啊胡适之！你怎么能如此堕落！先前订下的学习计划你都忘了吗？子曰：“吾日三省吾身。”...不能再这样下去了！

7月17日打牌。

7月18日打牌。

Saturday, July 25, 2009

Using SSH without password

The original link for this post is: http://www.hostingrails.com/wiki/27/HowTo-SSHSCP-without-a-password

HowTo SSH/SCP without a password.

This is a wiki article created by HostingRails users. Please login or signup to make edits.

This small HowTo will explain how to setup key-based authentication for password-less SSH and SCP usage.

This HowTo does assume the reader has some basic knowledge of ssh and a terminal, and is using an operating system that implements SSH. If you're using a Windows OS and want to use SSH, try PuTTY. For Putty, see key-based auth with Putty.

In the examples that follow please substitute 'servername' , 'ipaddress' and 'username' with the proper information for your setup. I have included a list of weblinks for the words in italic at the end of this document.

Step 1. Verify that you can connect normally (using a password) to the server you intend to setup keys for:

#### Examples ####

user@homebox ~ $ ssh username@'servername'

# Or:

user@homebox ~ $ ssh username@'ipaddress'

# If your username is the same on both the client ('homebox') and the server ('servername'):

user@homebox ~ $ ssh 'servername'

# Or:

user@homebox ~ $ ssh 'ipaddress'

# If this is your first time connecting to 'servername' (or 'ipaddress'), upon establishing a connection with the
# server you'll be asked if you want to add the servers fingerprint to the known_hosts file on your computer.
# Press 'enter' to add the fingerprint.

Step 2. Now that you're connected to the server and verified that you have everything you need for access (hopefully), disconnect by typing 'exit' .

#### Examples ####

user@servername ~ $ exit

# You should be back at:

user@homebox ~ $

Step 3. The next step is to copy a unique key generated on your 'homebox' to the server you are connecting too. First, before you generate a new key, check to see if you already have a key:

#### Example ####

user@homebox ~ $ ls -l ~/.ssh
total 20
-rwx--xr-x 1 user user 601 Feb 2 01:58 authorized_keys
-rwx--xr-x 1 user user 668 Jan 1 19:26 id_dsa
-rwx--xr-x 1 user user 599 Jan 1 19:26 id_dsa.pub
-rwx--xr-x 1 user user 6257 Feb 2 21:04 known_hosts

# The file we need to copy to the server is named id_dsa.pub. As you can see above, the file needed exists. You may or may not have other files in ~/.ssh as I do. If the key doesn't exist, however, you can make one as follows:

#### Example ####

user@homebox ~ $ ssh-keygen -t dsa
Generating public/private dsa key pair.
Enter file in which to save the key (/home/user/.ssh/id_dsa): # Press 'enter' here
Enter passphrase (empty for no passphrase): # Press 'enter' here
Enter same passphrase again: # Press 'enter' here
Your identification has been saved in /home/user/.ssh/id_dsa.
Your public key has been saved in /home/user/.ssh/id_dsa.pub.
The key fingerprint is:
6f:c3:cb:50:e6:e9:90:f0:0f:68:d2:10:56:eb:1d:91 user@host

# Entering a password when asked during the key generation processes when prompted would require you to enter a password each time you SSH/SCP to the server which defeats the purpose of this document.

Step 4. Regardless whether you had a key ready to go or if you had to generate a new key, the next step is the same in either case. Now you're ready to copy the key to the server. Do so like this:

#### Example ####

user@homebox ~ $ ssh-copy-id -i ~/.ssh/id_dsa.pub user@'servername' (or 'ipaddress')

# If you are asked weather or not you wish to continue, say yes.

Step 5. Now it's time to test the setup. To do that, try to ssh to the server:

#### Example ####

user@homebox ~ $ ssh 'servername' (or 'ipaddress')

# You should log in to the remote host without being asked for a password.

Step 6. You can now SSH or SCP to the remote host without having to enter a password at each connection. To make sure your public key stays secure from prying eyes, do the following to change permissions and restrict access on 'homebox' and also on 'servername' to ~/.ssh:

#### Example ####

user@homebox ~ $ chmod 600 ~/.ssh/id_dsa ~/.ssh/id_dsa.pub

# Verify the permissions on the files:

#### Example ####

user@homebox ~ $ ls -l ~/.ssh
-rw------- 1 user user 668 Feb 4 19:26 id_dsa
-rw------- 1 user user 599 Feb 4 19:26 id_dsa.pub

Links

1. OpenSSH

2. known_hosts

3. fingerprint

------
Nice post!

I've noticed that I don't have the command ssh-copy-id on my OS X machine (I didn't even know one existed!). To achieve the same effect I usually do the following:

user@homebox ~ $ scp ~/.ssh/id_dsa.pub user@'servername':.ssh/authorized_keys

This is assuming you've already created a .ssh directory on your server 'servername' (just ssh in as normal and `mkdir .ssh`). This also assumes that you don't already have an `authorized_keys` file in the .ssh directory on your server. If you do just copy (scp) the id_dsa.pub file to a temporary file in your server's home directory and then

user@homebox ~ $ scp .ssh/id_dsa.pub user@servername:homebox_dsa.pub
user@homebox ~ $ ssh user@servername
user@servername ~ $ cat homebox_dsa.pub >> .ssh/authorized_keys
user@servername ~ $ rm homebox_dsa.pub

If you've got it, the ssh-copy-id way is clearly a lot easier!

~ Mark

Hi Mark. Thanks for adding that bit. I don't have access to a Mac (new one anyway) so that's very nice to know.

Seth

Seth, I liked this post a lot, but felt the formatting and wording can be improved. I've made a few changes to the introduction.

Xin
(I wish I had used my name for my username now!)

-------

I found an elegant way of creating a new, or adding to an existing authorized_keys file with a single command:

ssh username@somedomain.com -n "echo `cat ~/.ssh/id_dsa.pub` >> ~/.ssh/authorized_keys"

-

I think it *is* a good practice to use pass phrases when using ssh keys. You can use ssh-agent on Linux and SSH Agent or SSHKeychain on Mac OS X, to avoid you to type your pass phrase everytime you access a remote host. Also, you can forward your keys using 'ssh -A' if you need to hop onto some host in the middle.

-- Igor"

-------

I'm using PUTTY (pageant) on a XP and on Vista. I use 2048 RSA private key that is password protected. I typically use PUTTY to connect, fyi: my purpose is really to be able to use git (which uses SSH) without having to log in every time I commit. Assuming you are too and have a key generated already...

load key into PuttyGen (enter password for the key) in the window copy the "public key for pasting into OpenSSH..." append this to the ~/.ssh/authorized_keys.

One comment: when I used the append from above aka the elegant code, it did not add a \n at the end of the line aka didn't work. I vi'd the file added a new line.

--Eric

Wednesday, July 22, 2009

How to run linux command regularly?

Using cron to schedule the jobs that need to be run regularly.

cron is a Linux system process that will execute a program at a preset time. To use cron you must prepare a text file that describes the program that you want executed and the times that cron should execute them. Then you use the crontab program to load the text file that describes the cron jobs into cron.

Here is the format of a cron job file:

[min] [hour] [day of month] [month] [day of week] [program to be run]

where each field is defined as

[min]	Minutes that program should be executed on. 0-59. Do not set as * or the program will be run once a minute.
[hour]	Hour that program should be executed on. 0-23. * for every hour.
[day of month]	Day of the month that process should be executed on. 1-31. * for every day.
[month]	Month that program whould be executed on. 1-12 * for every month.
[day of week]	Day of the week. 0-6 where Sunday = 0, Monday = 1, ...., Saturday = 6. * for every day of the week.
[program]	Program to be executed. Include full path information.

Here are some examples:

0,15,30,45 * * * * /usr/bin/foo

Will run /usr/bin/foo every 15 minutes on every hour, day-of-month, month, and day-of-week. In other words, it will run every 15 minutes for as long as the machine it running.

10 3 * * * /usr/bin/foo

Will run /usr/bin/foo at 3:10am on every day.

10 * 1 * * /usr/bin/foo

Will run /usr/bin/foo at 12:10am on the first day of the month.

10 * * 1 * /usr/bin/foo

Will run /usr/bin/foo at 12:10am on the first month of the year.

10 14 * * 1 /usr/bin/foo

Will run /usr/bin/foo at 2:10pm on every Monday.

There are more options for these. See man man crontab -S 5.

You must use crontab to load cron jobs into cron. First create a text file that uses the above rule to describe the cron job that you want to load into cron. But before you load it, type crontab -l to list any jobs that are currently loaded in crontab.

If none are listed, then it is safe to load your job. Example. If you wanted to run /usr/local/bin/foo once a day at 3:10am, then create a text file

10 3 * * * /usr/bin/foo

Save it as foo.cron. Then type crontab foo.cron. Check to see if it was loaded by typing crontab -l. It should display something like this:

# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (ipwatch.cron installed on Thu Nov 18 11:48:02 1999)
# (Cron version -- $Id: crontab.c,v 2.13 1994/01/17 03:20:37 vixie Exp $)
10 3 * * * /usr/bin/foo

If you want to edit the cron job, then edit foo.cron and then remove the existing cron job (crontab -r) and load it again (crontab foo.cron). You can have multiple jobs. Just put each different one on a seperate line in foo.cron.

contab jobs will run under the user that was in effect when you loaded the job in crontab.

See man cron, man crontab and man crontab -S 5 for more information.

The python debugger

Python has a debugging tool very similar to the gdb. This post would show how to start the debugger and the basic commands.

Starting the debugger
There are mainly two ways to start the debugger.
One is in the Python interpreter:

The debugger’s prompt is (Pdb). Typical usage to run a program under control of the debugger is:

>>> import pdb
>>> import mymodule
>>> pdb.run('mymodule.test()')
> (0)?()
(Pdb) continue
> (1)?()
(Pdb) continue
NameError: 'spam'
> (1)?()
(Pdb)

The other is from the command lines:

pdb.py can also be invoked as a script to debug other scripts. 
For example:python -m pdb myscript.py

Some basic commands include:
1. Setting and removing breakpoints:
break(b) + line no, clear (cl) + break + breakid
2. checking all the breakpoints:
break
3. step into:
step(s)
4. step over:
next (n)
5. go to next breakpoint:
continue(c)
6. go to the end of the function:
r
7. jump to certain place
jump(j)
8. list the current code
list(l)
9. change the variable value of the code:
as Python is a script language, the value can be changed in the running time. just assign a new value to the language!

10. printing
p
how to print the lists, dictionares, sets?
print is a statement in the Python, not in the pdb module.

11. How to inspect the statement inside a loop?
Using conditions
condition bpnumber. Here condition is any expression that could be evaluated. It must be evaluated as true if the breakpoint takes effect.

Tuesday, July 21, 2009

the summary about the backup&recovery system

I think I have already finished about 70% of the system. Besides, I also gain a feeling about python. This is really a good thing.

The overall picture for this system:
1. On the client, every day the backup system runs at a given time and store all the changes in the system or a given location, compress them, and send them to the cse stdsun server.

2. On stdsun, a recovery module also runs regularly. After certain time, it would set a checkpoint(recover point). It also can remove the compressed files that are too old and of no use. Finally, it can restore the file system structure at any given time.

The modules that have been finished:

The file system for this module:
We store all the information via files in the whole project.
a. the file list of the previous day. In this list, each record contains the following attributes: document type(f means file, and d means directory), file location ( the absolute location of the files, here some optimization could be applied by using relevant location), last access time (if the file type is directory, we can omit this attribute).

b. we also need to write a change log for the system every day. This change log is put together with the backup files, and it contains important information about how to recover the file to its latest version. The format is different from the file list. It includes five attributes:
document type( f or d), file location(use absolute location of the files), last access time, renaming and the operation(U update, N newly added, D delete). We do not need to write the unchanged document record in the change log.
The renaming is needed. Here we want to put all the files into a single directory, so this will destroy the original structure. If two file with the same name and in different directories would have a name conflict. So we need to rename it and accommodate all the files with the same name into a single directory.

c. the targzfile list in the server side( the place where you store the backups, in my case, it is the cse stdsun server.
This file contains all the files that are compressed ordered by time. We regularly set checkpoints on the backups. Each checkpoint is a recovery point. After we make a checkpoint, we write a line "checkpoint" in this targzfile list.\

d. The file list in the server. This could be the same as the file list in the client. In fact , the last access time attribute is not needed. We still have it in case we may need it in the future.

The size of the file is not needed here. The main purpose of maintaining this list is comparing the current list with the previous one, and decide whether each file is a newly added one, or a updated one, or the same document of the previous one. We also need to identify the removed files through the comparison. For file type, we say two files are the same if they are at the same location and the last access times are the same. If not, then the two data files are not the same. For directories, they are identical if they have the same locations. Updates and newly added files can be identified in a similar way by using location and last access time attributes. For removal, we try to find the files or directories that appear in the previous list but not the current list. Be sure to compare the file type! For example, if in the previous list, we have a record (f, /home/ye/aaa, 111) and in the current list we have a record (d, /home/ye/aaa, 222). Then the data file in the previous location is deleted and a new directory with the same name is created.

For performance considerations, we frequently use the hashing tables.

1. The automatic backup module
This module runs daily at midnight.
It compares the current file list with the previous one, write the change logs and copy all the changed files into a directory name with the timestamp. In the last, it sends the compressed files to the backup directory in the stdsun server.

2. The recovery module
It has three tasks:
a. Remove the old compressed files
b. Set checkpoint for the system regularly
c. Reconstruct the file system structure when required.

Monday, July 20, 2009

list comprehensions and looping techniques

List Comprehensions
Each list comprehension consists of an expression followed by a for clause, then zero or more for or if clauses. The result will be a list resulting from evaluating the expression in the context of the for and if clauses which follow it. If the expression would evaluate to a tuple, it must be parenthesized.

>>> freshfruit = [' banana', ' loganberry ', 'passion fruit ']
>>> [weapon.strip() for weapon in freshfruit]
['banana', 'loganberry', 'passion fruit']
#the format is [ (exp) for ... in ...]
>>> vec = [2, 4, 6]
>>> [3*x for x in vec]
[6, 12, 18]
>>> [3*x for x in vec if x > 3]
[12, 18]
>>> [3*x for x in vec if x <>>> [[x,x**2] for x in vec]
[[2, 4], [4, 16], [6, 36]]
>>> [x, x**2 for x in vec] # error - parens required for tuples
File "", line 1, in ?
[x, x**2 for x in vec]
^
SyntaxError: invalid syntax
>>> [(x, x**2) for x in vec]
[(2, 4), (4, 16), (6, 36)]
>>> vec1 = [2, 4, 6]
>>> vec2 = [4, 3, -9]
>>> [x*y for x in vec1 for y in vec2]
[8, 6, -18, 16, 12, -36, 24, 18, -54]
>>> [x+y for x in vec1 for y in vec2]
[6, 5, -7, 8, 7, -5, 10, 9, -3]
>>> [vec1[i]*vec2[i] for i in range(len(vec1))]
[8, 12, -54]

The looping on the dictionary and the sequences

When looping through dictionaries, the key and corresponding value can be retrieved at the same time using the iteritems() method.

>>> knights = {'gallahad': 'the pure', 'robin': 'the brave'}
>>> for k, v in knights.iteritems():
...     print k, v
...
gallahad the pure
robin the brave

When looping through a sequence, the position index and corresponding value can be retrieved at the same time using the enumerate() function.

>>> for i, v in enumerate(['tic', 'tac', 'toe']):
...     print i, v
...
0 tic
1 tac
2 toe

To loop over a sequence in sorted order, use the sorted() function which returns a new sorted list while leaving the source unaltered.

>>> basket = ['apple', 'orange', 'apple', 'pear', 'orange', 'banana']
>>> for f in sorted(set(basket)):
...     print f
...
apple
banana
orange
pear

Sunday, July 19, 2009

python regular expression-Grouping

Metacharacters are not active inside classes. For example, [akm$] will match any of the characters "a", "k", "m", or "$"; "$" is usually a metacharacter, but inside a character class it's stripped of its special nature.

The python provides very powerful grouping facilities. When you get a matching object,
you can apply the group(index) or groups() method on it.
The groups() is the same as {group(1), group(2), ...}

The syntax for a named group is one of the Python-specific extensions: (?P...). name is, obviously, the name of the group. Except for associating a name with a group, named groups also behave identically to capturing groups. The MatchObject methods that deal with capturing groups all accept either integers, to refer to groups by number, or a string containing the group name. Named groups are still given numbers, so you can retrieve information about a group in two ways:

>>> p = re.compile(r'(?P\b\w+\b)')
>>> m = p.search( '(((( Lots of punctuation )))' )
>>> m.group('word')
'Lots'
>>> m.group(1)
'Lots'
You can refer to the previous named groups by their names:
>>> p = re.compile(r'(?P\b\w+)\s+(?P=word)')
>>> p.search('Paris in the the spring').group()
'the the'

p.s.:
Be noted that both match and search will stop once they find ONE substring that fits the pattern.

One example about the grouping:
This is an example I met in the recovery and backup system.
Suppose we want to split the filename
2009-07-22-23-09.tar.gz (year-month-day-hour-minute.tar.gz)
We want to get the year, month, day, hour and minute when this file was created. How to write the regular expression?
the pattern should be

?P\d{4})-(?P\d{2})-(?P\d{2})-(?P\d{2})-(?P\d{2})\.tar\.gz
The file name is grouped into five parts, and we can retrieve each part by invoking the group(name) function.

A better example using VERBOSE:
pat = re.compile(r"""
\s* # Skip leading whitespace
(?P

[^:]+) # Header name
\s* : # Whitespace, and a colon
(?P.*?) # The header's value -- *? used to
# lose the following trailing whitespace
\s*$ # Trailing whitespace to end-of-line
""", re.VERBOSE)

Saturday, July 18, 2009

Summary about the backup system

I just finished a small part of the backup system. I should pay more attention to the designs...

The part I have finished

1. Extract the file list I need to backup and write them to a file. This is in fact the walk on the given directory.

2. Compare this filelist with the previous one, and identify the changes (updates, removal and newly-added). Copy these changes to the directory named with the current date, and write them to the change-logs.

The Part I need to work on tomorrow:

1. Copy these files to the stdsun, zip them if needed.

2. Write the recovery code for the files based on the logs.
This program is complex.

Tomorrow I need to learn the debugging of Python first, and write something about the research.

Wednesday, July 15, 2009

back to blog

After two weeks rest, I feel very good now.

In the following days I will mainly do the following things:
1. Write a backup&restore system using scripts for my linux system. It will backup the file and put the backups into the stdsun via ssh. More importantly, it will conduct the incremental backup.

2. Read the source code of Hadoop.

3. Provide some theoretical materials for the outlier detection.

tiny thoughts